Advertisement

Regular expressions

Started by December 28, 2021 03:46 AM
5 comments, last by frob 2 years, 11 months ago

I have a regular expression that produces non-empty sentence-sized fragments. Why does the \s* work?

vector<string> vs = std_strtok(

"bla1 bla1 bla1... bla2 bla22bla2?! bla3 bla3bla3!",

"[.?!]\\s*");

The full code is at https://github.com/sjhalayka/regex/blob/main/main.cpp

It produces:

3

'bla1 bla1 bla1'

'bla2 bla22bla2'

'bla3 bla3bla3'

The standard seems very unclear about it, but usually this happens due to greediness of the “*” operator. RE implementations in some languages have several forms of the repeating operators in different kinds of greediness (from as few as possible to as many as possible even if it breaks matching),

Advertisement

Tokenization looks correct, also when compared with regexer.com.

It looks like line 14, end is uninitialized. Fortunately for your code it happens to work, but is definitely a bug.

Parsing it myself, I see:

Searching for any of dot, question, or exclamation, followed by zero or more spaces.

Found token: bla1 bla1 bla1

Matches that aren't tokens: “.” “.” “. ” (that has the space)

Found token: bla2 bla22bla2

Matches that aren't tokens: “?” “! ” (note the space)

Found token: bla3 bla3bla3

Matches that aren't tokens: “!”

What results were you expecting?

Well, I was expecting it to tokenize based on [.?!], which it does, but I'm not certain why it strips the leading whitespace when you put \s* in after the [.?!]. For fun, I tried putting \s* in before [.?!], and it doesn't work. I'm at a loss as to why it must be put after, instead of before.

P.S. Does it strip out the [.?!] tokens first, and then feeds those tokens one-by-one into \s* …?

No bug: https://www.cplusplus.com/reference/regex/regex_token_iterator/regex_token_iterator/

It doesn't “strip” a leading whitespace, it “skips” it. Your expression is not anchored on the left, so it will skip whatever it has to skip before it finds a match.

enum Bool { True, False, FileNotFound };

taby said:
I'm at a loss as to why it must be put after, instead of before.

What is “it”?

The expression [.?!]\s* is as described: any of dot, question, or exclamation, followed by zero or more spaces.

If you mean “it” is the \s* (zero or more whitespace characters) of \s*[.?!] it would consume any number of whitespace characters first that are then followed by any of a dot, question, or exclamation character. No whitespace is ever followed by those characters so the whitespace would never be removed.

If you mean something else, please describe it better, and include an example of how you expected it to match. As given the string you have contains six matches that are considered token separators, leaving behind three separated tokens.

This topic is closed to new replies.

Advertisement