Back to General and Gameplay Programming

Regular expressions

General and Gameplay Programming Programming

Started by taby December 28, 2021 03:46 AM

5 comments, last by frob 3 years, 1 month ago

taby

Author

1,557

December 28, 2021 03:46 AM

I have a regular expression that produces non-empty sentence-sized fragments. Why does the \s* work?

vector<string> vs = std_strtok(

"bla1 bla1 bla1... bla2 bla22bla2?! bla3 bla3bla3!",

"[.?!]\\s*");

The full code is at https://github.com/sjhalayka/regex/blob/main/main.cpp

It produces:

'bla1 bla1 bla1'

'bla2 bla22bla2'

'bla3 bla3bla3'

Alberth

10,257

December 28, 2021 05:34 PM

The standard seems very unclear about it, but usually this happens due to greediness of the “*” operator. RE implementations in some languages have several forms of the repeating operators in different kinds of greediness (from as few as possible to as many as possible even if it breaks matching),

frob

46,319

December 29, 2021 03:50 AM

Tokenization looks correct, also when compared with regexer.com.

It looks like line 14, end is uninitialized. Fortunately for your code it happens to work, but is definitely a bug.

Parsing it myself, I see:

Searching for any of dot, question, or exclamation, followed by zero or more spaces.

Found token: bla1 bla1 bla1

Matches that aren't tokens: “.” “.” “. ” (that has the space)

Found token: bla2 bla22bla2

Matches that aren't tokens: “?” “! ” (note the space)

Found token: bla3 bla3bla3

Matches that aren't tokens: “!”

What results were you expecting?

taby

Author

1,557

December 29, 2021 05:05 PM

Well, I was expecting it to tokenize based on [.?!], which it does, but I'm not certain why it strips the leading whitespace when you put \s* in after the [.?!]. For fun, I tried putting \s* in before [.?!], and it doesn't work. I'm at a loss as to why it must be put after, instead of before.

P.S. Does it strip out the [.?!] tokens first, and then feeds those tokens one-by-one into \s* …?

No bug: https://www.cplusplus.com/reference/regex/regex_token_iterator/regex_token_iterator/

hplus0603

11,940

December 29, 2021 05:35 PM

It doesn't “strip” a leading whitespace, it “skips” it. Your expression is not anchored on the left, so it will skip whatever it has to skip before it finds a match.

enum Bool { True, False, FileNotFound };

frob

46,319

December 29, 2021 06:02 PM

taby said:
I'm at a loss as to why it must be put after, instead of before.

What is “it”?

The expression [.?!]\s* is as described: any of dot, question, or exclamation, followed by zero or more spaces.

If you mean “it” is the \s* (zero or more whitespace characters) of \s*[.?!] it would consume any number of whitespace characters first that are then followed by any of a dot, question, or exclamation character. No whitespace is ever followed by those characters so the whitespace would never be removed.

If you mean something else, please describe it better, and include an example of how you expected it to match. As given the string you have contains six matches that are considered token separators, leaving behind three separated tokens.

Regular expressions

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Regular expressions

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines