Discussion:
[regex] difficulties with the "Leftmost Longest" Rule
Detlef Meyer-Eltz
2006-12-18 17:05:18 UTC
Permalink
I have a difficulty to predict, which part of a regular expression
will match.

Example:
I have a regular expression for a general HTML tag: <[^>]*>
combined with an expression for the body tag: <body([^>]*)>

to: (<[^>]*>)|(<body([^>]*)>)

This expression matches the text: <body bgcolor="white">

As both alternatives can match the input with the same length, I
expected, that the repeated fouth part of the "Leftmost Longest" Rule
would determine, which alternatve is chosen:

4. Find the match which has matched the first sub-expression in the
leftmost position, along with any ties. If there is only on(e)
such match possible then return it.

// note the missing 'e'

As the tag-expression has no sub-expression at all, the
body-expression should win. Its sub-expression could match, but
doesn't. It seems to me, that the sequence of the alternatives
determines the match.

Now I guess, that I misinterpreted 4.: its not a means to predict the
matching alternative but only to find the one that matched
accidentally? My software constructs lexers from elementary
expressions automatically. So it's important for me to direct and
predict the expected matching alternative. Are there any other rules?
Does the sequence of the alternatives determine the match
unmistakably?


With Kind Regards,

Detlef Meyer-Eltz
John Maddock
2006-12-18 17:31:20 UTC
Permalink
Post by Detlef Meyer-Eltz
I have a difficulty to predict, which part of a regular expression
will match.
I have a regular expression for a general HTML tag: <[^>]*>
combined with an expression for the body tag: <body([^>]*)>
to: (<[^>]*>)|(<body([^>]*)>)
This expression matches the text: <body bgcolor="white">
As both alternatives can match the input with the same length, I
expected, that the repeated fouth part of the "Leftmost Longest" Rule
4. Find the match which has matched the first sub-expression in the
leftmost position, along with any ties. If there is only on(e)
such match possible then return it.
// note the missing 'e'
As the tag-expression has no sub-expression at all, the
body-expression should win. Its sub-expression could match, but
doesn't. It seems to me, that the sequence of the alternatives
determines the match.
Now I guess, that I misinterpreted 4.: its not a means to predict the
matching alternative but only to find the one that matched
accidentally? My software constructs lexers from elementary
expressions automatically. So it's important for me to direct and
predict the expected matching alternative. Are there any other rules?
Does the sequence of the alternatives determine the match
unmistakably?
Which Boost.Regex version are you using, and how are you compiling the
expression?

Recent versions default to the Perl matching rules: *which do not use the
leftmost longest rule*. They match based on a "first match found" rule, so
if the first alternative leads to a match then subsequent alternatives are
never examined.

If you really want leftmost-longest semantics, then compile the expression
as a POSIX extended regex, but of course then you loose the ability to use
Perl-like regex extensions.

HTH, John.

PS, your analysis of the leftmost-longest rule looks correct however.
Detlef Meyer-Eltz
2006-12-18 19:56:56 UTC
Permalink
Thank you for your quick reply. It's just because of the
leftmost-longest rule, that I'm indeed using POSIX extended style
(from boost 1.33.1 with Borland CBuilder 6). To assert, that I didn't
do anything wrong im my software, I now have made an extra commandline
program:


///////////////////////////////////////////

#pragma hdrstop

#include <iostream>
#include <string>
#include "boost\regex.hpp"

using namespace std;
using namespace boost;
//---------------------------------------------------------------------------

int main(int argc, char* argv[])
{
// my usual option, but makes no difference here, I think
regex_constants::syntax_option_type syntax_flags = regex_constants::extended
& ~regex_constants::no_escape_in_lists;

regex expression1("(<[^>]*>)|(<body([^>]*)>)", regex_constants::extended);
regex expression2("(<body([^>]*)>)|(<[^>]*>)", regex_constants::extended);
string testtext = "<body bgcolor=\"white\">";
std::string::const_iterator start, end;
start = testtext.begin();
end = testtext.end();

match_results<std::string::const_iterator> what;
match_flag_type flags = match_default;

cout << "testing: " << expression1 << endl;
if(regex_search(start, end, what, expression1, flags))
{
int iCount = 0;
for(unsigned int i = 0 ; i < what.size() ; ++i )
cout << "sub-expression" << iCount++ << ": \"" << what[i] << "\"" << endl;
}

cout << endl;

cout << "testing: " << expression2 << endl;
if(regex_search(start, end, what, expression2, flags))
{
int iCount = 0;
for(unsigned int i = 0 ; i < what.size() ; ++i )
cout << "sub-expression" << iCount++ << ": \"" << what[i] << "\"" << endl;
}

return 0;
}

/////////////////////////////////////////////////////////


That's the result:

testing: (<[^>]*>)|(<body([^>]*)>)
sub-expression0: "<body bgcolor="white">" // (<[^>]*>)|(<body([^>]*)>)
sub-expression1: "<body bgcolor="white">" // <[^>]*>
sub-expression2: "" // <body([^>]*)>
sub-expression3: "" // [^>]*

testing: (<body([^>]*)>)|(<[^>]*>)
sub-expression0: "<body bgcolor="white">" // (<body([^>]*)>)|(<[^>]*>)
sub-expression1: "<body bgcolor="white">" // <body([^>]*)>
sub-expression2: " bgcolor="white"" // [^>]*
sub-expression3: "" // <[^>]*>


I hoped, for both arrangements of the alternatives, the matching parts
would be the same. I would be very happy, if I had made a mistake ;-)


Best regards

Detlef Meyer-Eltz

--
Post by John Maddock
Post by Detlef Meyer-Eltz
I have a difficulty to predict, which part of a regular expression
will match.
I have a regular expression for a general HTML tag: <[^>]*>
combined with an expression for the body tag: <body([^>]*)>
to: (<[^>]*>)|(<body([^>]*)>)
This expression matches the text: <body bgcolor="white">
As both alternatives can match the input with the same length, I
expected, that the repeated fouth part of the "Leftmost Longest" Rule
4. Find the match which has matched the first sub-expression in the
leftmost position, along with any ties. If there is only on(e)
such match possible then return it.
// note the missing 'e'
As the tag-expression has no sub-expression at all, the
body-expression should win. Its sub-expression could match, but
doesn't. It seems to me, that the sequence of the alternatives
determines the match.
Now I guess, that I misinterpreted 4.: its not a means to predict the
matching alternative but only to find the one that matched
accidentally? My software constructs lexers from elementary
expressions automatically. So it's important for me to direct and
predict the expected matching alternative. Are there any other rules?
Does the sequence of the alternatives determine the match
unmistakably?
Which Boost.Regex version are you using, and how are you compiling the
expression?
Recent versions default to the Perl matching rules: *which do not use the
leftmost longest rule*. They match based on a "first match found" rule, so
if the first alternative leads to a match then subsequent alternatives are
never examined.
If you really want leftmost-longest semantics, then compile the expression
as a POSIX extended regex, but of course then you loose the ability to use
Perl-like regex extensions.
HTH, John.
PS, your analysis of the leftmost-longest rule looks correct however.
_______________________________________________
Boost-users mailing list
http://lists.boost.org/mailman/listinfo.cgi/boost-users
John Maddock
2006-12-19 10:24:12 UTC
Permalink
Post by Detlef Meyer-Eltz
Thank you for your quick reply. It's just because of the
leftmost-longest rule, that I'm indeed using POSIX extended style
(from boost 1.33.1 with Borland CBuilder 6). To assert, that I didn't
do anything wrong im my software, I now have made an extra commandline
I see the same output, here's what's happening with the leftmost longest
rule:

1 Find the leftmost match, if there is only one match possible at this
location then return it.

There are two matches at position 0, carry on...

2) Find the longest of the possible matches, along with any ties. If there
is only one such possible match then return it.

Both are equal length carry on....

3) If there are no marked sub-expressions, then all the remaining
alternatives are indistinguishable; return the first of these found.

Doesn't apply, carry on...

4) Find the match which has matched the first sub-expression in the leftmost
position, along with any ties. If there is only on such match possible then
return it.

There is only one match with $1 matched: the first alternative. So we
return it.

We don't get to these:

5) Find the match which has the longest match for the first sub-expression,
along with any ties. If there is only one such match then return it.
6) Repeat steps 3 and 4 for each additional marked sub-expression.
If there is still more than one possible match remaining, then they are
indistinguishable; return the first one found.

If you change the expression to

"<[^>]*>|<body([^>]*)>"

The the presence of a marked sub-expression on the second alternative breaks
the tie, and the second alternative is returned.

I know this wasn't what you were looking for, but I hope this explanation
helps!

John.
Detlef Meyer-Eltz
2006-12-19 16:07:06 UTC
Permalink
Post by John Maddock
I know this wasn't what you were looking for, but I hope this
explanation
helps!
Thank you, the truth always helps.

A hint for you: there is another writing mistake: "that" instead of
"than" in
"Often there is more that one way of matching".

And the boost::regex::egrep option doesn't work. No match is found
with:

regex expression1("<[^>]*>\n<body([^>]*)>", boost::regex::egrep);

I guess, even a working egrep option would not change anything for me.

Best regards

Detlef.
John Maddock
2006-12-20 16:51:12 UTC
Permalink
Post by Detlef Meyer-Eltz
Post by John Maddock
I know this wasn't what you were looking for, but I hope this
explanation
helps!
Thank you, the truth always helps.
A hint for you: there is another writing mistake: "that" instead of
"than" in
"Often there is more that one way of matching".
That'll be fixed in cvs in a moment: thanks for pointing that out.
Post by Detlef Meyer-Eltz
And the boost::regex::egrep option doesn't work. No match is found
regex expression1("<[^>]*>\n<body([^>]*)>", boost::regex::egrep);
Yikes, that's a bad one to slip through, I'm testing (the fairly trivial)
fix now.
Post by Detlef Meyer-Eltz
I guess, even a working egrep option would not change anything for me.
Nope, you need to either:

Use the perl-compatible expressions and place the alternatives in the order
you want them searched,

or

Use POSIX expressions, and either put brackets around all the alternatives
*and* put them in the order you want. Or don't put braces around those of
lower priority, and do put them around those of higher priority.

HTH, John.
Detlef Meyer-Eltz
2006-12-21 13:36:31 UTC
Permalink
I didn't want to steal your time, so I didn't tell you, what really is
my problem. You were too kind that I can resist any more.

I've built a parser generator IDE where the user can define single
tokens as regular expressions. These expressions are combined into a
bigger expression automatically, which then is used as a scanner. For
this, it is necessary, that the token, which matched can be calculated
from the scanner sub-expressions which matched. Further the
sub-expressions of the matching token should be accessible easily. So
I put every alternative token into parenthesis.


Example:

Integer ::= \d+
Real ::= (\d+\.\d*|\.\d+)([eE][+-]*\d+)?
Identifier ::= \w+

Scanner :: (\d+)|((\d+\.\d*|\.\d+)([eE][+-]*\d+)?)|(\w+)

or, to get the token at the actual location of the input:

Scanner :: \A((\d+)|((\d+\.\d*|\.\d+)([eE][+-]*\d+)?)|(\w+))

Parallel to the construction of the scanner expression a vector
(m_vSymbols) is filled with the numbers of (and symbols to) the
subexpressions, which are representing the tokens (2,3,6). (These
numbers are calculated by means of mark_count()). You now get the
matching token by

for(t = m_vSymbols.begin(); t != tEnd; ++t)
if(xMatch[t->first].matched)
return *t;

The sub-expressions of the matching token can be accessed similar as
in an isolated regular expression by adding the offset of the whole
sub-expression: Token.str(i) == Scanner.str(i + offset). This goes
behind the scene and worked fine for a long time.

Now you can imagine, that it is a shock for me, to discover, that I
misinterpreted the leftmost longest rule in the manner I liked. I
didn't stumble over this error, because the matching of two
alternatives with the same length seems to be a rare case.
Post by John Maddock
Use the perl-compatible expressions and place the alternatives in the order
you want them searched,
POSIX regular expressions are preferable for a parser, because the
leftmost longest rule helps to solve conflicts. For example a label
can be distinguished from an ordinary identifier easily: \w+|\w+\s*:
Perhaps I will provide the option in the future to use Perl regexs.
Post by John Maddock
or
Use POSIX expressions, and either put brackets around all the
alternatives
*and* put them in the order you want. Or don't put braces around
those of
lower priority, and do put them around those of higher priority.
Yes, the first suggestion seems feasible: I simply could arrange the
tokens in the reversed order of their mark_count. This should have
exactly the intended result. If I am right, this technique could be
used for an extension of the regex library. Something like:

template <class symbol_type, class charT, class traits =
regex_traits<charT> >
class lexregex {
...
void add_symbol(const charT* p, symbol_type s);

I don't know, whether there is a chance to write code for such an
addition to a lexregex which is already compiled. Otherwise such a
lexregex had to be compiled in an extra step before use. In this form
I could make it on top of the existing regex class.

An according lexmatch_results is needed too, with

string_type str(int sub = 0) const; // returning the subexpressions
of the matching token
symbol_type symbol() const;



In this context there are two other points I'm interested in:

In my parser generator there is a preference for literal tokens
already (they aren't treated as regular expressions but by a ternary
tree), and I have a vague idea, that generally a token should be
preferred the more, the more literally it is. In your documentation
you mention some experimental non-member comparison operators. What is
the idea behind these comparisions? Could they be used, to define
preferences?


I guess, that testing one token after the other would be much more
expensive, than testing them together. All the more as there is a
special feature of my parser generator not only to test for tokens at
the actual location in the input as to look for the next location,
where one of several tokens occur. Can you tell me something about
these differences of costs?


I am very interested to know your opinion about my ideas.
(It isn't urgent.)


Best Regards

Detlef
John Maddock
2007-01-11 17:59:49 UTC
Permalink
Post by Detlef Meyer-Eltz
Now you can imagine, that it is a shock for me, to discover, that I
misinterpreted the leftmost longest rule in the manner I liked. I
didn't stumble over this error, because the matching of two
alternatives with the same length seems to be a rare case.
Nod.
Post by Detlef Meyer-Eltz
void add_symbol(const charT* p, symbol_type s);
I don't know, whether there is a chance to write code for such an
addition to a lexregex which is already compiled. Otherwise such a
lexregex had to be compiled in an extra step before use. In this form
I could make it on top of the existing regex class.
I think you would have to recompile the whole regex in order to add an
arbitrary extension to the expression.
Post by Detlef Meyer-Eltz
In my parser generator there is a preference for literal tokens
already (they aren't treated as regular expressions but by a ternary
tree), and I have a vague idea, that generally a token should be
preferred the more, the more literally it is. In your documentation
you mention some experimental non-member comparison operators. What is
the idea behind these comparisions? Could they be used, to define
preferences?
The comparison operators aren't used anywhere to determine matching: they
can be used by the user to compare the result to a specific string for
example.
Post by Detlef Meyer-Eltz
I guess, that testing one token after the other would be much more
expensive, than testing them together. All the more as there is a
special feature of my parser generator not only to test for tokens at
the actual location in the input as to look for the next location,
where one of several tokens occur. Can you tell me something about
these differences of costs?
It's likely to be more expensive yes: "impossible" branches in the state
machine get eliminated quite quickly in the regex internals during matching,
where as the "one expression at a time" approach necessarily tests all the
candidates. The difference would depend very much upon the particular
expressions though.

HTH, John.

Continue reading on narkive:
Loading...