-Wundefined-control-flow¶
With -Wundefined-control-flow
warning re2c checks that every path in the
generated DFA contains at least one accepting state. When the input matches
such a path, lexer will eventually stop and execute the corresponding semantic
action. However, if some path has no accepting state, then lexer behavior is
undefined: it may loop forever, or read past the end of buffer, or jump to some
other semantic action by accident. For example, consider this simple piece of
code (a.re) that is supposed to match letter a
:
/*!re2c
"a" { return 'a'; }
*/
The generated code looks like this:
{
YYCTYPE yych;
if (YYLIMIT <= YYCURSOR) YYFILL(1);
yych = *YYCURSOR;
switch (yych) {
case 'a': goto yy3;
default: goto yy2;
}
yy2:
yy3:
++YYCURSOR;
{ return 'a'; }
}
Clearly this is not what we want: this code matches any letter, not just a
!
This happens because we did not specify any handler for the remaining input symbols.
If we run re2c with -Wundefined-control-flow
, we will see that it complains about undefined control flow and recommends using default rule *
:
a.re:3:2: warning: control flow is undefined for strings that match '[\x0-\x60\x62-\xFF]', use the default '*' rule [-Wundefined-control-flow]
Let’s follow the advice and change the code:
/*!re2c
* { return '*'; }
"a" { return 'a'; }
*/
Now the generated code looks much better:
{
YYCTYPE yych;
if (YYLIMIT <= YYCURSOR) YYFILL(1);
yych = *YYCURSOR;
switch (yych) {
case 'a': goto yy4;
default: goto yy2;
}
yy2:
++YYCURSOR;
{ return '*'; }
yy4:
++YYCURSOR;
{ return 'a'; }
}
Note that the default rule brings no overhead: it simply binds code to the default label. It should always be used, unless you are absolutely sure that your grammar covers all possible cases.
The old default rule¶
When the world was young and re2c didn’t have the default *
rule (that is,
before re2c-0.13.7) everyone used [^]
as the default rule, as in this
example (any.re):
/*!re2c
// ... normal rules ...
[^] { return "any"; }
*/
[^]
is just an ordinary rule: it matches any character and has normal
priority (so it should be the last rule). If other rules didn’t match, [^]
will match and consume one character.
But exactly what is a character? First, an abstract number that is assigned some sacred meaning within the current encoding — a code point. Second, a minimal piece of information (say, combination of bits) that can represent a unit of encoded text — a code unit. Rules are defined in terms of code points. Input is measured in code units. In fixed-width encodings (such as ASCII, EBCDIC, UCS-2, UTF-32, etc.), there is a one-to-one correspondence between code points and code units. In variable-width encodings (such as UTF-8, UTF-16, etc.), code points map to code unit sequences of different lengths.
The [^]
rule matches any code point. In fixed-width encodings, it covers all
code units and consumes exactly one of them. In variable-width encodings, it
consumes variable number of code units and may not match some of them. The
example above compiles without warnings with any fixed-width encoding (ASCII by
default). However, with the UTF-8 encoding
`re2c -i8 -Wundefined-control-flow any.re`
complains:
any.re:4:2: warning: control flow is undefined for strings that match
'[\x80-\xC1\xF5-\xFF]'
'\xF0 [\x0-\x8F\xC0-\xFF]'
'[\xE1-\xEF] [\x0-\x7F\xC0-\xFF]'
'\xF4 [\x0-\x7F\x90-\xFF]'
'\xE0 [\x0-\x9F\xC0-\xFF]'
'[\xF1-\xF3] [\x0-\x7F\xC0-\xFF]'
'[\xC2-\xDF] [\x0-\x7F\xC0-\xFF]'
'\xE0 [\xA0-\xBF] [\x0-\x7F\xC0-\xFF]'
... and 7 more, use default rule '*' [-Wundefined-control-flow]
It shows us the patterns that must never appear in valid UTF-8 encoded text. If
the input is not valid UTF-8, lexer behavior is undefined. One would expect that
with UTF-16 (another variable-width encoding), re2c would also report a warning,
but it doesn’t. This is because by default, re2c treats Unicode surrogates as
normal code points (for backwards compatibility reasons). If we tell re2c to
exclude surrogates
(`re2c -ix --encoding-policy fail -Wundefined-control-flow`
), then we will
get a warning:
any.re:4:2: warning: control flow is undefined for strings that match
'[\xDC00-\xDFFF]'
'[\xD800-\xDBFF] [\x0-\xDBFF\xE000-\xFFFF]'
, use default rule '*' [-Wundefined-control-flow]
As you see, it can get quite subtle. A good advice is, always use default rule
*
: it matches any code unit regardless of encoding, consumes a single code
unit no matter what and always has the lowest priority. Note that *
is a
built-in hack: it cannot be expressed through ordinary rules.