Warnings¶
-Wundefined-control-flow¶
With -Wundefined-control-flow
warning re2c checks that every path in the
generated DFA contains at least one accepting state. When the input matches
such a path, lexer will eventually stop and execute the corresponding semantic
action. However, if some path has no accepting state, then lexer behavior is
undefined: it may loop forever, or read past the end of buffer, or jump to some
other semantic action by accident. For example, consider this simple piece of
code (a.re) that is supposed to match letter a
:
/*!re2c
"a" { return 'a'; }
*/
The generated code looks like this:
{
YYCTYPE yych;
if (YYLIMIT <= YYCURSOR) YYFILL(1);
yych = *YYCURSOR;
switch (yych) {
case 'a': goto yy3;
default: goto yy2;
}
yy2:
yy3:
++YYCURSOR;
{ return 'a'; }
}
Clearly this is not what we want: this code matches any letter, not just a
!
This happens because we did not specify any handler for the remaining input symbols.
If we run re2c with -Wundefined-control-flow
, we will see that it complains about undefined control flow and recommends using default rule *
:
a.re:3:2: warning: control flow is undefined for strings that match '[\x0-\x60\x62-\xFF]', use the default '*' rule [-Wundefined-control-flow]
Let’s follow the advice and change the code:
/*!re2c
* { return '*'; }
"a" { return 'a'; }
*/
Now the generated code looks much better:
{
YYCTYPE yych;
if (YYLIMIT <= YYCURSOR) YYFILL(1);
yych = *YYCURSOR;
switch (yych) {
case 'a': goto yy4;
default: goto yy2;
}
yy2:
++YYCURSOR;
{ return '*'; }
yy4:
++YYCURSOR;
{ return 'a'; }
}
Note that the default rule brings no overhead: it simply binds code to the default label. It should always be used, unless you are absolutely sure that your grammar covers all possible cases.
The old default rule¶
When the world was young and re2c didn’t have the default *
rule (that is,
before re2c-0.13.7) everyone used [^]
as the default rule, as in this
example (any.re):
/*!re2c
// ... normal rules ...
[^] { return "any"; }
*/
[^]
is just an ordinary rule: it matches any character and has normal
priority (so it should be the last rule). If other rules didn’t match, [^]
will match and consume one character.
But exactly what is a character? First, an abstract number that is assigned some sacred meaning within the current encoding — a code point. Second, a minimal piece of information (say, combination of bits) that can represent a unit of encoded text — a code unit. Rules are defined in terms of code points. Input is measured in code units. In fixed-width encodings (such as ASCII, EBCDIC, UCS-2, UTF-32, etc.), there is a one-to-one correspondence between code points and code units. In variable-width encodings (such as UTF-8, UTF-16, etc.), code points map to code unit sequences of different lengths.
The [^]
rule matches any code point. In fixed-width encodings, it covers all
code units and consumes exactly one of them. In variable-width encodings, it
consumes variable number of code units and may not match some of them. The
example above compiles without warnings with any fixed-width encoding (ASCII by
default). However, with the UTF-8 encoding
`re2c -i8 -Wundefined-control-flow any.re`
complains:
any.re:4:2: warning: control flow is undefined for strings that match
'[\x80-\xC1\xF5-\xFF]'
'\xF0 [\x0-\x8F\xC0-\xFF]'
'[\xE1-\xEF] [\x0-\x7F\xC0-\xFF]'
'\xF4 [\x0-\x7F\x90-\xFF]'
'\xE0 [\x0-\x9F\xC0-\xFF]'
'[\xF1-\xF3] [\x0-\x7F\xC0-\xFF]'
'[\xC2-\xDF] [\x0-\x7F\xC0-\xFF]'
'\xE0 [\xA0-\xBF] [\x0-\x7F\xC0-\xFF]'
... and 7 more, use default rule '*' [-Wundefined-control-flow]
It shows us the patterns that must never appear in valid UTF-8 encoded text. If
the input is not valid UTF-8, lexer behavior is undefined. One would expect that
with UTF-16 (another variable-width encoding), re2c would also report a warning,
but it doesn’t. This is because by default, re2c treats Unicode surrogates as
normal code points (for backwards compatibility reasons). If we tell re2c to
exclude surrogates
(`re2c -ix --encoding-policy fail -Wundefined-control-flow`
), then we will
get a warning:
any.re:4:2: warning: control flow is undefined for strings that match
'[\xDC00-\xDFFF]'
'[\xD800-\xDBFF] [\x0-\xDBFF\xE000-\xFFFF]'
, use default rule '*' [-Wundefined-control-flow]
As you see, it can get quite subtle. A good advice is, always use default rule
*
: it matches any code unit regardless of encoding, consumes a single code
unit no matter what and always has the lowest priority. Note that *
is a
built-in hack: it cannot be expressed through ordinary rules.
-Wunreachable-rules¶
Sometimes the input grammar contains rules that will never match. This can
happen for two reasons. First, some rules may be shadowed by other rules that
match the same input, but have higher priority. Second, the rule itself may be
infinitely greedy: it may consume as many input characters as it can get and
never stop, and as a result never match. Both cases indicate a problem with
the grammar, and -Wunreachable-rules
detects and reports such rules.
Let’s see an example of the first kind: shadowed rules (shadowed.re).
/*!re2c
"" { return ""; }
* { return "*"; }
"a" | "b" { return "a | b"; }
"a" { return "a"; }
[\x00-\xFF] { return "[0 - 0xFF]"; }
[^] { return "[^]"; }
*/
In this example the empty rule ""
never matches, because any single code
unit is matched by other rules, which take precedence due to the longerst match.
Rule "a"
is shadowed by rule "a" | "b"
, which also matches a
, but
takes precedence because it comes first. Similarly, rule [^]
is shadowed by
rule [\x00-\xFF]
. Default rule *
is also shadowed, but it’s an exception
that is not reported (default case should always be handled). Shadowed rules
normally do not appear in the generated code: re2c removes them during its dead
code elimination pass.
$ re2c -Wunreachable-rules shadowed.re -o shadowed.c
shadowed.re:2:16: warning: unreachable rule (shadowed by rules at lines 4, 6) [-Wunreachable-rules]
shadowed.re:5:16: warning: unreachable rule (shadowed by rule at line 4) [-Wunreachable-rules]
shadowed.re:7:16: warning: unreachable rule (shadowed by rules at lines 4, 6) [-Wunreachable-rules]
Now let’s see an example of second kind: infinitely greedy rule (greedy.re).
/*!re2c
[^]* { return "greeedy"; }
*/
This rule will continue eating input characters until YYFILL
fails, or until
it reads past the end of buffer and causes memory access violation.
$ re2c -Wunreachable-rules greedy.re -o greedy.c
greedy.re:2:9: warning: unreachable rule [-Wunreachable-rules]
-Wcondition-order¶
Some older re2c programs that use -c --conditions
option rely on a fixed
condition order instead of using /*!types:re2c*/
directive or the
-t --type-header
option. This is incorrect and dangerous, as demonstrated by
the following example [fixorder.re]
. In this example
the lexer has two conditions: a
and b
. It starts in condition a
,
which expects a sequence of letters a
followed by a comma. The comma causes
transition to condition b
, which expects a sequence of letters b
followed by an exclamation mark. Anything other input is an error. Nothing
special, except that condition numbers are hardcoded manually (the mapping of
conditions to numbers is toggled by REVERSED_CONDITION_ORDER
define).
#include <stdio.h>
#ifdef REVERSED_CONDITION_ORDER
# define yyca 1
# define yycb 0
#else
# define yyca 0
# define yycb 1
#endif
int main()
{
const char * YYCURSOR = "aaaa,bbb!";
int c = yyca;
for (;;) {
/*!re2c
re2c:define:YYCTYPE = char;
re2c:yyfill:enable = 0;
re2c:define:YYSETCONDITION = "c = @@;";
re2c:define:YYSETCONDITION:naked = 1;
re2c:define:YYGETCONDITION = c;
re2c:define:YYGETCONDITION:naked = 1;
<*> * { printf ("error\n"); break; }
<a> "a" { printf ("a"); continue; }
<a> "," => b { printf (","); continue; }
<b> "!" { printf ("!\n"); break; }
<b> "b" { printf ("b"); continue; }
*/
}
return 0;
}
Let’s compile and run it.
Everything works fine: we get aaaa,bbb!
in both cases.
$ re2c -c -o fixorder.c -Wcondition-order fixorder.re
$
$ c++ -o fixorder fixorder.c && ./fixorder
aaaa,bbb!
$
$ c++ -o fixorder fixorder.c -DREVERSED_CONDITION_ORDER && ./fixorder
aaaa,bbb!
However, if we use the -s
re2c option, the lexer becomes sensitive to condition order:
$ re2c -cs -o fixorder.c -Wcondition-order fixorder.re
fixorder.re:31:6: warning: looks like you use hardcoded numbers instead of autogenerated condition names:
better add '/*!types:re2c*/' directive or '-t, --type-header' option and don't rely on fixed condition order. [-Wcondition-order]
$
$ c++ -o fixorder fixorder.c && ./fixorder
aaaa,bbb!
$
$ c++ -o fixorder fixorder.c -DREVERSED_CONDITION_ORDER && ./fixorder
error
And we get a warning from re2c. The same behavior remains if we use -g
or
-b
option. Why is that? A look at the generated code explains everything.
By default the initial dispatch on conditions is a switch
statement:
switch (c) {
case yyca: goto yyc_a;
case yycb: goto yyc_b;
}
Dispatch uses explicit condition names and works no matter what numbers are assigned to them.
However, with the -s
option, re2c generates an if
statement instead of a switch
:
if (c < 1) {
goto yyc_a;
} else {
goto yyc_b;
}
And with the -g
option, it uses a jump table (computed goto
):
static void *yyctable[2] = {
&&yyc_a,
&&yyc_b,
};
goto *yyctable[c];
The last two cases are sensitive to condition order. The fix is easy: as the
warning suggests, use the /*!types:re2c*/
directive or the
-t, --type-header
option.
-Wuseless-escape¶
Sometimes people escape characters that don’t need to be escaped — either
because they (mistakenly) think that the character is special and must be
escaped in the given context, or because they think that this escape sequence
means something (when it actually doesn’t), or just by pure accident. With
-Wuseless-escape
option re2c warns about ignored escapes. Consider this
example (escapes.re):
/*!re2c
* {}
"\a\A\"\'\[\]\-\x5d\377" {}
'\a\A\"\'\[\]\-\x5d\377' {}
[\a\A\"\'\[\]\-\x5d\377] {}
*/
re2c reports a bunch of warnings:
$ re2c -Wuseless-escape escapes.re -o escapes.c
escapes.re:3:7: warning: escape has no effect: '\A' [-Wuseless-escape]
escapes.re:3:11: warning: escape has no effect: '\'' [-Wuseless-escape]
escapes.re:3:13: warning: escape has no effect: '\[' [-Wuseless-escape]
escapes.re:3:15: warning: escape has no effect: '\]' [-Wuseless-escape]
escapes.re:3:17: warning: escape has no effect: '\-' [-Wuseless-escape]
escapes.re:4:7: warning: escape has no effect: '\A' [-Wuseless-escape]
escapes.re:4:9: warning: escape has no effect: '\"' [-Wuseless-escape]
escapes.re:4:13: warning: escape has no effect: '\[' [-Wuseless-escape]
escapes.re:4:15: warning: escape has no effect: '\]' [-Wuseless-escape]
escapes.re:4:17: warning: escape has no effect: '\-' [-Wuseless-escape]
escapes.re:5:7: warning: escape has no effect: '\A' [-Wuseless-escape]
escapes.re:5:9: warning: escape has no effect: '\"' [-Wuseless-escape]
escapes.re:5:11: warning: escape has no effect: '\'' [-Wuseless-escape]
escapes.re:5:13: warning: escape has no effect: '\[' [-Wuseless-escape]
This is because the \A
and \[
escapes are meaningless in all rules,
\-
makes sense only in a character class,
and each type of closing quotes ("
, '
and ]
) should only be escaped inside of a string delimited with the same quotes.
Useless escapes are ignored: the escaped symbol is treated as not escaped (\A
becomes A
, etc.).
The above example should be fixed as follows:
/*!re2c
* {}
"\aA\"'[]-\x5d\377" {}
'\aA"\'[]-\x5d\377' {}
[\aA"'[\]\-\x5d\377] {}
*/
More generally, re2c recognizes escapes in the following lexemes:
double-quoted strings
" ... "
single-quoted strings
' ... '
character classes
[ ... ]
and[^ ... ]
The following escapes are recognized:
Closing quotes (
\"
for double-quoted strings,\'
for single-quoted strings, and\]
for character classes).Dash
\-
in character classes.Octal escapes:
\ooo
, whereo
is in range[0 - 7]
(the largest octal escape is\377
, which equals0xFF
).Hexadecimal escapes:
\xhh
,\Xhhhh
,\uhhhh
, and\Uhhhhhhhh
, whereh
is in range[0 - 9]
,[a - f]
, or[A - F]
.Miscellaneous escapes:
\a
,\b
,\f
,\n
,\r
,\t
,\v
,\\
.
Ill-formed octal and hexadecimal escapes are treated as errors. An escape followed by a newline is also an error: multiline strings and character classes are not allowed. Any other ill-formed escapes are ignored.
-Wswapped-range¶
-Wswapped-range
warning is reported in cases when a character class contains
a range which lower bound is greater than the upper bound. For some strange
reason older versions of re2c did not consider this an error and silently
swapped range bounds. Consider the following example (swapped.re):
/*!re2c
* { return "*"; }
[a-Z] { return "is it what you wanted?"; }
*/
re2c interprets this code as [Z-a]
, but generates a warning:
$ re2c -Wswapped-range swapped.re -o swapped.c
swapped.re:3:5: warning: range lower bound (0x61) is greater than upper bound (0x5A), swapping [-Wswapped-range]
Use [-Werror-swapped-range]
to make it an error.
-Wempty-character-class¶
This warning is complementary to the
--empty-class <match-empty | match-none | error>
option.
For bakward compatibility reasons the default is match-empty
: empty
character class []
matches empty string (that is, it always matches without
consuming any input). This behaviour doesn’t make much sense, therefore re2c
provides a warning -Wempty-character-class
.
Note that empty character class can be constructed in many ways, for example as
a result of range negation or the difference operator. The code below (empty.re)
demonstrates some of the possibilities:
/*!re2c
[] { return 0; }
[^\x00-\xFF] { return 1; }
[^] \ [^] { return 2; }
[abc] \ ("a" | "b" | "c") { return 3; }
"a" \ [a-z] { return 4; }
*/
re2c gives the following warnings:
$ re2c -Wempty-character-class empty.re -o empty.c
empty.re:2:4: warning: empty character class [-Wempty-character-class]
empty.re:3:4: warning: empty character class [-Wempty-character-class]
empty.re:4:4: warning: empty character class [-Wempty-character-class]
empty.re:5:4: warning: empty character class [-Wempty-character-class]
empty.re:6:4: warning: empty character class [-Wempty-character-class]
-Wmatch-empty-string¶
[-Wmatch-empty-string]
warns when a rule is nullable (matches an empty
string). It was intended to prevent infinite looping in cases like the
[hang.re]
example below. The program loops over its arguments (the outer for
loop)
and tries to lex each argument (the inner for
loop). The lexer stops when
all input has been consumed and it sees the terminating NULL
. Arguments must
consist of lowercase letters only.
#include <stdio.h>
int main(int argc, char **argv)
{
for (int i = 1; i < argc; ++i) {
for (char *YYCURSOR = argv[i];;) {
/*!re2c
re2c:define:YYCTYPE = char;
re2c:yyfill:enable = 0;
"\x00" { break; }
[a-z]* { continue; }
*/
}
printf("argv[%d]: %s\n", i, argv[i]);
}
return 0;
}
On well-formed input the program runs as expected. However, if one of the arguments contains a symbol diffrerent from lowercase letter, the program hangs forever:
$ re2c -Wmatch-empty-string hang.re -o hang.c
hang.re:11:19: warning: rule matches empty string [-Wmatch-empty-string]
$ c++ -o hang hang.c
$
$ ./hang only lowercase letters
argv[1]: only
argv[2]: lowercase
argv[3]: letters
$
$ ./hang right ?
argv[1]: right
^C
Note that if we add default rule *
, the lexer won’t hang anymore: it will
match the default rule instead of the nullable rule. The fix is easy: make the
rule non-nullable (say, [a-z]+
) and add default rule *
.
In some cases matching an empty string makes perfect sense: for example, it
might be used as a non-consuming default rule, or it might be used to lex an
optional lexeme (if the corresponding rule doesn’t match, the lexer jumps to
another block and resumes lexing at the same input position). All these cases
are valid, so if [-Wmatch-empty-string]
becomes annoying, it can be silenced
with [-Wno-match-empty-string]
.
-Wsentinel-in-midrule¶
When using sentinel method of checking for the end of input, it is easy to
forget that the sentinel symbol must not be allowed in the middle of the rule.
For example, the following code tries to match single-quoted strings. It allows
any character except the single quote to occur in the string, including
terminating NULL
. As a result, the generated lexer works as expected on
well-formed input like 'aaa'\0
, but things go wrong on ill-formed input like
'aaa\0
(where the closing single quote is missing). Lexer reaches the
terminating NULL
and assumes it is a part of the single-quoted string, so
it continues reading bytes from memory. Eventually the lexer terminates due to
memory access violation, or worse — it accidentally hits a single quote and
assumes this to be the end of the string.
#include <assert.h>
int lex(const char *YYCURSOR)
{
/*!re2c
re2c:define:YYCTYPE = char;
re2c:yyfill:enable = 0;
['] [^']* ['] { return 0; }
* { return 1; }
*/
}
int main()
{
assert(lex("'good'") == 0);
assert(lex("'bad") == 1);
return 0;
}
On this code re2c reports a warning. It cannot be certain that NULL
is the sentinel
symbol, but this is by far the most common case.
$ re2c -Wsentinel-in-midrule example.re -oexample.c
example.re:9:18: warning: sentinel symbol 0 occurs in the middle of the rule
(note: if a different sentinel symbol is used, specify it with 're2c:sentinel' configuration) [-Wsentinel-in-midrule]
However, re2c suggests us to define the sentinel symbol using re2c:sentinel
configuration. Let’s do it.
#include <assert.h>
int lex(const char *YYCURSOR)
{
/*!re2c
re2c:define:YYCTYPE = char;
re2c:yyfill:enable = 0;
re2c:sentinel = 0;
['] [^']* ['] { return 0; }
* { return 1; }
*/
}
int main()
{
assert(lex("'good'") == 0);
assert(lex("'bad") == 1);
return 0;
}
The warning has turned into an error, as re2c is now certain that the code contains an error.
$ re2c -Wsentinel-in-midrule example.re -oexample.c
example.re:10:18: error: sentinel symbol 0 occurs in the middle of the rule [-Werror-sentinel-in-midrule]
The code can be fixed by excluding NULL
from the set of symbols allowed in
the middle of the string: ['] [^'\x00]* [']
. If it is necessary to allow
all symbols, a more powerful EOF handling method should be used.