Warnings

-Wundefined-control-flow

With -Wundefined-control-flow warning re2c checks that every path in the generated DFA contains at least one accepting state. When the input matches such a path, lexer will eventually stop and execute the corresponding semantic action. However, if some path has no accepting state, then lexer behavior is undefined: it may loop forever, or read past the end of buffer, or jump to some other semantic action by accident. For example, consider this simple piece of code (a.re) that is supposed to match letter a:

/*!re2c
    "a" { return 'a'; }
*/

The generated code looks like this:

{
        YYCTYPE yych;
        if (YYLIMIT <= YYCURSOR) YYFILL(1);
        yych = *YYCURSOR;
        switch (yych) {
        case 'a':       goto yy3;
        default:        goto yy2;
        }
yy2:
yy3:
        ++YYCURSOR;
        { return 'a'; }
}

Clearly this is not what we want: this code matches any letter, not just a! This happens because we did not specify any handler for the remaining input symbols. If we run re2c with -Wundefined-control-flow, we will see that it complains about undefined control flow and recommends using default rule *:

a.re:3:2: warning: control flow is undefined for strings that match '[\x0-\x60\x62-\xFF]', use the default '*' rule  [-Wundefined-control-flow]

Let’s follow the advice and change the code:

/*!re2c
    *   { return '*'; }
    "a" { return 'a'; }
*/

Now the generated code looks much better:

{
        YYCTYPE yych;
        if (YYLIMIT <= YYCURSOR) YYFILL(1);
        yych = *YYCURSOR;
        switch (yych) {
        case 'a':       goto yy4;
        default:        goto yy2;
        }
yy2:
        ++YYCURSOR;
        { return '*'; }
yy4:
        ++YYCURSOR;
        { return 'a'; }
}

Note that the default rule brings no overhead: it simply binds code to the default label. It should always be used, unless you are absolutely sure that your grammar covers all possible cases.

The old default rule

When the world was young and re2c didn’t have the default * rule (that is, before re2c-0.13.7) everyone used [^] as the default rule, as in this example (any.re):

/*!re2c
    // ... normal rules ...
    [^] { return "any"; }
*/

[^] is just an ordinary rule: it matches any character and has normal priority (so it should be the last rule). If other rules didn’t match, [^] will match and consume one character.

But exactly what is a character? First, an abstract number that is assigned some sacred meaning within the current encoding — a code point. Second, a minimal piece of information (say, combination of bits) that can represent a unit of encoded text — a code unit. Rules are defined in terms of code points. Input is measured in code units. In fixed-width encodings (such as ASCII, EBCDIC, UCS-2, UTF-32, etc.), there is a one-to-one correspondence between code points and code units. In variable-width encodings (such as UTF-8, UTF-16, etc.), code points map to code unit sequences of different lengths.

The [^] rule matches any code point. In fixed-width encodings, it covers all code units and consumes exactly one of them. In variable-width encodings, it consumes variable number of code units and may not match some of them. The example above compiles without warnings with any fixed-width encoding (ASCII by default). However, with the UTF-8 encoding `re2c -i8 -Wundefined-control-flow any.re` complains:

any.re:4:2: warning: control flow is undefined for strings that match
        '[\x80-\xC1\xF5-\xFF]'
        '\xF0 [\x0-\x8F\xC0-\xFF]'
        '[\xE1-\xEF] [\x0-\x7F\xC0-\xFF]'
        '\xF4 [\x0-\x7F\x90-\xFF]'
        '\xE0 [\x0-\x9F\xC0-\xFF]'
        '[\xF1-\xF3] [\x0-\x7F\xC0-\xFF]'
        '[\xC2-\xDF] [\x0-\x7F\xC0-\xFF]'
        '\xE0 [\xA0-\xBF] [\x0-\x7F\xC0-\xFF]'
 ... and 7 more, use default rule '*' [-Wundefined-control-flow]

It shows us the patterns that must never appear in valid UTF-8 encoded text. If the input is not valid UTF-8, lexer behavior is undefined. One would expect that with UTF-16 (another variable-width encoding), re2c would also report a warning, but it doesn’t. This is because by default, re2c treats Unicode surrogates as normal code points (for backwards compatibility reasons). If we tell re2c to exclude surrogates (`re2c -ix --encoding-policy fail -Wundefined-control-flow`), then we will get a warning:

any.re:4:2: warning: control flow is undefined for strings that match
        '[\xDC00-\xDFFF]'
        '[\xD800-\xDBFF] [\x0-\xDBFF\xE000-\xFFFF]'
, use default rule '*' [-Wundefined-control-flow]

As you see, it can get quite subtle. A good advice is, always use default rule *: it matches any code unit regardless of encoding, consumes a single code unit no matter what and always has the lowest priority. Note that * is a built-in hack: it cannot be expressed through ordinary rules.

-Wunreachable-rules

Sometimes the input grammar contains rules that will never match. This can happen for two reasons. First, some rules may be shadowed by other rules that match the same input, but have higher priority. Second, the rule itself may be infinitely greedy: it may consume as many input characters as it can get and never stop, and as a result never match. Both cases indicate a problem with the grammar, and -Wunreachable-rules detects and reports such rules.

Let’s see an example of the first kind: shadowed rules (shadowed.re).

/*!re2c
    ""          { return ""; }
    *           { return "*"; }
    "a" | "b"   { return "a | b"; }
    "a"         { return "a"; }
    [\x00-\xFF] { return "[0 - 0xFF]"; }
    [^]         { return "[^]"; }
*/

In this example the empty rule "" never matches, because any single code unit is matched by other rules, which take precedence due to the longerst match. Rule "a" is shadowed by rule "a" | "b", which also matches a, but takes precedence because it comes first. Similarly, rule [^] is shadowed by rule [\x00-\xFF]. Default rule * is also shadowed, but it’s an exception that is not reported (default case should always be handled). Shadowed rules normally do not appear in the generated code: re2c removes them during its dead code elimination pass.

$ re2c -Wunreachable-rules shadowed.re -o shadowed.c
shadowed.re:2:16: warning: unreachable rule (shadowed by rules at lines 4, 6) [-Wunreachable-rules]
shadowed.re:5:16: warning: unreachable rule (shadowed by rule at line 4) [-Wunreachable-rules]
shadowed.re:7:16: warning: unreachable rule (shadowed by rules at lines 4, 6) [-Wunreachable-rules]

Now let’s see an example of second kind: infinitely greedy rule (greedy.re).

/*!re2c
    [^]* { return "greeedy"; }
*/

This rule will continue eating input characters until YYFILL fails, or until it reads past the end of buffer and causes memory access violation.

$ re2c -Wunreachable-rules greedy.re -o greedy.c
greedy.re:2:9: warning: unreachable rule  [-Wunreachable-rules]

-Wcondition-order

Some older re2c programs that use -c --conditions option rely on a fixed condition order instead of using /*!types:re2c*/ directive or the -t --type-header option. This is incorrect and dangerous, as demonstrated by the following example [fixorder.re]. In this example the lexer has two conditions: a and b. It starts in condition a, which expects a sequence of letters a followed by a comma. The comma causes transition to condition b, which expects a sequence of letters b followed by an exclamation mark. Anything other input is an error. Nothing special, except that condition numbers are hardcoded manually (the mapping of conditions to numbers is toggled by REVERSED_CONDITION_ORDER define).

#include <stdio.h>

#ifdef REVERSED_CONDITION_ORDER
#    define yyca 1
#    define yycb 0
#else
#    define yyca 0
#    define yycb 1
#endif

int main()
{
    const char * YYCURSOR = "aaaa,bbb!";
    int c = yyca;
    for (;;) {
    /*!re2c
        re2c:define:YYCTYPE = char;
        re2c:yyfill:enable = 0;
        re2c:define:YYSETCONDITION = "c = @@;";
        re2c:define:YYSETCONDITION:naked = 1;
        re2c:define:YYGETCONDITION = c;
        re2c:define:YYGETCONDITION:naked = 1;

        <*> * { printf ("error\n"); break; }

        <a> "a"      { printf ("a"); continue; }
        <a> "," => b { printf (","); continue; }

        <b> "!" { printf ("!\n"); break; }
        <b> "b" { printf ("b"); continue; }
    */
    }
    return 0;
}

Let’s compile and run it. Everything works fine: we get aaaa,bbb! in both cases.

$ re2c -c -o fixorder.c -Wcondition-order fixorder.re
$
$ c++ -o fixorder fixorder.c && ./fixorder
aaaa,bbb!
$
$ c++ -o fixorder fixorder.c -DREVERSED_CONDITION_ORDER && ./fixorder
aaaa,bbb!

However, if we use the -s re2c option, the lexer becomes sensitive to condition order:

$ re2c -cs -o fixorder.c -Wcondition-order fixorder.re
fixorder.re:31:6: warning: looks like you use hardcoded numbers instead of autogenerated condition names:
better add '/*!types:re2c*/' directive or '-t, --type-header' option and don't rely on fixed condition order. [-Wcondition-order]
$
$ c++ -o fixorder fixorder.c && ./fixorder
aaaa,bbb!
$
$ c++ -o fixorder fixorder.c -DREVERSED_CONDITION_ORDER && ./fixorder
error

And we get a warning from re2c. The same behavior remains if we use -g or -b option. Why is that? A look at the generated code explains everything. By default the initial dispatch on conditions is a switch statement:

switch (c) {
case yyca: goto yyc_a;
case yycb: goto yyc_b;
}

Dispatch uses explicit condition names and works no matter what numbers are assigned to them. However, with the -s option, re2c generates an if statement instead of a switch:

if (c < 1) {
        goto yyc_a;
} else {
        goto yyc_b;
}

And with the -g option, it uses a jump table (computed goto):

static void *yyctable[2] = {
        &&yyc_a,
        &&yyc_b,
};
goto *yyctable[c];

The last two cases are sensitive to condition order. The fix is easy: as the warning suggests, use the /*!types:re2c*/ directive or the -t, --type-header option.

-Wuseless-escape

Sometimes people escape characters that don’t need to be escaped — either because they (mistakenly) think that the character is special and must be escaped in the given context, or because they think that this escape sequence means something (when it actually doesn’t), or just by pure accident. With -Wuseless-escape option re2c warns about ignored escapes. Consider this example (escapes.re):

/*!re2c
    *                        {}
    "\a\A\"\'\[\]\-\x5d\377" {}
    '\a\A\"\'\[\]\-\x5d\377' {}
    [\a\A\"\'\[\]\-\x5d\377] {}
*/

re2c reports a bunch of warnings:

$ re2c -Wuseless-escape escapes.re -o escapes.c
escapes.re:3:7: warning: escape has no effect: '\A' [-Wuseless-escape]
escapes.re:3:11: warning: escape has no effect: '\'' [-Wuseless-escape]
escapes.re:3:13: warning: escape has no effect: '\[' [-Wuseless-escape]
escapes.re:3:15: warning: escape has no effect: '\]' [-Wuseless-escape]
escapes.re:3:17: warning: escape has no effect: '\-' [-Wuseless-escape]
escapes.re:4:7: warning: escape has no effect: '\A' [-Wuseless-escape]
escapes.re:4:9: warning: escape has no effect: '\"' [-Wuseless-escape]
escapes.re:4:13: warning: escape has no effect: '\[' [-Wuseless-escape]
escapes.re:4:15: warning: escape has no effect: '\]' [-Wuseless-escape]
escapes.re:4:17: warning: escape has no effect: '\-' [-Wuseless-escape]
escapes.re:5:7: warning: escape has no effect: '\A' [-Wuseless-escape]
escapes.re:5:9: warning: escape has no effect: '\"' [-Wuseless-escape]
escapes.re:5:11: warning: escape has no effect: '\'' [-Wuseless-escape]
escapes.re:5:13: warning: escape has no effect: '\[' [-Wuseless-escape]

This is because the \A and \[ escapes are meaningless in all rules, \- makes sense only in a character class, and each type of closing quotes (", ' and ]) should only be escaped inside of a string delimited with the same quotes. Useless escapes are ignored: the escaped symbol is treated as not escaped (\A becomes A, etc.). The above example should be fixed as follows:

/*!re2c
    *                    {}
    "\aA\"'[]-\x5d\377"  {}
    '\aA"\'[]-\x5d\377'  {}
    [\aA"'[\]\-\x5d\377] {}
*/

More generally, re2c recognizes escapes in the following lexemes:

  • double-quoted strings " ... "

  • single-quoted strings ' ... '

  • character classes [ ... ] and [^ ... ]

The following escapes are recognized:

  • Closing quotes (\" for double-quoted strings, \' for single-quoted strings, and \] for character classes).

  • Dash \- in character classes.

  • Octal escapes: \ooo, where o is in range [0 - 7] (the largest octal escape is \377, which equals 0xFF).

  • Hexadecimal escapes: \xhh, \Xhhhh, \uhhhh, and \Uhhhhhhhh, where h is in range [0 - 9], [a - f], or [A - F].

  • Miscellaneous escapes: \a, \b, \f, \n, \r, \t, \v, \\.

Ill-formed octal and hexadecimal escapes are treated as errors. An escape followed by a newline is also an error: multiline strings and character classes are not allowed. Any other ill-formed escapes are ignored.

-Wswapped-range

-Wswapped-range warning is reported in cases when a character class contains a range which lower bound is greater than the upper bound. For some strange reason older versions of re2c did not consider this an error and silently swapped range bounds. Consider the following example (swapped.re):

/*!re2c
    *     { return "*"; }
    [a-Z] { return "is it what you wanted?"; }
*/

re2c interprets this code as [Z-a], but generates a warning:

$ re2c -Wswapped-range swapped.re -o swapped.c
swapped.re:3:5: warning: range lower bound (0x61) is greater than upper bound (0x5A), swapping [-Wswapped-range]

Use [-Werror-swapped-range] to make it an error.

-Wempty-character-class

This warning is complementary to the --empty-class <match-empty | match-none | error> option. For bakward compatibility reasons the default is match-empty: empty character class [] matches empty string (that is, it always matches without consuming any input). This behaviour doesn’t make much sense, therefore re2c provides a warning -Wempty-character-class. Note that empty character class can be constructed in many ways, for example as a result of range negation or the difference operator. The code below (empty.re) demonstrates some of the possibilities:

/*!re2c
    []                        { return 0; }
    [^\x00-\xFF]              { return 1; }
    [^] \ [^]                 { return 2; }
    [abc] \ ("a" | "b" | "c") { return 3; }
    "a" \ [a-z]               { return 4; }
*/

re2c gives the following warnings:

$ re2c -Wempty-character-class empty.re -o empty.c
empty.re:2:4: warning: empty character class [-Wempty-character-class]
empty.re:3:4: warning: empty character class [-Wempty-character-class]
empty.re:4:4: warning: empty character class [-Wempty-character-class]
empty.re:5:4: warning: empty character class [-Wempty-character-class]
empty.re:6:4: warning: empty character class [-Wempty-character-class]

-Wmatch-empty-string

[-Wmatch-empty-string] warns when a rule is nullable (matches an empty string). It was intended to prevent infinite looping in cases like the [hang.re] example below. The program loops over its arguments (the outer for loop) and tries to lex each argument (the inner for loop). The lexer stops when all input has been consumed and it sees the terminating NULL. Arguments must consist of lowercase letters only.

#include <stdio.h>

int main(int argc, char **argv)
{
    for (int i = 1; i < argc; ++i) {
        for (char *YYCURSOR = argv[i];;) {
        /*!re2c
            re2c:define:YYCTYPE = char;
            re2c:yyfill:enable = 0;
            "\x00" { break; }
            [a-z]* { continue; }
        */
        }
        printf("argv[%d]: %s\n", i, argv[i]);
    }
    return 0;
}

On well-formed input the program runs as expected. However, if one of the arguments contains a symbol diffrerent from lowercase letter, the program hangs forever:

$ re2c -Wmatch-empty-string hang.re -o hang.c
hang.re:11:19: warning: rule matches empty string [-Wmatch-empty-string]
$ c++ -o hang hang.c
$
$ ./hang only lowercase letters
argv[1]: only
argv[2]: lowercase
argv[3]: letters
$
$ ./hang right ?
argv[1]: right
^C

Note that if we add default rule *, the lexer won’t hang anymore: it will match the default rule instead of the nullable rule. The fix is easy: make the rule non-nullable (say, [a-z]+) and add default rule *.

In some cases matching an empty string makes perfect sense: for example, it might be used as a non-consuming default rule, or it might be used to lex an optional lexeme (if the corresponding rule doesn’t match, the lexer jumps to another block and resumes lexing at the same input position). All these cases are valid, so if [-Wmatch-empty-string] becomes annoying, it can be silenced with [-Wno-match-empty-string].

-Wnondeterministic-tags

Consider the following example, in which symbol a is repeated a few times, followed by a tag t, followed by three or more repetitions of a:

/*!re2c
    [a]* @t [a]{3,} {}
*/

Because the repetition is greedy, it should consume as many symbols a before the tag as possible, leaving only the last three a after the tag. With -Wnondeterministic-tags re2c gives a warning:

$ re2c --tags -Wnondeterministic-tags example.re -o/dev/null
example.re:2:24: warning: tag 't' has 4th degree of nondeterminism [-Wnondeterministic-tags]

Nondeterminism degree means the maximum number of different versions of a tag that must be tracked simultaneously during matching. Higher degrees require more tag variables and operations on DFA transitions. Generally it is fine if some tags have degrees 2 or 3, but higher degrees may be a point of concern, as the transitions get cluttered with operations and the execution gets slower.

To explain nondeterminism degrees in depth, let’s have a look at the generated NFA, its determinization process and the resulting DFA. NFA can be visualized with re2c --dump-nfa --tags example.re 2>&1 1>/dev/null | dot -Tsvg -o nfa.svg (note that debug options --dump-<xxx> only work if re2c was built in debug mode). Some of the NFA transitions are labeled with 97 (ASCII code for a); other are epsilon-transitions (they do not consume any symbols). Epsilon-transition from state 5 to state 4 is tagged.

../../_images/nfa.svg

Determinization process can be visualized with re2c --dump-dfa-raw --tags example.re 2>&1 1>/dev/null | dot -Tsvg -o determinization.svg. It shows all possible paths through the NFA when matching a string of a symbols. Transitions are labeled with 1, which means the first character class (in this case just the symbol a — there would be more classes if the regular expression contained other symbols). Some transitions have tag operations: for example, 3↑ on transition from state 4 to state 3 means that the version of tag t changes from 1 to 3 (upper arrow means that version 3 is set to the current input position). Boxes in-between transitions show subsets of NFA states at each step of determinization (after consuming one more a). Each NFA state is paired with a tag version. Exploring state sets at each step, we can see that the maximum number of different tag versions is 4 (t1, t5, t4, t3). Therefore tag t has 4th degree of nondeterminism.

../../_images/determinization.svg

The resulting DFA (before optimizations) can be visualized with re2c --dump-dfa-det --tags example.re 2>&1 1>/dev/null | dot -Tsvg -o dfa.svg. 4Th degree of nondeterminism causes the relatively high number of copy operations on the looping transition in state 3.

../../_images/dfa.svg

In this case it is sufficient to move tag t past the counted repetition (as in [a]*[a]{3} @t [a]* or introduce another tag u on which t can be fixed (as in [a]* @t [a]{3} @u [a]*) in order to decrease nondeterminism degree to 1 (in the latter case t will not require any tag operations at all, because it will be computed from u by subtracting a fixed offset).

-Wsentinel-in-midrule

When using sentinel method of checking for the end of input, it is easy to forget that the sentinel symbol must not be allowed in the middle of the rule. For example, the following code tries to match single-quoted strings. It allows any character except the single quote to occur in the string, including terminating NULL. As a result, the generated lexer works as expected on well-formed input like 'aaa'\0, but things go wrong on ill-formed input like 'aaa\0 (where the closing single quote is missing). Lexer reaches the terminating NULL and assumes it is a part of the single-quoted string, so it continues reading bytes from memory. Eventually the lexer terminates due to memory access violation, or worse — it accidentally hits a single quote and assumes this to be the end of the string.

#include <assert.h>

int lex(const char *YYCURSOR)
{
    /*!re2c
    re2c:define:YYCTYPE = char;
    re2c:yyfill:enable = 0;
    ['] [^']* ['] { return 0; }
    *             { return 1; }
    */
}

int main()
{
    assert(lex("'good'") == 0);
    assert(lex("'bad") == 1);
    return 0;
}

On this code re2c reports a warning. It cannot be certain that NULL is the sentinel symbol, but this is by far the most common case.

$ re2c -Wsentinel-in-midrule example.re -oexample.c
example.re:9:18: warning: sentinel symbol 0 occurs in the middle of the rule
    (note: if a different sentinel symbol is used, specify it with 're2c:sentinel' configuration) [-Wsentinel-in-midrule]

However, re2c suggests us to define the sentinel symbol using re2c:sentinel configuration. Let’s do it.

#include <assert.h>

int lex(const char *YYCURSOR)
{
    /*!re2c
    re2c:define:YYCTYPE = char;
    re2c:yyfill:enable = 0;
    re2c:sentinel = 0;
    ['] [^']* ['] { return 0; }
    *             { return 1; }
    */
}

int main()
{
    assert(lex("'good'") == 0);
    assert(lex("'bad") == 1);
    return 0;
}

The warning has turned into an error, as re2c is now certain that the code contains an error.

$ re2c -Wsentinel-in-midrule example.re -oexample.c
example.re:10:18: error: sentinel symbol 0 occurs in the middle of the rule [-Werror-sentinel-in-midrule]

The code can be fixed by excluding NULL from the set of symbols allowed in the middle of the string: ['] [^'\x00]* [']. If it is necessary to allow all symbols, a more powerful EOF handling method should be used.