-Wundefined-control-flow

With -Wundefined-control-flow warning re2c checks that every path in the generated DFA contains at least one accepting state. When the input matches such a path, lexer will eventually stop and execute the corresponding semantic action. However, if some path has no accepting state, then lexer behavior is undefined: it may loop forever, or read past the end of buffer, or jump to some other semantic action by accident. For example, consider this simple piece of code (a.re) that is supposed to match letter a:

/*!re2c
    "a" { return 'a'; }
*/

The generated code looks like this:

{
        YYCTYPE yych;
        if (YYLIMIT <= YYCURSOR) YYFILL(1);
        yych = *YYCURSOR;
        switch (yych) {
        case 'a':       goto yy3;
        default:        goto yy2;
        }
yy2:
yy3:
        ++YYCURSOR;
        { return 'a'; }
}

Clearly this is not what we want: this code matches any letter, not just a! This happens because we did not specify any handler for the remaining input symbols. If we run re2c with -Wundefined-control-flow, we will see that it complains about undefined control flow and recommends using default rule *:

a.re:3:2: warning: control flow is undefined for strings that match '[\x0-\x60\x62-\xFF]', use the default '*' rule  [-Wundefined-control-flow]

Let’s follow the advice and change the code:

/*!re2c
    *   { return '*'; }
    "a" { return 'a'; }
*/

Now the generated code looks much better:

{
        YYCTYPE yych;
        if (YYLIMIT <= YYCURSOR) YYFILL(1);
        yych = *YYCURSOR;
        switch (yych) {
        case 'a':       goto yy4;
        default:        goto yy2;
        }
yy2:
        ++YYCURSOR;
        { return '*'; }
yy4:
        ++YYCURSOR;
        { return 'a'; }
}

Note that the default rule brings no overhead: it simply binds code to the default label. It should always be used, unless you are absolutely sure that your grammar covers all possible cases.

The old default rule

When the world was young and re2c didn’t have the default * rule (that is, before re2c-0.13.7) everyone used [^] as the default rule, as in this example (any.re):

/*!re2c
    // ... normal rules ...
    [^] { return "any"; }
*/

[^] is just an ordinary rule: it matches any character and has normal priority (so it should be the last rule). If other rules didn’t match, [^] will match and consume one character.

But exactly what is a character? First, an abstract number that is assigned some sacred meaning within the current encoding — a code point. Second, a minimal piece of information (say, combination of bits) that can represent a unit of encoded text — a code unit. Rules are defined in terms of code points. Input is measured in code units. In fixed-width encodings (such as ASCII, EBCDIC, UCS-2, UTF-32, etc.), there is a one-to-one correspondence between code points and code units. In variable-width encodings (such as UTF-8, UTF-16, etc.), code points map to code unit sequences of different lengths.

The [^] rule matches any code point. In fixed-width encodings, it covers all code units and consumes exactly one of them. In variable-width encodings, it consumes variable number of code units and may not match some of them. The example above compiles without warnings with any fixed-width encoding (ASCII by default). However, with the UTF-8 encoding `re2c -i8 -Wundefined-control-flow any.re` complains:

any.re:4:2: warning: control flow is undefined for strings that match
        '[\x80-\xC1\xF5-\xFF]'
        '\xF0 [\x0-\x8F\xC0-\xFF]'
        '[\xE1-\xEF] [\x0-\x7F\xC0-\xFF]'
        '\xF4 [\x0-\x7F\x90-\xFF]'
        '\xE0 [\x0-\x9F\xC0-\xFF]'
        '[\xF1-\xF3] [\x0-\x7F\xC0-\xFF]'
        '[\xC2-\xDF] [\x0-\x7F\xC0-\xFF]'
        '\xE0 [\xA0-\xBF] [\x0-\x7F\xC0-\xFF]'
 ... and 7 more, use default rule '*' [-Wundefined-control-flow]

It shows us the patterns that must never appear in valid UTF-8 encoded text. If the input is not valid UTF-8, lexer behavior is undefined. One would expect that with UTF-16 (another variable-width encoding), re2c would also report a warning, but it doesn’t. This is because by default, re2c treats Unicode surrogates as normal code points (for backwards compatibility reasons). If we tell re2c to exclude surrogates (`re2c -ix --encoding-policy fail -Wundefined-control-flow`), then we will get a warning:

any.re:4:2: warning: control flow is undefined for strings that match
        '[\xDC00-\xDFFF]'
        '[\xD800-\xDBFF] [\x0-\xDBFF\xE000-\xFFFF]'
, use default rule '*' [-Wundefined-control-flow]

As you see, it can get quite subtle. A good advice is, always use default rule *: it matches any code unit regardless of encoding, consumes a single code unit no matter what and always has the lowest priority. Note that * is a built-in hack: it cannot be expressed through ordinary rules.