[-Wundefined-control-flow]

A simple example

Say, we want to match 'a':

1
2
3
/*!re2c
    "a" { return 'a'; }
*/

`re2c -i -Wundefined-control-flow`:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
/* Generated by re2c 0.14.1.dev on Thu Nov  5 14:35:46 2015*/

{
        YYCTYPE yych;
        if (YYLIMIT <= YYCURSOR) YYFILL(1);
        yych = *YYCURSOR;
        switch (yych) {
        case 'a':       goto yy3;
        default:        goto yy2;
        }
yy2:
yy3:
        ++YYCURSOR;
        { return 'a'; }
}

Clearly this is not what we want: this code matches any letter, not only 'a'. re2c grumbles something about undefined control flow and says that default * rule won’t hurt:

re2c: warning: line 3: control flow is undefined for strings that match '[\x0-\x60\x62-\xFF]', use default rule '*' [-Wundefined-control-flow]

Let’s add it:

1
2
3
4
/*!re2c
    *   { return '*'; }
    "a" { return 'a'; }
*/

Now that’s better:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
/* Generated by re2c 0.14.1.dev on Thu Nov  5 14:35:08 2015*/

{
        YYCTYPE yych;
        if (YYLIMIT <= YYCURSOR) YYFILL(1);
        yych = *YYCURSOR;
        switch (yych) {
        case 'a':       goto yy4;
        default:        goto yy2;
        }
yy2:
        ++YYCURSOR;
        { return '*'; }
yy4:
        ++YYCURSOR;
        { return 'a'; }
}

Note that default rule brings no overhead: it simply binds code to default label.

Difference between * and [^]

When the the world was young and re2c didn’t have default rule * (that is, before re2c-0.13.7), everyone used [^] as default rule:

/*!re2c
    // ... normal rules ...
    [^] { return "any"; }
*/

[^] is just an ordinary rule: it matches any character and has normal priority (so it should be the last rule). If other rules didn’t match, [^] will match and consume one character.

But exactly what is a character? First, an abstract number that is assigned some sacred meaning in current encoding — code point. Second, a minimal piece of information (say, combination of bits) that can represent a unit of encoded text — code unit. Rules are defined in terms of code points. Input is measured in code units. In fixed-width encodings (such as ASCII, EBCDIC, UCS-2, UTF-32, etc.) there is one-to-one correspondence between code points and code units. In variable-width encodings (such as UTF-8, UTF-16, etc.) code points map to code unit sequences of differing length.

[^] rule matches any code point. In fixed-width encodings it covers all code units and consumes exactly one of them. In variable-width encodings it consumes variable number of code units and may not match some of them. The example above compiles without warnings with any fixed-width encoding (ASCII by default). However, with UTF-8 encoding `re2c -i8 -Wundefined-control-flow` says:

re2c: warning: line 4: control flow is undefined for strings that match
        '[\x80-\xC1\xF5-\xFF]'
        '\xF0 [\x0-\x8F\xC0-\xFF]'
        '[\xE1-\xEF] [\x0-\x7F\xC0-\xFF]'
        '\xF4 [\x0-\x7F\x90-\xFF]'
        '\xE0 [\x0-\x9F\xC0-\xFF]'
        '[\xF1-\xF3] [\x0-\x7F\xC0-\xFF]'
        '[\xC2-\xDF] [\x0-\x7F\xC0-\xFF]'
        '\xE0 [\xA0-\xBF] [\x0-\x7F\xC0-\xFF]'
 ... and 7 more, use default rule '*' [-Wundefined-control-flow]

It shows us the patterns that must never appear in valid UTF-8 encoded text. If the input is not valid UTF-8, lexer behaviour is undefined (most likely it will end up with segfault). One would expect that with UTF-16 (another variable-width encoding) re2c will also report a warning, but it doesn’t. This is because by default re2c treats Unicode surrogates as normal code points (for backwards compatibility reasons). If we tell re2c to exclude surrogates, `re2c -ix --encoding-policy fail -Wundefined-control-flow` will warn:

re2c: warning: line 4: control flow is undefined for strings that match
        '[\xDC00-\xDFFF]'
        '[\xD800-\xDBFF] [\x0-\xDBFF\xE000-\xFFFF]'
, use default rule '*' [-Wundefined-control-flow]

As you see, it can get quite subtle. One should always use the true default rule * (it matches any code unit regardless of encoding, consumes a single code unit no matter what and always has the lowest priority). Note that * is a builtin hack: it cannot be expressed through ordinary rules.

How it works

Every path in the generated DFA must contain at least one accepting state, otherwise it causes undefined behaviour and should be reported. re2c walks DFA in deep-first search and checks all paths. Each branch of search aborts as soon as it meets accepting state. Most of the real-world programs only forget to handle a few cases, so almost all branches abort soon and search takes very little time even for a large DFA. In pathological cases re2c avoids exponential time and space consumption by placing an upper bound on the number of faulty patterns. The shortest patterns are reported first.

Note that the analyses is done anyway. The option -Wundefined-control-flow only controls if the warning is reported or not.

Real-world examples

Many real-world examples deal with preprocessed input, so they make strong assumptions about the input form or character set. These assumptions may or may not be valid under certain circumstances; however, double-check won’t hurt. Even it you are absolutely sure that default case is impossible, do handle it. It adds no overhead. No additional checks and transitions. It simply binds code to default label.

I found [-Wundefined-control-flow] warnings in many real-world programs (including re2c own lexer). Mostly these are minor issues like forgetting to handle newlines or zeroes in already preprocessed input, but it’s curious how they creeped into the code. I bet they were just forgotten and not omitted for a good reason. :)