[-Wmatch-empty-string]

A simple example

[-Wmatch-empty-string] warns when a rule is nullable (matches empty string). It was intended to prevent hitting eternal loop in cases like this:

[wmatch_empty_string.re]

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
#include <stdio.h>

int main(int argc, char **argv)
{
    for (int i = 1; i < argc; ++i) {
        for (char *YYCURSOR = argv[i];;) {
        /*!re2c
            re2c:define:YYCTYPE = char;
            re2c:yyfill:enable = 0;
            "\x00" { break; }
            [a-z]* { continue; }
        */
        }
        printf("argv[%d]: %s\n", i, argv[i]);
    }
    return 0;
}

The program loops over its arguments (the outer for loop) and tries to lex each argument (the inner for loop). Lexer stops when all input has been consumed and it sees the terminating NULL. Arguments must consist of lowercase letters only. Generate, compile and run:

$ re2c -o example.c -Wmatch-empty-string wmatch_empty_string.re
re2c: warning: line 11: rule matches empty string [-Wmatch-empty-string]
$ g++ -o example example.c
$
$ ./example only lowercase letters are allowed
argv[1]: only
argv[2]: lowercase
argv[3]: letters
argv[4]: are
argv[5]: allowed
$
$ ./example oh really?
argv[1]: oh
^C

The program hangs forever if one of the arguments is ill-formed.

Note that [-Wundefined-control-flow] has no complaints about this particular case: all input patterns are covered by rules. Yet if we add default rule *, lexer won’t hang anymore: it will match default rule instead of nullable rule.

The fix is easy: make the rule non-nullable (say, [a-z]+) and add default rule *.

False alarm

In many cases matching empty string makes perfect sense:

  • It might be used as a non-consuming default rule.
  • It might be used to lex an optional lexeme: if lexeme rules didn’t match, lexer must jump to another block and resume lexing at the same input position.

Or any other useful examples you can invent. All these cases are perfectly sane. If [-Wmatch-empty-string] becomes annoying, use [-Wno-match-empty-string]. Maybe re2c should move this warning to some paranoid level.

Real-world examples

In general, it is a common mistake to use * instead of + in repetitions. That is, to accept zero or more repetitions instead of one or more.

Typos in definitions

Here is the skeleton of REXX lexer (the very lexer which motivated Peter to write re2c :)).

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
/*!re2c
    all    = [\000-\377];
    eof    = [\000];
    any    = all\eof;
    letter = [a-z]|[A-Z];
    digit  = [0-9];
    symchr = letter|digit|[.!?_];
    const  = (digit|[.])symchr*([eE][+-]?digit+)?;
    simple = (symchr\(digit|[.]))(symchr\[.])*;
    stem   = simple [.];
    symbol = symchr*;
    sqstr  = ['] ((any\['\n])|(['][']))* ['];
    dqstr  = ["] ((any\["\n])|(["]["]))* ["];
    str    = sqstr|dqstr;
    ob     = [ \t]*;
    not    = [\\~];
    A      = [aA];
    B      = [bB];
    C      = [cC];
    D      = [dD];
    E      = [eE];
    F      = [fF];
    G      = [gG];
    H      = [hH];
    I      = [iI];
    J      = [jJ];
    K      = [kK];
    L      = [lL];
    M      = [mM];
    N      = [nN];
    O      = [oO];
    P      = [pP];
    Q      = [qQ];
    R      = [rR];
    S      = [sS];
    T      = [tT];
    U      = [uU];
    V      = [vV];
    W      = [wW];
    X      = [xX];
    Y      = [yY];
    Z      = [zZ];

    "\n"                                  {}
    "|" ob "|"                            {}
    "+"                                   {}
    "-"                                   {}
    "*"                                   {}
    "/"                                   {}
    "%"                                   {}
    "/" ob "/"                            {}
    "*" ob "*"                            {}
    "="                                   {}
    not ob "=" | "<" ob ">" | ">" ob "<"  {}
    ">"                                   {}
    "<"                                   {}
    ">" ob "=" | not ob "<"               {}
    "<" ob "=" | not ob ">"               {}
    "=" ob "="                            {}
    not ob "=" ob "="                     {}
    ">" ob ">"                            {}
    "<" ob "<"                            {}
    ">" ob ">" ob "=" | not ob "<" ob "<" {}
    "<" ob "<" ob "=" | not ob ">" ob ">" {}
    "&"                                   {}
    "|"                                   {}
    "&" ob "&"                            {}
    not                                   {}
    ":"                                   {}
    ","                                   {}
    "("                                   {}
    ")"                                   {}
    ";"                                   {}
    A D D R E S S                         {}
    A R G                                 {}
    C A L L                               {}
    D O                                   {}
    D R O P                               {}
    E L S E                               {}
    E N D                                 {}
    E X I T                               {}
    I F                                   {}
    I N T E R P R E T                     {}
    I T E R A T E                         {}
    L E A V E                             {}
    N O P                                 {}
    N U M E R I C                         {}
    O P T I O N S                         {}
    O T H E R W I S E                     {}
    P A R S E                             {}
    P R O C E D U R E                     {}
    P U L L                               {}
    P U S H                               {}
    Q U E U E                             {}
    R E T U R N                           {}
    S A Y                                 {}
    S E L E C T                           {}
    S I G N A L                           {}
    T H E N                               {}
    T R A C E                             {}
    W H E N                               {}
    O F F                                 {}
    O N                                   {}
    B Y                                   {}
    D I G I T S                           {}
    E N G I N E E R I N G                 {}
    E R R O R                             {}
    E X P O S E                           {}
    F A I L U R E                         {}
    F O R                                 {}
    F O R E V E R                         {}
    F O R M                               {}
    F U Z Z                               {}
    H A L T                               {}
    L I N E I N                           {}
    N A M E                               {}
    N O T R E A D Y                       {}
    N O V A L U E                         {}
    S C I E N T I F I C                   {}
    S O U R C E                           {}
    S Y N T A X                           {}
    T O                                   {}
    U N T I L                             {}
    U P P E R                             {}
    V A L U E                             {}
    V A R                                 {}
    V E R S I O N                         {}
    W H I L E                             {}
    W I T H                               {}
    const                                 {}
    simple                                {}
    stem                                  {}
    symbol                                {}
    str                                   {}
    str [bB] / (all\symchr)               {}
    str [xX] / (all\symchr)               {}
    eof                                   {}
    any                                   {}
*/

`re2c -Wmatch-empty-string` warns:

re2c: warning: line 133: rule matches empty string [-Wmatch-empty-string]

The faulty rule is symbol. It is defined as symchr* and clearly is nullable. In this particular example (assuming ASCII encoding) empty match is shadowed by other rules: together eof and any cover all possible code units. So in this case there is no chance of hitting eternal loop.

However, by no means symbol should be nullable: it makes no sense. Sure, it’s just a typo and the author meant symchr+.

Skipping uninteresting stuff

One often needs to skip variable number of, say, spaces:

/*!re2c
    TABS_AND_SPACES = [ \t]*;
*/

This definition is ok when used inside of another (non-nullable) rule:

/*!re2c
    TABS_AND_SPACES = [ \t]*;
    "(" TABS_AND_SPACES ("int" | "integer") TABS_AND_SPACES ")" {}
*/

However, as a standalone rule it may cause eternal loop on ill-formed input. And it’s very common to reuse one rule for multiple purposes.