Numbers

This example is very simple, yet practical. We assume that the input is small (fits in one continuous block of memory). We also assume that some characters never occur in well-formed input (but may occur in input that’s ill-formed). This is often the case in simple real-world tasks such as parsing program options, converting strings to numbers, determining binary file types based on some magic in the first few bytes, or efficiently switching on a string. Our example program simply loops over its command-line arguments and tries to match each argument against one of four patterns: binary, octal, decimal, and hexadecimal integer literals. The numbers are not parsed (their numeric value is not retrieved), they are merely recognized.

[integers.re]

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
#include <stdio.h>

enum num_t { ERR, BIN, OCT, DEC, HEX };

static num_t lex(const char *YYCURSOR)
{
    const char *YYMARKER;
    /*!re2c
        re2c:define:YYCTYPE = char;
        re2c:yyfill:enable = 0;

        end = "\x00";
        bin = '0b' [01]+;
        oct = "0" [0-7]*;
        dec = [1-9][0-9]*;
        hex = '0x' [0-9a-fA-F]+;

        *       { return ERR; }
        bin end { return BIN; }
        oct end { return OCT; }
        dec end { return DEC; }
        hex end { return HEX; }
    */
}

int main(int argc, char **argv)
{
    for (int i = 1; i < argc; ++i) {
        switch (lex(argv[i])) {
            case ERR: printf("error\n"); break;
            case BIN: printf("binary\n"); break;
            case OCT: printf("octal\n"); break;
            case DEC: printf("decimal\n"); break;
            case HEX: printf("hexadecimal\n"); break;
        }
    }
    return 0;
}

A couple of things should be noted:

  • The default case (when none of the rules matched) is handled properly with the * rule (line 18). Never forget to handle the default case, otherwise control flow in the lexer for some input strings will be undefined . Use the [-Wundefined-control-flow] re2c warning: it will warn you about the unhandled default case and show the input patterns that are not covered by the rules.
  • We use the sentinel method to stop at the end of input (re2c:yyfill:enable = 0; at line 10). A sentinel is a special character that can never occur in well-formed input. It is appended to the end of input and serves as a stop signal for the lexer. In our case, the sentinel is NULL: all arguments are NULL-terminated and none of the rules matches NULL in the middle. The lexer will inevitably stop when it sees a NULL. Note that we make no assumptions about the input; it may contain any characters. But do make sure that the sentinel character is not allowed in the middle of a rule.
  • YYMARKER (line 7) is needed because rules overlap: it backs up the input position of the longest successful match. Imagine we have overlapping rules "a" and "abc" and input string "abd": by the time "a" matches, there’s still a chance to match "abc", but when the lexer sees 'd', it must roll back. (You might wonder why YYMARKER is exposed at all: why not make it a local variable like yych? The reason is, all input pointers must be updated by YYFILL as explained in the Large input example.)

Compile:

$ re2c -o integers.cc integers.re
$ g++ -o integers integers.cc

Run:

$ ./integers 123 0xfF 010 0B101 ?
decimal
hexadecimal
octal
binary
error