Skeleton

With the -S, --skeleton option, re2c ignores all non-re2c code and generates a self-contained C program that can be further compiled and executed. The program consists of lexer code and input data. For each constructed DFA (block or condition), re2c generates a standalone lexer and two files: an .input file with strings derived from the DFA and a .keys file with expected match results. The program runs each lexer on the corresponding .input file and compares results with the expectations.

For encodings with 1-byte code units (such as ASCII, UTF-8 and EBCDIC) the generated data covers all DFA transitions (in other words, the skeleton program triggers each conditional jump in the lexer). For encodings with multibyte code units the generated data covers up to 256 transitions of each disjoint character range in the DFA (see Generating data section for details).

Example

[hex2.re]

/*!re2c
    *              {}
    [0-9a-fA-F]{2} {}
*/

Here is a very simple program (it tries to match two-digit hexadecimal numbers). We can see the generated DFA using `re2c -D hex2.re | dot -Gratio=0.5 -Tpng -o example.png`:

../../../_images/example.png

Given this program, `re2c -S -o example.c hex2.re` generates three files: example.c (main program), example.c.line4.input (input data) and example.c.line4.keys (expected match results). First, let’s look at the generated strings:

[example.c.line4.input]

$ hexdump -v -e '"%08_ax " 24/1 "%02x "' -e '" |" 24/1 "%_p" "|\n"' example.c.line4.input
00000000 00 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f 10 11 12 13 14 15 16 17 |........................|
00000018 18 19 1a 1b 1c 1d 1e 1f 20 21 22 23 24 25 26 27 28 29 2a 2b 2c 2d 2e 2f |........ !"#$%&'()*+,-./|
00000030 3a 3b 3c 3d 3e 3f 40 47 48 49 4a 4b 4c 4d 4e 4f 50 51 52 53 54 55 56 57 |:;<=>?@GHIJKLMNOPQRSTUVW|
00000048 58 59 5a 5b 5c 5d 5e 5f 60 67 68 69 6a 6b 6c 6d 6e 6f 70 71 72 73 74 75 |XYZ[\]^_`ghijklmnopqrstu|
00000060 76 77 78 79 7a 7b 7c 7d 7e 7f 80 81 82 83 84 85 86 87 88 89 8a 8b 8c 8d |vwxyz{|}~...............|
00000078 8e 8f 90 91 92 93 94 95 96 97 98 99 9a 9b 9c 9d 9e 9f a0 a1 a2 a3 a4 a5 |........................|
00000090 a6 a7 a8 a9 aa ab ac ad ae af b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 ba bb bc bd |........................|
000000a8 be bf c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 ca cb cc cd ce cf d0 d1 d2 d3 d4 d5 |........................|
000000c0 d6 d7 d8 d9 da db dc dd de df e0 e1 e2 e3 e4 e5 e6 e7 e8 e9 ea eb ec ed |........................|
000000d8 ee ef f0 f1 f2 f3 f4 f5 f6 f7 f8 f9 fa fb fc fd fe ff 30 30 31 31 32 32 |..................001122|
000000f0 33 33 34 34 35 35 36 36 37 37 38 38 39 39 41 41 42 42 43 43 44 44 45 45 |33445566778899AABBCCDDEE|
00000108 46 46 61 61 62 62 63 63 64 64 65 65 66 66 30 00 31 01 32 02 33 03 34 04 |FFaabbccddeeff0.1.2.3.4.|
00000120 35 05 36 06 37 07 38 08 39 09 41 0a 42 0b 43 0c 44 0d 45 0e 46 0f 61 10 |5.6.7.8.9.A.B.C.D.E.F.a.|
00000138 62 11 63 12 64 13 65 14 66 15 30 16 31 17 32 18 33 19 34 1a 35 1b 36 1c |b.c.d.e.f.0.1.2.3.4.5.6.|
00000150 37 1d 38 1e 39 1f 41 20 42 21 43 22 44 23 45 24 46 25 61 26 62 27 63 28 |7.8.9.A B!C"D#E$F%a&b'c(|
00000168 64 29 65 2a 66 2b 30 2c 31 2d 32 2e 33 2f 34 3a 35 3b 36 3c 37 3d 38 3e |d)e*f+0,1-2.3/4:5;6<7=8>|
00000180 39 3f 41 40 42 47 43 48 44 49 45 4a 46 4b 61 4c 62 4d 63 4e 64 4f 65 50 |9?A@BGCHDIEJFKaLbMcNdOeP|
00000198 66 51 30 52 31 53 32 54 33 55 34 56 35 57 36 58 37 59 38 5a 39 5b 41 5c |fQ0R1S2T3U4V5W6X7Y8Z9[A\|
000001b0 42 5d 43 5e 44 5f 45 60 46 67 61 68 62 69 63 6a 64 6b 65 6c 66 6d 30 6e |B]C^D_E`Fgahbicjdkelfm0n|
000001c8 31 6f 32 70 33 71 34 72 35 73 36 74 37 75 38 76 39 77 41 78 42 79 43 7a |1o2p3q4r5s6t7u8v9wAxByCz|
000001e0 44 7b 45 7c 46 7d 61 7e 62 7f 63 80 64 81 65 82 66 83 30 84 31 85 32 86 |D{E|F}a~b.c.d.e.f.0.1.2.|
000001f8 33 87 34 88 35 89 36 8a 37 8b 38 8c 39 8d 41 8e 42 8f 43 90 44 91 45 92 |3.4.5.6.7.8.9.A.B.C.D.E.|
00000210 46 93 61 94 62 95 63 96 64 97 65 98 66 99 30 9a 31 9b 32 9c 33 9d 34 9e |F.a.b.c.d.e.f.0.1.2.3.4.|
00000228 35 9f 36 a0 37 a1 38 a2 39 a3 41 a4 42 a5 43 a6 44 a7 45 a8 46 a9 61 aa |5.6.7.8.9.A.B.C.D.E.F.a.|
00000240 62 ab 63 ac 64 ad 65 ae 66 af 30 b0 31 b1 32 b2 33 b3 34 b4 35 b5 36 b6 |b.c.d.e.f.0.1.2.3.4.5.6.|
00000258 37 b7 38 b8 39 b9 41 ba 42 bb 43 bc 44 bd 45 be 46 bf 61 c0 62 c1 63 c2 |7.8.9.A.B.C.D.E.F.a.b.c.|
00000270 64 c3 65 c4 66 c5 30 c6 31 c7 32 c8 33 c9 34 ca 35 cb 36 cc 37 cd 38 ce |d.e.f.0.1.2.3.4.5.6.7.8.|
00000288 39 cf 41 d0 42 d1 43 d2 44 d3 45 d4 46 d5 61 d6 62 d7 63 d8 64 d9 65 da |9.A.B.C.D.E.F.a.b.c.d.e.|
000002a0 66 db 30 dc 31 dd 32 de 33 df 34 e0 35 e1 36 e2 37 e3 38 e4 39 e5 41 e6 |f.0.1.2.3.4.5.6.7.8.9.A.|
000002b8 42 e7 43 e8 44 e9 45 ea 46 eb 61 ec 62 ed 63 ee 64 ef 65 f0 66 f1 30 f2 |B.C.D.E.F.a.b.c.d.e.f.0.|
000002d0 31 f3 32 f4 33 f5 34 f6 35 f7 36 f8 37 f9 38 fa 39 fb 41 fc 42 fd 43 fe |1.2.3.4.5.6.7.8.9.A.B.C.|
000002e8 44 ff                                                                   |D.|

Byte sequences correspond to the paths in DFA. All strings are glued together, so it’s hard to tell where is the end of one string and the beginning of another. For that re2c generates keys:

[example.c.line4.keys]

$hexdump -v -e '"%08_ax " 36/1 "%02x " "\n"' example.c.line4.keys
00000000 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe
00000024 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe
00000048 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe
0000006c 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe
00000090 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe
000000b4 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe
000000d8 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe
000000fc 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe
00000120 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe
00000144 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe
00000168 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe
0000018c 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe
000001b0 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe
000001d4 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe
000001f8 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe
0000021c 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe
00000240 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe
00000264 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe
00000288 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe
000002ac 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 01 01 fe 02 02 00 02 02 00 02 02 00 02 02 00 02 02 00 02 02 00
000002d0 02 02 00 02 02 00 02 02 00 02 02 00 02 02 00 02 02 00 02 02 00 02 02 00 02 02 00 02 02 00 02 02 00 02 02 00
000002f4 02 02 00 02 02 00 02 02 00 02 02 00 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe
00000318 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe
0000033c 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe
00000360 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe
00000384 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe
000003a8 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe
000003cc 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe
000003f0 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe
00000414 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe
00000438 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe
0000045c 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe
00000480 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe
000004a4 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe
000004c8 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe
000004ec 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe
00000510 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe
00000534 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe
00000558 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe
0000057c 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe
000005a0 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe 02 01 fe

A key is a triplet: string length, the length of matching prefix and the number of matching rule. All three components occupy equal amount of memory (re2c uses unsigned integer type of minimal sufficient width). Keys are packed into array of length 3 * n (where n is the number of keys). In our case each triplet occupies three bytes.

And finally, the program itself:

[example.c]

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
/* Generated by re2c 0.15.2 on Wed Dec  2 08:46:58 2015 */

#include <stdio.h>
#include <stdlib.h> /* malloc, free */

static void *read_file
    ( const char *fname
    , size_t unit
    , size_t padding
    , size_t *pfsize
    )
{
    void *buffer = NULL;
    size_t fsize = 0;

    /* open file */
    FILE *f = fopen(fname, "rb");
    if(f == NULL) {
        goto error;
    }

    /* get file size */
    fseek(f, 0, SEEK_END);
    fsize = (size_t) ftell(f) / unit;
    fseek(f, 0, SEEK_SET);

    /* allocate memory for file and padding */
    buffer = malloc(unit * (fsize + padding));
    if (buffer == NULL) {
        goto error;
    }

    /* read the whole file in memory */
    if (fread(buffer, unit, fsize, f) != fsize) {
        goto error;
    }

    fclose(f);
    *pfsize = fsize;
    return buffer;

error:
    fprintf(stderr, "error: cannot read file '%s'\n", fname);
    free(buffer);
    if (f != NULL) {
        fclose(f);
    }
    return NULL;
}

#define YYCTYPE unsigned char
#define YYKEYTYPE unsigned char
#define YYPEEK() *cursor
#define YYSKIP() ++cursor
#define YYLESSTHAN(n) (limit - cursor) < n
#define YYFILL(n) { break; }

static int action_line4
    ( unsigned int i
    , const YYKEYTYPE *keys
    , const YYCTYPE *start
    , const YYCTYPE *token
    , const YYCTYPE **cursor
    , YYKEYTYPE rule_act
    )
{
    const long pos = token - start;
    const long len_act = *cursor - token;
    const long len_exp = (long) keys [3 * i + 1];
    const YYKEYTYPE rule_exp = keys [3 * i + 2];
    if (rule_exp == 255) {
        fprintf
            ( stderr
            , "warning: lex_line4: control flow is undefined for input"
                " at position %ld, rerun re2c with '-W'\n"
            , pos
            );
    }
    if (len_act == len_exp && rule_act == rule_exp) {
        const YYKEYTYPE offset = keys[3 * i];
        *cursor = token + offset;
        return 0;
    } else {
        fprintf
            ( stderr
            , "error: lex_line4: at position %ld (iteration %u):\n"
                "\texpected: match length %ld, rule %u\n"
                "\tactual:   match length %ld, rule %u\n"
            , pos
            , i
            , len_exp
            , rule_exp
            , len_act
            , rule_act
            );
        return 1;
    }
}

int lex_line4()
{
    const size_t padding = 2; /* YYMAXFILL */
    int status = 0;
    size_t input_len = 0;
    size_t keys_count = 0;
    YYCTYPE *input = NULL;
    YYKEYTYPE *keys = NULL;
    const YYCTYPE *cursor = NULL;
    const YYCTYPE *limit = NULL;
    const YYCTYPE *token = NULL;
    const YYCTYPE *eof = NULL;
    unsigned int i = 0;

    input = (YYCTYPE *) read_file
        ("example.c.line4.input"
        , sizeof (YYCTYPE)
        , padding
        , &input_len
        );
    if (input == NULL) {
        status = 1;
        goto end;
    }

    keys = (YYKEYTYPE *) read_file
        ("example.c.line4.keys"
        , 3 * sizeof (YYKEYTYPE)
        , 0
        , &keys_count
        );
    if (keys == NULL) {
        status = 1;
        goto end;
    }

    cursor = input;
    limit = input + input_len + padding;
    eof = input + input_len;

    for (i = 0; status == 0 && i < keys_count; ++i) {
        token = cursor;
        YYCTYPE yych;

        if (YYLESSTHAN (2)) YYFILL(2);
        yych = YYPEEK ();
        switch (yych) {
        case '0':
        case '1':
        case '2':
        case '3':
        case '4':
        case '5':
        case '6':
        case '7':
        case '8':
        case '9':
        case 'A':
        case 'B':
        case 'C':
        case 'D':
        case 'E':
        case 'F':
        case 'a':
        case 'b':
        case 'c':
        case 'd':
        case 'e':
        case 'f':    goto yy4;
        default:    goto yy2;
        }
yy2:
        YYSKIP ();
yy3:
        status = action_line4(i, keys, input, token, &cursor, 254);
        continue;
yy4:
        YYSKIP ();
        yych = YYPEEK ();
        switch (yych) {
        case '0':
        case '1':
        case '2':
        case '3':
        case '4':
        case '5':
        case '6':
        case '7':
        case '8':
        case '9':
        case 'A':
        case 'B':
        case 'C':
        case 'D':
        case 'E':
        case 'F':
        case 'a':
        case 'b':
        case 'c':
        case 'd':
        case 'e':
        case 'f':    goto yy5;
        default:    goto yy3;
        }
yy5:
        YYSKIP ();
        status = action_line4(i, keys, input, token, &cursor, 0);
        continue;

    }
    if (status == 0) {
        if (cursor != eof) {
            status = 1;
            const long pos = token - input;
            fprintf(stderr, "error: lex_line4: unused input strings left at position %ld\n", pos);
        }
        if (i != keys_count) {
            status = 1;
            fprintf(stderr, "error: lex_line4: unused keys left after %u iterations\n", i);
        }
    }

end:
    free(input);
    free(keys);

    return status;
}

#undef YYCTYPE
#undef YYKEYTYPE
#undef YYPEEK
#undef YYSKIP
#undef YYLESSTHAN
#undef YYFILL

int main()
{
    if(lex_line4() != 0) {
        return 1;
    }
    return 0;
}

re2c generated two auxilary functions: read_file and action_line4. read_file is used to map .input and .keys files into memory (this function is shared between all lexers). action_line4 is a replacement for actions: it compares actual lexing results with the expected results. This function is specific to each lexer. Lexing is done by lex_line4: this function contains the generated DFA. main simply calls the lexer.

Compile and run:

$ gcc -o example example.c
$ ./example
$ echo $?
0

When everything is fine, the program outputs nothing and exits with zero.

Failures

Skeleton’s main purpose was to find and prevent errors in re2c code generation (which it did). I don’t have a ready example of failure (otherwise I would have fixed re2c), but we can trigger various errors by alternating the generated program, input and keys.

Artificial examples

Let’s take the first key in example.c.line4.keys (bytes 0x01, 0x01 and 0xFE) and mangle it in various ways. For example, let’s change the first byte (length of current input string) to 0xFF:

$ re2c -S -o example.c hex2.re
$ sed -i -e 's/\x01\x01\xfe/\xff\x01\xfe/' example.c.line4.keys
$ ./example
error: lex_line4: unused input strings left at position 255
error: lex_line4: unused keys left after 1 iterations

Lexer complains about unused keys and input strings. The alternated parameter controls offset of the next string: lexer was forced to jump past the end of input. Now let’s alter the second byte (length of the matching prefix):

$ re2c -S -o example.c hex2.re
$ sed -i -e 's/\x01\x01\xfe/\x01\x03\xfe/' example.c.line4.keys
$ ./example
error: lex_line4: at position 0 (iteration 0):
        expected: match length 3, rule 254
        actual:   match length 1, rule 254

Third byte stands for rule number:

$ re2c -S -o example.c hex2.re
$ sed -i -e 's/\x01\x01\xfe/\x01\x01\xfd/' example.c.line4.keys
$ ./example
error: lex_line4: at position 0 (iteration 0):
        expected: match length 1, rule 253
        actual:   match length 1, rule 254

Now let’s change the input string:

$ re2c -S -o example.c hex2.re
$ sed -i -e 's/@/00/' example.c.line4.input
$ ./example
error: lex_line4: at position 3 (iteration 3):
        expected: match length 1, rule 254
        actual:   match length 2, rule 0

Finally, let’s break the generated code:

$ re2c -S -o example.c hex2.re
$ sed -i -e "s/case '7'://" example.c
$ gcc -o example example.c
$ ./example
error: lex_line4: at position 248 (iteration 241):
        expected: match length 2, rule 0
        actual:   match length 1, rule 254

And so on.

Undefined control flow

One special case of failure is caused by undefined control flow in the generated lexer. Suppose, for example, that we forgot to handle default case in the example above:

1
2
3
/*!re2c
    [0-9a-fA-F]{2} {}
*/

In this case re2c generates code that is perfectly correct, but because of the undefined control flow skeleton program fails:

$ re2c -S -o example.c hex2.re
$ gcc -o example example.c
$ ./example
warning: lex_line3: control flow is undefined for input at position 72, rerun re2c with '-W'
error: lex_line3: at position 72 (iteration 36):
        expected: match length 0, rule 255
        actual:   match length 3, rule 0

In this case we are lucky: lexer erroneously hit an action and was terminated. We got a nice error and a warning that suggests that we should rerun re2c with -W:

$ re2c -W -S -o example.c hex2.re
re2c: warning: line 3: control flow is undefined for strings that match
        '[\x0-\x2F\x3A-\x40\x47-\x60\x67-\xFF]'
        '[\x30-\x39\x41-\x46\x61-\x66] [\x0-\x2F\x3A-\x40\x47-\x60\x67-\xFF]'
, use default rule '*' [-Wundefined-control-flow]

However, it could be much worse: segfault or eternal loop. One thing is for sure: the generated input would have triggered undefined control flow anyway.

Generating data

One cannot just run lexer on all possible inputs: for an alphabet of n characters and input length m, the number of input strings is n to the power of m. The problem can be mitigated by tailoring the input for a particular lexer.

First, re2c constructs skeleton of the given DFA. For encodings with 1-byte code units skeleton is an exact copy of the original DFA. For encodings with multibyte code units skeleton is a copy of DFA with certain edges omitted: for each continuous range of code units (that corresponds to a set of edges between two DFA states) re2c takes at most 256 code units. The chosen values are evenly distributed and include range bounds. (In case of 1-byte code units 256 values are enough to cover the whole range.)

Second, instead of trying to cover all possible inputs, re2c generates skeleton path cover: a set of paths from start node to end node such that each skeleton edge is covered by at least one path. The number of paths in path cover is bounded by the number of edges in skeleton. However, the total size of path cover depends quadratically on the number of edges: an example is shown below and upper bound follows from the fact that total size is bounded by the number of paths times maximal path length, both values obviously bounded by the number of edges.

The algorithm’s implementation is limited by ~1Gb of edges and consumes constant amount of RAM (re2c dumps data to file as soon as it is generated). We can study the algorithm simply alternating repetition counter in the example above:

/*!re2c
    *              {}
    [0-9a-fA-F]{1} {} // [0-9a-fA-F]{2}, [0-9a-fA-F]{3}, [0-9a-fA-F]{4}, ...
*/

Each increment of the repetition counter adds 256 new edges the DFA. Compare the following pictures of DFA with one, two and three repetitions:

../../../_images/example_next.png

We will use a simple script that dumps re2c source file, generates skeleton program and measures the size of .input and .keys files:

[gen.sh]

1
2
3
4
5
6
7
8
printf "%10s%10s%10s%10s\n\n" iters edges input keys
for i in `seq $1 $2`
do
    echo '/*!re2c * {} [0-9a-fA-F]{'$i'} {} */' | re2c -S -o example.c -
    input=`stat -c '%s' example.c.line1.input`
    keys=`stat -c '%s' example.c.line1.keys`
    printf "%10d%10d%10d%10d\n" $i $((256 * i)) $input $keys
done

Script runs in a loop and expects two arguments: lower and upped bounds of the iteration counter.

$ ./gen.sh 1 16
     iters     edges     input      keys

         1       256       256       768
         2       512       746      1470
         3       768      1470      2172
         4      1024      2428      2874
         5      1280      3620      3576
         6      1536      5046      4278
         7      1792      6706      4980
         8      2048      8600      5682
         9      2304     10728      6384
        10      2560     13090      7086
        11      2816     15686      7788
        12      3072     18516      8490
        13      3328     21580      9192
        14      3584     24878      9894
        15      3840     28410     10596
        16      4096     32176     11298

The following pictures show how .keys and .input file size depends on the number of edges (repetition counters 1 through 1024):

../../../_images/plot.png

Keys grow linearly with the number of edges in skeleton (a key corresponds to a single path in path cover). The change of angle is due to the change of key size: starting from repetition count 256 path length gets out of 1-byte range.

Size of input grows quadratically with the number of edges in skeleton. We can calculate the exact analythic function as follows. Each repetition increment adds 256 new edges: 22 edges corresponding to [0-9a-fA-F] that go to accepting state and 234 remaining edges that go to default state. With a bit of common sense (or knowledge of re2c algorithm) one can see that path cover must include 22 paths of length n (that go to accepting state) plus all the default paths (234 paths of lenght 1, plus 234 paths of length 2 and so on up to 234 paths of length n):

22*1 + 234*1
22*2 + (234*1 + 234*2)
22*3 + (234*1 + 234*2 + 234*3)
...
f(n) = 22*n + (234*1 + 234*2 + ... + 234*n)
     = 22*n + 234*n*(n - 1)/2
     = 117*n^2 + 139*n

However, in practice maximal path length is usually less than a hundred and the size of .input files is quite acceptable even for large graphs.

Use

Code verification

When re2c transforms regular expressions to code, it uses several internal representations:

  1. Regular expressions are parsed and transformed into abstract syntax tree (AST).
  2. AST is compiled to bytecode.
  3. Bytecode is used to construct deterministic finite automaton (DFA).
  4. DFA undergoes some transformations.
  5. Transformed DFA is compiled to C/C++ code.

Skeleton is constructed right after stage 3, before any additional changes have been made to DFA. Skeleton is in fact a simplified copy of DFA (better suited for deriving data and freed of all irrelevant information). It is used only to generate .input and .keys files. The rest of skeleton program is the usual re2c-generated code: -S, --skeleton option only resets environment bindings, but does not affect any code generation decisions.

Skeleton programs are therefore capable of catching various errors in code generation. In other words, it is much safer to rebalance nested if statements, add some novel dispatch mechanism, apply various code deduplication or inlining optimizations and otherwise reorganize and tweak the generated code.

Benchmarks

Another direct application of skeleton is benchmarking. There is no need to construct benchmarks by hand: re2c automatically converts any real-world program to a ready benchmark and provides input data. One only needs to measure execution time of the generated program (and perhaps adjust it by running the main loop multiple times).

Executable specs

Because of the fact that skeleton ignores all non-re2c code (including actions), skeleton programs don’t need any additional code. One can just sketch a regular expression specification and immediately get an executable program.