Arbitrary large input and YYFILLΒΆ

In this example we suppose that input cannot be mapped in memory at once: either it’s too large or its size cannot be determined in advance. The usual thing to do in such case is to allocate a buffer and lex input in chunks that fit into buffer. re2c allows us to refill buffer using YYFILL: see Recognizing strings: the need for YYMAXFILL example for details about program points and conditions that trigger YYFILL invocation. Currently re2c provides no way to combine YYFILL with the sentinel method: we have to enable YYLIMIT-based checks for the end of input and pad input with YYMAXFILL fake characters. This may be changed in later versions of re2c.

The idea of YYFILL is fairly simple: lexer is stuck upon the fact that (YYLIMIT - YYCURSOR) < n and YYFILL must either invert this condition or stop lexing. Disaster will happen if YYFILL fails to provide at least n characters, yet resumes lexing. Technically YYFILL must somehow “extend” input for at least n characters: after YYFILL all input pointers must point to exact same characters, except YYLIMIT: it must be advanced at least n positions. Since we want to use a fixed amount of memory, we have to shift buffer contents: discard characters that are already lexed, move the remaining characters at the beginning of the buffer and fill the vacant space with new characters. All the pointers must be decreased by the length of discarded input, except YYLIMIT (it must point at the end of buffer):

<--- discarded -->                              <----- n ----->
oxxxxxxxxxxxxxxxxxo----------o-----------------o--------o.....o..........o... (more input)
buffer          lexeme    YYMARKER          YYCURSOR YYLIMIT *          *
|             *          *                 *            | *          *
|          *          *                 *              *|         *
|       *          *                 *              *   |      *
|    *          *                 *              *      |   *
| *          *                 *              *         |*
o-----------o-----------------o--------------o----------o
buffer,  YYMARKER          YYCURSOR                  YYLIMIT
lexeme

End of input is a special case: as explained in Recognizing strings: the need for YYMAXFILL example, the input must be padded with YYMAXFILL fake characters. In this case YYLIMIT must point at the end of padding:

<--- discarded -->                              <----- n ----->
oxxxxxxxxxxxxxxxxxo----------o-----------------o---o0000000000000000o
buffer          lexeme    YYMARKER          YYCURSOR YYLIMIT       *
|             *          *                 *   *        |       *
|          *          *                 *   *           |    *
|       *          *                 *   *              | *
|    *          *                 *   *                *|
| *          *                 *   <-- YYMAXFILL -->*   |
o-----------o-----------------o---o0000000000000000o    |
buffer,  YYMARKER          YYCURSOR             YYLIMIT
lexeme

Which part of input can be discarded? The answer is, all input up to the leftmost meaningful pointer. Intuitively it seems that it must be YYMARKER: it backups input position of the latest match, so it’s always less than or equal to YYCURSOR. However, YYMARKER is not always used and even when it is, its usage depends on the input: not all control flow paths in lexer ever initialize it. Thus for some inputs YYMARKER is meaningless and should be used with care. In practice input rarely consists of one giant lexeme: it is usually a sequence of small lexemes. In that case lexer runs in a loop and it is convenient to have a special “lexeme start” pointer. It can be used as boundary in YYFILL.

Our example program reads stdin in chunks of 16 bytes (in real word buffer size is usually ~4Kb) and tries to lex numbers separated by newlines.

[03_arbitrary_large_input.re]

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
#include <stdio.h>
#include <string.h>

/*!max:re2c*/
static const size_t SIZE = 16;

struct input_t {
    char buf[SIZE + YYMAXFILL];
    char *lim;
    char *cur;
    char *tok;
    bool eof;

    input_t()
        : buf()
        , lim(buf + SIZE)
        , cur(lim)
        , tok(lim)
        , eof(false)
    {}
    bool fill(size_t need)
    {
        if (eof) {
            return false;
        }
        const size_t free = tok - buf;
        if (free < need) {
            return false;
        }
        memmove(buf, tok, lim - tok);
        lim -= free;
        cur -= free;
        tok -= free;
        lim += fread(lim, 1, free, stdin);
        if (lim < buf + SIZE) {
            eof = true;
            memset(lim, 0, YYMAXFILL);
            lim += YYMAXFILL;
        }
        return true;
    }
};

static bool lex(input_t & in, unsigned int &count)
{
    for (count = 0;;) {
        in.tok = in.cur;
        /*!re2c
            re2c:define:YYCTYPE = char;
            re2c:define:YYCURSOR = in.cur;
            re2c:define:YYLIMIT = in.lim;
            re2c:define:YYFILL = "if (!in.fill(@@)) return false;";
            re2c:define:YYFILL:naked = 1;

            end = "\x00";
            wsp = [\n]+;
            num = [0-9]+;

            *   { return false; }
            end { return YYMAXFILL == in.lim - in.tok; }
            wsp { continue; }
            num { ++count; continue; }
        */
    }
}

int main()
{
    input_t in;
    unsigned int count;
    if (lex(in, count)) {
        printf("glorious %u numbers!\n", count);
    } else {
        printf("error\n");
    }

    return 0;
}

Notes:

  • YYMAXFILL bytes at the end of buffer are reserved for padding. This memory is unused most of the time, but YYMAXFILL is usually negligably small compared to buffer size.
  • There is only one successsful way out (line 60): lexer must recognize a standalone “end of input” lexeme (NULL) exactly at the beginning of padding. YYFILL failure is an error: if the input was correct, lexer should have already stopped.
  • YYFILL may fail for two reasons: either there is no more input (line 23), or lexeme is too long: it occupies the whole buffer and nothing can be discarded (line 27). We treat both cases in the same way (as error), but a real-world program might handle them differently (resize buffer, cut long lexeme in two, etc.).
  • @@ in YYFILL definition (line 52) is a formal parameter: re2c substitutes it with the actual argument to YYFILL.
  • There is a special tok pointer: it points at the beginning of lexeme (line 47) and serves as a boundary in YYFILL.

Generate, compile and run:

$ re2c -o example.cc 03_arbitrary_large_input.re
$ g++ -o example example.cc
$ ./example
0
11
222
3333
44444
555555
6666666
77777777
888888888
9999999999
glorious 10 numbers!
$ seq 123456789 | ./example
glorious 123456789 numbers!
$ seq 123456789 | wc -l
123456789