Release 1.0

by Ulya Trofimovich

This release is an important milestone in the history of re2c: it adds submatch extraction. It is a long-awaited feature; even as Peter Bumbulis released the first version of re2c back in 1993, he mentioned submatch extraction in his paper together with Donald D. Cowan “RE2C: a more versatile scanner generator” as a useful feature that soon should be added to re2c. It seemed back then that submatch extraction is relatively easy to implement; like many other useful extensions, it was deemed to be no more than syntactic sugar on top of canonical regular expressions.

As it turned out, this is not so: the problem of parsing with regular expressions is inherently more complex than recognition, and submatch extraction cannot be implemented using traditional DFA-based approach. Many authors studied this subject and developed algorithms suitable for their particular settings and problem domains. Some of them used NFA-based approach, others suggested algorithms based on multiple DFA or multiple passes over a single DFA. For re2c most of these algorithms are too heavyweight: they require complication of the underlying computational model and incur overhead even in simple cases.

However, one algorithm is outstanding in this respect: it is the algorithm invented by Ville Laurikari in 2000 and described in his paper “NFAs with Tagged Transitions, their Conversion to Deterministic Automata and Application to Regular Expressions”. Laurikari algorithm is based on a single DFA extended with a fixed number of registers; it makes one pass over the input string, works in linear time and the required space does not depend on the input length. But what is most important, the added complexity depends on submatch detalization: on submatch-free regular expressions Laurikari automaton reduces to a simple DFA.

As I started working on the implementation of submatch extraction in re2c (in early 2016), at first I didn’t appreciate the elegance of Laurikari approach; it still seemed overly complicated for re2c. I studied many papers and tried to invent an algorithm that would support only partial submatch extraction, but the supported cases would be handled very efficiently. As I developed the algorithm, I came to realize that it was in fact but a special case of Laurikari algorithm. Naturally, I turned back to the original paper and started to adapt my implementation. To my surprise, I discovered a strange deficiency in Laurikari algorithm compared to my early implementation: when recording submatches, Laurikari automata totally ignore the lookahead symbol and perform a lot of redundant operations that could be easily avoided. It was only logical to fix Laurikari algorithm; but as I worked on it, a couple of other issues showed up, and so it happened that I ended up writing a whole paper on the subject: “Tagged Deterministic Finite Automata with Lookahead”

Submatch extraction in re2c comes in two flavors: POSIX-compliant capturing groups (with true POSIX semantics) and standalone tags that use leftmost greedy disambiguation. Both features require a number of new API primitives and configurations; and they are barely documented yet (though I managed to add a section in the manual). The paper cited above has some interesting benchmark results; articles and examples will follow in the nearest future. As usually, you may expect a number of small bug-fix releases in the 1.0.x series.

The following people contributed to this release:

  • Petr Skocik tidied up re2c documentation: fixed ubiquitous spelling errors and established synchronization system between the online documentation, manpage and help message displayed by re2c with --help option.
  • Sergei Trofimovich wrote a fuzzer that generates random regular expressions and checks them on input data generated with re2c --skeleton option. This fuzzer is most helpful; it has already found tens of bugs in the early implementation of submatch extraction. Sergei also helped me with testing re2c on numerous platforms and computer architectures.
  • Abs62, asmwarrior, Ben Smith, CRCinAU, Derick Rethans, Dimitri John Ledkov, Eldar Zakirov, jcfp, Jean-Claude Wippler, Jeff Trull, Jérôme Dumesnil, Jesse Buesking, Julian Andres Klode, Paulo Custodio, Perry E. Metzger, philippschaefer, Ross Burton, Rui Maciel, Ryan Mast, Samuel006, sirzooro, Tim Kelly — all these people contributed bug reports, patches and feedback. Thank you!
  • Many thanks to the early user of the new features João Paredes for his patience and encouraging comments on the subject.

Below is a coarse list of all changes (a raw excerpt from changelog):

  • Added options:
    • -P --posix-captures (POSIX-compliant capturing groups)
    • -T --tags (standalone tags with leftmost greedy disambiguation)
    • --no-lookahead
    • --no-optimize-tags
    • --eager-skip
    • --dump-nfa
    • --dump-dfa-raw
    • --dump-dfa-det
    • --dump-dfa-tagopt
    • --dump-dfa-min
    • --dump-adfa
  • Added new syntax:
    • @<stag>
    • #<mtag>
  • Added new directives:
    • /*!stags:re2c ... */
    • /*!mtags:re2c ... */
    • /*!maxnmatch:re2c ... */
  • Added new API:
    • YYSTAGN (t)
    • YYSTAGP (t)
    • YYMTAGN (t)
    • YYMTAGP (t)
    • YYRESTORETAG (t)
    • YYMAXNMATCH
    • yynmatch
    • yypmatch
  • Added inplace confgurations:
    • re2c:define:YYSTAGN
    • re2c:define:YYSTAGP
    • re2c:define:YYMTAGN
    • re2c:define:YYMTAGP
    • re2c:define:YYRESTORETAG
    • re2c:flags:8 or re2c:flags:utf-8``
    • re2c:flags:b or re2c:flags:bit-vectors
    • re2c:flags:case-insensitive
    • re2c:flags:case-inverted
    • re2c:flags:d or re2c:flags:debug-output
    • re2c:flags:dfa-minimization
    • re2c:flags:eager-skip
    • re2c:flags:e or re2c:flags:ecb
    • re2c:flags:empty-class
    • re2c:flags:encoding-policy
    • re2c:flags:g or re2c:flags:computed-gotos
    • re2c:flags:i or re2c:flags:no-debug-info
    • re2c:flags:input
    • re2c:flags:lookahead
    • re2c:flags:optimize-tags
    • re2c:flags:P or re2c:flags:posix-captures
    • re2c:flags:s or re2c:flags:nested-ifs
    • re2c:flags:T or re2c:flags:tags
    • re2c:flags:u or re2c:flags:unicode
    • re2c:flags:w or re2c:flags:wide-chars
    • re2c:flags:x or re2c:flags:utf-16
    • re2c:tags:expression
    • re2c:tags:prefix
  • Added warning -Wnondeterministic-tags

  • Added fuzz-testing scripts

  • Added paper “Tagged Deterministic Finite Automata with Lookahead”

  • Fixed bugs:
    • #121 “trailing contexts are fundamentally broken”
    • #135 “In installation make check give syntax error”
    • #137 “run_tests.sh fail when running configure script with absolute path”
    • #138 “website improvement”
    • #141 “Tests under Windows”
    • #142 “segvault with null terminated input”
    • #145 “Values for enum YYCONDTYPE are not generated when default rules with conditions are used”
    • #147 “Please add symbol name to “can’t find symbol” error message”
    • #152 “Line number in #line directive after enum YYCONDTYPE is 0-based”
    • #156 “Build with Visual Studio 14 2015: symbol name conflict”
    • #158 “Inconsistent forward declaration of struct/class vs definition”
    • #160 “Open text files with “wb” causes issues on Windows”
    • #162 “Reading files with “rb” causes issues in Windows”
    • #165 “Trailing context consumed if initial expression matches it”
    • #176 “re2c help message is too wide for most terminals”
    • #184 “Small documentation issue”
    • #186 “Difference operator sometimes doesn’t work with utf-8”
  • Merged pull requests:
    • #131 “Use bash-specific [[ builtin”
    • #136 “Added basic support for travis-ci.org integration”
    • #171 “Typo fix”
    • #172 “Grammar fixes in the docs”
    • #173 “Grammar fixes in the manpage”
    • #174 “more documentation fixes”
    • #175 “more manpage fixes”
    • #177 “sync –help output w/ manpage”
    • #178 “Moves rts used in the manpage to master”
    • #179 “compose manpage out of rsts from gh-pages-gen”
    • #189 “Typo fix and small grammatical change”
    • #191 “Makefile.am: create target directory before writing into it”