Release notes

Release 1.2.1

This is a minor release that improves some of the features introduced in 1.2. First, it is now possible to reset re2c:eof configuration to the default value by setting it to -1. Second, re2c now installs standard include files to $(datadir)/re2c/stdlib, where $(datadir) is /usr/share by default (but can be configured otherwise). When looking for include files, re2c automatically tries this directory — there is no need to add it manually with -I. Both issues have been reported by Wesley W. Terpstra. Thank you!

See the changelog 1.2.1 (2019-08-11) for details.

Release 1.2

This release adds a handful of useful features and finalizes two-year research effort on POSIX disambiguation.

First and foremost, EOF rule has been added to re2c. It is a new method of checking for the end of input, which aims at being both generic and simple to use. Historically re2c has a number of ways to handle EOF situation. The simplest and the most efficient method is using a sentinel symbol, but it has limited applicability. The most widely used and generic method is bounds checking with padding, but it is rather complex and requires input buffering. Finally, it is possible to define custom EOF handling methods by using the generic API. However, such methods are not necessarily correct and are likely to be slower than other methods. The new method, EOF rule, tries to combine the simplicity and efficiency of the sentinel symbol and the generality of bounds checking. Essentially, all the user is required to do is to specify an arbitrary “fake” sentinel using re2c:eof configuration (zero is usually a good choice), to define a special EOF rule $, and to make sure that the fake sentinel is always the last symbol in the input buffer. The lexer will then run without end of input checks until it hits the branch corresponding to the fake sentinel, at which point it will perform additional bounds checking. If indeed it is the end, the lexer will try to get more input. If this attempt fails, the lexer will either rollback to the last matched rule (or the default rule, if it has consumed some input but did not match), or else it will go to EOF rule. Depending on each particular case and the choice of sentinel, the performance of EOF rule may be slightly better or worse than bounds checking with padding. See the user manual for more details and examples.

Another important new feature is the possibility to include other files. Re2c provides a directive /*!include:re2c "file.re" */, where file.re is the name of the included file. Re2c looks for included files in the directory of the including file and in the include locations, which can be specified with -I option. Include files may have further includes of their own. Re2c provides some predefined include files that can be found in the include/ sub-directory of the project. These files contain useful definitions and form something like a standard library for re2c. Currently there is one available include, unicode_categories.re.

Re2c now allows to generate header file from the input file using option -t --type-header or re2c:flags:t and re2c:flags:type-header configurations and the newly added directives /*!header:re2c:on*/ and /*!header:re2c:off*/. Header files may be needed in cases when re2c is used to generate definitions of constants, variables and structs that must be visible from other translation units.

Re2c can now understand UTF8-encoded string literals and character classes in regular expressions. By default, re2c parses regular expressions like "∀x ∃y" as a sequence of 1-byte ASCII code points e2 88 80 78 20 e2 88 83 79 (hexadecimal), and the users have to escape Unicode symbols manually: "\u2200x \u2203y". This is not what most users would expect, as demonstrated by multiple issues on the bugtracker. The new option --input-encoding <ascii | utf8> allows to change the default behavior and parse "∀x ∃y" as a sequence of 4-byte Unicode code points 2200 78 20 2203 79 (hexadecimal). Note that the new option affects only parsing, and not the code generation.

Re2c now allows to mix reusable blocks with normal blocks when -r --reuse option is used. This is very useful in cases when the input file contains many re2c blocks, and only some of them need to be reused (a good example is re2c own lexer, which has to use certain rules twice in order to generate ASCII and UTF8 lexer variants for --input-encoding <ascii | utf8> option).

It is now possible to specify location format in re2c-generated messages with --location-format <gnu | msvc> option. As one might guess, possible arguments are GNU location format filename:line:column: (the default) and MSVC format filename(line,column). This option might help IDE users. Another new option is --verbose — it prints a short “success” message in case re2c exits successfully.

Flex compatibility mode (enabled with -F --flex-support option) has been improved: parsing errors and incorrect operator precedence in some rare cases have been fixed. Historically re2c allows to mix flex-style code with re2c-style code, which creates certain difficulties for the parser. Such difficulties can be resolved by passing parts of the parser context to the lexer.

The difference operator / is now applied to its operands before encoding expansion. Re2c supports difference only for character classes, and encoding expansion in case of a variable-length encoding may transform even a simple character class into a complex graph (see the graph for [^] in UTF8 for example). Applying expansion after difference makes / applicable in more cases.

Re2c now generates the output file atomically: it first creates a temporary file and writes the generated output into it, and then renames the temporary file to the output file.

Documentation has been rearranged and updated, and the website has been fixed to look somewhat better on mobile devices (hopefully).

From developers standpoint, re2c now has better debugging capabilities. Debug code has been moved in a separate subsystem and is no longer built in release binaries (debug builds are enabled with --enable-debug configure option). New debug options and passes have been added.

This release has taken almost a year, mostly because of the time needed to formalize the extended algorithm for POSIX disambiguation and write a formal paper Efficient POSIX Submatch Extraction on NFA. The algorithms described in the paper are implemented in an experimental library libre2c. The library is intended as a playground for experiments and research, not a replacement for existing libraries like RE2. Having re2c as a library is convenient for benchmarking, testing and creating bindings to other languages. To build the library and benchmarks, configure re2c with --enable-libs option.

Many people have contributed to this release. Angelo Borsotti co-authored the paper; his hard work has revealed multiple shortcomings and deficiencies in the algorithms used in re2c. Henri Salo helped with fuzz-testing re2c using the American Fuzzy Lop. Denis Naumov helped with Windows portability issues. Sergei Trofimovich fixed a number of bugs; he and Serghei Iakovlev helped with the Travis CI infrastructure. Wesley Terpstra used re2c in the open-source wake build tool. Many thanks to them and all the others who contributed to this release!

A lot of bugs have been fixed. See the changelog 1.2 (2019-08-02) for details.

Release 1.1.1

This is a minor bug-fix release. It fixes a crash with -V --vernum option. The crash was reported and fixed by Mike Gilbert, and later rediscovered by Daniel J. Luke and Perry E. Metzger. Sergei Trofimovich made a unit test, as the -V --vernum option has already caused some troubles in the past. See the changelog 1.1.1 (2018-08-30) for details. Thanks everyone!

Release 1.1

This release doesn’t bring in any serious user-visible changes; most of the work concentrated on the internal POSIX disambiguation algorithm used for submatch extraction with --posix-captures option. The old algorithm, introduced in release 1.0, was based on the work of Chris Kuklewicz (outlined on this haskell wiki page and formalized in the paper Tagged Deterministic Finite Automata with Lookahead). Kuklewicz algorithm was replaced with a totally different algorithm based on the work of Satoshi Okui and Taro Suzuki (described in the paper “Disambiguation in Regular Expression Matching via Position Automata with Augmented Transitions”, year 2013). The main effort in this release was in developing proper theoretical foundations for the new algorithm. As usual, there has been a couple of bug fixes.

See the changelog 1.1 (2018-08-27) for details.

Release 1.0.3

Yet another minor bug-fix release. It fixes a build failure on Mac OS X (thanks to Ryan Shmidt for the bug report). See the changelog 1.0.3 (2017-11-08) for details.

Release 1.0.2

This is another minor bug-fix release. It addresses some documentation issues (thanks to Ziris85 for the bug report). See the changelog 1.0.2 (2017-08-26) for details.

Release 1.0.1

This is a minor bug-fix release. It fixes a build failure on Mac OS X (thanks to ilovezf for the bug report). See the changelog 1.0.1 (2017-08-11) for details.

Release 1.0

This release is an important milestone in the history of re2c: it adds submatch extraction. It is a long-awaited feature; even as Peter Bumbulis released the first version of re2c back in 1993, he mentioned submatch extraction in his paper together with Donald D. Cowan “RE2C: a more versatile scanner generator” as a useful feature that soon should be added to re2c. It seemed back then that submatch extraction is relatively easy to implement; like many other useful extensions, it was deemed to be no more than syntactic sugar on top of canonical regular expressions.

As it turned out, this is not so: the problem of parsing with regular expressions is inherently more complex than recognition, and submatch extraction cannot be implemented using traditional DFA-based approach. Many authors studied this subject and developed algorithms suitable for their particular settings and problem domains. Some of them used NFA-based approach, others suggested algorithms based on multiple DFA or multiple passes over a single DFA. For re2c most of these algorithms are too heavyweight: they require complication of the underlying computational model and incur overhead even in simple cases.

However, one algorithm is outstanding in this respect: it is the algorithm invented by Ville Laurikari in 2000 and described in his paper “NFAs with Tagged Transitions, their Conversion to Deterministic Automata and Application to Regular Expressions”. Laurikari algorithm is based on a single DFA extended with a fixed number of registers; it makes one pass over the input string, works in linear time and the required space does not depend on the input length. But what is most important, the added complexity depends on submatch detalization: on submatch-free regular expressions Laurikari automaton reduces to a simple DFA.

When the work on submatch extraction feature started in early 2016, at first it seemed that Laurikari algorithm was still overly complicated. Re2c needed an algorithm that would support only partial submatch extraction, but the supported cases would be handled very efficiently. A lot of effort was spent on studying papers, and finally this lead to the development of a new algorithm. It soon became clear that the self-invented algorithm was in fact a special case of Laurikari algorithm, with one significant performance improvement: the use of lookahead symbol. Surprisingly, the original Laurikari algorithm has a strange deficiency: when recording submatches, the automata totally ignore the lookahead symbol and perform a lot of redundant operations that could be easily avoided. It was only logical to fix Laurikari algorithm. The new improved algorithm is described in the paper “Tagged Deterministic Finite Automata with Lookahead”

Submatch extraction in re2c comes in two flavors: POSIX-compliant capturing groups (with true POSIX semantics) and standalone tags that use leftmost greedy disambiguation. Both features require a number of new API primitives and configurations; and they are barely documented yet (though I managed to add a section in the manual). The paper cited above has some interesting benchmark results; articles and examples will follow in the nearest future. As usually, you may expect a number of small bug-fix releases in the 1.0.x series.

The following people contributed to this release:

  • Petr Skocik tidied up re2c documentation: fixed ubiquitous spelling errors and established synchronization system between the online documentation, manpage and help message displayed by re2c with --help option.
  • Sergei Trofimovich wrote a fuzzer that generates random regular expressions and checks them on input data generated with re2c --skeleton option. This fuzzer is most helpful; it has already found tens of bugs in the early implementation of submatch extraction. Sergei also helped me with testing re2c on numerous platforms and computer architectures.
  • Abs62, asmwarrior, Ben Smith, CRCinAU, Derick Rethans, Dimitri John Ledkov, Eldar Zakirov, jcfp, Jean-Claude Wippler, Jeff Trull, Jérôme Dumesnil, Jesse Buesking, Julian Andres Klode, Paulo Custodio, Perry E. Metzger, philippschaefer, Ross Burton, Rui Maciel, Ryan Mast, Samuel006, sirzooro, Tim Kelly — all these people contributed bug reports, patches and feedback. Thank you!
  • Many thanks to the early user of the new features João Paredes for his patience and encouraging comments on the subject.

See the changelog 1.0 (2017-08-11) for details.

Release 0.16

This release adds a very important step in the process of code generation: minimization of the underlying DFA (deterministic finite automaton). Simply speaking, this means that re2c now generates less code (while the generated code behaves exactly the same).

DFA minimization is a very well-known technique and one might expect that any self-respecting lexer generator would definitely use it. So how could it be that re2c didn’t? In fact, re2c did use a couple of self-invented tricks to compress the generated code (one interesting technique is the construction of a tunnel automaton). Some of these tricks were quite buggy (see this bug report for example). Now that re2c does canonical DFA minimization, all this stuff is obsolete and has been dropped.

A lot of attention has been paid to the correctness of the DFA minimization. Usually, re2c uses a very simple criterion to validate changes: the generated code for all the tests in testsuite must remain the same. However, in the case of DFA minimization the generated code changes dramatically. It is impossible to verify the changes manually.

One possible verification tool is the skeleton feature. Because skeleton is constructed prior to DFA minimization, it cannot be affected by any errors in its implementation.

Another way to verify DFA minimization is by implementing two different algorithms and comparing the results. The minimization procedure has a very useful property: the miminal DFA is unique (with respect to state relabeling). We used the Moore and the so-called table filling algorithms: The Moore algorithm is fast, while table filling is very simple to implement. The --dfa-minimization <moore | table> option that allows choosing the particular algorithm (defaults to moore), but it’s only useful for debugging DFA minimization.

A good side effect of messing with re2c internals is a significant speedup of code generation (see this issue for example). The test suite now runs twice as fast (at least).

See the changelog 0.16 (2016-01-21) for details.

Release 0.15.3

This release fixes multiple build-time and run-time failures on OS X, FreeBSD, and Windows. Most of the problems were reported and fixed by Oleksii Taran (on OS X) and Sergei Trofimovich (on FreeBSD and Windows). Thank you for your help!

Skeleton coverage has been improved.

See the changelog 0.15.3 (2015-12-02) for details.

Release 0.15.2

This release fixes a bug in the build system: it adds a missing dependency of the lexer on the bison-generated header (reported on Gentoo bugtracker). See the changelog 0.15.2 (2015-11-23) for details.

Release 0.15.1

This release fixes an error in the testing script (thanks to Sergei Trofimovich who reported the bug and suggested a quick fix). See the changelog 0.15.1 (2015-11-22) for details.

Release 0.15

This release started out in the spring of 2015 as a relatively simple code cleanup.

It focused on the following problem: re2c used to repeat the whole generation process multiple times. Some parts of the generated program depend on the overall input statistics; they cannot be generated until the whole input has been processed. The usual strategy is to make stubs for all such things and fix them later. Instead, re2c used to process the whole input, gather statistics, discard the generated output, and regenerate it from scratch. Moreover, each generation pass further duplicated certain calculations (for similar reasons). As a result, the code generation phase was repeated four times.

The problem here was not inefficiency: re2c is fast enough to allow the 4x overhead. The real problem was the complexity of the code: the developer has to think of multiple execution layers all the time. Some parts of code are only executed on certain layers and affect each other. Simple reasoning gets really hard.

So the main purpose of this release was to simplify the code and make it easier to fix bugs and add new features. Very soon it became clear that some of the changes in code generation are hard to verify by hand. For example, even a minor rebalancing of if and switch statements may change the generated code significantly. It was clear that re2c needed an automatic verification tool, which lead to the idea of generating skeleton programs.

Meanwhile some work was done on adding warnings, updating the build system and fixing various bugs. A heart-warming consequence of the code simplification is that re2c now uses re2c more extensively: not only the main program, but also the command-line options, inplace configurations, and all kinds of escaped strings are parsed with re2c. Website has also been updated (feel free to suggest improvements).

See the changelog 0.15 (2015-11-22) for details.