Release notes

Release 3.1

The main focus of this release has been on internal rather than user-facing changes. The codebase has been cleaned up and migrated to C++11, and the codegen subsystem has undergone considerable simplification and rewrite. Some new features have been added to improve submatch extraction with capturing groups (see below for more details).

Capturing groups. Since release 1.0, re2c supported two ways of submatch extraction: tags with leftmost greedy semantics (enabled with --tags option or re2c:tags configuration) and capturing groups with POSIX longest-match semantics (enabled with --posix-captures option or re2c:posix-captures configuration). Capturing group syntax is traditionally used in regular expression libraries, so it is preferable for some users. However, they found it problematic to use with large regular expressions: compile times are increased due to the complex POSIX disambiguation algorithm, and the generated code is bloated, as every pair of parentheses is a capturing group and it requires runtime operations to extract submatch boundaries. This release addresses both problems. First, it adds --leftmost-captures option and re2c:leftmost-captures configuration that allow using capturing group syntax with leftmost greedy semantics. Second, it adds new syntax (! ...) for non-capturing groups, so that the user can mark only a few selected groups for submatch extraction, and avoid unnecessary overhead on other groups. It is also possible to flip defaults with --invert-captures option or re2c:invert_captures configuration, so that (...) is a capturing group and (! ...) is a non-capturing one.

TDFA paper and removal of experimental algorithms. In previous releases, re2c added a few experimental algorithms for research purposes (alternative types of automata for submatch extraction, different approaches to POSIX disambiguation and epsilon-closure construction, etc.). All these algorithms have been shown less efficient in the final TDFA paper (see the benchmark results). In this release the following experimental algorithms have been removed: staDFA, TDFA(0), backwards-matching, Kuklewicz POSIX disambiguation and GTOP shortest path finding. The corresponding options are deprecated: they are still accepted for backwards compatibility, but do nothing. This is part of the general effort to keep re2c codebase simpler and easier to maintain.

Codebase improvements. Most of the work in this release has been dedicated to simplifying the codebase and migrating it to C++11 standard. Codegen subsystem in particular has been reorganized in four well-defined phases (analyze, generate, fixup, render) and separated from parsing phase. Errors are now handled in a uniform way by checking return codes and propagating them up the call stack. Memory allocation has been improved with the use of slab allocators instead of global free lists. Bison parsers now use pure API. Recursive functions that potentially cause stack overflow have been rewritten in iterative form, and re2c test suite now runs with 256K stack limit on CI. Finally, code style has been unified across the codebase.

Build system. This release adds minimal Bazel integration: it is now possible to use re2c as a tool in other projects with Bazel build systems (see commit 3205c867 for a usage example). This is not intended as a new build system alongside Autotools and CMake; Bazel integration is minimal and does not support developer build features like bootstrap, documentation, etc.

CI improvements. Continuous testing on GithubActions is more extensive now: it covers both build systems (Autotools and CMake) as well as special validation modes, such as ASan, UBSan and Valgrind.

For a full list of changes and bug fixes see the changelog 3.1 (2023-07-19). As always, thanks to all contributors, especially Sergei Trofimovich, Doug Cook and Marcus Boerger!

Release 3.0

This release adds Rust support, a uniform naming scheme for options and configurations, small improvements in code generation and a few bug fixes.

Rust backend. It has been on the wish-list for quite a while, but the real work has started on #372 with raekye doing the initial implementation and pmetzger porting the first examples. After a few weeks of brainstorming on (#re2c), followed by a few months of work, now the Rust backend is fully implemented and ready to use. See the re2rust manual with examples.

The main difficulty with Rust is that it has no goto statement, which makes the usual code generation approach impossible. Normally re2c uses goto-label approach: DFA states are blocks of code with labels, and transitions are goto statements. For Rust a different approach is needed: a DFA is represented with a switch on a numeric state variable, where each state is a separate case. Transitions set state variable to the target state index and loop to the start of the switch. The loop-switch approach can be used for other languages as well, therefore it has been added as an option --loop-switch (available for all backends and mandatory for Rust). The new approach is different enough to require multiple changes in code generation. In loop-switch mode it is impossible to jump into the middle of a state bypassing the skip statement, so the --eager-skip option is enforced, which moves skip statements to transitions. With conditions it is impossible to jump between different blocks, so DFAs for all conditions are merged into one switch, and condition numbers are the indices of the initial DFA states. In storable state mode it is impossible to jump from the YYGETSTATE switch to a DFA state, therefore a separate getstate:re2c detached from the lexer block is not supported. Most of these implementation details are hidden from the user.

Generic API is set by default for Rust, as the use of pointer arithmetic is unsafe and generally discouraged. The YYPEEK operation is automatically wrapped in unsafe because it should avoid bounds checks, which are done by the lexer in a more efficient way. To get rid of the automatic unsafe wrappers use option --no-unsafe or configuration re2c:unsafe on a per-block basis (this can help to verify lexer behavior in debug mode, for example).

Initialization of yych in the generated code is prefixed with #[allow(unused_assignments)]. In fact initialization is not needed (yych is always set before it is used), but the Rust compiler cannot figure it out in cases with complex control flow and emits errors if initialization is missing.

Rust backend and the new re2rust binary are enabled by default, but the build can be configured with --enable-rust, --disable-rust Autoconf options or -DRE2C_BUILD_RE2RUST CMake option.

New aliases. Some configurations that have been added over the years are at odds with the general naming scheme. For example, most of condition-related configurations have re2c:cond: prefix, as in re2c:cond:goto or re2c:cond:divider, but for some reason re2c:condprefix has no column after cond. It is illogical, and it would be better to have re2c:cond:prefix instead. This release adds a bunch of such aliases. Of course, the old names still work, and there is no plan to remove them (the change does not break existing code). Likewise some options have new aliases. For a full list see the changelog.

Code generation. The changes in code generation are minor. The main difference is in the state enumeration: states without incoming transitions do not have numbers anymore. This was required for the loop-switch approach, so that switch cases can have consecutive labels. Other code generation changes include removal of unused YYPEEK statements and slightly different formatting.

For a full list of changes and bug fixes see the changelog 3.0 (2022-01-27). Huge thanks to all contributors, especially raekye and pmetzger!

Release 2.2

This release adds scoping rules and improves code sharing between re2c blocks. It brings in a bunch of new features, most importantly named blocks, local blocks, scoped directives and the ability to include arbitrary blocks in the middle of another block. Much work has been done on improving testing and continuous integration.

Local blocks. This release adds a new kind of blocks: /*!local:re2c ... */, which are exactly like the usual /*!re2c ... */ blocks, except that they do not add their named definitions and configurations to the global scope. The contents of a local block are not visible outside of the block and do not affect subsequent blocks.

Named blocks. Previously all re2c blocks were anonymous. Now it is possible to give block a name: for example, /*!re2c:woot42 ... */ defines a global block named woot42. The name can be used to refer to this block in other parts of the program (for example in directives). Rules blocks can have names as well: a block /*!rules:re2c:woot42 ... */ can be mentioned in a use block /*!use:woot42 ... */ or in a use directive !use:woot42;. Use blocks cannot have their own name; if you need to have a named block that reuses another block, do that with the new !use directive.

Block lists in directives. Previously re2c directives such as max:re2c, stags:re2c, getstate:re2c etc. were global: they accumulated the information across all blocks in the file. Now it is possible to specify a list of colon-separated block names in each directive: for example /*!<directive>:re2c:<name1>:<name2>... */ where <name1>, <name2> and so on are the names of re2c blocks defined somewhere in the file. This allows one to have scoped directives, and also multiple directives per file for different groups of blocks. For example, if the file defines named blocks foo, bar and baz, one can define all s-tag variables in blocks foo and baz with a directive /*!stags:foo:baz ... */.

In-block use directive. Now it is possible to use a block in the middle of another block: for example, a rules block /*!rules:re2c:woot42 ... */ can be used in the middle of another block with a statement !use:woot42;. Rules, named definitions and configurations of the used block are added to those of the current block. The new use directive is more flexible than the old /*!use:re2c ... */ block, because there is no limitation on the number of used blocks, and they don’t have to be used at the beginning of a block.

In-block include directive. The existing /*!include ... */ directive now has an in-block variant. For example, !include "x/y/"; includes a file x/y/ in the middle of a block. The in-block include directive works in the same way as the out-of-block one: the contents of the file are inserted verbatim in place of the directive.

Configurable directives. More directives can now be customized with configurations that define the form of the generated code. Directives max:re2c and maxnmatch:re2c now have an optional format configuration that accepts a template string with interpolated variable @@{max} (or just @@ for short), for example /*!max:re2c format="int maxfill = @@;"; */ would generate code maxfill = <N>; where <N> is the YYMAXFILL value. Directive types:re2c now has optional configurations format and separator: the former accepts a template string with interpolated variables @@{cond} and @@{num} (the name and numeric index of each condition), and the latter specifies the glue text used to join different format pieces together.

Testing. Serghei Iakovlev ported the old and hairy testing script from Bash to Python 3. The new script covers more platforms and is much faster than the old one (and also easier to debug and modify).

See the changelog 2.2 (2021-08-01) for details. As always, a big thanks to all contributors!

Release 2.1.1

This is a minor release that adds a missing file to the release tarballs (thanks to david_david for the bug report). See the changelog 2.1.1 (2021-03-27) for details.

Release 2.1

This release adds continuous integration with GitHub Actions, portability improvements, benchmarks, depfile generation and bugfixes.

GitHub Actions CI. Serghei Iakovlev has added continuous integration with GitHub Actions for Linux, macOS and Windows platforms. Now every change in re2c gets build and tested in various configurations on all three platforms, which (hopefully!) leads to fewer bugs and better support for these platforms. GitHub Actions complement the existing Travis CI.

Benchmarks. A lot of work has been done on adding infrastructure for benchmarks. Previously re2c had some benchmarks, but they were not integrated into the build system, and as a result somewhat difficult to build and run. Now there is a new benchmark subsystem, which includes three groups. The first group contains AOT-compiled benchmarks (a.k.a. lexer generators): regular expressions that are compiled ahead of time to C code, which is further compiled to native code. The currently supported generators are re2c, ragel and kleenex. The second group contains JIT-compiled benchmarks: library algorithms that are based on deterministic automata. Finally, the third group contains interpreted benchmarks: library algorithms that are based on non-deterministic automata. Benchmarks are fully integrated in both build systems (they are enabled with Autotools options --enable-benchmarks, --enable-benchmarks-regenerate and CMake options -DRE2C_BUILD_BENCHMARKS, -DRE2C_REGEN_BENCHMARKS). All benchmarks are regularly built and ran on CI. There is also a script for rendering benchmark results as bar charts.

For lexer generators, benchmark results show that re2c-TDFA(1) and ragel are the fastest algorithms. Their performance is very close, but in non-deterministic cases ragel gives incorrect results due to its inability to remember multiple parallel versions of one tag. On the other hand, kleenex programs are much shorter and easier to read.

Dependency files. This release adds a --depfile option that allows one to generate dependency (.d) files. This is needed in order to track build dependencies in the presence of /*!include:re2c*/ directives. The generated depfiles have the same format as the ones generated with GCC -MF option.

There are also a few other improvements and bugfixes, see the full changelog 2.1 (2021-03-26) for details.

A big warm thank you to all the people who contributed to this release!

Release 2.0.3

This is a minor bugfix release. It fixes an issue when building re2c as a subproject of another CMake project (thanks to Ivo Hanak for the bugreport). See the changelog 2.0.3 (2020-08-22) for details.

Release 2.0.2

This is a minor bugfix release. It makes two changes. First, re2go is now built by default (can be turned off with --disable-golang for Autotools and -DRE2C_BUILD_RE2GO=no for CMake). Second, CMake build files are now packaged into release tarball. See the changelog 2.0.2 (2020-08-08) for details.

Release 2.0.1

This is a minor release that fixes re2c version (as output by re2c -v) for the CMake build system. Thanks marryjianjian for the bug report! See the changelog 2.0.1 (2020-07-29) for details.

Release 2.0

This release adds a Go code generation backend, an alternative CMake build system, free-form generic API, improvements in reuse mode and start conditions, and bugfixes in EOF rule implementation.

Go backend. The new Go code generation backend can be used either with a --lang go re2c option, or as a standalone re2go binary (which is identical to re2c with --lang go set by default). The documentation has been split into separate manpages for re2c and re2go; they are generated from the same text, but have language-specific examples. The re2go binary and manpage need to be enabled explicitly with a configuration parameter --enable-golang for Autotools or -DRE2C_BUILD_RE2GO=yes for CMake. Unlike C, Go does not have pointer arithmetics, therefore re2go programs use integer indices instead of pointers and have generic API enabled by default. Almost all re2c features have been ported to re2go, with the exception of code generation options like --computed-gotos and --case-ranges that need unsupported language features. Behind the scenes adding a new backend required a full rewrite of the code generation subsystem and quite a bit of work on documentation and testing. Thanks to meme for discussing the idea and providing an early implementation prototype. Feedback from the Go community is welcome!

CMake build system. Thanks to ligfx for adding a CMake-based build system. This has been discussed multiple times in the past, and a few attempts at an implementation have been made, but ligfx was the first to contribute a full working implementation. The existing Autotools-based build system continues to be used and maintained for the foreseeable future. In fact, both build systems are continuously tested on Travis CI. See the build instructions page for more details.

Free-form generic API. Generic API now supports two styles: function-like style is enabled with re2c:api:style = functions;, and free-form style is enabled with re2c:api:style = free-form;. Previously only the function-like style was supported, and free-form style could be enabled for a few selected APIs with configurations like re2c:define:YYFILL:naked, etc. Free-form style allows to define API primitives as arbitrary strings of code with interpolated arguments of the form @@{name}, or optionally just @@ if there is a single argument. The @@ marker can be redefined with re2c:api:sigil configuration.

Improved reuse mode and start conditions. The restriction on not using normal (non-conditional) blocks with -c, --start-conditions option has been removed: now it is possible to have both conditional and normal blocks. The switch on conditions is now generated for every block rather than once per file, which allows to have multiple lexers in one file. The combination of -r, --reuse option with -c, --start-conditions and/or -f, --storable-state options now uses correct block-local options instead of global options in condition and state switch, and re2c now generates unique labels in each reuse block.

Fixes in EOF rule. A bug that caused missing fallback states with EOF rule has been fixed. Re2c no longer generates overlapping labels when EOF rule is used with multiple blocks.

Developer changes. The way re2c bootstraps itself has been simplified: instead of always trying to regenerate lexer files if the re2c binary exists in the build directory, re2c now does this only when configured explicitly with --enable-lexers for Autotools or -DRE2C_REBUILD_LEXERS=yes for CMake. This configuration requires passing an explicit RE2C_FOR_BUILD variable that contains a path to the re2c executable that will be used to regenerate lexers. The new bootstrapping scheme allows to avoid build problems caused by dynamically changing dependency graph, as well as hard-to-recover errors when a bad re2c build was used to rebuild itself in a vicious loop. Another developer-visible change is a rework of the testing framework, which now uses in-line comments instead of hard-coding options in test filenames.

Backwards incompatible changes. In a few rare/undocumented cases re2c behavior has been changed in a way that may break existing code:

  • Generic APIs YYSHIFT/YYSHIFTSTAG/YYSHIFTMTAG have been added. Their meaning is to shift the cursor/tag position by a specified offset. The new APIs enable fixed-tags optimization with generic API, which is becoming more important as the generic API is getting used more often. Programs that use both generic API and tags or POSIX capturing groups may need to define the new APIs.

  • Generic APIs YYSTAGPD/YYMTAGPD have been removed. They were used only with the recently added experimental --stadfa option, and the removal is unlikely to break any existing code.

  • In reuse mode configurations like re2c:define:YYCTYPE are no longer reset in each use block. The old behavior was kept only for backwards compatibility. It has caused confusion and inconvenience, as the configurations had to be repeatedly defined in each use block.

  • The re2c:flags:type-header option now treats the header filename as relative to the output directory. The undocumented re2c:flags:output option has been removed.

  • Some internal labels have been renamed, in particular the yyFillLabel prefix has been replaced with yyfill. This should not break any well-formed code, as these labels are not part of the user interface. Programs that relied on the old label names may restore them with the re2c:label:yyFillLabel configuration.

As always, a warm thank you to all the people who contributed to this release, including ligfx, meme, ryandesign, fanf2, samebchase and trofi. :)

See the changelog 2.0 (2020-07-20) for details.

Release 1.3

The bulk of the work in this release was concentrated on implementing an experimental submatch extraction algorithm. The new algorithm is based on staDFA: another type of automaton described in the paper “staDFA: An Efficient Subexpression Matching Method” by Mohammad Imran Chowdhury (2018). Like TDFA, staDFA performs operations on variables that hold submatch positions during the matching process. The main difference between the two automata types is that TDFA associates variable operations with transitions, while staDFA puts them in states. Now that we have an experimental implementation of staDFA in re2c, it is possible to compare its performance against TDFA. The initial experiments have shown that on average staDFA is slower and larger than TDFA(1): it has more variable operations and more states. It still remains to run proper benchmarks and formalize the modifications to the original algorithm that we had to make for correctness and performance. The new staDFA algorithm is enabled with --stadfa option.

This release adds a new warning -Wsentinel-in-midrule that warns the user in the case when the sentinel symbol occurs in the middle of the rule – this may cause reads past the end of buffer, crashes or memory corruption in the generated lexer. Complimentary to the warning there is a new configuration re2c:sentinel that allows to specify which symbol is used as a sentinel (by default re2c assumes NULL). It this configuration is used, then -Wsentinel-in-midrule is an error.

Some work has been done on reproducible builds.

As usual, thanks to all the people who contributed to this release (including Chris Lamb, jcfp and Ondřej Čertík)! See the changelog 1.3 (2019-12-14) for details.

Release 1.2.1

This is a minor release that improves some of the features introduced in 1.2. First, it is now possible to reset re2c:eof configuration to the default value by setting it to -1. Second, re2c now installs standard include files to $(datadir)/re2c/stdlib, where $(datadir) is /usr/share by default (but can be configured otherwise). When looking for include files, re2c automatically tries this directory — there is no need to add it manually with -I. Both issues have been reported by Wesley W. Terpstra. Thank you!

See the changelog 1.2.1 (2019-08-11) for details.

Release 1.2

This release adds a handful of useful features and finalizes two-year research effort on POSIX disambiguation.

First and foremost, end of input rule has been added to re2c. It is a new method of checking for the end of input, which aims at being both generic and simple to use. Historically re2c has a number of ways to handle EOF situation. The simplest and the most efficient method is using a sentinel symbol, but it has limited applicability. The most widely used and generic method is bounds checking with padding, but it is rather complex and requires input buffering. Finally, it is possible to define custom EOF handling methods by using the generic API. However, such methods are not necessarily correct and are likely to be slower than other methods. The new method, EOF rule, tries to combine the simplicity and efficiency of the sentinel symbol and the generality of bounds checking. Essentially, all the user is required to do is to specify an arbitrary “fake” sentinel using re2c:eof configuration (zero is usually a good choice), to define a special EOF rule $, and to make sure that the fake sentinel is always the last symbol in the input buffer. The lexer will then run without end of input checks until it hits the branch corresponding to the fake sentinel, at which point it will perform additional bounds checking. If indeed it is the end, the lexer will try to get more input. If this attempt fails, the lexer will either rollback to the last matched rule (or the default rule, if it has consumed some input but did not match), or else it will go to EOF rule. Depending on each particular case and the choice of sentinel, the performance of EOF rule may be slightly better or worse than bounds checking with padding. See the user manual for more details and examples.

Another important new feature is the possibility to include other files. Re2c provides a directive /*!include:re2c "" */, where is the name of the included file. Re2c looks for included files in the directory of the including file and in the include locations, which can be specified with -I option. Include files may have further includes of their own. Re2c provides some predefined include files that can be found in the include/ sub-directory of the project. These files contain useful definitions and form something like a standard library for re2c. Currently there is one available include,

Re2c now allows to generate header file from the input file using option -t --type-header or re2c:flags:t and re2c:flags:type-header configurations and the newly added directives /*!header:re2c:on*/ and /*!header:re2c:off*/. Header files may be needed in cases when re2c is used to generate definitions of constants, variables and structs that must be visible from other translation units.

Re2c can now understand UTF8-encoded string literals and character classes in regular expressions. By default, re2c parses regular expressions like "∀x ∃y" as a sequence of 1-byte ASCII code points e2 88 80 78 20 e2 88 83 79 (hexadecimal), and the users have to escape Unicode symbols manually: "\u2200x \u2203y". This is not what most users would expect, as demonstrated by multiple issues on the bugtracker. The new option --input-encoding <ascii | utf8> allows to change the default behavior and parse "∀x ∃y" as a sequence of 4-byte Unicode code points 2200 78 20 2203 79 (hexadecimal). Note that the new option affects only parsing, and not the code generation.

Re2c now allows to mix reusable blocks with normal blocks when -r --reuse option is used. This is very useful in cases when the input file contains many re2c blocks, and only some of them need to be reused (a good example is re2c own lexer, which has to use certain rules twice in order to generate ASCII and UTF8 lexer variants for --input-encoding <ascii | utf8> option).

It is now possible to specify location format in re2c-generated messages with --location-format <gnu | msvc> option. As one might guess, possible arguments are GNU location format filename:line:column: (the default) and MSVC format filename(line,column). This option might help IDE users. Another new option is --verbose — it prints a short “success” message in case re2c exits successfully.

Flex compatibility mode (enabled with -F --flex-support option) has been improved: parsing errors and incorrect operator precedence in some rare cases have been fixed. Historically re2c allows to mix flex-style code with re2c-style code, which creates certain difficulties for the parser. Such difficulties can be resolved by passing parts of the parser context to the lexer.

The difference operator / is now applied to its operands before encoding expansion. Re2c supports difference only for character classes, and encoding expansion in case of a variable-length encoding may transform even a simple character class into a complex graph (see the graph for [^] in UTF8 for example). Applying expansion after difference makes / applicable in more cases.

Re2c now generates the output file atomically: it first creates a temporary file and writes the generated output into it, and then renames the temporary file to the output file.

Documentation has been rearranged and updated, and the website has been fixed to look somewhat better on mobile devices (hopefully).

From developers standpoint, re2c now has better debugging capabilities. Debug code has been moved in a separate subsystem and is no longer built in release binaries (debug builds are enabled with --enable-debug configure option). New debug options and passes have been added.

This release has taken almost a year, mostly because of the time needed to formalize the extended algorithm for POSIX disambiguation and write a formal paper Efficient POSIX Submatch Extraction on NFA. The algorithms described in the paper are implemented in an experimental library libre2c. The library is intended as a playground for experiments and research, not a replacement for existing libraries like RE2. Having re2c as a library is convenient for benchmarking, testing and creating bindings to other languages. To build the library and benchmarks, configure re2c with --enable-libs option.

Many people have contributed to this release. Angelo Borsotti co-authored the paper; his hard work has revealed multiple shortcomings and deficiencies in the algorithms used in re2c. Henri Salo helped with fuzz-testing re2c using the American Fuzzy Lop. Denis Naumov helped with Windows portability issues. Sergei Trofimovich fixed a number of bugs; he and Serghei Iakovlev helped with the Travis CI infrastructure. Wesley Terpstra used re2c in the open-source wake build tool. Many thanks to them and all the others who contributed to this release!

A lot of bugs have been fixed. See the changelog 1.2 (2019-08-02) for details.

Release 1.1.1

This is a minor bug-fix release. It fixes a crash with -V --vernum option. The crash was reported and fixed by Mike Gilbert, and later rediscovered by Daniel J. Luke and Perry E. Metzger. Sergei Trofimovich made a unit test, as the -V --vernum option has already caused some troubles in the past. See the changelog 1.1.1 (2018-08-30) for details. Thanks everyone!

Release 1.1

This release doesn’t bring in any serious user-visible changes; most of the work concentrated on the internal POSIX disambiguation algorithm used for submatch extraction with --posix-captures option. The old algorithm, introduced in release 1.0, was based on the work of Chris Kuklewicz (outlined on this haskell wiki page and formalized in the paper Tagged Deterministic Finite Automata with Lookahead). Kuklewicz algorithm was replaced with a totally different algorithm based on the work of Satoshi Okui and Taro Suzuki (described in the paper “Disambiguation in Regular Expression Matching via Position Automata with Augmented Transitions”, year 2013). The main effort in this release was in developing proper theoretical foundations for the new algorithm. As usual, there has been a couple of bug fixes.

See the changelog 1.1 (2018-08-27) for details.

Release 1.0.3

Yet another minor bug-fix release. It fixes a build failure on Mac OS X (thanks to Ryan Shmidt for the bug report). See the changelog 1.0.3 (2017-11-08) for details.

Release 1.0.2

This is another minor bug-fix release. It addresses some documentation issues (thanks to Ziris85 for the bug report). See the changelog 1.0.2 (2017-08-26) for details.

Release 1.0.1

This is a minor bug-fix release. It fixes a build failure on Mac OS X (thanks to ilovezf for the bug report). See the changelog 1.0.1 (2017-08-11) for details.

Release 1.0

This release is an important milestone in the history of re2c: it adds submatch extraction. It is a long-awaited feature; even as Peter Bumbulis released the first version of re2c back in 1993, he mentioned submatch extraction in his paper together with Donald D. Cowan “RE2C: a more versatile scanner generator” as a useful feature that soon should be added to re2c. It seemed back then that submatch extraction is relatively easy to implement; like many other useful extensions, it was deemed to be no more than syntactic sugar on top of canonical regular expressions.

As it turned out, this is not so: the problem of parsing with regular expressions is inherently more complex than recognition, and submatch extraction cannot be implemented using traditional DFA-based approach. Many authors studied this subject and developed algorithms suitable for their particular settings and problem domains. Some of them used NFA-based approach, others suggested algorithms based on multiple DFA or multiple passes over a single DFA. For re2c most of these algorithms are too heavyweight: they require complication of the underlying computational model and incur overhead even in simple cases.

However, one algorithm is outstanding in this respect: it is the algorithm invented by Ville Laurikari in 2000 and described in his paper “NFAs with Tagged Transitions, their Conversion to Deterministic Automata and Application to Regular Expressions”. Laurikari algorithm is based on a single DFA extended with a fixed number of registers; it makes one pass over the input string, works in linear time and the required space does not depend on the input length. But what is most important, the added complexity depends on submatch detalization: on submatch-free regular expressions Laurikari automaton reduces to a simple DFA.

When the work on submatch extraction feature started in early 2016, at first it seemed that Laurikari algorithm was still overly complicated. Re2c needed an algorithm that would support only partial submatch extraction, but the supported cases would be handled very efficiently. A lot of effort was spent on studying papers, and finally this lead to the development of a new algorithm. It soon became clear that the self-invented algorithm was in fact a special case of Laurikari algorithm, with one significant performance improvement: the use of lookahead symbol. Surprisingly, the original Laurikari algorithm has a strange deficiency: when recording submatches, the automata totally ignore the lookahead symbol and perform a lot of redundant operations that could be easily avoided. It was only logical to fix Laurikari algorithm. The new improved algorithm is described in the paper “Tagged Deterministic Finite Automata with Lookahead”

Submatch extraction in re2c comes in two flavors: POSIX-compliant capturing groups (with true POSIX semantics) and standalone tags that use leftmost greedy disambiguation. Both features require a number of new API primitives and configurations; and they are barely documented yet (though I managed to add a section in the manual). The paper cited above has some interesting benchmark results; articles and examples will follow in the nearest future. As usually, you may expect a number of small bug-fix releases in the 1.0.x series.

The following people contributed to this release:

  • Petr Skocik tidied up re2c documentation: fixed ubiquitous spelling errors and established synchronization system between the online documentation, manpage and help message displayed by re2c with --help option.

  • Sergei Trofimovich wrote a fuzzer that generates random regular expressions and checks them on input data generated with re2c --skeleton option. This fuzzer is most helpful; it has already found tens of bugs in the early implementation of submatch extraction. Sergei also helped me with testing re2c on numerous platforms and computer architectures.

  • Abs62, asmwarrior, Ben Smith, CRCinAU, Derick Rethans, Dimitri John Ledkov, Eldar Zakirov, jcfp, Jean-Claude Wippler, Jeff Trull, Jérôme Dumesnil, Jesse Buesking, Julian Andres Klode, Paulo Custodio, Perry E. Metzger, philippschaefer, Ross Burton, Rui Maciel, Ryan Mast, Samuel006, sirzooro, Tim Kelly — all these people contributed bug reports, patches and feedback. Thank you!

  • Many thanks to the early user of the new features João Paredes for his patience and encouraging comments on the subject.

See the changelog 1.0 (2017-08-11) for details.

Release 0.16

This release adds a very important step in the process of code generation: minimization of the underlying DFA (deterministic finite automaton). Simply speaking, this means that re2c now generates less code (while the generated code behaves exactly the same).

DFA minimization is a very well-known technique and one might expect that any self-respecting lexer generator would definitely use it. So how could it be that re2c didn’t? In fact, re2c did use a couple of self-invented tricks to compress the generated code (one interesting technique is the construction of a tunnel automaton). Some of these tricks were quite buggy (see this bug report for example). Now that re2c does canonical DFA minimization, all this stuff is obsolete and has been dropped.

A lot of attention has been paid to the correctness of the DFA minimization. Usually, re2c uses a very simple criterion to validate changes: the generated code for all the tests in testsuite must remain the same. However, in the case of DFA minimization the generated code changes dramatically. It is impossible to verify the changes manually.

One possible verification tool is the skeleton feature. Because skeleton is constructed prior to DFA minimization, it cannot be affected by any errors in its implementation.

Another way to verify DFA minimization is by implementing two different algorithms and comparing the results. The minimization procedure has a very useful property: the miminal DFA is unique (with respect to state relabeling). We used the Moore and the so-called table filling algorithms: The Moore algorithm is fast, while table filling is very simple to implement. The --dfa-minimization <moore | table> option that allows choosing the particular algorithm (defaults to moore), but it’s only useful for debugging DFA minimization.

A good side effect of messing with re2c internals is a significant speedup of code generation (see this issue for example). The test suite now runs twice as fast (at least).

See the changelog 0.16 (2016-01-21) for details.

Release 0.15.3

This release fixes multiple build-time and run-time failures on OS X, FreeBSD, and Windows. Most of the problems were reported and fixed by Oleksii Taran (on OS X) and Sergei Trofimovich (on FreeBSD and Windows). Thank you for your help!

Skeleton coverage has been improved.

See the changelog 0.15.3 (2015-12-02) for details.

Release 0.15.2

This release fixes a bug in the build system: it adds a missing dependency of the lexer on the bison-generated header (reported on Gentoo bugtracker). See the changelog 0.15.2 (2015-11-23) for details.

Release 0.15.1

This release fixes an error in the testing script (thanks to Sergei Trofimovich who reported the bug and suggested a quick fix). See the changelog 0.15.1 (2015-11-22) for details.

Release 0.15

This release started out in the spring of 2015 as a relatively simple code cleanup.

It focused on the following problem: re2c used to repeat the whole generation process multiple times. Some parts of the generated program depend on the overall input statistics; they cannot be generated until the whole input has been processed. The usual strategy is to make stubs for all such things and fix them later. Instead, re2c used to process the whole input, gather statistics, discard the generated output, and regenerate it from scratch. Moreover, each generation pass further duplicated certain calculations (for similar reasons). As a result, the code generation phase was repeated four times.

The problem here was not inefficiency: re2c is fast enough to allow the 4x overhead. The real problem was the complexity of the code: the developer has to think of multiple execution layers all the time. Some parts of code are only executed on certain layers and affect each other. Simple reasoning gets really hard.

So the main purpose of this release was to simplify the code and make it easier to fix bugs and add new features. Very soon it became clear that some of the changes in code generation are hard to verify by hand. For example, even a minor rebalancing of if and switch statements may change the generated code significantly. It was clear that re2c needed an automatic verification tool, which lead to the idea of generating skeleton programs.

Meanwhile some work was done on adding warnings, updating the build system and fixing various bugs. A heart-warming consequence of the code simplification is that re2c now uses re2c more extensively: not only the main program, but also the command-line options, inplace configurations, and all kinds of escaped strings are parsed with re2c. Website has also been updated (feel free to suggest improvements).

See the changelog 0.15 (2015-11-22) for details.