Benchmarks¶

Benchmarks are not built by default, as some of them contain large automata that take considerable time to compile with GCC or Clang. To enable them, configure with --enable-benchmarks (Autotools) or -DRE2C_BUILD_BENCHMARKS=yes (CMake). There are a few different groups of benchmarks.

Submatch extraction in lexer generators:

$ cd ${BUILD_DIR}/benchmarks/c
$ ./run.py --output=results.json

Submatch extraction in library algorithms based on deterministic automata:

$ cd ${BUILD_DIR}/benchmarks/c/libre2c/jit
$ ./bench_submatch_jit --benchmark_out_format=json --benchmark_out=results.json

Submatch extraction in library algorithms based on non-deterministic automata:

$ cd ${BUILD_DIR}/benchmarks/c/libre2c/nfa
$ ./bench_submatch_nfa --benchmark_out_format=json --benchmark_out=results.json

To generate a TeX bar chart (PGF plot) from the JSON output, use json2pgfplot.py script. It has a few options, such as --relative-to <algo> (which scales the timings relative to the specified algorithm) and --font <name> (which specifies the font to be used). The generated TeX file can be compiled to PDF, which can be further converted to SVG, etc.

$ ${SOURCE_DIR}/benchmarks/json2pgfplot.py --variant <aot | jit | nfa> results.json results.tex
$ pdflatex results.tex </dev/null >results.log
$ pdf2svg results.pdf results.svg

Submatch (lexer generators)¶

These benchmarks contain regular expressions with submatch markers that are compiled to C code by a lexer generator, and further compiled to native code by a C compiler (GCC and Clang). Compilation happens ahead of time, so it is not included in the run time. The currently supported generators are ragel and re2c (no flex, as it does not support submatch extraction). Regular expressions used in the benchmarks can be divided in two categories:

real-world regular expressions (parsers for HTTP message headers, URI, email addresses, etc.) ranging from large to small, mostly unambiguous
artificial regular expressions (a few different groups with emphasis on alternative, concatenation or repetition) in series of increasing complexity and ambiguity

The generated programs find submatch boundaries and count the total length of all submatch fragments (using submatch results is necessary to prevent the compiler from optimizing out submatch extraction logic). For each engine there are simple and buffered variants. In the simple case the whole input is represented as one continuous block of memory, while in the buffered case the input is copied into a buffer of smaller size chunk by chunk, so the lexers need additional logic for buffer refills. For re2c there are two buffered variants that differ in the way they handle the end of input: buffered-eof uses sentinel with bounds checks method, while buffered-scc uses bounds checks with padding.

Ragel programs are not always correct. This is because ragel cannot handle non-determinism in the general case. It is a fundamental limitation which cannot be solved with disambiguation operators, except for simple cases. Ragel has no notion of tags and tag variables, and its submatch extraction is based on actions — arbitrary blocks of code embedded in the regular expression. When ragel generates a DFA, it puts actions either on transitions or in states (depending on the action type). Non-determinism means that there are multiple parallel NFA paths with different actions that reach a given DFA state. Actions on different paths conflict with each other (in particular, an action can conflict with itself) and change the program state in unforeseen ways (e.g. overwrite variables set by another action). Disambiguation operators can remove some of the conflicting actions, but they cannot fork the program state.

re2c uses TDFA(1) algorithm (see the paper Tagged Deterministic Finite Automata with Lookahead for details and benchmarks against other submatch extraction algorithms). In real-world cases the overhead on submatch extraction is very small, and re2c is usually a bit faster than ragel (performance and binary size depends a lot on the C++ compiler being used). In highly ambiguous artificial cases there is considerable overhead on tracking multiple versions of non-deterministic tags (which is necessary for correctness), and re2c is slower than ragel.

The measurements have been taken on an Intel(R) Core(TM) i7-8750H CPU with 32 KiB L1 Data cache (x6 cores), 32 KiB L1 Instruction cache (x6 cores), 256 KiB L2 Unified cache (x6 cores), 9216 KiB L3 Unified cache, 32 GiB RAM. Software versions: ragel-7.0.4, re2c built from Git at commit 961ce42, GCC-14.2.0, Clang-19.1.4. The results may be out of date!

Time is measured in ms. Input size is 16MB. Binary size is measured in KB (it is calculated as a sum of binary sizes of all functions relevant for a particular benchmark, as shown by nm utility).

Submatch (libraries, DFA)¶

This group contains library algorithms for submatch extraction that are based on deterministic automata. These are regular expressions that undergo just-in-time determinization before matching, which is included in the run time. Currently the group contains only two algorithms, TDFA(1) and regless-TDFA(1) — a modification of TDFA(1) that replaces register operations with a record of tag history, and a second pass on the history that unfolds it and reconstructs submatch results. The benchmark also includes variations for POSIX and leftmost greedy disambiguation policies, since disambiguation is also included in the run time.

Regless-TDFA(1) bypasses a few expensive computation steps in regcomp(), which results in much faster determinization times (a few orders of magnitude faster on large regular expressions). For regexec() the results are divided: on real-world regular expressions TDFA(1) usually outperforms regless-TDFA(1), but in some artificially constructed cases it can arbitrarily slower. These are pathological inputs for TDFA(1); the tags have arbitrarily high degree of non-determinism (increased with the repetition counter), so TDFA(1) has to track arbitrary many tag variables. Regless-TDFA(1) does not have tag variables.

The measurements have been taken on an Intel(R) Core(TM) i7-8750H CPU with 32 KiB L1 Data cache (x6 cores), 32 KiB L1 Instruction cache (x6 cores), 256 KiB L2 Unified cache (x6 cores), 9216 KiB L3 Unified cache, 32 GiB RAM. Software versions: re2c built from Git at commit 814d38f. The results may be out of date!

Compile time and run time is shown relative to the first row.

Submatch (libraries, NFA)¶

This group contains library algorithms for submatch extraction based on non-deterministic automata. They are described in depth in the paper Efficient POSIX submatch extraction on NFA. The goal is to compare algorithms that support POSIX longest-match semantics. A few leftmost greedy algorithms are provided as a baseline, including the Google RE2 library (which does not support POSIX semantics). There are four different variations of Okui-Suzuki algorithm, an algorithm proposed by Kuklewicz and a backward-matching algorithm proposed by Cox (which is generally incorrect).

As the benchmarks show, the basic Okui-Suzuki algorithm the most robust one among POSIX algorithms. It works in bounded memory that depends only on the regular expression and does not grow with the input size, and it has reasonable performance compared to the leftmost-greedy algorithm (although it too has some pathological cases on regular expressions with high ambiguity level). The lazy variation of Okui-Suzuki algorithm is often faster, but its memory requirement is not bounded (it grows with the size of input). Both Kuklewicz and backward algorithms are much slower on large real-world regular expressions.

The measurements have been taken on an Intel(R) Core(TM) i7-8750H CPU with 32 KiB L1 Data cache (x6 cores), 32 KiB L1 Instruction cache (x6 cores), 256 KiB L2 Unified cache (x6 cores), 9216 KiB L3 Unified cache, 32 GiB RAM. Software versions: re2c built from Git at commit b55ab37, re2-2020-11-01. The results may be out of date!

Simulation time is shown relative to the first row.