Encodings

re2c supports the following encodings: ASCII (default), EBCDIC (-e), UCS-2 (-w), UTF-16 (-x), UTF-32 (-u) and UTF-8 (-8). See also inplace configuration re2c:flags.

The following concepts should be clarified when talking about encodings. A code point is an abstract number that represents a single symbol. A code unit is the smallest unit of memory, which is used in the encoded text (it corresponds to one character in the input stream). One or more code units may be needed to represent a single code point, depending on the encoding. In a fixed-length encoding, each code point is represented with an equal number of code units. In variable-length encodings, different code points can be represented with different number of code units.

  • ASCII is a fixed-length encoding. Its code space includes 0x100 code points, from 0 to 0xFF. A code point is represented with exactly one 1-byte code unit, which has the same value as the code point. The size of YYCTYPE must be 1 byte.
  • EBCDIC is a fixed-length encoding. Its code space includes 0x100 code points, from 0 to 0xFF. A code point is represented with exactly one 1-byte code unit, which has the same value as the code point. The size of YYCTYPE must be 1 byte.
  • UCS-2 is a fixed-length encoding. Its code space includes 0x10000 code points, from 0 to 0xFFFF. One code point is represented with exactly one 2-byte code unit, which has the same value as the code point. The size of YYCTYPE must be 2 bytes.
  • UTF-16 is a variable-length encoding. Its code space includes all Unicode code points, from 0 to 0xD7FF and from 0xE000 to 0x10FFFF. One code point is represented with one or two 2-byte code units. The size of YYCTYPE must be 2 bytes.
  • UTF-32 is a fixed-length encoding. Its code space includes all Unicode code points, from 0 to 0xD7FF and from 0xE000 to 0x10FFFF. One code point is represented with exactly one 4-byte code unit. The size of YYCTYPE must be 4 bytes.
  • UTF-8 is a variable-length encoding. Its code space includes all Unicode code points, from 0 to 0xD7FF and from 0xE000 to 0x10FFFF. One code point is represented with a sequence of one, two, three, or four 1-byte code units. The size of YYCTYPE must be 1 byte.

In Unicode, values from range 0xD800 to 0xDFFF (surrogates) are not valid Unicode code points. Any encoded sequence of code units that would map to Unicode code points in the range 0xD800-0xDFFF, is ill-formed. The user can control how re2c treats such ill-formed sequences with the --encoding-policy <policy> switch.

For some encodings, there are code units that never occur in a valid encoded stream (e.g., 0xFF byte in UTF-8). If the generated scanner must check for invalid input, the only correct way to do so is to use the default rule (*). Note that the full range rule ([^]) won’t catch invalid code units when a variable-length encoding is used ([^] means “any valid code point”, whereas the default rule (*) means “any possible code unit”).