[-Wuseless-escape]

A simple example

1
2
3
4
5
6
/*!re2c
    *                        {}
    "\a\A\"\'\[\]\-\x5d\377" {}
    '\a\A\"\'\[\]\-\x5d\377' {}
    [\a\A\"\'\[\]\-\x5d\377] {}
*/

Given this code, `re2c -Wuseless-escape` reports a bunch of warnings:

re2c: warning: line 3: column 11: escape has no effect: '\A' [-Wuseless-escape]
re2c: warning: line 3: column 15: escape has no effect: '\'' [-Wuseless-escape]
re2c: warning: line 3: column 17: escape has no effect: '\[' [-Wuseless-escape]
re2c: warning: line 3: column 19: escape has no effect: '\]' [-Wuseless-escape]
re2c: warning: line 3: column 21: escape has no effect: '\-' [-Wuseless-escape]
re2c: warning: line 4: column 11: escape has no effect: '\A' [-Wuseless-escape]
re2c: warning: line 4: column 13: escape has no effect: '\"' [-Wuseless-escape]
re2c: warning: line 4: column 17: escape has no effect: '\[' [-Wuseless-escape]
re2c: warning: line 4: column 19: escape has no effect: '\]' [-Wuseless-escape]
re2c: warning: line 4: column 21: escape has no effect: '\-' [-Wuseless-escape]
re2c: warning: line 5: column 11: escape has no effect: '\A' [-Wuseless-escape]
re2c: warning: line 5: column 13: escape has no effect: '\"' [-Wuseless-escape]
re2c: warning: line 5: column 15: escape has no effect: '\'' [-Wuseless-escape]
re2c: warning: line 5: column 17: escape has no effect: '\[' [-Wuseless-escape]

It says that \A and \[ escapes are meaningless in all rules, \- makes sense only in character class and each type of closing quotes (", ' and ]) should only be escaped inside of same-quoted string. Useless escapes are ignored: the escaped symbol is treated as not escaped (\A becomes A, etc.). The above example should be fixed as follows:

1
2
3
4
5
6
/*!re2c
    *                    {}
    "\aA\"'[]-\x5d\377"  {}
    '\aA"\'[]-\x5d\377'  {}
    [\aA"'[\]\-\x5d\377] {}
*/

How it works

re2c recognizes escapes in the following lexemes:

  • double-quoted strings " ... "
  • single-quoted strings ' ... '
  • character classes [ ... ] and [^ ... ]

The following escapes are recognized:

  • Closing quotes (\" for double-quoted strings, \' for single-quoted strings and \] for character classes).
  • Dash \- in character classes.
  • Octal escapes: \ooo, where o is in range [0 - 7] (maximal octal escape is \377, which equals 0xFF).
  • Hexadecimal escapes: \xhh, \Xhhhh, \uhhhh and \Uhhhhhhhh, where h is in range [0 - 9], [a - f] or [A - F].
  • Miscellaneous escapes: \a, \b, \f, \n, \r, \t, \v, \\.

Ill-formed octal and hexadecimal escapes are treated as errors. Escape followed by a newline is also an error: multiline strings and classes are not allowed (this is very inconvenient; hopefully it will be fixed in future). Any other ill-formed escapes are ignored. If [-Wuseless-escape] is enabled, re2c warns about ignored escapes.

Real-world examples

I found many useless escapes in real-world programs:

  • A very strange escape \* in a regular expression like "*\*": either someone wanted to write "*\\*" (with backslash in the middle), or I have no explanation at all (considering that the first * is not escaped). As far as I know re2c always treated "*\*" as "**".
  • \h in character classes (e.g. [ \h\t\v\f\r]): perhaps someone confused \h with horisontal tab (or even hostname :)).
  • \[ in charater classes; this one is very common.
  • \/ in character classes (e.g. [^\/\000]) and strings (e.g. "\/*"). However, there is one interesting case: "/**** State @@ ***\/": here unescaped slash would end multiline comment. Perhaps [-Wuseless-escape] should be fixed to recognize such cases.
  • \. in character classes (e.g [\.]).