SciTE Regular Expressions

Regular Expressions in SciTE

Purpose

Regular expressions can be used for searching for patterns rather than literals. For example, it is possible to search for variables in SciTE property files, which look like $(name.subname) with the regular expression:
\$([a-z.]+) (or \$$[a-z.]+$ in posix mode).

Replacement with regular expressions allows complex transformations with the use of tagged expressions. For example, pairs of numbers separated by a ',' could be reordered by replacing the regular expression:
$[0-9]+$,$[0-9]+$ (or ([0-9]+),([0-9]+) in posix mode, or even (\d+),(\d+))
with:
\2,\1

Syntax

Regular expression syntax depends on a parameter: find.replace.regexp.posix
If set to 0, syntax uses the old Unix style where $ and $ mark capturing sections while ( and ) are themselves.
If set to 1, syntax uses the more common style where ( and ) mark capturing sections while $ and $ are plain parentheses.

[1] char

matches itself, unless it is a special character (metachar): . \ [ ] * + ? ^ $ and ( ) in posix mode.

[2] .

matches any character.

[3] \

matches the character following it, except:

\a, \b, \f, \n, \r, \t, \v match the corresponding C escape char, respectively BEL, BS, FF, LF, CR, TAB and VT;
Note that \r and \n are never matched because in Scintilla, regular expression searches are made line per line (stripped of end-of-line chars).
if not in posix mode, when followed by a left or right round bracket (see [8]);
when followed by a digit 1 to 9 (see [9]);
when followed by a left or right angle bracket (see [10]);
when followed by d, D, s, S, w or W (see [11]);
when followed by x and two hexa digits (see [12]);

Backslash is used as an escape character for all other meta-characters, and itself.

[4] [set]

matches one of the characters in the set. If the first character in the set is ^, it matches the characters NOT in the set, i.e. complements the set. A shorthand S-E (start dash end) is used to specify a set of characters S up to E, inclusive. The special characters ] and - have no special meaning if they appear as the first chars in the set. To include both, put - first: [-]A-Z] (or just backslash them).

example	match
`[-]\|]`	matches these 3 chars,
`[]-\|]`	matches from ] to \| chars
`[a-z]`	any lowercase alpha
`[^-]]`	any char except - and ]
`[^A-Z]`	any char except uppercase alpha
`[a-zA-Z]`	any alpha

[5] *

any regular expression form [1] to [4] (except [8], [9] and [10] forms of [3]), followed by closure char (*) matches zero or more matches of that form.

[6] +

same as [5], except it matches one or more.

[5-6]

Both [5] and [6] are greedy (they match as much as possible) unless they are followed by the 'lazy' quantifier (?) in which case both [5] and [6] try to match as little as possible.

[7] ?

same as [5], except it matches zero or one.

[8]

a regular expression in the form [1] to [13], enclosed as $form$ (or (form) with posix flag) matches what form matches. The enclosure creates a set of tags, used for [9] and for pattern substitution. The tagged forms are numbered starting from 1.

[9]

a \ followed by a digit 1 to 9 matches whatever a previously tagged regular expression ([8]) matched.

[10] \< \>

a regular expression starting with a \< construct and/or ending with a \> construct, restricts the pattern matching to the beginning of a word, and/or the end of a word. A word is defined to be a character string beginning and/or ending with the characters A-Z a-z 0-9 and _. Scintilla extends this definition by user setting. The word must also be preceded and/or followed by any character outside those mentioned.

[11] \l

a backslash followed by d, D, s, S, w or W, becomes a character class (both inside and outside sets []).

d: decimal digits
D: any char except decimal digits
s: whitespace (space, \t \n \r \f \v)
S: any char except whitespace (see above)
w: alphanumeric & underscore (changed by user setting)
W: any char except alphanumeric & underscore (see above)

[12] \xHH

a backslash followed by x and two hexa digits, becomes the character whose Ascii code is equal to these digits. If not followed by two digits, it is 'x' char itself.

[13]

a composite regular expression xy where x and y are in the form [1] to [12] matches the longest match of x followed by a match for y.

[14] ^ $

a regular expression starting with a ^ character and/or ending with a $ character, restricts the pattern matching to the beginning of the line, or the end of line. [anchors] Elsewhere in the pattern, ^ and $ are treated as ordinary characters.

Acknowledgments

Most of this documentation was originally written by Ozan S. Yigit.
Additions by Neil Hodgson and Philippe Lhoste.
All of this document is in the public domain.