mirror of
https://github.com/garabik/grc.git
synced 2024-11-10 21:26:52 +01:00
188 lines
11 KiB
Plaintext
188 lines
11 KiB
Plaintext
|
|
Regular Expression Syntax (taken from python library reference)
|
|
|
|
A regular expression (or RE) specifies a set of strings that matches
|
|
it; the functions in this module let you check if a particular string
|
|
matches a given regular expression (or if a given regular expression
|
|
matches a particular string, which comes down to the same thing).
|
|
|
|
Regular expressions can be concatenated to form new regular
|
|
expressions; if A and B are both regular expressions, then AB is also
|
|
an regular expression. If a string p matches A and another string q
|
|
matches B, the string pq will match AB. Thus, complex expressions can
|
|
easily be constructed from simpler primitive expressions like the
|
|
ones described here. For details of the theory and implementation of
|
|
regular expressions, consult the Friedl book referenced below, or
|
|
almost any textbook about compiler construction.
|
|
|
|
A brief explanation of the format of regular expressions follows. For
|
|
further information and a gentler presentation, consult the Regular
|
|
Expression HOWTO, accessible from http://www.python.org/doc/howto/.
|
|
|
|
Regular expressions can contain both special and ordinary characters.
|
|
Most ordinary characters, like "A", "a", or "0", are the simplest
|
|
regular expressions; they simply match themselves. You can
|
|
concatenate ordinary characters, so last matches the string 'last'.
|
|
(In the rest of this section, we'll write RE's in this special style,
|
|
usually without quotes, and strings to be matched 'in single
|
|
quotes'.)
|
|
|
|
Some characters, like "|" or "(", are special. Special characters
|
|
either stand for classes of ordinary characters, or affect how the
|
|
regular expressions around them are interpreted.
|
|
|
|
The special characters are:
|
|
|
|
"."
|
|
(Dot.) In the default mode, this matches any character except a newline. If the
|
|
DOTALL flag has been specified, this matches any character including a newline.
|
|
"^"
|
|
(Caret.) Matches the start of the string, and in MULTILINE mode also matches
|
|
immediately after each newline.
|
|
"$"
|
|
Matches the end of the string, and in MULTILINE mode also matches before a newline.
|
|
foo matches both 'foo' and 'foobar', while the regular expression foo$ matches only
|
|
'foo'.
|
|
"*"
|
|
Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many
|
|
repetitions as are possible. ab* will match 'a', 'ab', or 'a' followed by any
|
|
number of 'b's.
|
|
"+"
|
|
Causes the resulting RE to match 1 or more repetitions of the preceding RE. ab+
|
|
will match 'a' followed by any non-zero number of 'b's; it will not match just 'a'.
|
|
"?"
|
|
Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. ab? will
|
|
match either 'a' or 'ab'.
|
|
*?, +?, ??
|
|
The "*", "+", and "?" qualifiers are all greedy; they match as much text as
|
|
possible. Sometimes this behaviour isn't desired; if the RE <.*> is matched against
|
|
'<H1>title</H1>', it will match the entire string, and not just '<H1>'. Adding "?"
|
|
after the qualifier makes it perform the match in non-greedy or minimal fashion; as
|
|
few characters as possible will be matched. Using .*? in the previous expression
|
|
will match only '<H1>'.
|
|
{m,n}
|
|
Causes the resulting RE to match from m to n repetitions of the preceding RE,
|
|
attempting to match as many repetitions as possible. For example, a{3,5} will match
|
|
from 3 to 5 "a" characters. Omitting n specifies an infinite upper bound; you can't
|
|
omit m.
|
|
{m,n}?
|
|
Causes the resulting RE to match from m to n repetitions of the preceding RE,
|
|
attempting to match as few repetitions as possible. This is the non-greedy version
|
|
of the previous qualifier. For example, on the 6-character string 'aaaaaa', a{3,5}
|
|
will match 5 "a" characters, while a{3,5}? will only match 3 characters.
|
|
"\"
|
|
Either escapes special characters (permitting you to match characters like "*",
|
|
"?", and so forth), or signals a special sequence; special sequences are discussed
|
|
below.
|
|
If you're not using a raw string to express the pattern, remember that Python also
|
|
uses the backslash as an escape sequence in string literals; if the escape sequence
|
|
isn't recognized by Python's parser, the backslash and subsequent character are
|
|
included in the resulting string. However, if Python would recognize the resulting
|
|
sequence, the backslash should be repeated twice. This is complicated and hard to
|
|
understand, so it's highly recommended that you use raw strings for all but the
|
|
simplest expressions.
|
|
[]
|
|
Used to indicate a set of characters. Characters can be listed individually, or a
|
|
range of characters can be indicated by giving two characters and separating them
|
|
by a "-". Special characters are not active inside sets. For example, [akm$] will
|
|
match any of the characters "a", "k", "m", or "$"; [a-z] will match any lowercase
|
|
letter, and [a-zA-Z0-9] matches any letter or digit. Character classes such as \w
|
|
or \S(defined below) are also acceptable inside a range. If you want to include a
|
|
"]" or a "-" inside a set, precede it with a backslash, or place it as the first
|
|
character. The pattern []] will match ']', for example.
|
|
You can match the characters not within a range by complementing the set. This is
|
|
indicated by including a "^" as the first character of the set; "^" elsewhere will
|
|
simply match the "^" character. For example, [^5] will match any character except
|
|
"5".
|
|
"|"
|
|
A|B, where A and B can be arbitrary REs, creates a regular expression that will
|
|
match either A or B. This can be used inside groups (see below) as well. To match a
|
|
literal "|", use \|, or enclose it inside a character class, as in [|].
|
|
(...)
|
|
Matches whatever regular expression is inside the parentheses, and indicates the
|
|
start and end of a group; the contents of a group can be retrieved after a match
|
|
has been performed, and can be matched later in the string with the \number special
|
|
sequence, described below. To match the literals "(" or "')", use \( or \), or
|
|
enclose them inside a character class: [(] [)].
|
|
(?...)
|
|
This is an extension notation (a "?" following a "(" is not meaningful otherwise).
|
|
The first character after the "?" determines what the meaning and further syntax of
|
|
the construct is. Extensions usually do not create a new group; (?P<name>...) is
|
|
the only exception to this rule. Following are the currently supported extensions.
|
|
(?iLmsx)
|
|
(One or more letters from the set "i", "L", "m", "s", "x".) The group matches the
|
|
empty string; the letters set the corresponding flags (re.I, re.L, re.M, re.S,
|
|
re.X) for the entire regular expression. This is useful if you wish to include the
|
|
flags as part of the regular expression, instead of passing a flag argument to the
|
|
compile() function.
|
|
(?:...)
|
|
A non-grouping version of regular parentheses. Matches whatever regular expression
|
|
is inside the parentheses, but the substring matched by the group cannot be
|
|
retrieved after performing a match or referenced later in the pattern.
|
|
(?P<name>...)
|
|
Similar to regular parentheses, but the substring matched by the group is
|
|
accessible via the symbolic group name name. Group names must be valid Python
|
|
identifiers. A symbolic group is also a numbered group, just as if the group were
|
|
not named. So the group named 'id' in the example above can also be referenced as
|
|
the numbered group 1.
|
|
For example, if the pattern is (?P<id>[a-zA-Z_]\w*), the group can be referenced by
|
|
its name in arguments to methods of match objects, such as m.group('id')or
|
|
m.end('id'), and also by name in pattern text (e.g. (?P=id)) and replacement text
|
|
(e.g. \g<id>).
|
|
(?P=name)
|
|
Matches whatever text was matched by the earlier group named name.
|
|
(?#...)
|
|
A comment; the contents of the parentheses are simply ignored.
|
|
(?=...)
|
|
Matches if ... matches next, but doesn't consume any of the string. This is called
|
|
a lookahead assertion. For example, Isaac (?=Asimov) will match 'Isaac ' only if
|
|
it's followed by 'Asimov'.
|
|
(?!...)
|
|
Matches if ... doesn't match next. This is a negative lookahead assertion. For
|
|
example, Isaac (?!Asimov) will match 'Isaac ' only if it's not followed by
|
|
'Asimov'.
|
|
|
|
The special sequences consist of "\" and a character from the list below. If the ordinary
|
|
character is not on the list, then the resulting RE will match the second character. For
|
|
example, \$ matches the character "$".
|
|
|
|
\number
|
|
Matches the contents of the group of the same number. Groups are numbered starting
|
|
from 1. For example, (.+) \1 matches 'the the' or '55 55', but not 'the end' (note
|
|
the space after the group). This special sequence can only be used to match one of
|
|
the first 99 groups. If the first digit of number is 0, or number is 3 octal digits
|
|
long, it will not be interpreted as a group match, but as the character with octal
|
|
value number. Inside the "[" and "]" of a character class, all numeric escapes are
|
|
treated as characters.
|
|
\A
|
|
Matches only at the start of the string.
|
|
\b
|
|
Matches the empty string, but only at the beginning or end of a word. A word is
|
|
defined as a sequence of alphanumeric characters, so the end of a word is indicated
|
|
by whitespace or a non-alphanumeric character. Inside a character range, \b
|
|
represents the backspace character, for compatibility with Python's string
|
|
literals.
|
|
\B
|
|
Matches the empty string, but only when it is not at the beginning or end of a
|
|
word.
|
|
\d
|
|
Matches any decimal digit; this is equivalent to the set [0-9].
|
|
\D
|
|
Matches any non-digit character; this is equivalent to the set [^0-9].
|
|
\s
|
|
Matches any whitespace character; this is equivalent to the set [ \t\n\r\f\v].
|
|
\S
|
|
Matches any non-whitespace character; this is equivalent to the set [^ \t\n\r\f\v].
|
|
\w
|
|
When the LOCALE flag is not specified, matches any alphanumeric character; this is
|
|
equivalent to the set [a-zA-Z0-9_]. With LOCALE, it will match the set [0-9_] plus
|
|
whatever characters are defined as letters for the current locale.
|
|
\W
|
|
When the LOCALE flag is not specified, matches any non-alphanumeric character; this
|
|
is equivalent to the set [^a-zA-Z0-9_]. With LOCALE, it will match any character
|
|
not in the set [0-9_], and not defined as a letter for the current locale.
|
|
\Z
|
|
Matches only at the end of the string.
|
|
\\
|
|
Matches a literal backslash.
|