2008年6月10日星期二

笔记 : Mastering Regular Expressions

[abc]与(a|b|c)
见:Mastering Regular Expressions-1.4. Egrep Metacharacters
Don't confuse alternation with a character class. The class [abc] and the alternation (a|b|c) effectively mean the same thing, but the similarity in this example does not extend to the general case. A character class can match exactly one character, and that's true no matter how long or short the specified list of acceptable characters might be.
Alternation, on the other hand, can have arbitrarily long alternatives, each textually unrelated to the other: \<(1,000,000|million|thousand•thou)\>. However, alternation can't be negated like a character class.
character class每次是匹配集合中的任意"一个"字符,而且可以加^表示该集合的补集。在这边是[abc]。
alternation是匹配集合里面的任意"一项",可以是字符串匹配,该集合不能简单取反。在这边是(a|b|c)。
正向、反向预搜索
见: Mastering Regular Expressions-2.3.5.2. A few more lookahead examples
Table 2-1. Approaches to the "Jeffs" Problem
Solution Comments
s/\bJeffs\b/Jeff's/g The simplest, most straightforward, and efficient solution; the one I'd use if I weren't trying to show other interesting ways to approach the same problem. Without lookaround, the regex "consumes" the entire 'Jeffs'.
s/\b(Jeff)(s)\b/$1'$2/g Complex without benefit. Still consumes entire'Jeffs'.
s/\bJeff(?=s\b)/Jeff'/g Doesn't actually consume the 's', but this not of much practical value here except to illustrate lookahead.
s/(?<=\bJeff)(?=s\b)/'/g This regex doesn't actually "consume" any text. It uses both lookahead and lookbehind to match positions of interest, at which an apostrophe is inserted. Very useful to illustrate lookaround.
s/(?=s\b)(?<=\bJeff)/'/g This is exactly the same as the one above, but the two lookaround tests are reversed. Because the tests don't consume text, the order in which they're applied makes no difference to whether there's a match.
从当前位置往前/往后做匹配,注意(?=Jeffery)Jeff相当于Jeff(?=ery),都是要找匹配Jeffery的字符串。
\G,向后匹配,\Q...\E
来源:Section 3.5. Common Metacharacters and Features
\G上次匹配末端,见3.5.3.3. Start of match (or end of previous match): \G
向后匹配时只能是定长的字符串:The most restrictive rule exists in Perl and Python, where the lookbehind can match only fixed-length strings. For example, (?\w) and (?this|that) are allowed, but (?books?) and (?<^\w+:) are not, as they can match a variable amount of text.
自动转换变量包含的转义字符,见3.5.4.4. Literal-text span: \Q⋯\E
First introduced with Perl, the special sequence \Q⋯\E turns off all regex metacharacters between them, except for \E itself. (If the \E is omitted, they are turned off until the end of the regex.) It allows what would otherwise be taken as normal metacharacters to be treated as literal text. This is especially useful when including the contents of a variable while building a regular expression.
For example, to respond to a web search, you might accept what the user types as $query, and search for it with m/$query/i. As it is, this would certainly have unexpected results if $query were to contain, say, 'C:\WINDOWS\', which results in a run-time error because the search term contains something that isn't a valid regular expression (the trailing lone backslash).
\Q⋯\E avoids the problem. With the Perl code m/\Q$query\E/i, a $query of 'C:\WINDOWS\' becomes' C\:\\WINDOWS\\', resulting in a search that finds the original 'C:\WINDOWS\' as the user expects.

使用.*时注意加一些条件优化下
见:4.5.3. Using Lazy Quantifiers
<B> # Match the opening <B>
( # Now, as many of the following as possible ...
(?! < /?B> ) # If not <B>, and not </B> ...
. # ... any character is okay
)* # (now greedy)
</B> # <ANNO> ... until the closing delimiter can match.
这样的效果会比<B>.*?</B>
贪婪匹配
4.2.4.2. Being too greedy
比如说 ^.*([0-9][0-9]) 来匹配这个字符串 'about•24•characters•long'
会一直冲到最后一个字符g,然后发现后面还要有两个数字才行,再一步步回退到24前面的•。
NFA与DFA
4.3. Regex-Directed Versus Text-Directed
NFA会根据各个子正则式依次进行匹配。
DFA则会同时匹配可能满足条件的子正则式,也就是说在匹配完成前会保存所有可能的中间态。
比如'to(nite|knight|night)'的匹配字符串tonight:
如果是NFA:先匹配nite,不行再回溯匹配knight,还不行再回溯匹配night,最后确认成功匹配。
如果是DFA:将同时匹配nite以及night的ni,发现只有night匹配nig,继续匹配完night,最后确认成功匹配。

没有评论:

发表评论