python正则表达式

常用模式

模式	含义
^	Matches the beginning of a line
$	Matches the end of the line
.	Matches any character
\d	Match one digit
\D	Matches any non-digit character
\w	Match one number or one digit
\s	Matches whitespace
\S	Matches any non-whitespace character
?	Repeats a character zero or one times
*	Repeats a character zero or more times
+	Repeats a character one or more times
[aeiou]	Matches a single character in the listed set
[^XYZ]	Matches a single character not in the listed set
[a-z0-9]	The set of characters can include a range
()	Indicates where string extraction is to start and to end

Lazy means match shortest possible string.
Greedy means match longest possible string.(default)
For example, the greedy h.+l matches ‘hell’ in ‘hello’ but the lazy h.+?l matches ‘hel’.

python调用方式

import re引入模块直接开始使用正则匹配各种功能

re.search() 返回True\False
re.match() 匹配起始位置成功返回起始位置，否则返回non
re.findall() 返回所有匹配的list
另外一种使用方法

先将正则表达式的字符串形式编译为Pattern实例
使用Pattern实例处理文本并获得匹配结果（一个Match实例）
使用Match实例获得信息
两种方法是等价的，只不过第二种支持pattern的复用

正则匹配复杂度

Python正则匹配使用基于回溯的一种NFA实现。通过数据比较，在最坏的情况下用Thompson NFA实现的awk表现比匹配回溯的NFA要好很多倍。最坏情况下的复杂度不一样，回溯NFA是O(2^N)，而Thompson的复杂度是O(N^2)。参见正则表达式匹配和NFA/DFA

practice

print re.findAll([0-9]+,'My favorite 2 number are 19 and 42')
[‘2’,’19’,’42’]

x='From stephen.marq@uct.ac.za to sansa@uci.edu Sat Jan 5 09:04:15 2008'
y=re.findall('\S+@\S+',x)
[‘stephen.marq@uct.ac.za’,’sansa@uci.edu’]
y=re.findall('From (\S+@\S+)',x)
[‘stephen.marq@uct.ac.za’]

Reference

课程链接
 regular expression验证工具RegExr

2016-05-26