浅析python正则表达式二:用好问号

最原始用法

“?”最原始的用法是匹配或不匹配前面的模式，这个不必多说。

问号转义

由于问号的特殊含义，若要在字符串中匹配字符’?’,要加上转义符，即’\?’

贪婪模式

在查看re的帮助文档时，能看到：

"*"      Matches 0 or more (greedy) repetitions of the preceding RE.
     	 Greedy means that it will match as many repetitions as possible.
"+"      Matches 1 or more (greedy) repetitions of the preceding RE.
"?"      Matches 0 or 1 (greedy) of the preceding RE.
*?,+?,?? Non-greedy versions of the previous three special characters.
{m,n}?   Non-greedy version of the above.

可见*,+,?,{m,n}默认都是贪婪模式，即匹配尽量多的数据，但这在一些情况下会出现问题，导致意想不到的结果发生，如：

1 2	In [11]: re.sub(r'a(.+)u','love','I abu you!') Out[11]: 'I love!'

这明显不科学啊！

那么问题来了，如何使用非贪婪模式，可在*,+,?,{m,n}后加上问号:

1 2	In [12]: re.sub(r'a(.+?)u','love','I abu you!') Out[12]: 'I love you!'

好了，不用再爱感叹号了。

子模式取消组号

这种用法，想不到取个什么名字好，在帮助文档中是这样描述的：

(?:…) Non-grouping version of regular parentheses.

在harkerranker上一道题用到了这个用法，但我把这个用法去掉，还是可以通过的，所以对这个用法感到特别迷惑，在stack oveflow上找到了比较好的解释。

http://stackoverflow.com/questions/3512471/what-is-a-non-capturing-group

如果有两个网站链接：

http://stackoverflow.com/
http://stackoverflow.com/questions/tagged/regex

匹配模式为：

(http|ftp)://([^/\r\n]+)(/[^\r\n]*)?

结果如下：

group() “http://stackoverflow.com/“
group(1) http
group(2) stackoverflow.com
group(3) /

group() http://stackoverflow.com/questions/tagged/regex
group(1) http
group(2) stackoverflow.com
group(3) /questions/tagged/regex

但是如果我们不关心用到的协议，怎么办？这个时候就到这种用法发挥的时候了，将匹配模式改为：

(?:http|ftp)://([^/\r\n]+)(/[^\r\n]*)?

则结果变为：

group() “http://stackoverflow.com/“
group(1) stackoverflow.com
group(2) /

group() http://stackoverflow.com/questions/tagged/regex
group(1) stackoverflow.com
group(2) /questions/tagged/regex

可预见模式

这种用法就太有用处了，它可以确定模式的前面或后面一定会出现某些子模式，而且这些子模式是没有组号的。在帮助文档中描述如下：

(?=...)  Matches if ... matches next, but doesn't consume the string.
(?!...)  Matches if ... doesn't match next.
(?<=...) Matches if preceded by ... (must be fixed length).
(?<!...) Matches if not preceded by ... (must be fixed length).

例1：

In [24]: print re.search(r'(?<!^)#[abc]{2,}(?=z)','#abcz')
None

In [27]: print re.search(r'(?<!^)#[abc]{2,}(?=z)','a#ababcz').group()
#ababc

其中(?<!^)表示模式不能出现在开头。

例2:

给定一个字符串和子字符串，找出子字符串在字符串中出现的序号对。该例子可见Hackerrank
如输入字符串和字符串：

aaadaa
aa

输出结果为，子字符串在字符串中出现的起始序号：

(0, 1)
(1, 2)
(4, 5)

方法一：遍历

这种方法虽然很好理解，但代码冗余，没有体现出python代码的优美。

import re

s = raw_input().strip()
k = raw_input().strip()
result = []
i = 0
while i <= len(s) -len(k):
    m = re.search(k, s[i:])
    if m: result.append((m.start()+i, m.end()+i-1))
    i += 1

if not result: print (-1, -1)
else:
    for x in sorted(set(result), key = lambda x: x[0]):
        print x

方法二：正则表达式预见模式

import re
S = raw_input()
k = raw_input()
anymatch = 'No'
for m in re.finditer(r'(?=('+k+'))',S):
    anymatch = 'Yes'
    print (m.start(1),m.end(1)-1)
if anymatch == 'No':
    print (-1, -1)

运行情况：

aaadaa
aa
(0, 1)
(1, 2)
(4, 5)

例3: 找出交替重复出现的字符

可用到例二的方法二,搜索预见模式(?=(\w)\w\1)

1 2	In [29]: re.findall(r'(?=(\w)\w\1)', 'aaaabbbb') Out[29]: ['a', 'a', 'b', 'b']

例4: 可预见模式中出现的字符数必须确定

即可预见模式中不能使用*,+,{m,}等不确定模式

In [46]: re.search(r'(?<=a.+)jj','asdjj')
error: look-behind requires fixed-width pattern
In [47]: re.search(r'(?<=a.*)jj','asdjj')
error: look-behind requires fixed-width pattern
In [48]: re.search(r'(?<=a[a-z]{1,})jj','asdjj')
error: look-behind requires fixed-width pattern

问号在匹配组字典中的用处

先看帮助文档的描述：

1	(?P<name>...) The substring matched by the group is accessible by name.

直接看个例子：

In [31]:  m = re.match(r'(?P<user>\w+)@(?P<website>\w+)\.(?P<extension>\w+)','myname@hackerrank.com')

In [32]:  m.groupdict()
Out[32]: {'extension': 'com', 'user': 'myname', 'website': 'hackerrank'}

总结

python正则中常用方法大概就是这些了，还有些几个用法帮助文档中有提到：

(?P=name)Matches the text matched earlier by the group named name.
(?iLmsux) Set the I, L, M, S, U, or X flag for the RE (see below).
(?(id/name)yes|no) Matches yes pattern if the group with id/name matched,
the (optional) no pattern otherwise.

这些用法至今还没碰到过，以后有机会再来补充。