人人都爱Python,python与正则表达式（re）【Gtalent】

635 05-17

智一面的面试题提供python的测试题
使用地址：http://www.gtalent.cn/exam/interview?token=99ef9b1b81c34b4e0514325e9bd3be54

re.match() 与 re.search()

python 提供了两种不同的操作：基于 re.match() 检查字符串开头，或者 re.search() 检查字符串的任意位置（默认Perl中的行为）。

>>> import re

>>> re.match("c","abcdef") # 没有匹配 no match

>>> re.search("c","abcdef")

<re.Match object; span=(2, 3), match='c'>

span=[2, 3) 表示在原字符串 “abcedf” 中自2号位置开始到3号位置之前为匹配项。

>>> re.search("cd","abcdef")

<re.Match object; span=(2, 4), match='cd'>

同理，span=[2, 4) …

在 search() 中，可以用 ‘^’ 作为开始来限制匹配到字符串的首位

>>> re.match("c", "abcdef") # No match

>>> re.search("^c", "abcdef") # No match

>>> re.search("^a", "abcdef") # Match

<re.Match object; span=(0, 1), match='a'>

注意 MULTILINE 多行模式中函数 match() 只匹配字符串的开始，但使用 search() 和以 ‘^’ 开始的正则表达式会匹配每行的开始。

>>> re.match('X', 'A\nB\nX', re.MULTILINE) # No match

>>> re.search('^X', 'A\nB\nX', re.MULTILINE) # Match

<re.Match object; span=(4, 5), match='X'>

group() 和 groups()方法

当处理正则表达式时，除了正则表达式对象之外，还有另一个对象类型：匹配对象。

这些是成功调用 match()或者 search()返回的对象。匹配对象有两个主要的方法： group()和groups()。

>>> m = re.search("cd","abcdef")

>>> if m is not None:

... m.group()

...

'cd' # 输出匹配内容

注：这里使用了 if 语句块，以避免 AttributeError 异常（ None 是返回的错误值，该值并没有 group() 属性[方法]）。

下面演示的方法尽管看起来代码简洁，省略了 if 语句进行判断。但是如果正则匹配失败就会引发异常。（我们在实际的使用中应该尽量避免下面这种写法，最好不要省去 if 判断，以免产生异常）。示例如下：

>>> re.match('foo', 'food on the table').group()

'foo'

# 使用这种方式，如果查询不匹配将返回None，在调用group()时引发异常

>>> re.match('fqoo', 'food on the table').group()

Traceback (most recent call last):

File "<stdin>", line 1, in <module>

AttributeError: 'NoneType' object has no attribute 'group'

group()要么返回整个匹配对象，要么根据要求返回特定子组。 groups()则仅返回一个包含唯一或者全部子组的元组。如果没有子组的要求，那么当group()仍然返回整个匹配时， groups()返回一个空元组。

需要注意的是，在正则中一个圆括号可实现分组的功能。如果为两个子模式都加上圆括号，例如(\w+)-(\d+)，然后就能够分别访问每一个匹配子组。

"""

示例：使该正则表达式能够提取字母数字字符串和数字。

我们可以使用 group()方法访问每个独立的子组

以及 groups()方法以获取一个包含所有匹配子组的元组。

"""

>>> m = re.match('(\w\w\w)-(\d\d\d)', 'abc-123')

>>> m.group() # 完整匹配

'abc-123'

>>> m.group(1) # 子组 1

'abc'

>>> m.group(2) # 子组 2

'123'

>>> m.groups() # 全部子组

('abc', '123')

如上，group()通常用于以普通方式显示所有的匹配部分，但也能用于获取各个匹配的子组。可以使用 groups()方法来获取一个包含所有匹配子字符串的元组。

>>> m = re.match('ab', 'ab') # 没有子组

>>> m.group() # 完整匹配

'ab'

>>> m.groups() # 所有子组

()

>>>

>>> m = re.match('(ab)', 'ab') # 一个子组

>>> m.group() # 完整匹配

'ab'

>>> m.group(1) # 子组 1

'ab'

>>> m.groups() # 全部子组

('ab',)

>>>

>>> m = re.match('(a)(b)', 'ab') # 两个子组

>>> m.group() # 完整匹配

'ab'

>>> m.group(1) # 子组 1

'a'

>>> m.group(2) # 子组 2

'b'

>>> m.groups() # 所有子组

('a', 'b')

>>>

>>> m = re.match('(a(b))', 'ab') # 两个子组

>>> m.group() # 完整匹配

'ab'

>>> m.group(1) # 子组 1

'ab'

>>> m.group(2) # 子组 2

'b'

>>> m.groups() # 所有子组

('ab', 'b')

findall() 和 finditer()

re.findall(pattern, string, flags=0)

re.finditer(pattern, string, flags=0)

findall()查询字符串中某个正则表达式模式全部的非重复出现情况。

这与 search()在执行字符串搜索时类似，但与 match()和 search()的不同之处在于， findall()总是返回一个列表。如果 findall()没有找到匹配的部分，就返回一个空列表，但如果匹配成功，列表将包含所有成功的匹配部分（从左向右按出现顺序排列）

>>> re.findall('car', 'car')

['car']

>>> re.findall('car', 'scary')

['car']

>>> re.findall('car', 'carry the barcardi to the car')

['car', 'car', 'car']

对于一个成功的匹配，每个子组匹配是由 findall()返回的结果列表中的单一元素；

对于多个成功的匹配，每个子组匹配是返回的一个元组中的单一元素，而且每个元组（每个元组都对应一个成功的匹配）是结果列表中的元素。

finditer()函数是一个与 findall()函数类似但是更节省内存的变体。使用finditer将会返回为一个迭代器 iterator 保存了匹配对象。

>>> s = 'This and that.'

>>> re.findall(r'(th\w+) and (th\w+)', s, re.I) # re.I 不区分大小写更多详见文章末尾的表单

[('This', 'that')]

>>> re.finditer(r'(th\w+) and (th\w+)', s,

... re.I).__next()__.groups()

('This', 'that')

>>> re.finditer(r'(th\w+) and (th\w+)', s,

... re.I).__next__().group(1)

'This'

>>> re.finditer(r'(th\w+) and (th\w+)', s,

... re.I).__next__().group(2)

'that'

>>> [g.groups() for g in re.finditer(r'(th\w+) and (th\w+)',

... s, re.I)]

[('This', 'that')]

在单个字符串中执行单个分组的多重匹配

>>> re.findall(r'(th\w+)', s, re.I)

['This', 'that']

>>> it = re.finditer(r'(th\w+)', s, re.I)

>>> g = it.__next__()

>>> g.groups()

('This',)

>>> g.group(1)

'This'

>>> g = it.__next__()

>>> g.groups()

('that',)

>>> g.group(1)

'that'

>>> [g.group(1) for g in re.finditer(r'(th\w+)', s, re.I)]

['This', 'that']

sub() 和 subn()

re.sub(pattern, repl, string, count=0, flags=0)

re.subn(pattern, repl, string, count=0, flags=0)

# 行为与 sub() 相同，但是返回一个元组 (字符串, 替换次数).

sub()和 subn()。都是将字符串中所有匹配正则表达式的部分进行某种形式的替换。用来替换的部分通常是一个字符串，但它也可能是一个函数，该函数返回一个用来替换的字符串。

subn()和 sub()一样，但 subn()还返回一个表示替换的总数，替换后的字符串和表示替换总数的数字一起作为一个拥有两个元素的元组返回。

>>> re.sub('X', 'Mr. Smith', 'attn: X\n\nDear X,\n')

'attn: Mr. Smith\012\012Dear Mr. Smith,\012'

>>>

>>> re.subn('X', 'Mr. Smith', 'attn: X\n\nDear X,\n')

('attn: Mr. Smith\012\012Dear Mr. Smith,\012', 2)

>>>

>>> print(

... re.sub('X', 'Mr. Smith', 'attn: X\n\nDear X,\n')

... )

attn: Mr. Smith

Dear Mr. Smith,

>>> re.sub('[ae]', 'X', 'abcdef')

'XbcdXf'

>>> re.subn('[ae]', 'X', 'abcdef')

('XbcdXf', 2)

使用匹配对象的 group（）方法除了能够取出匹配分组编号外，还可以使用\N，其中 N 是在替换字符串中使用的分组编号。下面的代码仅仅只是将美式的日期表示法MM/DD/YY{,YY}格式转换为其他国家常用的格式 DD/MM/YY{,YY}。

>>> re.sub(r'(\d{1,2})/(\d{1,2})/(\d{2}|\d{4})',

... r'\2/\1/\3', '2/20/91') # Yes, Python is...

'20/2/91'

>>> re.sub(r'(\d{1,2})/(\d{1,2})/(\d{2}|\d{4})',

... r'\2/\1/\3', '2/20/1991') # ... 20+ years old!

'20/2/1991'

split()

re.split(pattern, string, maxsplit=0, flags=0)

用 pattern 分开 string 。如果在 pattern 中捕获到括号，那么所有的组里的文字也会包含在列表里。如果 maxsplit 非零，最多进行 maxsplit 次分隔，剩下的字符全部返回到列表的最后一个元素。

# 以 ':' 为分隔符，分割字符串'str1:str2:str3'

>>> re.split(':', 'str1:str2:str3')

['str1', 'str2', 'str3']

# \W 匹配所有非字母数字，即符号，等同于： [^\w]

>>> re.split(r'\W+', 'Words, words, words.')

['Words', 'words', 'words', '']

>>>

># 如果pattern中包含括号，那么括号内的元素，即用作分隔符的部分也会出现在结果中

>>> re.split(r'(\W+)', 'Words, words, words.')

['Words', ', ', 'words', ', ', 'words', '.', '']

>>>

>>> re.split(r'\W+', 'Words, words, words.', 1)

['Words', 'words, words.']

>>>

>>> re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE)

['0', '3', '9']

如果分隔符里有捕获组合，并且匹配到字符串的开始，那么结果将会以一个空字符串开始。对于结尾也是一样。

>>> re.split(r'(\W+)', '...words, words...')

['', '...', 'words', ', ', 'words', '...', '']

这样的话，分隔组将会出现在结果列表中同样的位置。

样式的空匹配仅在与前一个空匹配不相邻时才会拆分字符串。

>>> re.split(r'\b', 'Words, words, words.')

['', 'Words', ', ', 'words', ', ', 'words', '.']

>>>

>>> re.split(r'\W*', '...words...')

['', '', 'w', 'o', 'r', 'd', 's', '', '']

>>>

>>> re.split(r'(\W*)', '...words...')

['', '...', '', '', 'w', '', 'o', '', 'r', '', 'd', '', 's', '...', '', '', '']

示例：一个用于 Web 站点（类似于Google 或者 Yahoo! Maps）的简单解析器，该如何实现？用户需要输入城市和州名，或者城市名加上 ZIP 编码，还是三者同时输入？这就需要比仅仅是普通字符串分割更强大的处理方式，具体如下。

>>> import re

>>> DATA = (

... 'Mountain View, CA 94040',

... 'Sunnyvale, CA',

... 'Los Altos, 94023',

... 'Cupertino 95014',

... 'Palo Alto CA',

... )

>>> for datum in DATA:

... print re.split(', |(?= (?:\d{5}|[A-Z]{2})) ', datum)

...

['Mountain View', 'CA', '94040']

['Sunnyvale', 'CA']

['Los Altos', '94023']

['Cupertino', '95014']

['Palo Alto', 'CA']

————————————————

标签： python