python 正则表达式

re 是 python 的标准库，位于Lib/re.py
regex 表示 regular expression。正则表达式会被编译成一个字节码，python 底层调用 C 编写的正则引擎执行该字节码。可以优化正则表达式的书写，从而使得字节码的执行速度更快。

quick reference¶

RE = r'pattern'
re_obj = re.compile(RE)

#regex obj or re.(RE, ..)
match
search
findall   #return string list, not match obj
finiter

split
sub       #替换
subn

#match obj
group    #group(0) 返回匹配的整个字符串，group(1) 返回第一个 group
groupdict
start
end

m.span([group])
- 默认 group 为 0，即整个匹配
- 返回 (m.start(group), m.end(group))
- if group did not contribute to the match, this is (-1, -1)

example¶

match 返回 match object

import re

pattern = '^a...s$'
test_string = 'abyss'
result = re.match(pattern, test_string)

if result:
  print("Search successful.")
else:
  print("Search unsuccessful.")

findall 返回字符串列表

# Program to extract numbers from a string

import re

string = 'hello 12 hi 89. Howdy 34'
pattern = '\d+'

result = re.findall(pattern, string) 
print(result)

# Output: ['12', '89', '34']

group¶

(...) 匹配生成一个 group。并且可以在后面的字符串中使用\number引用

(?...)

(?:...)：non-capturing，产生的 group 相当于会被忽略
(?P<key>...)：产生一个命名 group，通过 goup

reference group¶

可以在定义 pattern 时引用 group 内容
可以在匹配的 match object 里获取 group
可以在 sub 替换函数中，引用 group 作为替代字符串

Context of reference to group “quote”	Ways to reference it
in the same pattern itself	- `(?P=quote)` (as shown) - `\1`
when processing match object m	- `m.group('quote')` - `m.end('quote')` (etc.)
in a string passed to the repl argument of `re.sub()`	- `\g<quote>` - `\g<1>` - `\1`
example

匹配一个被引号括起来的字符串：(?P<quote>['"]).*?(?P=quote)
替换掉匹配的 group，新值引用了 group

import re

# 示例字符串
text = "Name: John, Age: 30, City: New York"

# 使用正则表达式匹配 Name 和 Age，并替换为新的值
result = re.sub(r"(Name: )(\w+), (Age: )(\d+)", r"\1Alice, \325", text)

print(result)

lookahead assertion¶

(?=...) #匹配则继续
(?!...) #不匹配则继续

(?=...) Matches if ... matches next, but doesn’t consume any of the string. This is called a lookahead assertion. For example, Isaac (?=Asimov) will match 'Isaac ' only if it’s followed by 'Asimov'.

(?!...) Matches if ... doesn’t match next. This is a negative lookahead assertion. For example, Isaac (?!Asimov) will match 'Isaac ' only if it’s not followed by 'Asimov'.

example

r'.*[.](?!bat$)[^.]*$' #匹配所有非*.bat 文件

其它¶

VERBOSE¶

pat = re.compile(r"""
 \s*                 # Skip leading whitespace
 (?P<header>[^:]+)   # Header name
 \s* :               # Whitespace, and a colon
 (?P<value>.*?)      # The header's value -- *? used to
                     # lose the following trailing whitespace
 \s*$                # Trailing whitespace to end-of-line
""", re.VERBOSE)

Greedy vs Non-Greedy vs Possessive¶

greedy: 默认情况*, +, ?都会尽可能匹配多的字符
non-greedy：使用?修饰，即*?, +?, ??，匹配尽可能少的字符
possessive：匹配尽可能多，并且不建立 backtrace 点，可能导致匹配失败。
- 比如aaaa，a*a中的a*匹配到第 4 个 a 后，会导致 a 匹配失败，因此会回溯。最终a*匹配 3 个 a，最后一个 a 单独匹配。
- 但是a*+a没有这个回溯过程，导致表达式匹配失败

>>> s = '<html><head><title>Title</title>'
>>> print(re.match('<.*>', s).group())
<html><head><title>Title</title>

>>> print(re.match('<.*?>', s).group())
<html>