RE正则匹配整理 – 源码巴士

一：简介

re模块为正则提供了无尽的可能，也是爬虫，数据处理必备利器！

二：主要用法

1: re.match

# re.match(pattern, string, flags)
print('re.match(pattern, string, flags)-----------------------------------------')
string = 'Cats are smarter than dogs'
r = re.match(r'(c.*) are (.*?) than (.*s)', string, re.M | re.I)
try:
    """r cannot be None, otherwise an error will be reported."""
    print(r)
    print(r.span())
    print(r.group().__class__)
    print(r.group(1))
    print(r.group(2))
    print(r.group(3))
    # print(r.group(4))
except BaseException as e:
    print(e)

2: re.search

# re.search(pattern, string, flags)
print('re.search(pattern, string, flags)----------------------------------------')
string = 'Cats are smarter than dogs'
r = re.search(r'(Ar.*) (.*?) than (.*s)', string, re.M | re.I)
try:
    """r cannot be None, otherwise an error will be reported."""
    print(r)
    print(r.span())
    print(r.group().__class__)
    print(r.group(1))
    print(r.group(2))
    print(r.group(3))
    # print(r.group(4))
except BaseException as e:
    print(e)

3: re.sub

# re.sub(pattern, repl, string, count, flags)
print('re.sub(pattern, repl, string, count, flags)------------------------------')
s = '2004-959-559 # # # 这是一个国外电话号码'
r = re.sub(r' #.*?', '', s, count=2)
print(r)
"""
    2004-959-559 # 这是一个国外电话号码
    repl 可以是一个函数
"""


def double(matched):
    # print(matched)
    value = int(matched.group('value'))
    print(value)
    return str(value * 2)


def upper(matched):
    # print(matched)
    value = matched.group('value')
    print(value)
    return value.upper()


s = 'A2004c959D559e # # # 这是一个国外电话号码'
r = re.sub(r'(?P<value>\d+)', double, s)
print(r)
r = re.sub(r'(?P<value>[a-zA-Z]+)', upper, s)
print(r)

4: re.compile

# re.compile(pattern, flags)
print('re.compile(pattern, flags)-----------------------------------------------')
pattern = re.compile(r'([a-z]+)[0-9]+([a-z]+)', re.I)
s = 'one12twothree34fourR'
m = pattern.match(s, 1, 30)
print(m)
print(m.span())
print(m.groups())
print(m.group(1))
print(m.group(2))
print(m.start(1))
print(m.end(1))
print(m.start(2))
print(m.end(2))

5: re.findall

# re.findall(pattern, string, flags)
print('re.findall(pattern, string, flags)---------------------------------------')
s = 'one12twothree34fourR\noneD133PO2twFothreTe3UI4fourR'
r = re.findall(r'[a-z]+', s, re.I)
print(r)

6: re.finditer

# re.finditer(pattern, string, flags)
print('re.finditer(pattern, string, flags)--------------------------------------')
r = re.finditer(r'[a-z]+', s, re.I)
for it in r:
    """返回一个迭代器"""
    print(it.group())

7: re.split

# re.split(pattern, string, maxsplit, flags)
print('re.split(pattern, string, maxsplit, flags)-------------------------------')
"""maxsplit最大分割次数，默认全分割，分割次数+1 = 分割完成数组长度"""
s = 'oNe12twothree34fourRneD133PO2twFothr555eTe3UI4fourR'
r = re.split(r'[a-z]+', s, 5, re.I)
print(r)
print(len(r))

三：补充说明

flags:    
	修饰符	描述
	re.I	使匹配对大小写不敏感
	re.L	做本地化识别（locale-aware）匹配
	re.M	多行匹配，影响 ^ 和 $
	re.S	使 . 匹配包括换行在内的所有字符
	re.U	根据Unicode字符集解析字符。这个标志影响 \w, \W, \b, \B.
	re.X	该标志通过给予你更灵活的格式以便你将正则表达式写得更易于理解。

re.RegexObject:
    re.compile() 返回 RegexObject 对象。
        
re.MatchObject
	group() 返回被 RE 匹配的字符串。
	start() 返回匹配开始的位置
	end() 返回匹配结束的位置
	span() 返回一个元组包含匹配 (开始,结束) 的位置

四：小细节

match与search的区别:
    re.match只匹配字符串的开始，如果字符串开始不符合正则表达式，则匹配失败，函数返回None；
    re.search匹配整个字符串，直到找到一个匹配。

r = re.sub(r'(?P<value>\d+)', double, s):
    ?P<value>\d+:
        ?P: 定义一个分组，为re.match对象
        <value>: 传入re.match对象的分组名，可以用re.match.group('value')获取匹配到的值
        \d+: 传入repl函数的值的匹配规则

五：源码地址分享

源码地址：Github:[https://github.com/Rainstyed/rainsty/blob/master/LearnPython/re_basis.py]

原文链接：https://blog.csdn.net/weixin_43933475/article/details/100544578