python爬虫五大解析器

python有五大解析器

一、正则表达式，使用第三方库 re(re)

1.匹配规则有

模式	描述
`\w`	匹配字母、数字及下划线
`\W`	匹配不是字母、数字及下划线的字符
`\s`	匹配任意空白字符，等价于`[\t\n\r\f]`
`\S`	匹配任意非空字符
`\d`	匹配任意数字，等价于`[0-9]`
`\D`	匹配任意非数字的字符
`\A`	匹配字符串开头
`\Z`	匹配字符串结尾，如果存在换行，只匹配到换行前的结束字符串
`\z`	匹配字符串结尾，如果存在换行，同时还会匹配换行符
`\G`	匹配最后匹配完成的位置
`\n`	匹配一个换行符
`\t`	匹配一个制表符
`^`	匹配一行字符串的开头
`$`	匹配一行字符串的结尾
`.`	匹配任意字符，除了换行符，当`re.DOTALL`标记被指定时，则可以匹配包括换行符的任意字符
`[...]`	用来表示一组字符，单独列出，比如`[amk]`匹配`a`、`m`或`k`
`[^...]`	不在`[]`中的字符，比如`[^abc]`匹配除了`a`、`b`、`c`之外的字符
`*`	匹配0个或多个表达式
`+`	匹配1个或多个表达式
`?`	匹配0个或1个前面的正则表达式定义的片段，非贪婪方式
`{n}`	精确匹配`n`个前面的表达式
`{n, m}`	匹配`n`到`m`次由前面正则表达式定义的片段，贪婪方式
`a\|b`	匹配`a`或`b`
`( )`	匹配括号内的表达式，也表示一个组

2.方法：
（1）match() 从头开始匹配 match('规则'，html,re.S)
（2）search() 扫描整个字符串，可不从头匹配，但只匹配到第一个符合规则的，search('规则',html,re.S)
(3) findall() 扫描整个字符串，可不从头匹配，匹配到所有符合规则的 findall('规则'，html,re.S)
(4) sub() 替换/去除某些元素 sub.('规则'，要去替换的元素，html)
(5) compile() 将正则表达式编译成正则表达式对象，以便后面匹配中复用 compile(规则)

小例：

import requests
import re
res = requests.get(url)
id = re.findall('<p>(.*?)</p>',res.text,re.S)

项目实战：https://mp.csdn.net/postedit/83478140

具体知识请浏览 https://blog.csdn.net/huang1600301017/article/details/83418871

或 https://cuiqingcai.com/5530.html

二、xpath，使用第三方库 from lxml import etree

1. XPath常用规则

(1)列举了XPath的几个常用规则。

XPath常用规则

表达式	描述
`nodename`	选取此节点的所有子节点
`/`	从当前节点选取直接子节点
`//`	从当前节点选取子孙节点
`.`	选取当前节点
`..`	选取当前节点的父节点
`@`	选取属性

(2) 方法
etree.HTML() 解析HTML文档
etree.tostring() 输出修正的文档
html.parse(文档，etree.HTMLParse())
data = html.xpath('规则') #推荐使用这个，chrome 开发者工具可以复制规则

（3）主要使用结构位置
例

from lxml import etree

html = etree.parse('./test.html', etree.HTMLParser())

result = html.xpath('//li[@class="item-0"]')

print(result)

小例：
import requests
from lxml imoport etree
res = requests.get(url)
html = etree.HTML(res.text)
id = html.xpath('div[1]/a[2]/h2/text()')

项目实战：https://mp.csdn.net/postedit/83478140

详细知识请前往：https://cuiqingcai.com/5545.html

三、BeautifulSoup (from bs4 import BeautifulSoup)
BeautifulSoup (html,'lxml)

(1)Beautiful Soup支持的解析器

解析器	使用方法	优势	劣势
Python标准库	`BeautifulSoup(markup, "html.parser")`	Python的内置标准库、执行速度适中、文档容错能力强	Python 2.7.3及Python 3.2.2之前的版本文档容错能力差
lxml HTML解析器	`BeautifulSoup(markup, "lxml")`	速度快、文档容错能力强	需要安装C语言库
lxml XML解析器	`BeautifulSoup(markup, "xml")`	速度快、唯一支持XML的解析器	需要安装C语言库
html5lib	`BeautifulSoup(markup, "html5lib")`	最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档	速度慢、不依赖外部扩展

(2)方法
find_all(name,strrs) 找出所有符合规则的
find(name,atrrs) 找出第一个符合规则的
CSS选择器select() 找出所有符合规则的 #推荐使用这个，因为chrome开发者工具能够复制规则
小例：
import requests
from bs4 import BeautifulSoup
res = requests.get(url)
soup = BeautifulSoup(res.text,'lxml')
id = soup.select('#page_list>u1>li')

项目实战：https://mp.csdn.net/postedit/83478140

具体知识请前往：https://cuiqingcai.com/5548.html

四、pyquery from pyquery import PyQuery

参考：https://cuiqingcai.com/5551.html

五、jsonpath (jsonpath.jsonpath()) 其实是xpath在json的使用

小例：

import json
json_string = '{"user_man":"xiangao"}'
json_data = json.loads(json_string)
print(json_data.get("user_man"))
或者print(json_data["user_man"])

实例：
import requests
import json
import  jsonpath
import pygal

headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)'
                      'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.92 Safari/537.36'
}
url = 'http://pg.qq.com/zlkdatasys/data_zlk_zlzx.json'
respose  = requests.get(url,headers = headers)
respose.encoding = 'utf-8'
html = respose.text   #网页源代码文本格式
#print(html)

unicodest = json.loads(html)
two = jsonpath.jsonpath(unicodest,'$..yd_c6')#枪的特点
print(two)
three = jsonpath.jsonpath(unicodest,'$..ldtw_f2')#枪的性能
print(three)
four = jsonpath.jsonpath(unicodest,'$..mc_94')  #枪的名字
print(four)
print(four[1:8])
data  = []
num = 0
for a in three:
    if num<7:
        num+=1
        data.append([int(a[0]['wl_45']),int(a[0]['sc_54']),int(a[0]['ss_d0']),int(a[0]['wdx_a7']),int(a[0]['zds_62'])])
        print(data)
radar_chart = pygal.Radar()
radar_chart.title  = '步枪的性能'
radar_chart.x_labels = ['威力','射程','射速','稳定性','子弹数']
for name,property in zip(four[1:8],data):
    radar_chart.add(name,property)
radar_chart.render_to_file('枪支.svg')

此例子为用python实现吃鸡的梦想,具体请参考：https://blog.csdn.net/huang1600301017/article/details/83449330

jsonpath具体知识请前往：https://blog.csdn.net/huang1600301017/article/details/83450479

以上就是爬虫五大匹配库或解析库，从零开始学Python网络爬虫一书，把前三大库讲的非常详细哦

看到这里，是不是想要实战一下，项目实战：https://mp.csdn.net/postedit/83478140

原文链接：https://blog.csdn.net/huang1600301017/article/details/83474288