自学爬虫两个月了,记录一下自己的爬虫学习经历,并且和大家分享一下可用的爬虫技术:
一、爬虫原理
简单介绍一下爬虫原理,核心的爬虫分为两步:
- 获取网页
- 提取信息
获取网页通俗来说就是在浏览器中输入网址,然后获取这个网址所指向的网页的所有信息。不过通过编程可以直接在程序中输入网址,然后获取网页。这一步用到的Python库是urllib、request。
提取信息就是抓住所需要的关键信息,网页信息包含大量无关信息,比如“身为一个理智的消费者,为何要在意二13青年对你的看法呢?”这句话,爬下来的原始信息是:
</p><p>身为一个理智的消费者,为何要在意二13青年对你的看法呢?</p><blockquote>为了提取有用信息,剔除无关信息,需要用到Python库比如Beautiful Soup和Pyquery。
二、知乎爬虫
用的多的网站当中容易爬的有知乎、微博等,首先这两个网站的信息是全公开的,不像比如微信要有好友才能看别人的朋友圈;其次是知乎、微博不用登陆账号就能直接浏览,不像微信必须登录微信账号;最后这两者可以直接用浏览器登录,而不像微信必须用app打开。
知乎爬虫中获取网页用到的是request,提取信息用到的是Json、Pyquery。
废话不多说,直接放代码,库文件:
import requests
from pyquery import PyQuery as pq
#import json
import csv,codecs#解决乱码!
import os
import numpy as np
from hashlib import md5
from bs4 import BeautifulSoup爬的知乎回答”2021年有什么高性价的轻薄笔记本推荐“,网址链接和头文件:
url = 'https://www.zhihu.com/question/438588361/answer/1703903763'#'https://www.zhihu.com/question/437319323/answer/1785586165'
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}Ajax处理获取网页:
base_url = 'https://www.zhihu.com/api/v4/questions/421463194/answers?'
include = 'data%5B%2A%5D.is_normal%2Cadmin_closed_comment%2Creward_info%2Cis_collapsed%2Cannotation_action%2Cannotation_detail%2Ccollapse_reason%2Cis_sticky%2Ccollapsed_by%2Csuggest_edit%2Ccomment_count%2Ccan_comment%2Ccontent%2Ceditable_content%2Cattachment%2Cvoteup_count%2Creshipment_settings%2Ccomment_permission%2Ccreated_time%2Cupdated_time%2Creview_info%2Crelevant_info%2Cquestion%2Cexcerpt%2Cis_labeled%2Cpaid_info%2Cpaid_info_content%2Crelationship.is_authorized%2Cis_author%2Cvoting%2Cis_thanked%2Cis_nothelp%2Cis_recognized%3Bdata%5B%2A%5D.mark_infos%5B%2A%5D.url%3Bdata%5B%2A%5D.author.follower_count%2Cbadge%5B%2A%5D.topics%3Bdata%5B%2A%5D.settings.table_of_content.enabled'
def get_page(page):#page0就是第一页
url1 = 'include=' + include+ '&limit=5&' + 'offset=' + str(page)+ '&platform=desktop&sort_by=default'
url = base_url + url1#urlencode(params)
try:
response = requests.get(url, headers=headers)
if response.status_code == 200:
return response.json()
except requests.ConnectionError as e:
print('Error', e.args)提取信息:
def parse_page(json):
if json:
items = json.get('data')
for item in items:
zhihu = {}
zhihu['作者'] = item.get('author').get('name')
zhihu['回答'] = pq(item.get('content')).text()
zhihu['赞'] = item.get('voteup_count')
yield zhihu#生成器主函数执行:
if __name__=='__main__':
i = 0
f = codecs.open('对于笔记本的选择,轻薄本真的被看不起吗?.csv', 'w+', 'utf_8_sig')
ftxt = open('对于笔记本的选择,轻薄本真的被看不起吗?.txt', 'w+', encoding='utf_8')
fieldnames = ['作者', '回答','赞']
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
while True:
js = get_page(5*i)
results = parse_page(js)
for res in results:
writer.writerow(res)
for detail in res.values():
ftxt.write(str(detail)+'\n')
ftxt.write('\n' + '=' * 50 + '\n')
if js.get('paging').get('is_end'):
print('finish!')
break
i+=1
f.close()
ftxt.close()
版权声明:本文为m0_37969932原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接和本声明。