1.获取你要爬虫的数据代理:user-Agent
2.然后对request头进行封装:
python
def DouBanSpide(i):
url = "https://movie.douban.com/top250?start="+str(i*9)
user_agent = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "}
req = request.Request(url=url, headers=user_agent)
html = request.urlopen(req)
Douban_data_wash(html.read().decode())
3.中间的豆瓣数据利用split方式进行切片,切出你想要的排名,电影名称,评分,及其人数,以及推荐理由。
python
rank = text1.split('<em class=\"\">')[i+1].split("</em>")[0]
title = text1.split('</span>')[i].split('>')[-1].strip()
rate = text1.split('v:average\">')[i+1].split('</span>')[0]
number = text1.split('star')[i+1].split('<span>')[1].split('</span>')[0]
4.所有完整的代码:
python
import os
import random
import time
from urllib import request
def Douban_data_wash(text1):
text1 = text1.split('<h1>豆瓣电影 Top 250</h1>')[1]
for i in range(0, 9):
rank = text1.split('<em class=\"\">')[i+1].split("</em>")[0]
title = text1.split('</span>')[i].split('>')[-1].strip()
rate = text1.split('v:average\">')[i+1].split('</span>')[0]
number = text1.split('star')[i+1].split('<span>')[1].split('</span>')[0]
try:
quote = text1.split('inq')[1].split('>')[1].split('<')[0]
except:
print(rank + "该处评价为空")
quote = " "
file = open("豆瓣数据", "a")
file.write(
"排名:" + rank + ",豆瓣评分" + rate + ",评价人数:" + number + "。推荐理由:" + quote + "\n")
file.close()
print("排名{},《{}》,豆瓣评分{},{}。推荐理由:{}".format(rank, title, rate, number, quote))
def DouBanSpide(i):
url = "https://movie.douban.com/top250?start="+str(i*9)
user_agent = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "}
req = request.Request(url=url, headers=user_agent)
html = request.urlopen(req)
Douban_data_wash(html.read().decode())
if __name__ == '__main__':
file = open("豆瓣数据", "w")
file.write("")
for i in range(0, 10):
DouBanSpide(i)
time.sleep(random.randint(2, 10))
5.最后的实验结果:
版权声明:本文为Ustiniany原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接和本声明。