豆瓣爬虫

1.获取你要爬虫的数据代理:user-Agent
在这里插入图片描述
2.然后对request头进行封装:

        python
def DouBanSpide(i):
    url = "https://movie.douban.com/top250?start="+str(i*9)
    user_agent = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "}
    req = request.Request(url=url, headers=user_agent)
    html = request.urlopen(req)
    Douban_data_wash(html.read().decode())

3.中间的豆瓣数据利用split方式进行切片,切出你想要的排名,电影名称,评分,及其人数,以及推荐理由。

              python
 rank = text1.split('<em class=\"\">')[i+1].split("</em>")[0]
        title = text1.split('</span>')[i].split('>')[-1].strip()
        rate = text1.split('v:average\">')[i+1].split('</span>')[0]
        number = text1.split('star')[i+1].split('<span>')[1].split('</span>')[0]

4.所有完整的代码:

          python
 import os
import random
import time
from urllib import request
def Douban_data_wash(text1):
    text1 = text1.split('<h1>豆瓣电影 Top 250</h1>')[1]
    for i in range(0, 9):
        rank = text1.split('<em class=\"\">')[i+1].split("</em>")[0]
        title = text1.split('</span>')[i].split('>')[-1].strip()
        rate = text1.split('v:average\">')[i+1].split('</span>')[0]
        number = text1.split('star')[i+1].split('<span>')[1].split('</span>')[0]

        try:
            quote = text1.split('inq')[1].split('>')[1].split('<')[0]
        except:
            print(rank + "该处评价为空")
            quote = " "
        file = open("豆瓣数据", "a")
        file.write(
            "排名:" + rank + ",豆瓣评分" + rate + ",评价人数:" + number + "。推荐理由:" + quote + "\n")
        file.close()
        print("排名{},《{}》,豆瓣评分{},{}。推荐理由:{}".format(rank, title,  rate, number, quote))



def DouBanSpide(i):
    url = "https://movie.douban.com/top250?start="+str(i*9)
    user_agent = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "}
    req = request.Request(url=url, headers=user_agent)
    html = request.urlopen(req)
    Douban_data_wash(html.read().decode())
if __name__ == '__main__':
    file = open("豆瓣数据", "w")
    file.write("")
    for i in range(0, 10):
        DouBanSpide(i)
        time.sleep(random.randint(2, 10))

5.最后的实验结果:
在这里插入图片描述


版权声明:本文为Ustiniany原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接和本声明。