python网络爬虫–项目实战–scrapy嵌入selenium,芯片厂级联评论爬取(6)

一、目标

爬取芯片厂电影级联页面的评论

二、分析

2.1 网页分析

经过研究发现,该网页的评论是动态加载的。故我们本次采用selenium来解决。本次只拿数据不进行存储。

三、完整代码

xpc.py

import scrapy


class XpcSpider(scrapy.Spider):
    name = 'xpc'
    allowed_domains = ['www.xinpianchang.com']
    start_urls = ['https://www.xinpianchang.com/a10975710?from=ArticleList']

    def parse(self, response):
        results = response.xpath("//ul[contains(@class, 'comment-list')]/li/div/div/i[@class='text']/text()").extract()
        print(results)

middlewares.py

该py文件中只需要改 process_request函数即可

class ScrapyadvancedDownloaderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called

        if isinstance(spider, XpcSpider):
            # 在这可以很方便的添加 随机UA,Cookie,Proxy
            print("切点我来了", request.url)

            # if isinstance(spider, XpcSpider):
            # 调用谷歌浏览器进行请求
            driver = WebDriver()
            driver.get(request.url)
            sleep(2)
            # 获取请求的内容
            content = driver.page_source

            # 使用请求内容构造Response
            response = HtmlResponse(request.url, body=content.encode("utf-8"))
            return response
        # return None

版权声明:本文为u010671028原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。