一、目标
爬取芯片厂电影级联页面的评论
二、分析
2.1 网页分析
经过研究发现,该网页的评论是动态加载的。故我们本次采用selenium来解决。本次只拿数据不进行存储。
三、完整代码
xpc.py
import scrapy
class XpcSpider(scrapy.Spider):
name = 'xpc'
allowed_domains = ['www.xinpianchang.com']
start_urls = ['https://www.xinpianchang.com/a10975710?from=ArticleList']
def parse(self, response):
results = response.xpath("//ul[contains(@class, 'comment-list')]/li/div/div/i[@class='text']/text()").extract()
print(results)
middlewares.py
该py文件中只需要改 process_request函数即可
class ScrapyadvancedDownloaderMiddleware:
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the downloader middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
# Called for each request that goes through the downloader
# middleware.
# Must either:
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloader middleware will be called
if isinstance(spider, XpcSpider):
# 在这可以很方便的添加 随机UA,Cookie,Proxy
print("切点我来了", request.url)
# if isinstance(spider, XpcSpider):
# 调用谷歌浏览器进行请求
driver = WebDriver()
driver.get(request.url)
sleep(2)
# 获取请求的内容
content = driver.page_source
# 使用请求内容构造Response
response = HtmlResponse(request.url, body=content.encode("utf-8"))
return response
# return None
版权声明:本文为u010671028原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。