网络爬虫学习(三)-scrapy框架

目录

一  Scrapy框架简介

二  scrapy框架的基本使用

1) 环境的安装

2)基础命令

3)项目组成:

4)创建爬虫文件:

5)scrapy架构组成

 6) 五大核心组件

7)srapy工作原理

 8)yield

9)scrapy shell

 10)pymysql的使用步骤

11)CrawlSpider

 12)日志信息和日志等级

13) 乱码

14)scrapy持久化存储

- 基于终端指令:

- 基于管道:

15)图片数据爬取之ImagesPipeline

案例:scrapy图片爬取:

16)中间件(处理动态加载数据)

案例:网易新闻爬取

17)scrapy的post请求

18)代理

 三 scrapy框架应用--以创建BaiduSpider项目为例,介绍使用scrapy框架的使用过程

1)创建项目

 2)修改items脚本

3)创建spider脚本

5)运行爬虫程序

6) 修改pipelines脚本

 7)定制MIiddleware

四 项目实战 爬取中国大学Mooc网站课程信息

 五 小结


一  Scrapy框架简介

- 什么是框架?
    - 就是一个集成了很多功能并且具有很强通用性的一个项目模板。

- 如何学习框架?
    - 专门学习框架封装的各种功能的详细用法。

- 什么是scrapy?
    - 爬虫中封装好的一个明星框架。功能:高性能的持久化存储,异步的数据下载,高性能的数据解析,分布式

Scrapy是一个为了爬取网站数据,提取结构性数据而编写的应用框架。 可以应用在包括数据挖掘,信息处理 或存储历史数据等一系列的程序中。

二  scrapy框架的基本使用

1) 环境的安装


    - mac or linux:pip install scrapy
    - windows:
        - pip install wheel
        - 下载twisted,下载地址为http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
        - 安装twisted:pip install Twisted‑17.1.0‑cp36‑cp36m‑win_amd64.whl
        - pip install pywin32
        - pip install scrapy
        测试:在终端里录入scrapy指令,没有报错即表示安装成功!

2)基础命令

- 创建一个工程:scrapy startproject   xxx
- cd xxx
- 在spiders子目录中创建一个爬虫文件
    - scrapy genspider spiderName www.xxx.com
- 执行工程:
    - scrapy crawl spiderName

注意:应在spiders文件夹内执行

3)项目组成:

          spiders               

 __init__.py            

 自定义的爬虫文件.py       ‐‐‐》由我们自己创建,是实现爬虫核心功能的文件

  __init__.py                            

 items.py                     ‐‐‐》定义数据结构的地方,是一个继承自scrapy.Item的类

 middlewares.py               ‐‐‐》中间件   代理

 pipelines.py                ‐‐‐》管道文件,里面只有一个类,用于处理下载数据的后续处理

                                        默认是300优先级,值越小优先级越高(1‐1000)

 settings.py                 ‐‐‐》配置文件  比如:是否遵守robots协议,User‐Agent定义等

4)创建爬虫文件:

            (1)跳转到spiders文件夹   cd 目录名字/目录名字/spiders

            (2)scrapy genspider 爬虫名字 网页的域名

  爬虫文件的基本组成:

             继承scrapy.Spider类

                      name = 'baidu'       ‐‐‐》  运行爬虫文件时使用的名字

                allowed_domains       ‐‐‐》 爬虫允许的域名,在爬取的时候,如果不是此域名之下的

url,会被过滤掉

                      start_urls           ‐‐‐》 声明了爬虫的起始地址,可以写多个url,一般是一个

                      parse(self, response) ‐‐‐》解析数据的回调函数

                                  response.text         ‐‐‐》响应的是字符串

                                  response.body         ‐‐‐》响应的是二进制文件

                                  response.xpath()‐》xpath方法的返回值类型是selector列表

                                  extract()             ‐‐‐》提取的是selector对象的是data

                                  extract_first()       ‐‐‐》提取的是selector列表中的第一个数据

5)scrapy架构组成

        (1)引擎                   ‐‐‐》自动运行,无需关注,会自动组织所有的请求对象,分发给下载器

        (2)下载器                 ‐‐‐》从引擎处获取到请求对象后,请求数据

        (3)spiders  ‐‐‐》Spider类定义了如何爬取某个(或某些)网站。包括了爬取的动作(例 如:是否跟进链接)以及如何从网页的内容中提取结构化数据(爬取item)。 换句话说,Spider就是您定义爬取的动作及 分析某个网页(或者是有些网页)的地方。

              

        (4)调度器                 ‐‐‐》有自己的调度规则,无需关注

        (5)管道(Item pipeline)   ‐‐‐》最终处理数据的管道,会预留接口供我们处理数据 当Item在Spider中被收集之后,它将会被传递到Item Pipeline,一些组件会按照一定的顺序执行对Item的处理。 每个item pipeline组件(有时称之为“Item Pipeline”)是实现了简单方法的Python类。他们接收到Item并通过它执行 一些行为,同时也决定此Item是否继续通过pipeline,或是被丢弃而不再进行处理。

          以下是item pipeline的一些典型应用:

          1. 清理HTML数据

          2. 验证爬取的数据(检查item包含某些字段)           

          3. 查重(并丢弃)           

         4. 将爬取结果保存到数据库中

 6) 五大核心组件


    引擎(Scrapy)
        用来处理整个系统的数据流处理, 触发事务(框架核心)
    调度器(Scheduler)
        用来接受引擎发过来的请求, 压入队列中, 并在引擎再次请求的时候返回. 可以想像成一个URL(抓取网页的网址或者说是链接)的优先队列, 由它来决定下一个要抓取的网址是什么, 同时去除重复的网址
    下载器(Downloader)
        用于下载网页内容, 并将网页内容返回给蜘蛛(Scrapy下载器是建立在twisted这个高效的异步模型上的)
    爬虫(Spiders)
        爬虫是主要干活的, 用于从特定的网页中提取自己需要的信息, 即所谓的实体(Item)。用户也可以从中提取出链接,让Scrapy继续抓取下一个页面
    项目管道(Pipeline)
        负责处理爬虫从网页中抽取的实体,主要的功能是持久化实体、验证实体的有效性、清除不需要的信息。当页面被爬虫解析后,将被发送到项目管道,并经过几个特定的次序

7)srapy工作原理

 

 8)yield

1. 带有 yield 的函数不再是一个普通函数,而是一个生成器generator,可用于迭代

2. yield 是一个类似 return 的关键字,迭代一次遇到yield时就返回yield后面(右边)的值。重点是:下一次迭代 时,从上一次迭代遇到的yield后面的代码(下一行)开始执行

3. 简要理解:yield就是 return 返回一个值,并且记住这个返回的位置,下次迭代就从这个位置后(下一行)开始

9)scrapy shell

 10)pymysql的使用步骤

1.pip install pymysql

2.pymysql.connect(host,port,user,password,db,charset)

3.conn.cursor()

4.cursor.execute()

11)CrawlSpider

- CrawlSpider:类,Spider的一个子类
    - 全站数据爬取的方式
        - 基于Spider:手动请求
        - 基于CrawlSpider
    - CrawlSpider的使用:
        - 创建一个工程
        - cd XXX
        - 创建爬虫文件(CrawlSpider):
            - scrapy genspider -t crawl xxx www.xxxx.com
            - 链接提取器:
                - 作用:根据指定的规则(allow)进行指定链接的提取
            - 规则解析器:
                - 作用:将链接提取器提取到的链接进行指定规则(callback)的解析

运行原理

 12)日志信息和日志等级

(1)日志级别:

            CRITICAL:严重错误

            ERROR:   一般错误

            WARNING: 警告

            INFO:     一般信息

            DEBUG:   调试信息

                         默认的日志等级是DEBUG

            只要出现了DEBUG或者DEBUG以上等级的日志

            那么这些日志将会打印 (2)settings.py文件设置:

           默认的级别为DEBUG,会显示上面所有的信息

在配置文件中  settings.py

            LOG_FILE  : 将屏幕显示的信息全部记录到文件中,屏幕不再显示,注意文件后缀一定是.log

            LOG_LEVEL : 设置日志显示的等级,就是显示哪些,不显示哪些

例如,想要只显示出错误的日志信息

可在setting中加入     

#显示指定类型的日志信息
LOG_LEVEL = 'ERROR'

13) 乱码

在settings中加入

FEED_EXPORT_ENCODING='utf-8'

14)scrapy持久化存储

- 基于终端指令:


    - 要求:只可以将parse方法的返回值存储到本地的文本文件中
    - 注意:持久化存储对应的文本文件的类型只可以为:'json', 'jsonlines', 'jl', 'csv', 'xml', 'marshal', 'pickle
    - 指令:scrapy crawl xxx -o filePath
    - 好处:简介高效便捷
    - 缺点:局限性比较强(数据只可以存储到指定后缀的文本文件中)

- 基于管道:


    - 编码流程:
        - 数据解析
        - 在item类中定义相关的属性
        - 将解析的数据封装存储到item类型的对象
        - 将item类型的对象提交给管道进行持久化存储的操作(在items中 )
        - 在管道类的process_item中要将其接受到的item对象中存储的数据进行持久化存储操作
        - 在配置文件中开启管道
    - 好处:
        - 通用性强。

15)图片数据爬取之ImagesPipeline

- 基于scrapy爬取字符串类型的数据和爬取图片类型的数据区别?
    - 字符串:只需要基于xpath进行解析且提交管道进行持久化存储
    - 图片:xpath解析出图片src的属性值。单独的对图片地址发起请求获取图片二进制类型的数据

- ImagesPipeline:
    - 只需要将img的src的属性值进行解析,提交到管道,管道就会对图片的src进行请求发送获取图片的二进制类型的数据,且还会帮我们进行持久化存储。
- 需求:爬取站长素材中的高清图片
- 使用流程:
    - 数据解析(图片的地址)
    - 将存储图片地址的item提交到制定的管道类
    - 在管道文件中自定制一个基于ImagesPipeLine的一个管道类
        - get_media_request
        - file_path
        - item_completed
    - 在配置文件中:
        - 指定图片存储的目录:IMAGES_STORE = './imgs_bobo'
        - 指定开启的管道:自定制的管道类

案例:scrapy图片爬取:

img.py:

# -*- coding: utf-8 -*-
import scrapy
from imgsPro.items import ImgsproItem

class ImgSpider(scrapy.Spider):
    name = 'img'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['http://sc.chinaz.com/tupian/']

    def parse(self, response):
        div_list = response.xpath('//div[@id="container"]/div')
        for div in div_list:
            #注意:使用伪属性
            src = div.xpath('./div/a/img/@src2').extract_first()

            item = ImgsproItem()
            item['src'] = src

            yield item

itiems:

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class ImgsproItem(scrapy.Item):
    # define the fields for your item here like:
    src = scrapy.Field()
    # pass

pipelines:

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html


# class ImgsproPipeline(object):
#     def process_item(self, item, spider):
#         return item
from scrapy.pipelines.images import ImagesPipeline
import scrapy
class imgsPileLine(ImagesPipeline):

    #就是可以根据图片地址进行图片数据的请求
    def get_media_requests(self, item, info):

        yield scrapy.Request(item['src'])

    #指定图片存储的路径
    def file_path(self, request, response=None, info=None):
        imgName = request.url.split('/')[-1]
        return imgName

    def item_completed(self, results, item, info):
        return item #返回给下一个即将被执行的管道类
# -*- coding: utf-8 -*-

# Define here the models for your spider middleware
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals


class ImgsproSpiderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, dict or Item objects.
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.

        # Should return either None or an iterable of Response, dict
        # or Item objects.
        pass

    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated.

        # Must return only requests (not items).
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)


class ImgsproDownloaderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

settins:

# -*- coding: utf-8 -*-

# Scrapy settings for imgsPro project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'imgsPro'

SPIDER_MODULES = ['imgsPro.spiders']
NEWSPIDER_MODULE = 'imgsPro.spiders'

LOG_LEVEL = 'ERROR'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'imgsPro.middlewares.ImgsproSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'imgsPro.middlewares.ImgsproDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'imgsPro.pipelines.imgsPileLine': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

#指定图片存储的目录
IMAGES_STORE = './imgs_bobo'

16)中间件(处理动态加载数据)

- 中间件
    - 下载中间件
        - 位置:引擎和下载器之间
        - 作用:批量拦截到整个工程中所有的请求和响应
        - 拦截请求:
            - UA伪装:process_request
            - 代理IP:process_exception:return request

        - 拦截响应:
            - 篡改响应数据,响应对象

案例:网易新闻爬取

items;

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class WangyiproItem(scrapy.Item):
    # define the fields for your item here like:
    title = scrapy.Field()
    content = scrapy.Field()

 wangyi.py

# -*- coding: utf-8 -*-
import scrapy
from selenium import webdriver
from wangyiPro.items import WangyiproItem
class WangyiSpider(scrapy.Spider):
    name = 'wangyi'
    # allowed_domains = ['www.cccom']
    start_urls = ['https://news.163.com/']
    models_urls = []  #存储五个板块对应详情页的url
    #解析五大板块对应详情页的url

    #实例化一个浏览器对象
    def __init__(self):
        self.bro = webdriver.Chrome(executable_path='/Users/bobo/Desktop/小猿圈爬虫课程/chromedriver')

    def parse(self, response):
        li_list = response.xpath('//*[@id="index2016_wrap"]/div[1]/div[2]/div[2]/div[2]/div[2]/div/ul/li')
        alist = [3,4,6,7,8]
        for index in alist:
            model_url = li_list[index].xpath('./a/@href').extract_first()
            self.models_urls.append(model_url)

        #依次对每一个板块对应的页面进行请求
        for url in self.models_urls:#对每一个板块的url进行请求发送
            yield scrapy.Request(url,callback=self.parse_model)

    #每一个板块对应的新闻标题相关的内容都是动态加载
    def parse_model(self,response): #解析每一个板块页面中对应新闻的标题和新闻详情页的url
        # response.xpath()
        div_list = response.xpath('/html/body/div/div[3]/div[4]/div[1]/div/div/ul/li/div/div')
        for div in div_list:
            title = div.xpath('./div/div[1]/h3/a/text()').extract_first()
            new_detail_url = div.xpath('./div/div[1]/h3/a/@href').extract_first()


            item = WangyiproItem()
            item['title'] = title

            #对新闻详情页的url发起请求
            yield scrapy.Request(url=new_detail_url,callback=self.parse_detail,meta={'item':item})
    def parse_detail(self,response):#解析新闻内容
        content = response.xpath('//*[@id="endText"]//text()').extract()
        content = ''.join(content)
        item = response.meta['item']
        item['content'] = content

        yield item


    def closed(self,spider):
        self.bro.quit()


 settings:

# -*- coding: utf-8 -*-

# Scrapy settings for wangyiPro project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'wangyiPro'

SPIDER_MODULES = ['wangyiPro.spiders']
NEWSPIDER_MODULE = 'wangyiPro.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'wangyiPro (+http://www.yourdomain.com)'
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'wangyiPro.middlewares.WangyiproSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
   'wangyiPro.middlewares.WangyiproDownloaderMiddleware': 543,
}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'wangyiPro.pipelines.WangyiproPipeline': 300,
}
LOG_LEVEL = 'ERROR'
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

 pipeline:

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html


class WangyiproPipeline(object):
    def process_item(self, item, spider):
        print(item)
        return item

 middleware:

# -*- coding: utf-8 -*-

# Define here the models for your spider middleware
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals


from scrapy.http import HtmlResponse
from time import sleep
class WangyiproDownloaderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.



    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None
    #该方法拦截五大板块对应的响应对象,进行篡改
    def process_response(self, request, response, spider):#spider爬虫对象
        bro = spider.bro#获取了在爬虫类中定义的浏览器对象

        #挑选出指定的响应对象进行篡改
        #通过url指定request
        #通过request指定response
        if request.url in spider.models_urls:
            bro.get(request.url) #五个板块对应的url进行请求
            sleep(3)
            page_text = bro.page_source  #包含了动态加载的新闻数据

            #response #五大板块对应的响应对象
            #针对定位到的这些response进行篡改
            #实例化一个新的响应对象(符合需求:包含动态加载出的新闻数据),替代原来旧的响应对象
            #如何获取动态加载出的新闻数据?
                #基于selenium便捷的获取动态加载数据
            new_response = HtmlResponse(url=request.url,body=page_text,encoding='utf-8',request=request)

            return new_response
        else:
            #response #其他请求对应的响应对象
            return response





    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

17)scrapy的post请求

(1)重写start_requests方法:

        def start_requests(self)  (2) start_requests的返回值:

     scrapy.FormRequest(url=url, headers=headers, callback=self.parse_item, formdata=data)             url: 要发送的post地址

            headers:可以定制头信息

            callback: 回调函数                formdata: post所携带的数据,这是一个字典

18)代理

(1)到settings.py中,打开一个选项

        DOWNLOADER_MIDDLEWARES = {

           'postproject.middlewares.Proxy': 543,

        }

    (2)到middlewares.py中写代码

        def process_request(self, request, spider):

            request.meta['proxy'] = 'https://113.68.202.10:9999'

            return None

 三 scrapy框架应用--以创建BaiduSpider项目为例,介绍使用scrapy框架的使用过程

1)创建项目

语法:scrapy startproject 项目名称 【存放爬虫项目的路径】

若不指定项目路径,则在命令执行路径下生成

例如在pycharm中创建

 则会在该文件目录下生成

 2)修改items脚本

scrapy库提供Item对象来实现将爬取到的数据转换成结构化数据的功能,实现方法是定义Item类(继承scrapy.Item类),并定义类中的数据类型为scrapy.Filed字段

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class BaiduspiderItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass

class NewsItem(scrapy.Item):
    index=scrapy.Field()  #定义排名,标题,链接,搜素结果个数
    title=scrapy.Field()
    link=scrapy.Field()
    newsNum=scrapy.Field()

3)创建spider脚本

语法:scrapy genspider [template]   <name>  <domain>

template:表示创建模板的类型,缺省则使用默认模板

name:表示创建的spider脚本名称,擦混关键后会在spider目录下生成一个以namme命名的py文件

在BaiduSpider项目中创建一个爬取百度首页热榜新闻的spider脚本,命名为news,域名为www .baidu.com

 

#xpath返回的是列表,但是列表元素一定是Selector类型的对象
#extract可以将Selector对象中data参数存储的字符串提取出来

 request方法常用的参数及其说明

 

 在BadiuSpider项目中,shixiannews爬虫的过程如下:

1)在parse()方法中提取百度首页热榜新闻的排名,标题和链接,并写入NewsItem,

类的对象中,然后调用Scrapy。Request()方法请求新的链接,并使用callback参数指定回调方法为parse_newsnum,使用meta参数在两个解析方法之间传递NewsItem类的对象。

2)定义parse_newsnum()方法,解析请求新的链接返回的响应,提取每台哦新闻的搜索接过书,并写入newsItem类中

news.py:

import scrapy									#导入scrapy模块
#导入items模块中的NewsItem类
from BaiduSpider.items import NewsItem
from copy import deepcopy						#导入deepcopy模块
class NewsSpider(scrapy.Spider):				#定义NewsSpider类
    name = 'news'								#初始化name
    #初始化allowed_domains
    allowed_domains = ['www.baidu.com']
    start_urls = ['http://www.baidu.com/']	#初始化start_urls
    def parse(self, response):					#定义parse方法
        #查找ul节点下的所有li节点
        news_list = response.xpath(
            '//ul[@class="s-hotsearch-content"]/li'
        )
        for news in news_list:					#遍历li节点
            item = NewsItem()					#初始化对象item
            #查找排名节点,获取文本,并赋值给item中的index
            item['index'] = news.xpath(
                'a/span[1]/text()').extract_first()
            #查找标题节点,获取文本,并赋值给item中的title
            item['title'] = news.xpath(
                'a/span[2]/text()').extract_first()
            #查找链接节点,获取文本,并赋值给item中的link
            item['link']= news.xpath('a/@href').extract_first()
            #发送请求,并指定回调方法为parse_newsnum
            yield scrapy.Request(
                item['link'],
                callback=self.parse_newsnum,
                meta={'item': deepcopy(item)}
                #使用meta参数在两个解析方法之间传递item时,避免item数据发生错误,需使用深拷贝,而且需要导入deepcopy模块
            )

    def parse_newsnum(self, response):  # 定义parse_newsnum方法
        item = response.meta['item']  # 传递item
        # 查找搜索结果个数节点,并获取文本,赋值给news_num
        item['newsNum'] = response.xpath(
            '//span[@class="nums_text"]/text()').extract_first()
        yield item  # 返回item


提示:

scrapy.Requset()方法中使用meta参数在两个解析方法之间传递Item时,避免Item数据发生错误

需使用深拷贝,如meta={'item':deepcopy(item)},而且需要导入deepcopy模块

注意:

导入items模块中的newsitem类时,需要设置路径。方法为:右击项目名,在弹出的快捷菜单中选择mark directory as ->sources root即可

 拓展:

scrapy还提供FormRequest()方法发送请求并使用表单提交数据如POST请求,常用的参数有url,callback,method,formdata,meta,dont_filter等,其中formdata为字典类型,表示表单提交的数据,dont)filte为bool类型,如果需要多次提交表单,切url一样,那么必须将其设置为True,以防止被当成重复请求被过滤

4)修改settings脚本

settings脚本提供键值映射的全局命名空间,可以使用代码提取其中的配置值,默认的settings文件中共有25项设置

可添加俩个

一个是只显示错误的日志信息

#显示指定类型的日志信息
LOG_LEVEL = 'ERROR'


这个是爬取的数据对于中文不乱码
FEED_EXPORT_ENCODING='utf-8'

 

 settings:

# Scrapy settings for BaiduSpider project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'BaiduSpider'

SPIDER_MODULES = ['BaiduSpider.spiders']
NEWSPIDER_MODULE = 'BaiduSpider.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36'								#设置USER_AGENT

# Obey robots.txt rules
ROBOTSTXT_OBEY = False					#设置不遵守robots协议

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'BaiduSpider.middlewares.BaiduspiderSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
    #启动RandomUserAgentMiddleware,并设置调用顺序
    'BaiduSpider.middlewares.RandomUserAgentMiddleware': 350,
}


# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    #启动TextPipeline,设置调用顺序
    'BaiduSpider.pipelines.TextPipeline': 300,
    #启动MongoPipeline,设置调用顺序
    'BaiduSpider.pipelines.MongoPipeline': 400,
}
MONGO_URI = 'localhost'				#设置数据库的连接地址
MONGO_DB = 'baidu'						#设置数据库名


# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

FEED_EXPORT_ENCODING = 'utf-8'

5)运行爬虫程序

以上BaiduSpider项目已经基本完成可以运行

语法  :scrapy crawl 项目名称

注意在 BadiduSpider目录下

 也可以通过命令将爬取到的内容保存到文件中,例如输入  “scrapy crawl news -o news.json”即可将Item的内容保存到json文件中

 提示:

使用-o输出json文件时,会默认使用unicode编码,当内容为中文时,输出的json文件不便于查看,此时。可以在settings。py文件中修改默认的编码方式,即增加设置FEED_EXPORT_ENCODING='utf-8'

6) 修改pipelines脚本

如果想进行更复杂的处理,如筛选一些有用的数据或者将数据保存到数据库中,可以在pipelines脚本中通过定义Item Pipeline来实现

定义Item Pipeline只需要定义一个类并实现process_item()方法即可,process_item方法有两个参数:一个是item参数,Spider生成的Item每次都会作为参数传递过来;另外一个是spider参数,他说Spider的实例(如果创建多个Spider可以通过spider.name来区分)。该方法必须返回包含数据的字典或者Item对象或者抛出DropItem异常并且删除该Item

例如在该项目中,定义一个类,实现将index为3的Item删除

并且顶一个MOngoPipeline类,实现将处理后的item存储至MongoDB数据库中

pipelines.py:

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter


class BaiduspiderPipeline:
    def process_item(self, item, spider):
        return item

#导入DropItem模块
from scrapy.exceptions import DropItem
class TextPipeline:						#定义TextPipeline类
#定义process_item方法
    def process_item(self, item, spider):
        if item['index']=='3':			#如果item中“index”为“3”
            raise DropItem()				#删除item
        else:								#如果item中“index”不为“3”
            return item					#返回item

import csv
class CsvPipeline:
        def __init__(self):
            # csv文件的位置,无需事先创建
            store_file = 'news.csv'
            # 打开(创建)文件
            self.file = open(store_file, 'w', newline='')
            # csv写法
            self.writer = csv.writer(self.file)#, dialect="excel"
        def process_item(self, item, spider):
            # 判断字段值不为空再写入文件
            if item['title']:
                #写入csv文件
                self.writer.writerow([item['index'], item['title'], item['link']])
            return item

        def close_spider(self, spider):
            # 关闭爬虫时顺便将文件保存退出
            self.file.close()


import pymongo							#导入pymongo模块
class MongoPipeline:					#定义MongoPipeline类
    #定义__init__方法
    def __init__(self, mongo_uri, mongo_db):
         self.mongo_uri = mongo_uri	#初始化类中的mongo_uri
         self.mongo_db = mongo_db		#初始化类中的mongo_db
    @classmethod							#使用classmethod标识
    def from_crawler(cls, crawler):	#定义from_crawler方法
        #获取settings.py文件中的数据库的连接地址和数据库名
        return cls(
            mongo_uri = crawler.settings.get('MONGO_URI'),
            mongo_db = crawler.settings.get('MONGO_DB')
        )
    def open_spider(self, spider):	#定义open_spider方法
        #连接MongoDB数据库
        self.client = pymongo.MongoClient(self.mongo_uri)
        #创建数据库
        self.db = self.client[self.mongo_db]
    def close_spider(self, spider):	#定义close_spider方法
        self.client.close()				#关闭数据库连接
    #定义process_item方法
    def process_item(self, item, spider):
        data = {
            'index': item['index'],
            'title': item['title'],
            'link': item['link'],
            'newsNum': item['newsNum'],
        }									#初始化data
        table = self.db['news']		#新建集合
        table.insert_one(data)			#向数据库中插入数据
        return item						#返回item

定义了TextPipeline和MongoPipeline类后,还需要再settings文件的ITEM_PIPELINES中启动这两个Pipeline并设置掉调用顺序,调用数据库的连接地址解二数据库名,修改后的代码:

 其中ITEM_PIPELINES的键值是一个数字,表示调用优先级,数字越小优先级越高

在BaiduSpider目录下,输入scrapy crawl news命令并运行,即可将内容保存至数据库中

 

 7)定制MIiddleware

 

 在该项目中,定制一个Downloader Middleware实现随即设置请求头的User-Agent,即在middlewares.py中定义RandomUserAgentMiddleware类:

middleware:

# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals

# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter


'''class BaiduspiderSpiderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, or item objects.
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.

        # Should return either None or an iterable of Request or item objects.
        pass

    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated.

        # Must return only requests (not items).
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)'''

class BaiduspiderSpiderMiddleware:
    @classmethod
    def from_crawler(cls, crawler):
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        return None

    def process_spider_output(self, response, result, spider):
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        pass

    def process_start_requests(self, start_requests, spider):
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

class BaiduspiderDownloaderMiddleware:
    @classmethod
    def from_crawler(cls, crawler):
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        return None

    def process_response(self, request, response, spider):
        return response

    def process_exception(self, request, exception, spider):
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

import random									#设置random模块
#定义RandomUserAgentMiddleware类
class RandomUserAgentMiddleware:
    def __init__(self):							#定义__init__方法
        self.user_agent_list = [
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.18363',
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
            'Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)',
            'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)',
            'Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)',
            'Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)',
            'Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6',
            'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1',
            'Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0',
            'Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5',
            'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6',
            'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11',
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20'
        ]										#定义user_agent_list
    #定义process_request方法
    def process_request(self, request, spider):
        #在user_agent_list随机选择
        useragent = random.choice(self.user_agent_list)
        #设置请求头中的User-Agent
        request.headers.setdefault('User-Agent', useragent)
        return None						#返回None


class RandomProxyMiddleware():
    def __init__(self):
        self.proxy_list = [
            'http://121.232.148.167:9000',
            'http://39.105.28.28:8118',
            'http://113.195.18.133:9999'
        ]

    def process_request(self, request, spider):
        proxy = random.choice(self.proxy_list)
        request.meta.setdefault('proxy', proxy)
        print(request.meta['proxy'])


定制完成后,需要在settings中DOWNLOADER_MIDOLEWARES中启动Downloader Middleware并设置调用顺序,同时取消USER_AGENT设置

 

 

 默认提供并开启的Spider Middleware

已经足够满足多数需求,一般不需要手动修改

四 项目实战 爬取中国大学Mooc网站课程信息

实战内容:使用scrapy框架爬取中国大学MOOC网站搜素的课程信息(如python)包括课程名称,开设学习,课程类型,参与人数,课程概述,授课目标,预备知识,并将参与人数大于10000的课程存储到MongoDB中

1)scrapy startproject MOOCSpider        #创建项目

2)items:

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class MoocspiderItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass

class CourseItem(scrapy.Item):
    courseName = scrapy.Field()				#定义课程名称
    university = scrapy.Field()				#定义开课学校
    category = scrapy.Field()					#定义课程类型
    enrollCount = scrapy.Field()				#定义参与人数
    overview = scrapy.Field()					#定义课程概述
    objective = scrapy.Field()					#定义授课目标
    preliminaries = scrapy.Field()			#定义预备知识


class SchoolItem(scrapy.Item):
    university = scrapy.Field()  #定义排名
    courseName = scrapy.Field()  #定义标题
    enrollCount = scrapy.Field()
    teacher = scrapy.Field()

3)创建一个spider脚本

cd MOOCSpider

scrapy genspider course www.icourse163.org

scrapy genspider school www.icourse163.org

4)修改course.py。由于启动程序时发送的是POST请求,所以删除默认的start_urls

属性,重写start_requests()方法,其中使用scrapy.FormRequest()方法发送POST请求,表单提交数据,指定回调方法为parse

在parse0方法中提取课程名称、开课学校、课程类型和参与人数,同时获取学校缩写和课程ID,拼接成新的URL(如https://www.icourse163.org/course/HENANNU-1003544138,其中HENANNU为学校缩写,1003544138 为课程ID),并使用scrapy.Request0方法请求新的URL,指定回调方法为parse_ section()以提取课程概述、授课目标和预备知识。由于获取的内容需要请求不同的网页,所以在使用scrapy.Request()方法时需要使用meta参数传递item,并用使用深拷贝。
course.py:

import scrapy									#导入scrapy模块
#导入items模块中的CourseItem类
from MOOCSpider.items import CourseItem
import json										#导入json模块
from copy import deepcopy						#导入deepcopy模块
#定义CourseSpider类
class CourseSpider(scrapy.Spider):
    name = 'course'								#初始化name
    #初始化allowed_domains
    allowed_domains = ['www.icourse163.org']

    # 由于启动程序时发送的是post请求,所以删除默认的start_urls属性,重写start_requests()方法,其中
    # 使用scrapy.FormRequest()方法发送POSt请求,表单提交数据,指定回调方法为parse
    # start_urls = ['http://www.icourse163.org/']
    # 重写start_requests方法
    #重写start_requests方法
    def start_requests(self):
        #定义url
        url = 'https://www.icourse163.org/web/j/' \
              'mocSearchBean.searchCourse.rpc?csrfKey=' \
              '6d38afce3bd84a39b368f9175f995f2b'
        for i in range(7):						#循环7次
            #定义data_str
            data_dict = {
                'keyword': 'python',
                'pageIndex': str(i+1),
                'highlight': 'true',
                'orderBy': 0,
                'stats': 30,
                'pageSize': 20
            }
            data_str = json.dumps(data_dict)
            data = {
                'mocCourseQueryVo': data_str
            }										#定义data
            #发送POST请求,指定回调方法为parse
            yield scrapy.FormRequest(
                method='POST',
                url=url,
                formdata=data,
                callback=self.parse,
                dont_filter=True
            )
    def parse(self, response):					#定义parse方法
        data = response.body.decode('utf-8')	#响应解码
        #获取课程列表
        course_list = json.loads(data)['result']['list']
        item = CourseItem()						#初始化对象item
        for course in course_list:				#遍历
            #获取mocCourseCardDto键值
            CourseCard=course['mocCourseCard']['mocCourseCardDto']
               #提取课程名称,并写入Item
            item['courseName'] = CourseCard['name']
            #提取开课学校,并写入Item
            item['university']=CourseCard['schoolPanel']['name']
            if CourseCard['mocTagDtos']:#如果mocTagDtos键在字典中
                #提取课程类型,并写入Item
                item['category']=CourseCard['mocTagDtos'][0]['name']
            else:						#如果mocTagDtos键不在字典中
                item['category'] = 'NULL'#课程类型赋值为NULL
            #提取参与人数,并写入Item
            item['enrollCount']=CourseCard['termPanel']['enrollCount']
            #提取学校缩写
            shortName = CourseCard['schoolPanel']['shortName']
            #提取课程ID
            course_id = course['courseId']
            #拼接URL
            url = 'https://www.icourse163.org/course/' + \
                  shortName + '-' + str(course_id)
            #指定回调方法为parse_section方法
            yield scrapy.Request(url,meta={'item':deepcopy(item)},
                                 callback=self.parse_section)
    def parse_section(self, response):	#定义parse_section方法
        item = response.meta['item']		#传递item
        #初始化item的“overview”为NULL
        item['overview'] = 'NULL'
        #初始化item的“objective”为NULL
        item['objective'] = 'NULL'
        #初始化item的“preliminaries”为NULL
        item['preliminaries'] = 'NULL'
        #获取节点列表
        course_section = response.xpath(
            '//div[@id="content-section"]')[0]
        for i in range(3, 10, 2):			#循环,间隔为2
            #定义节点路径,提取节点文本
            path_str = 'div[' + str(i) + ']/span[2]/text()'
            text = course_section.xpath(path_str).extract()
            #定义节点路径
            path = 'div[' + str(i + 1) + ']/div//p//text()'
            if '课程概述' in text:		#如果节点文本包含“课程概述”
                #提取课程概述列表
                overview = course_section.xpath(path).extract()
                overview = ''.join(overview)	#连接列表中元素
                item['overview'] = overview	#写入item
            elif '授课目标' in text:		#如果节点文本包含“授课目标”
                #提取授课目标列表
                objective = course_section.xpath(path).extract()
                objective = ''.join(objective)	#连接列表中元素
                item['objective'] = objective		#写入item
            elif '预备知识' in text:		#如果节点文本包含“预备知识”
                #提取预备知识列表
                preliminaries=course_section.xpath(path).extract()
                #连接列表中元素
                preliminaries = ''.join(preliminaries)
                #写入item
                item['preliminaries'] = preliminaries
        yield item						#返回item



school.py

import scrapy
import re
from MOOCSpider.items import SchoolItem#导入items模块中的NewsItem类
import json
class SchoolSpider(scrapy.Spider):
    name = 'school'
    allowed_domains = ['www.icourse163.org']
    start_urls = ['https://www.icourse163.org/university/PKU#/c']

    '''def parse(self, response):
        university_list = response.xpath('//div[@class="u-usitys f-cb"]/a')
        #for university in university_list:
        university = university_list[0]
        university_url = 'https://www.icourse163.org' + university.xpath('@href').extract_first()
        yield scrapy.Request(university_url, callback=self.parse_schoolID)'''

    def parse(self, response):
        text = re.search('window.schoolId = "(.*?)"', response.text, re.S)
        school_Id = text.group(1)
        url = 'https://www.icourse163.org/web/j/courseBean.getCourseListBySchoolId.rpc?csrfKey=6d38afce3bd84a39b368f9175f995f2b'
        for num in range(6):
            data = {
                'schoolId': school_Id,
                'p': str(num+1),
                'psize': '20',
                'type': '1',
                'courseStatus': '30'
            }
            yield scrapy.FormRequest(
                method='POST',
                url=url,
                formdata=data,
                callback=self.parse_course,
                dont_filter=True
            )

    def parse_course(self, response):
        data = response.body.decode('utf-8')
        course_list = json.loads(data)['result']['list']
        item = SchoolItem()
        for course in course_list:
            item['university'] = course['schoolName']
            item['courseName'] = course['name']
            item['enrollCount'] = course['enrollCount']
            item['teacher'] = course['teacherName']
            yield item
           #print(university, courseName, enrollCount, teacher)

5)修改pipelines.py,定义TextPipeline类,筛选出参数人数大于1000的课程,定义MongoPipeline类,将数据存储到MongoDB数据库中

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter


class MoocspiderPipeline:
    def process_item(self, item, spider):
        return item

from scrapy.exceptions import DropItem	#导入DropItem模块
class TextPipeline():						#定义TextPipeline类
    def process_item(self, item, spider):#定义process_item方法
        #如果item中的“enrollCount”大于于10000
        if item['enrollCount'] > 10000:
            return item						#返回item
        else:#如果item中的“enrollCount”小于等于10000
            raise DropItem('Missing item')#删除item
import pymongo								#导入pymongo模块
class MongoPipeline():						#定义MongoPipeline类
    def __init__(self, mongo_uri, mongo_db):#定义__init__方法
         self.mongo_uri = mongo_uri		#初始化类中的mongo_uri
         self.mongo_db = mongo_db			#初始化类中的mongo_db
    @classmethod								#使用classmethod标识
    def from_crawler(cls, crawler):		#定义from_crawler方法
        #获取settings.py文件中数据库的URI和数据库名称
        return cls(
            mongo_uri = crawler.settings.get('MONGO_URI'),
            mongo_db = crawler.settings.get('MONGO_DB')
        )
    def open_spider(self, spider):		#定义open_spider方法
        #连接MongoDB数据库
        self.client = pymongo.MongoClient(self.mongo_uri)
        #创建数据库
        self.db = self.client[self.mongo_db]
    def close_spider(self, spider):		#定义close_spider方法
        self.client.close()					#关闭数据库连接
    def process_item(self, item, spider):#定义process_item方法
        data={
            '课程名称': item['courseName'],
            '开课学校': item['university'],
            '课程类型': item['category'],
            '参与人数': item['enrollCount'],
            '课程概述': item['overview'],
            '授课目标': item['objective'],
            '预备知识': item['preliminaries'],
        }										#初始化data
        table = self.db['course']			#新建集合
        table.insert_one(data)				#向数据库中插入数据
        return item							#返回item

6)修改middlewares,定义一个类实现随即设置请求头的Useragent

# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals

# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter


class MoocspiderSpiderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, or item objects.
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.

        # Should return either None or an iterable of Request or item objects.
        pass

    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated.

        # Must return only requests (not items).
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)


class MoocspiderDownloaderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

import random								#导入random
#定义RandomUserAgentMiddleware类
class RandomUserAgentMiddleware:
    def __init__(self):						#定义__init__方法
        self.user_agent_list = [
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.18363',
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
            'Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)',
            'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)',
            'Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)',
            'Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)',
            'Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6',
            'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1',
            'Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0',
            'Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5',
            'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6',
            'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11',
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20'
        ]									#定义user_agent_list
    #定义process_request方法
    def process_request(self, request, spider):
        #在user_agent_list中随机选择
        useragent = random.choice(self.user_agent_list)
        #设置请求头中的User-Agent
        request.headers.setdefault('User-Agent', useragent)
        return None						#返回None

7)修改settings。设置ROBOTSTXT_OBEY,DOWNLOADER_MIDDLEWARES,ITEM_PIPELINES定义连接MongoDB数据需要的地址和数据库名

# Scrapy settings for MOOCSpider project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'MOOCSpider'

SPIDER_MODULES = ['MOOCSpider.spiders']
NEWSPIDER_MODULE = 'MOOCSpider.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
    'cookie': 'EDUWEBDEVICE=ad366d00d2a448df9e8b89e8ddb2abb8; hb_MA-A976-948FFA05E931_source=www.baidu.com; __yadk_uid=BJEprvWuVabEiTPe8yD4cxTxKeAOpusu; NTESSTUDYSI=6d38afce3bd84a39b368f9175f995f2b; Hm_lvt_77dc9a9d49448cf5e629e5bebaa5500b=1601255430,1601272424,1601272688,1601273453; WM_NI=edPVgwr6D7b1I0MgK58PF%2FAm%2FIyhZPldCt5b8sM%2FhscIGdXgkmsyDgzHAmRiUa7FH5TC8pZjD4KIBeRgKqNGbQSw0HaOZchEIuwNDn4YwcBaF2UrBM7WArc6W1IvlSUJZ2M%3D; WM_NIKE=9ca17ae2e6ffcda170e2e6ee89ee69838da495d33de98e8fb3c15b829b8f85f552a69684aaf95caef5fdadd22af0fea7c3b92aa7bca0b7e67db2ea8cd8e13b8bf08388ca3ffc908ad0c467ed97b789d95cb0bc8d95b86afcad83d0eb79a1978985db6da9b3bd9ac76dba988f8ed16397bff9a7cb3f989df891d96288ec85aac16f92b98592cd4da28f9d98b344a3919684eb4f8babb9afc766f887b984c16b86ee9b93c147f5898f93e23e95ef8797ef59979696d3d037e2a3; WM_TID=gSj%2BsvyvzttFRAEVVQI7MbZOMjPj6zKS; Hm_lpvt_77dc9a9d49448cf5e629e5bebaa5500b=1601276871',
}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'MOOCSpider.middlewares.MoocspiderSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
    'MOOCSpider.middlewares.RandomUserAgentMiddleware': 350,
}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'MOOCSpider.pipelines.TextPipeline': 400,
    'MOOCSpider.pipelines.MongoPipeline': 500,
}
MONGO_URI = 'localhost'
MONGO_DB = 'MOOC'

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

8)运行spider脚本,查看数据内容

scrapy crawl course 

 五 小结


(1) Scrapy框架由Engine、Scheduler、Downloader、Spider、 Item Pipeline、DownloaderMiddleware和Spider Middleware构成。

(2)使用Scrapy 框架的一般流程为:首先,创建新项目:其次,修改items 脚本,定义Item中数据的结构;然后,创建spider 脚本,解析响应,提取数据和新的urL:接着,修改sttings.py脚本,设置Scrapy组件和定义全局变量:最后,运行爬虫程序。

六 对数据进行分析的案例




飞桨AI Studio - 人工智能学习与实训社区 (baidu.com)

 


版权声明:本文为lclchong原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接和本声明。