- mac or linux：pip install scrapy
    - windows:
        - pip install wheel
        - 下载twisted，下载地址为http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
        - 安装twisted：pip install Twisted‑17.1.0‑cp36‑cp36m‑win_amd64.whl
        - pip install pywin32
        - pip install scrapy
        测试：在终端里录入scrapy指令，没有报错即表示安装成功！

2)基础命令

- 创建一个工程：scrapy startproject xxx
- cd xxx
- 在spiders子目录中创建一个爬虫文件
- scrapy genspider spiderName www.xxx.com
- 执行工程：
- scrapy crawl spiderName

注意：应在spiders文件夹内执行

3）项目组成：

spiders

__init__.py

自定义的爬虫文件.py ‐‐‐》由我们自己创建，是实现爬虫核心功能的文件

__init__.py

items.py ‐‐‐》定义数据结构的地方，是一个继承自scrapy.Item的类

middlewares.py ‐‐‐》中间件代理

pipelines.py ‐‐‐》管道文件，里面只有一个类，用于处理下载数据的后续处理

默认是300优先级，值越小优先级越高（1‐1000）

settings.py ‐‐‐》配置文件比如：是否遵守robots协议，User‐Agent定义等

4）创建爬虫文件：

（1）跳转到spiders文件夹 cd 目录名字/目录名字/spiders

（2）scrapy genspider 爬虫名字网页的域名

爬虫文件的基本组成：

继承scrapy.Spider类

name = 'baidu' ‐‐‐》运行爬虫文件时使用的名字

allowed_domains ‐‐‐》爬虫允许的域名，在爬取的时候，如果不是此域名之下的

url，会被过滤掉

start_urls ‐‐‐》声明了爬虫的起始地址，可以写多个url，一般是一个

parse(self, response) ‐‐‐》解析数据的回调函数

response.text ‐‐‐》响应的是字符串

response.body ‐‐‐》响应的是二进制文件

response.xpath()‐》xpath方法的返回值类型是selector列表

extract() ‐‐‐》提取的是selector对象的是data

extract_first() ‐‐‐》提取的是selector列表中的第一个数据

5）scrapy架构组成

（1）引擎 ‐‐‐》自动运行，无需关注，会自动组织所有的请求对象，分发给下载器

（2）下载器 ‐‐‐》从引擎处获取到请求对象后，请求数据

（3）spiders ‐‐‐》Spider类定义了如何爬取某个(或某些)网站。包括了爬取的动作(例如:是否跟进链接)以及如何从网页的内容中提取结构化数据(爬取item)。换句话说，Spider就是您定义爬取的动作及分析某个网页(或者是有些网页)的地方。

（4）调度器 ‐‐‐》有自己的调度规则，无需关注

（5）管道（Item pipeline） ‐‐‐》最终处理数据的管道，会预留接口供我们处理数据当Item在Spider中被收集之后，它将会被传递到Item Pipeline，一些组件会按照一定的顺序执行对Item的处理。每个item pipeline组件(有时称之为“Item Pipeline”)是实现了简单方法的Python类。他们接收到Item并通过它执行一些行为，同时也决定此Item是否继续通过pipeline，或是被丢弃而不再进行处理。

以下是item pipeline的一些典型应用：

1. 清理HTML数据

2. 验证爬取的数据(检查item包含某些字段)

3. 查重(并丢弃)

4. 将爬取结果保存到数据库中

6）五大核心组件

    引擎(Scrapy)
        用来处理整个系统的数据流处理, 触发事务(框架核心)
    调度器(Scheduler)
        用来接受引擎发过来的请求, 压入队列中, 并在引擎再次请求的时候返回. 可以想像成一个URL（抓取网页的网址或者说是链接）的优先队列, 由它来决定下一个要抓取的网址是什么, 同时去除重复的网址
    下载器(Downloader)
        用于下载网页内容, 并将网页内容返回给蜘蛛(Scrapy下载器是建立在twisted这个高效的异步模型上的)
    爬虫(Spiders)
        爬虫是主要干活的, 用于从特定的网页中提取自己需要的信息, 即所谓的实体(Item)。用户也可以从中提取出链接,让Scrapy继续抓取下一个页面
    项目管道(Pipeline)
        负责处理爬虫从网页中抽取的实体，主要的功能是持久化实体、验证实体的有效性、清除不需要的信息。当页面被爬虫解析后，将被发送到项目管道，并经过几个特定的次序

7）srapy工作原理

8）yield

1. 带有 yield 的函数不再是一个普通函数，而是一个生成器generator，可用于迭代

2. yield 是一个类似 return 的关键字，迭代一次遇到yield时就返回yield后面(右边)的值。重点是：下一次迭代时，从上一次迭代遇到的yield后面的代码(下一行)开始执行

3. 简要理解：yield就是 return 返回一个值，并且记住这个返回的位置，下次迭代就从这个位置后(下一行)开始

9）scrapy shell

10）pymysql的使用步骤

1.pip install pymysql

2.pymysql.connect(host,port,user,password,db,charset)

3.conn.cursor()

4.cursor.execute()

11）CrawlSpider

- CrawlSpider:类，Spider的一个子类
    - 全站数据爬取的方式
        - 基于Spider：手动请求
        - 基于CrawlSpider
    - CrawlSpider的使用：
        - 创建一个工程
        - cd XXX
        - 创建爬虫文件（CrawlSpider）：
            - scrapy genspider -t crawl xxx www.xxxx.com
            - 链接提取器：
                - 作用：根据指定的规则（allow）进行指定链接的提取
            - 规则解析器：
                - 作用：将链接提取器提取到的链接进行指定规则（callback）的解析

运行原理

12）日志信息和日志等级

（1）日志级别：

CRITICAL：严重错误

ERROR：一般错误

WARNING：警告

INFO: 一般信息

DEBUG：调试信息

默认的日志等级是DEBUG

只要出现了DEBUG或者DEBUG以上等级的日志

那么这些日志将会打印（2）settings.py文件设置：

默认的级别为DEBUG，会显示上面所有的信息

在配置文件中 settings.py

LOG_FILE : 将屏幕显示的信息全部记录到文件中，屏幕不再显示，注意文件后缀一定是.log

LOG_LEVEL : 设置日志显示的等级，就是显示哪些，不显示哪些

例如，想要只显示出错误的日志信息

可在setting中加入

#显示指定类型的日志信息
LOG_LEVEL = 'ERROR'

13）乱码

在settings中加入

FEED_EXPORT_ENCODING='utf-8'

14）scrapy持久化存储

- 基于终端指令：

    - 要求：只可以将parse方法的返回值存储到本地的文本文件中
    - 注意：持久化存储对应的文本文件的类型只可以为：'json', 'jsonlines', 'jl', 'csv', 'xml', 'marshal', 'pickle
    - 指令：scrapy crawl xxx -o filePath
    - 好处：简介高效便捷
    - 缺点：局限性比较强（数据只可以存储到指定后缀的文本文件中）

- 基于管道：

    - 编码流程：
        - 数据解析
        - 在item类中定义相关的属性
        - 将解析的数据封装存储到item类型的对象
        - 将item类型的对象提交给管道进行持久化存储的操作（在items中）
        - 在管道类的process_item中要将其接受到的item对象中存储的数据进行持久化存储操作
        - 在配置文件中开启管道
    - 好处：
        - 通用性强。

15）图片数据爬取之ImagesPipeline

- 基于scrapy爬取字符串类型的数据和爬取图片类型的数据区别？
- 字符串：只需要基于xpath进行解析且提交管道进行持久化存储
- 图片：xpath解析出图片src的属性值。单独的对图片地址发起请求获取图片二进制类型的数据

- ImagesPipeline：
    - 只需要将img的src的属性值进行解析，提交到管道，管道就会对图片的src进行请求发送获取图片的二进制类型的数据，且还会帮我们进行持久化存储。
- 需求：爬取站长素材中的高清图片
- 使用流程：
    - 数据解析（图片的地址）
    - 将存储图片地址的item提交到制定的管道类
    - 在管道文件中自定制一个基于ImagesPipeLine的一个管道类
        - get_media_request
        - file_path
        - item_completed
    - 在配置文件中：
        - 指定图片存储的目录：IMAGES_STORE = './imgs_bobo'
        - 指定开启的管道：自定制的管道类

案例：scrapy图片爬取：

img.py:

# -*- coding: utf-8 -*-
import scrapy
from imgsPro.items import ImgsproItem

class ImgSpider(scrapy.Spider):
    name = 'img'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['http://sc.chinaz.com/tupian/']

    def parse(self, response):
        div_list = response.xpath('//div[@id="container"]/div')
        for div in div_list:
            #注意：使用伪属性
            src = div.xpath('./div/a/img/@src2').extract_first()

            item = ImgsproItem()
            item['src'] = src

            yield item

itiems:

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class ImgsproItem(scrapy.Item):
    # define the fields for your item here like:
    src = scrapy.Field()
    # pass

pipelines：

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html


# class ImgsproPipeline(object):
#     def process_item(self, item, spider):
#         return item
from scrapy.pipelines.images import ImagesPipeline
import scrapy
class imgsPileLine(ImagesPipeline):

    #就是可以根据图片地址进行图片数据的请求
    def get_media_requests(self, item, info):

        yield scrapy.Request(item['src'])

    #指定图片存储的路径
    def file_path(self, request, response=None, info=None):
        imgName = request.url.split('/')[-1]
        return imgName

    def item_completed(self, results, item, info):
        return item #返回给下一个即将被执行的管道类

# -*- coding: utf-8 -*-

# Define here the models for your spider middleware
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals


class ImgsproSpiderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, dict or Item objects.
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.

        # Should return either None or an iterable of Response, dict
        # or Item objects.
        pass

    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated.

        # Must return only requests (not items).
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)


class ImgsproDownloaderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

settins:

# -*- coding: utf-8 -*-

# Scrapy settings for imgsPro project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'imgsPro'

SPIDER_MODULES = ['imgsPro.spiders']
NEWSPIDER_MODULE = 'imgsPro.spiders'

LOG_LEVEL = 'ERROR'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'imgsPro.middlewares.ImgsproSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
#    'imgsPro.middlewares.ImgsproDownloaderMiddleware': 543,
#}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'imgsPro.pipelines.imgsPileLine': 300,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

#指定图片存储的目录
IMAGES_STORE = './imgs_bobo'

16）中间件（处理动态加载数据）

- 中间件
    - 下载中间件
        - 位置：引擎和下载器之间
        - 作用：批量拦截到整个工程中所有的请求和响应
        - 拦截请求：
            - UA伪装:process_request
            - 代理IP:process_exception:return request

- 拦截响应：
- 篡改响应数据，响应对象

案例：网易新闻爬取

items；

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class WangyiproItem(scrapy.Item):
    # define the fields for your item here like:
    title = scrapy.Field()
    content = scrapy.Field()

wangyi.py

# -*- coding: utf-8 -*-
import scrapy
from selenium import webdriver
from wangyiPro.items import WangyiproItem
class WangyiSpider(scrapy.Spider):
    name = 'wangyi'
    # allowed_domains = ['www.cccom']
    start_urls = ['https://news.163.com/']
    models_urls = []  #存储五个板块对应详情页的url
    #解析五大板块对应详情页的url

    #实例化一个浏览器对象
    def __init__(self):
        self.bro = webdriver.Chrome(executable_path='/Users/bobo/Desktop/小猿圈爬虫课程/chromedriver')

    def parse(self, response):
        li_list = response.xpath('//*[@id="index2016_wrap"]/div[1]/div[2]/div[2]/div[2]/div[2]/div/ul/li')
        alist = [3,4,6,7,8]
        for index in alist:
            model_url = li_list[index].xpath('./a/@href').extract_first()
            self.models_urls.append(model_url)

        #依次对每一个板块对应的页面进行请求
        for url in self.models_urls:#对每一个板块的url进行请求发送
            yield scrapy.Request(url,callback=self.parse_model)

    #每一个板块对应的新闻标题相关的内容都是动态加载
    def parse_model(self,response): #解析每一个板块页面中对应新闻的标题和新闻详情页的url
        # response.xpath()
        div_list = response.xpath('/html/body/div/div[3]/div[4]/div[1]/div/div/ul/li/div/div')
        for div in div_list:
            title = div.xpath('./div/div[1]/h3/a/text()').extract_first()
            new_detail_url = div.xpath('./div/div[1]/h3/a/@href').extract_first()


            item = WangyiproItem()
            item['title'] = title

            #对新闻详情页的url发起请求
            yield scrapy.Request(url=new_detail_url,callback=self.parse_detail,meta={'item':item})
    def parse_detail(self,response):#解析新闻内容
        content = response.xpath('//*[@id="endText"]//text()').extract()
        content = ''.join(content)
        item = response.meta['item']
        item['content'] = content

        yield item


    def closed(self,spider):
        self.bro.quit()

settings：

# -*- coding: utf-8 -*-

# Scrapy settings for wangyiPro project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'wangyiPro'

SPIDER_MODULES = ['wangyiPro.spiders']
NEWSPIDER_MODULE = 'wangyiPro.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'wangyiPro (+http://www.yourdomain.com)'
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'wangyiPro.middlewares.WangyiproSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
   'wangyiPro.middlewares.WangyiproDownloaderMiddleware': 543,
}

# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'wangyiPro.pipelines.WangyiproPipeline': 300,
}
LOG_LEVEL = 'ERROR'
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

pipeline:

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html


class WangyiproPipeline(object):
    def process_item(self, item, spider):
        print(item)
        return item

middleware:

# -*- coding: utf-8 -*-

# Define here the models for your spider middleware
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals


from scrapy.http import HtmlResponse
from time import sleep
class WangyiproDownloaderMiddleware(object):
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.



    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None
    #该方法拦截五大板块对应的响应对象，进行篡改
    def process_response(self, request, response, spider):#spider爬虫对象
        bro = spider.bro#获取了在爬虫类中定义的浏览器对象

        #挑选出指定的响应对象进行篡改
        #通过url指定request
        #通过request指定response
        if request.url in spider.models_urls:
            bro.get(request.url) #五个板块对应的url进行请求
            sleep(3)
            page_text = bro.page_source  #包含了动态加载的新闻数据

            #response #五大板块对应的响应对象
            #针对定位到的这些response进行篡改
            #实例化一个新的响应对象（符合需求：包含动态加载出的新闻数据），替代原来旧的响应对象
            #如何获取动态加载出的新闻数据？
                #基于selenium便捷的获取动态加载数据
            new_response = HtmlResponse(url=request.url,body=page_text,encoding='utf-8',request=request)

            return new_response
        else:
            #response #其他请求对应的响应对象
            return response





    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

17）scrapy的post请求

（1）重写start_requests方法：

def start_requests(self) (2) start_requests的返回值：

scrapy.FormRequest(url=url, headers=headers, callback=self.parse_item, formdata=data) url: 要发送的post地址

headers：可以定制头信息

callback: 回调函数 formdata: post所携带的数据，这是一个字典

18）代理

（1）到settings.py中，打开一个选项

DOWNLOADER_MIDDLEWARES = {

'postproject.middlewares.Proxy': 543,

}

（2）到middlewares.py中写代码

def process_request(self, request, spider):

request.meta['proxy'] = 'https://113.68.202.10:9999'

return None

三 scrapy框架应用--以创建BaiduSpider项目为例，介绍使用scrapy框架的使用过程

1）创建项目

语法：scrapy startproject 项目名称【存放爬虫项目的路径】

若不指定项目路径，则在命令执行路径下生成

例如在pycharm中创建

则会在该文件目录下生成

2）修改items脚本

scrapy库提供Item对象来实现将爬取到的数据转换成结构化数据的功能，实现方法是定义Item类（继承scrapy.Item类），并定义类中的数据类型为scrapy.Filed字段

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class BaiduspiderItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass

class NewsItem(scrapy.Item):
    index=scrapy.Field()  #定义排名，标题，链接，搜素结果个数
    title=scrapy.Field()
    link=scrapy.Field()
    newsNum=scrapy.Field()

3）创建spider脚本

语法：scrapy genspider [template] <name> <domain>

template：表示创建模板的类型，缺省则使用默认模板

name：表示创建的spider脚本名称，擦混关键后会在spider目录下生成一个以namme命名的py文件

在BaiduSpider项目中创建一个爬取百度首页热榜新闻的spider脚本，命名为news，域名为www .baidu.com

#xpath返回的是列表，但是列表元素一定是Selector类型的对象
#extract可以将Selector对象中data参数存储的字符串提取出来

request方法常用的参数及其说明

在BadiuSpider项目中，shixiannews爬虫的过程如下：

1）在parse（）方法中提取百度首页热榜新闻的排名，标题和链接，并写入NewsItem，

类的对象中，然后调用Scrapy。Request（）方法请求新的链接，并使用callback参数指定回调方法为parse_newsnum,使用meta参数在两个解析方法之间传递NewsItem类的对象。

2）定义parse_newsnum（）方法，解析请求新的链接返回的响应，提取每台哦新闻的搜索接过书，并写入newsItem类中

news.py:

import scrapy									#导入scrapy模块
#导入items模块中的NewsItem类
from BaiduSpider.items import NewsItem
from copy import deepcopy						#导入deepcopy模块
class NewsSpider(scrapy.Spider):				#定义NewsSpider类
    name = 'news'								#初始化name
    #初始化allowed_domains
    allowed_domains = ['www.baidu.com']
    start_urls = ['http://www.baidu.com/']	#初始化start_urls
    def parse(self, response):					#定义parse方法
        #查找ul节点下的所有li节点
        news_list = response.xpath(
            '//ul[@class="s-hotsearch-content"]/li'
        )
        for news in news_list:					#遍历li节点
            item = NewsItem()					#初始化对象item
            #查找排名节点，获取文本，并赋值给item中的index
            item['index'] = news.xpath(
                'a/span[1]/text()').extract_first()
            #查找标题节点，获取文本，并赋值给item中的title
            item['title'] = news.xpath(
                'a/span[2]/text()').extract_first()
            #查找链接节点，获取文本，并赋值给item中的link
            item['link']= news.xpath('a/@href').extract_first()
            #发送请求，并指定回调方法为parse_newsnum
            yield scrapy.Request(
                item['link'],
                callback=self.parse_newsnum,
                meta={'item': deepcopy(item)}
                #使用meta参数在两个解析方法之间传递item时，避免item数据发生错误，需使用深拷贝，而且需要导入deepcopy模块
            )

    def parse_newsnum(self, response):  # 定义parse_newsnum方法
        item = response.meta['item']  # 传递item
        # 查找搜索结果个数节点，并获取文本，赋值给news_num
        item['newsNum'] = response.xpath(
            '//span[@class="nums_text"]/text()').extract_first()
        yield item  # 返回item

提示：

scrapy.Requset()方法中使用meta参数在两个解析方法之间传递Item时，避免Item数据发生错误

需使用深拷贝，如meta={'item':deepcopy(item)},而且需要导入deepcopy模块

注意：

导入items模块中的newsitem类时，需要设置路径。方法为：右击项目名，在弹出的快捷菜单中选择mark directory as ->sources root即可

拓展：

scrapy还提供FormRequest（）方法发送请求并使用表单提交数据如POST请求，常用的参数有url，callback，method，formdata，meta，dont_filter等，其中formdata为字典类型，表示表单提交的数据，dont）filte为bool类型，如果需要多次提交表单，切url一样，那么必须将其设置为True，以防止被当成重复请求被过滤

4）修改settings脚本

settings脚本提供键值映射的全局命名空间，可以使用代码提取其中的配置值，默认的settings文件中共有25项设置

可添加俩个

一个是只显示错误的日志信息

#显示指定类型的日志信息
LOG_LEVEL = 'ERROR'

这个是爬取的数据对于中文不乱码
FEED_EXPORT_ENCODING='utf-8'

settings：

# Scrapy settings for BaiduSpider project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'BaiduSpider'

SPIDER_MODULES = ['BaiduSpider.spiders']
NEWSPIDER_MODULE = 'BaiduSpider.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36'								#设置USER_AGENT

# Obey robots.txt rules
ROBOTSTXT_OBEY = False					#设置不遵守robots协议

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
#COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
#}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'BaiduSpider.middlewares.BaiduspiderSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
    #启动RandomUserAgentMiddleware，并设置调用顺序
    'BaiduSpider.middlewares.RandomUserAgentMiddleware': 350,
}


# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    #启动TextPipeline，设置调用顺序
    'BaiduSpider.pipelines.TextPipeline': 300,
    #启动MongoPipeline，设置调用顺序
    'BaiduSpider.pipelines.MongoPipeline': 400,
}
MONGO_URI = 'localhost'				#设置数据库的连接地址
MONGO_DB = 'baidu'						#设置数据库名


# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

FEED_EXPORT_ENCODING = 'utf-8'

5）运行爬虫程序

以上BaiduSpider项目已经基本完成可以运行

语法：scrapy crawl 项目名称

注意在 BadiduSpider目录下

也可以通过命令将爬取到的内容保存到文件中，例如输入 “scrapy crawl news -o news.json”即可将Item的内容保存到json文件中

提示：

使用-o输出json文件时，会默认使用unicode编码，当内容为中文时，输出的json文件不便于查看，此时。可以在settings。py文件中修改默认的编码方式，即增加设置FEED_EXPORT_ENCODING='utf-8'

6) 修改pipelines脚本

如果想进行更复杂的处理，如筛选一些有用的数据或者将数据保存到数据库中，可以在pipelines脚本中通过定义Item Pipeline来实现

定义Item Pipeline只需要定义一个类并实现process_item()方法即可，process_item方法有两个参数：一个是item参数，Spider生成的Item每次都会作为参数传递过来；另外一个是spider参数，他说Spider的实例（如果创建多个Spider可以通过spider.name来区分）。该方法必须返回包含数据的字典或者Item对象或者抛出DropItem异常并且删除该Item

例如在该项目中，定义一个类，实现将index为3的Item删除

并且顶一个MOngoPipeline类，实现将处理后的item存储至MongoDB数据库中

pipelines.py:

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter


class BaiduspiderPipeline:
    def process_item(self, item, spider):
        return item

#导入DropItem模块
from scrapy.exceptions import DropItem
class TextPipeline:						#定义TextPipeline类
#定义process_item方法
    def process_item(self, item, spider):
        if item['index']=='3':			#如果item中“index”为“3”
            raise DropItem()				#删除item
        else:								#如果item中“index”不为“3”
            return item					#返回item

import csv
class CsvPipeline:
        def __init__(self):
            # csv文件的位置,无需事先创建
            store_file = 'news.csv'
            # 打开(创建)文件
            self.file = open(store_file, 'w', newline='')
            # csv写法
            self.writer = csv.writer(self.file)#, dialect="excel"
        def process_item(self, item, spider):
            # 判断字段值不为空再写入文件
            if item['title']:
                #写入csv文件
                self.writer.writerow([item['index'], item['title'], item['link']])
            return item

        def close_spider(self, spider):
            # 关闭爬虫时顺便将文件保存退出
            self.file.close()


import pymongo							#导入pymongo模块
class MongoPipeline:					#定义MongoPipeline类
    #定义__init__方法
    def __init__(self, mongo_uri, mongo_db):
         self.mongo_uri = mongo_uri	#初始化类中的mongo_uri
         self.mongo_db = mongo_db		#初始化类中的mongo_db
    @classmethod							#使用classmethod标识
    def from_crawler(cls, crawler):	#定义from_crawler方法
        #获取settings.py文件中的数据库的连接地址和数据库名
        return cls(
            mongo_uri = crawler.settings.get('MONGO_URI'),
            mongo_db = crawler.settings.get('MONGO_DB')
        )
    def open_spider(self, spider):	#定义open_spider方法
        #连接MongoDB数据库
        self.client = pymongo.MongoClient(self.mongo_uri)
        #创建数据库
        self.db = self.client[self.mongo_db]
    def close_spider(self, spider):	#定义close_spider方法
        self.client.close()				#关闭数据库连接
    #定义process_item方法
    def process_item(self, item, spider):
        data = {
            'index': item['index'],
            'title': item['title'],
            'link': item['link'],
            'newsNum': item['newsNum'],
        }									#初始化data
        table = self.db['news']		#新建集合
        table.insert_one(data)			#向数据库中插入数据
        return item						#返回item

定义了TextPipeline和MongoPipeline类后，还需要再settings文件的ITEM_PIPELINES中启动这两个Pipeline并设置掉调用顺序，调用数据库的连接地址解二数据库名，修改后的代码：

其中ITEM_PIPELINES的键值是一个数字，表示调用优先级，数字越小优先级越高

在BaiduSpider目录下，输入scrapy crawl news命令并运行，即可将内容保存至数据库中

7）定制MIiddleware

在该项目中，定制一个Downloader Middleware实现随即设置请求头的User-Agent,即在middlewares.py中定义RandomUserAgentMiddleware类：

middleware：

# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals

# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter


'''class BaiduspiderSpiderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, or item objects.
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.

        # Should return either None or an iterable of Request or item objects.
        pass

    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated.

        # Must return only requests (not items).
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)'''

class BaiduspiderSpiderMiddleware:
    @classmethod
    def from_crawler(cls, crawler):
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        return None

    def process_spider_output(self, response, result, spider):
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        pass

    def process_start_requests(self, start_requests, spider):
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

class BaiduspiderDownloaderMiddleware:
    @classmethod
    def from_crawler(cls, crawler):
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        return None

    def process_response(self, request, response, spider):
        return response

    def process_exception(self, request, exception, spider):
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

import random									#设置random模块
#定义RandomUserAgentMiddleware类
class RandomUserAgentMiddleware:
    def __init__(self):							#定义__init__方法
        self.user_agent_list = [
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.18363',
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
            'Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)',
            'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)',
            'Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)',
            'Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)',
            'Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6',
            'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1',
            'Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0',
            'Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5',
            'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6',
            'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11',
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20'
        ]										#定义user_agent_list
    #定义process_request方法
    def process_request(self, request, spider):
        #在user_agent_list随机选择
        useragent = random.choice(self.user_agent_list)
        #设置请求头中的User-Agent
        request.headers.setdefault('User-Agent', useragent)
        return None						#返回None


class RandomProxyMiddleware():
    def __init__(self):
        self.proxy_list = [
            'http://121.232.148.167:9000',
            'http://39.105.28.28:8118',
            'http://113.195.18.133:9999'
        ]

    def process_request(self, request, spider):
        proxy = random.choice(self.proxy_list)
        request.meta.setdefault('proxy', proxy)
        print(request.meta['proxy'])

定制完成后，需要在settings中DOWNLOADER_MIDOLEWARES中启动Downloader Middleware并设置调用顺序，同时取消USER_AGENT设置

默认提供并开启的Spider Middleware

已经足够满足多数需求，一般不需要手动修改

四项目实战爬取中国大学Mooc网站课程信息

实战内容：使用scrapy框架爬取中国大学MOOC网站搜素的课程信息（如python）包括课程名称，开设学习，课程类型，参与人数，课程概述，授课目标，预备知识，并将参与人数大于10000的课程存储到MongoDB中

1）scrapy startproject MOOCSpider #创建项目

2）items：

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class MoocspiderItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass

class CourseItem(scrapy.Item):
    courseName = scrapy.Field()				#定义课程名称
    university = scrapy.Field()				#定义开课学校
    category = scrapy.Field()					#定义课程类型
    enrollCount = scrapy.Field()				#定义参与人数
    overview = scrapy.Field()					#定义课程概述
    objective = scrapy.Field()					#定义授课目标
    preliminaries = scrapy.Field()			#定义预备知识


class SchoolItem(scrapy.Item):
    university = scrapy.Field()  #定义排名
    courseName = scrapy.Field()  #定义标题
    enrollCount = scrapy.Field()
    teacher = scrapy.Field()

3）创建一个spider脚本

cd MOOCSpider

scrapy genspider course www.icourse163.org

scrapy genspider school www.icourse163.org

4）修改course.py。由于启动程序时发送的是POST请求，所以删除默认的start_urls

属性，重写start_requests()方法，其中使用scrapy.FormRequest()方法发送POST请求，表单提交数据，指定回调方法为parse

在parse0方法中提取课程名称、开课学校、课程类型和参与人数，同时获取学校缩写和课程ID,拼接成新的URL(如https://www.icourse163.org/course/HENANNU-1003544138,其中HENANNU为学校缩写，1003544138 为课程ID)，并使用scrapy.Request0方法请求新的URL,指定回调方法为parse_ section(）以提取课程概述、授课目标和预备知识。由于获取的内容需要请求不同的网页，所以在使用scrapy.Request()方法时需要使用meta参数传递item,并用使用深拷贝。
course.py：

import scrapy									#导入scrapy模块
#导入items模块中的CourseItem类
from MOOCSpider.items import CourseItem
import json										#导入json模块
from copy import deepcopy						#导入deepcopy模块
#定义CourseSpider类
class CourseSpider(scrapy.Spider):
    name = 'course'								#初始化name
    #初始化allowed_domains
    allowed_domains = ['www.icourse163.org']

    # 由于启动程序时发送的是post请求，所以删除默认的start_urls属性，重写start_requests()方法，其中
    # 使用scrapy.FormRequest()方法发送POSt请求，表单提交数据，指定回调方法为parse
    # start_urls = ['http://www.icourse163.org/']
    # 重写start_requests方法
    #重写start_requests方法
    def start_requests(self):
        #定义url
        url = 'https://www.icourse163.org/web/j/' \
              'mocSearchBean.searchCourse.rpc?csrfKey=' \
              '6d38afce3bd84a39b368f9175f995f2b'
        for i in range(7):						#循环7次
            #定义data_str
            data_dict = {
                'keyword': 'python',
                'pageIndex': str(i+1),
                'highlight': 'true',
                'orderBy': 0,
                'stats': 30,
                'pageSize': 20
            }
            data_str = json.dumps(data_dict)
            data = {
                'mocCourseQueryVo': data_str
            }										#定义data
            #发送POST请求，指定回调方法为parse
            yield scrapy.FormRequest(
                method='POST',
                url=url,
                formdata=data,
                callback=self.parse,
                dont_filter=True
            )
    def parse(self, response):					#定义parse方法
        data = response.body.decode('utf-8')	#响应解码
        #获取课程列表
        course_list = json.loads(data)['result']['list']
        item = CourseItem()						#初始化对象item
        for course in course_list:				#遍历
            #获取mocCourseCardDto键值
            CourseCard=course['mocCourseCard']['mocCourseCardDto']
               #提取课程名称，并写入Item
            item['courseName'] = CourseCard['name']
            #提取开课学校，并写入Item
            item['university']=CourseCard['schoolPanel']['name']
            if CourseCard['mocTagDtos']:#如果mocTagDtos键在字典中
                #提取课程类型，并写入Item
                item['category']=CourseCard['mocTagDtos'][0]['name']
            else:						#如果mocTagDtos键不在字典中
                item['category'] = 'NULL'#课程类型赋值为NULL
            #提取参与人数，并写入Item
            item['enrollCount']=CourseCard['termPanel']['enrollCount']
            #提取学校缩写
            shortName = CourseCard['schoolPanel']['shortName']
            #提取课程ID
            course_id = course['courseId']
            #拼接URL
            url = 'https://www.icourse163.org/course/' + \
                  shortName + '-' + str(course_id)
            #指定回调方法为parse_section方法
            yield scrapy.Request(url,meta={'item':deepcopy(item)},
                                 callback=self.parse_section)
    def parse_section(self, response):	#定义parse_section方法
        item = response.meta['item']		#传递item
        #初始化item的“overview”为NULL
        item['overview'] = 'NULL'
        #初始化item的“objective”为NULL
        item['objective'] = 'NULL'
        #初始化item的“preliminaries”为NULL
        item['preliminaries'] = 'NULL'
        #获取节点列表
        course_section = response.xpath(
            '//div[@id="content-section"]')[0]
        for i in range(3, 10, 2):			#循环，间隔为2
            #定义节点路径，提取节点文本
            path_str = 'div[' + str(i) + ']/span[2]/text()'
            text = course_section.xpath(path_str).extract()
            #定义节点路径
            path = 'div[' + str(i + 1) + ']/div//p//text()'
            if '课程概述' in text:		#如果节点文本包含“课程概述”
                #提取课程概述列表
                overview = course_section.xpath(path).extract()
                overview = ''.join(overview)	#连接列表中元素
                item['overview'] = overview	#写入item
            elif '授课目标' in text:		#如果节点文本包含“授课目标”
                #提取授课目标列表
                objective = course_section.xpath(path).extract()
                objective = ''.join(objective)	#连接列表中元素
                item['objective'] = objective		#写入item
            elif '预备知识' in text:		#如果节点文本包含“预备知识”
                #提取预备知识列表
                preliminaries=course_section.xpath(path).extract()
                #连接列表中元素
                preliminaries = ''.join(preliminaries)
                #写入item
                item['preliminaries'] = preliminaries
        yield item						#返回item

school.py

import scrapy
import re
from MOOCSpider.items import SchoolItem#导入items模块中的NewsItem类
import json
class SchoolSpider(scrapy.Spider):
    name = 'school'
    allowed_domains = ['www.icourse163.org']
    start_urls = ['https://www.icourse163.org/university/PKU#/c']

    '''def parse(self, response):
        university_list = response.xpath('//div[@class="u-usitys f-cb"]/a')
        #for university in university_list:
        university = university_list[0]
        university_url = 'https://www.icourse163.org' + university.xpath('@href').extract_first()
        yield scrapy.Request(university_url, callback=self.parse_schoolID)'''

    def parse(self, response):
        text = re.search('window.schoolId = "(.*?)"', response.text, re.S)
        school_Id = text.group(1)
        url = 'https://www.icourse163.org/web/j/courseBean.getCourseListBySchoolId.rpc?csrfKey=6d38afce3bd84a39b368f9175f995f2b'
        for num in range(6):
            data = {
                'schoolId': school_Id,
                'p': str(num+1),
                'psize': '20',
                'type': '1',
                'courseStatus': '30'
            }
            yield scrapy.FormRequest(
                method='POST',
                url=url,
                formdata=data,
                callback=self.parse_course,
                dont_filter=True
            )

    def parse_course(self, response):
        data = response.body.decode('utf-8')
        course_list = json.loads(data)['result']['list']
        item = SchoolItem()
        for course in course_list:
            item['university'] = course['schoolName']
            item['courseName'] = course['name']
            item['enrollCount'] = course['enrollCount']
            item['teacher'] = course['teacherName']
            yield item
           #print(university, courseName, enrollCount, teacher)

5)修改pipelines.py，定义TextPipeline类，筛选出参数人数大于1000的课程，定义MongoPipeline类，将数据存储到MongoDB数据库中

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter


class MoocspiderPipeline:
    def process_item(self, item, spider):
        return item

from scrapy.exceptions import DropItem	#导入DropItem模块
class TextPipeline():						#定义TextPipeline类
    def process_item(self, item, spider):#定义process_item方法
        #如果item中的“enrollCount”大于于10000
        if item['enrollCount'] > 10000:
            return item						#返回item
        else:#如果item中的“enrollCount”小于等于10000
            raise DropItem('Missing item')#删除item
import pymongo								#导入pymongo模块
class MongoPipeline():						#定义MongoPipeline类
    def __init__(self, mongo_uri, mongo_db):#定义__init__方法
         self.mongo_uri = mongo_uri		#初始化类中的mongo_uri
         self.mongo_db = mongo_db			#初始化类中的mongo_db
    @classmethod								#使用classmethod标识
    def from_crawler(cls, crawler):		#定义from_crawler方法
        #获取settings.py文件中数据库的URI和数据库名称
        return cls(
            mongo_uri = crawler.settings.get('MONGO_URI'),
            mongo_db = crawler.settings.get('MONGO_DB')
        )
    def open_spider(self, spider):		#定义open_spider方法
        #连接MongoDB数据库
        self.client = pymongo.MongoClient(self.mongo_uri)
        #创建数据库
        self.db = self.client[self.mongo_db]
    def close_spider(self, spider):		#定义close_spider方法
        self.client.close()					#关闭数据库连接
    def process_item(self, item, spider):#定义process_item方法
        data={
            '课程名称': item['courseName'],
            '开课学校': item['university'],
            '课程类型': item['category'],
            '参与人数': item['enrollCount'],
            '课程概述': item['overview'],
            '授课目标': item['objective'],
            '预备知识': item['preliminaries'],
        }										#初始化data
        table = self.db['course']			#新建集合
        table.insert_one(data)				#向数据库中插入数据
        return item							#返回item

6）修改middlewares,定义一个类实现随即设置请求头的Useragent

# Define here the models for your spider middleware
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html

from scrapy import signals

# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter


class MoocspiderSpiderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, or item objects.
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.

        # Should return either None or an iterable of Request or item objects.
        pass

    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated.

        # Must return only requests (not items).
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)


class MoocspiderDownloaderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

import random								#导入random
#定义RandomUserAgentMiddleware类
class RandomUserAgentMiddleware:
    def __init__(self):						#定义__init__方法
        self.user_agent_list = [
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.18363',
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
            'Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US)',
            'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)',
            'Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)',
            'Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)',
            'Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6',
            'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2pre) Gecko/20070215 K-Ninja/2.1.1',
            'Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN; rv:1.9) Gecko/20080705 Firefox/3.0 Kapiko/3.0',
            'Mozilla/5.0 (X11; Linux i686; U;) Gecko/20070322 Kazehakase/0.4.5',
            'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.8) Gecko Fedora/1.9.0.8-1.fc10 Kazehakase/0.5.6',
            'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11',
            'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.20 (KHTML, like Gecko) Chrome/19.0.1036.7 Safari/535.20'
        ]									#定义user_agent_list
    #定义process_request方法
    def process_request(self, request, spider):
        #在user_agent_list中随机选择
        useragent = random.choice(self.user_agent_list)
        #设置请求头中的User-Agent
        request.headers.setdefault('User-Agent', useragent)
        return None						#返回None

7）修改settings。设置ROBOTSTXT_OBEY，DOWNLOADER_MIDDLEWARES，ITEM_PIPELINES定义连接MongoDB数据需要的地址和数据库名

# Scrapy settings for MOOCSpider project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://docs.scrapy.org/en/latest/topics/settings.html
#     https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://docs.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'MOOCSpider'

SPIDER_MODULES = ['MOOCSpider.spiders']
NEWSPIDER_MODULE = 'MOOCSpider.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False

# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
    'cookie': 'EDUWEBDEVICE=ad366d00d2a448df9e8b89e8ddb2abb8; hb_MA-A976-948FFA05E931_source=www.baidu.com; __yadk_uid=BJEprvWuVabEiTPe8yD4cxTxKeAOpusu; NTESSTUDYSI=6d38afce3bd84a39b368f9175f995f2b; Hm_lvt_77dc9a9d49448cf5e629e5bebaa5500b=1601255430,1601272424,1601272688,1601273453; WM_NI=edPVgwr6D7b1I0MgK58PF%2FAm%2FIyhZPldCt5b8sM%2FhscIGdXgkmsyDgzHAmRiUa7FH5TC8pZjD4KIBeRgKqNGbQSw0HaOZchEIuwNDn4YwcBaF2UrBM7WArc6W1IvlSUJZ2M%3D; WM_NIKE=9ca17ae2e6ffcda170e2e6ee89ee69838da495d33de98e8fb3c15b829b8f85f552a69684aaf95caef5fdadd22af0fea7c3b92aa7bca0b7e67db2ea8cd8e13b8bf08388ca3ffc908ad0c467ed97b789d95cb0bc8d95b86afcad83d0eb79a1978985db6da9b3bd9ac76dba988f8ed16397bff9a7cb3f989df891d96288ec85aac16f92b98592cd4da28f9d98b344a3919684eb4f8babb9afc766f887b984c16b86ee9b93c147f5898f93e23e95ef8797ef59979696d3d037e2a3; WM_TID=gSj%2BsvyvzttFRAEVVQI7MbZOMjPj6zKS; Hm_lpvt_77dc9a9d49448cf5e629e5bebaa5500b=1601276871',
}

# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
#    'MOOCSpider.middlewares.MoocspiderSpiderMiddleware': 543,
#}

# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
    'MOOCSpider.middlewares.RandomUserAgentMiddleware': 350,
}

# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'MOOCSpider.pipelines.TextPipeline': 400,
    'MOOCSpider.pipelines.MongoPipeline': 500,
}
MONGO_URI = 'localhost'
MONGO_DB = 'MOOC'

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

8）运行spider脚本，查看数据内容

scrapy crawl course

五小结

(1) Scrapy框架由Engine、Scheduler、Downloader、Spider、 Item Pipeline、DownloaderMiddleware和Spider Middleware构成。

(2)使用Scrapy 框架的一般流程为:首先，创建新项目:其次，修改items 脚本，定义Item中数据的结构;然后，创建spider 脚本，解析响应，提取数据和新的urL:接着，修改sttings.py脚本，设置Scrapy组件和定义全局变量:最后，运行爬虫程序。

六对数据进行分析的案例

飞桨AI Studio - 人工智能学习与实训社区 (baidu.com)

原文链接：https://blog.csdn.net/lclchong/article/details/127333869

一 Scrapy框架简介

二 scrapy框架的基本使用

1) 环境的安装