scrapy 自定义管道保存图片、json、csv文件格式以及MySQL

在上一篇博客中，简单的介绍了系统自带的管道保存方法，今天来讲一下如何用自定义的管道来保存文件以及图片。

1.保存图片：

开始的步骤，保持不变，成功获取数据后，进入到pipeline.py里面，引入 from scrapy.pipelines.images import ImagesPipeline,并且继承ImagesPipeline（如下图）：

import scrapy
# ImagesPipeline 系统中下载图片的管道
from scrapy.pipelines.images import ImagesPipeline
# 系统管道有下载图片的功能，我们的管道继承了系统的管道，也有了下载图片的功能
class ZhanzhangPipeline(ImagesPipeline):
    def get_media_requests(self, item, info):
        print('管道方法执行了')
        # print(item['title'])
        # print(item['img'])
        # 这个方法会循环执行
        # 前边每次会传入一个个item，这个item会被交给了引擎，
        # 引擎又交给了管道来运行，管道里面有很多方法
        # 这些方法会依次执行
        yield scrapy.Request(url=item['img'][0],meta={'item':item})
        # 管道里面提供了一系列的内置方法，这下方法会自动从第一个执行到最后一个

然后设置保存的路径以及图片名称：

 def file_path(self, request, response=None, info=None):
        print('====================')
        item = request.meta['item']
        print(item['title'])
        print(item['img'])
        # 设置图片的路径为，类型名称/url地址
        image_name  = item['img'][0].split('/')[-1]
        # 在拼接图片名字的时候，注意/和\
        path ='%s/%s' % (item['title'],image_name)
        return path

最后进入到settings.py里面，将ITEM_PIPELIME解注释，并设置图片保存的路径

注意，如果pipeline.py里面的类名改变了的话，settings.py里面的也需改变。

保存后，效果如下图：

2. 保存 json文件

系统的命令为：scrapy crawl 爬虫名 -o mei.json -s FEED_EXPORT_ENCODING=UTF-8

自定义管道方法操作如下：

import codecs
import json
import os
class XiaoshuoPipeline(object):
    def __init__(self):
        # w 写文件
        # w+ 读写文件 r 读  r+ 读写文件
        # 前者读写文件，如果文件不存建，则创建
        # 后者读写文件，如果不存在，则抛出异常
        self.file = codecs.open(filename='book.json',mode='w+',encoding='utf-8')
        self.file.write('"list":[')
    # 如果想要将数据写入本地 或者想将数据写入数据库的时候，这个方法保留
    def process_item(self, item, spider):

        # 将item对象转化为字典对象
        res = dict(item)
        
        # dumps 将字段对象转化为字符串， ascii编码是否可用
        # 如果直接将字典形式的数据写入到文件当中，会发生错误，所以讲字典形式的值，转化为字符串形式，写入到文件中
        str = json.dumps(res,ensure_ascii=False)
        # 将数据写入到文件当中
        self.file.write(str)
        self.file.write(',\n')

    def open_spider(self,spider):
        print('爬虫开始了')

    def close_spider(self,spider):

        print('爬虫结束了')
        # 删除文件当中最后一个字符
        # -1 表示偏移量
        # SEEK_END 定位到文件的最后一个字符
        self.file.seek(-1,os.SEEK_END)
        # 开始执行
        self.file.truncate()

        self.file.seek(-1,os.SEEK_END)
        self.file.truncate()

        self.file.write(']')
        self.file.close()

完成后，进入到settings.py里面，将ITEM_PIPELIME解注释即可。

3.保存 csv格式

系统命令：scrapy crawl 爬虫名 -o mei.csv

自定义管道操作为：

import csv
import itertools
class TaobaospiderPipeline(object):
    def __init__(self):
        self.writer = csv.writer(open('taobao.csv','w+',newline=''))
        # 设置标题
        self.writer.writerow(['name','price','shopper'])
        
    def process_item(self, item, spider):
        rows = zip(item['name'],item['price'],item['shopper'],item['img'])
        for row in rows:
            self.writer.writerow(row)

        return item

如要保存为excel表格格式，如下：

import xlwt

class TaobaoPipeline(object):
    def __init__(self):
        self.workbook = xlwt.Workbook(encoding='utf-8')
        self.sheet = self.workbook.add_sheet('一加手机')
        self.info_list = ['info','price','shop','img_src']
        self.row = 1
    def open_spider(self,spider):

        for index,info in enumerate(self.info_list):
            self.sheet.write(0,index,info)

    def close_spider(self,spider):

        self.workbook.save("Taobao.xlsx")

    def process_item(self, item, spider):

        data_list = [item["info"],item["price"],item["shop"],item["img_src"]]

        for index,data in enumerate(data_list):
            self.sheet.write(self.row,index,data)
        self.row += 1
        return item

4.将数据保存到 MySQL数据库中

首先，打开MySQL数据库，并创建，解码方式设置为‘utf8’

之后进入到pipeline文件中，进行如下操作

import pymysql
class DianyingspiderPipeline(object):
    def __init__(self):
        self.connect = pymysql.connect(host = 'localhost',user = '***',password = '******',db = 'movie',port = 3306)
        self.cursor = self.connect.cursor()

    def process_item(self, item, spider):
        self.cursor.execute('insert into moreTable (name ,href) VALUES ("{}","{}")'.format(item['name'],item['href']))
        self.connect.commit()

        return item
    def close_spider(self,spider):
        self.cursor.close()
        self.connect.close()

之后，再进入到 settings 文件中，将对应的代码解注释即可。

原文链接：https://blog.csdn.net/weixin_42657103/article/details/81635514