其实我只是想试试爬取图片而已
,先看看网页,需要爬的地方有两个,一是封面图,二是下载地址,挺简单的

Item定义:
1 2 3 4 5 6 7 8 9 10 | import scrapyclass TiantianmeijuItem(scrapy.Item): name = scrapy.Field() image_urls = scrapy.Field() images = scrapy.Field() image_paths = scrapy.Field() episode = scrapy.Field() episode_url = scrapy.Field() |
name是保存名字
image_urls和images 是爬取图片的pipeline用的,一个是保存图片URL,一个是保存图片存放信息
image_paths其实没什么实际作用,只是记录下载成功的图片地址
epiosde和episode_url是保存集数和对应下载地址
Spider:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | import scrapyfrom tiantianmeiju.items import TiantianmeijuItem import sysreload (sys) # Python2.5 初始化后会删除 sys.setdefaultencoding 这个方法,我们需要重新载入sys.setdefaultencoding( 'utf-8' )class CacthUrlSpider(scrapy.Spider): name = 'meiju' allowed_domains = [ 'cn163.net' ] start_urls = [ "http://cn163.net/archives/{id}/" . format ( id = id ) for id in [ '16355' , '13470' , '18766' , '18805' ]] def parse( self , response): item = TiantianmeijuItem() item[ 'name' ] = response.xpath( '//*[@id="content"]/div[2]/div[2]/h2/text()' ).extract() item[ 'image_urls' ] = response.xpath( '//*[@id="entry"]/div[2]/img/@src' ).extract() item[ 'episode' ] = response.xpath( '//*[@id="entry"]/p[last()]/a/text()' ).extract() item[ 'episode_url' ] = response.xpath( '//*[@id="entry"]/p[last()]/a/@href' ).extract() yield item |
页面比较简单
Pipelines:这里写了两个管道,一个是把下载链接保存到文件,一个是下载图片
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 | import jsonimport osfrom scrapy.pipelines.images import ImagesPipelinefrom scrapy.exceptions import DropItemfrom scrapy.http import Requestfrom settings import IMAGES_STOREclass TiantianmeijuPipeline( object ): def process_item( self , item, spider): return item class WriteToFilePipeline( object ): def process_item( self , item, spider): item = dict (item) FolderName = item[ 'name' ][ 0 ].replace( '/' , '') downloadFile = 'download_urls.txt' with open (os.path.join(IMAGES_STORE, FolderName, downloadFile), 'w' ) as file : for name,url in zip (item[ 'episode' ], item[ 'episode_url' ]): file .write( '{name}: {url}\n' . format (name = name, url = url)) return item class MyImagesPipeline(ImagesPipeline): def get_media_requests( self , item, info): for image_url in item[ 'image_urls' ]: yield Request(image_url, meta = { 'item' : item}) def item_completed( self , results, item, info): image_paths = [x[ 'path' ] for ok,x in results if ok] if not image_paths: raise DropItem( "Item contains no images" ) item[ 'image_paths' ] = image_paths return item def file_path( self , request, response = None , info = None ): item = request.meta[ 'item' ] FolderName = item[ 'name' ][ 0 ].replace( '/' , '') image_guid = request.url.split( '/' )[ - 1 ] filename = u '{}/{}' . format (FolderName, image_guid) return filename |
get_media_requests和item_completed,因为默认的图片储存路径是
<IMAGES_STORE>/full/3afec3b4765f8f0a07b78f98c07b83f013567a0a.jpg,
我需要把full改成以美剧名字目录来保存,所以重写了file_path
settings打开pipelines相关配置:
1 2 3 4 5 6 7 8 | ITEM_PIPELINES = { 'tiantianmeiju.pipelines.WriteToFilePipeline' : 2 , 'tiantianmeiju.pipelines.MyImagesPipeline' : 1 ,}IMAGES_STORE = os.path.join(os.getcwd(), 'image' ) # 图片存储路径IMAGES_EXPIRES = 90IMAGES_MIN_HEIGHT = 110IMAGES_MIN_WIDTH = 110 |
爬下来之后就是这个效果了:


本文转自运维笔记博客51CTO博客,原文链接http://blog.51cto.com/lihuipeng/1713531如需转载请自行联系原作者
lihuipeng