爬虫爬取今日头条街拍美图

一抓取分析

1 在抓取之前，首先要分析抓取的逻辑，打开今日头条的首页http://www.toutiao.com/

2 右上角有一个搜索入口，这里尝试抓取街拍美图，所以输入“街拍”二字搜索一下。

3 分析数据

4 我们的目的是要抓取其中的美图，这里一组图就对应前面data字段中的一条数据。每条数据还有一个image_listl字段，它是列表形式，这其中就包含了组图的所有图片列表。

5 我们只需要将列表中的url字段提取出来并下载下来就好了。每一组图都建立一个文件夹，文件夹的名称就为组图的标题。

6 接下来，就可以直接用Python来模拟这个Ajax请求，然后提取出相关美图链接并下载。但是在这之前，我们还需要分析一下URL的规律。

7 切换回Headers选项卡，观察一下它的请求URL和Headers信息，如下图所示。

可以看到，这是一个GET请求，请求URL的参数有offset、format、keyword、autoload、count、cur_tab、from和pd。我们需要找出这些参数的规律，因为这样才可以方便地用程序构造出来。

滑动窗口，观察一下后续链接的参数，发现变化的参数只有offset，其他参数都没有变化，而且第二次请求的offset值为20，第三次为40，第四次为60，所以可以发现规律，这个offset值就是偏移量，进而可以推断出count参数就是一次性获取的数据条数。因此，我们可以用offset参数来控制数据分页。这样一来，我们就可以通过接口批量获取数据了，然后将数据解析，将图片下载下来即可。

二爬取代码

import requests
from urllib.parse import urlencode
from requests import codes
import os
from hashlib import md5
from multiprocessing.pool import Pool

# 实现方法get_page()来加载单个Ajax请求的结果。
# 其中唯一变化的参数就是offset，所以我们将它当作参数传递
def get_page(offset):
    params = {
        'offset': offset,
        'format': 'json',
        'keyword': '街拍',
        'autoload': 'true',
        'count': '20',
        'cur_tab': '1',
        'from': 'search_tab',
        'pd':'synthesis'
    }
    base_url = 'https://www.toutiao.com/search_content/?'
    # 这里我们用urlencode()方法构造请求的GET参数
    url = base_url + urlencode(params)
    try:
        # 然后用requests请求这个链接
        resp = requests.get(url)
        # 如果返回状态码为codes.ok(200),将结果转为JSON格式
        # 然后返回
        if codes.ok == resp.status_code:
            return resp.json()
    except requests.ConnectionError:
        return None

# 接下来，再实现一个解析方法：提取每条数据的image_list字段中的每一张图片链接，
# 将图片链接和图片所属的标题一并返回，此时可以构造一个生成器。
def get_images(json):
    if json.get('data'):
        data = json.get('data')
        for item in data:
            if item.get('cell_type') is not None:
                continue
            title = item.get('title')
            images = item.get('image_list')
            for image in images:
                yield {
                    'image': 'https:' + image.get('url'),
                    'title': title
                }

# 接下来，实现一个保存图片的方法save_image()，
# 其中item就是前面get_images()方法返回的一个字典。
def save_image(item):
    img_path = 'img' + os.path.sep + item.get('title')
    if not os.path.exists(img_path):
        # 首先根据item的title来创建文件夹
        os.makedirs(img_path)
    try:
        # 然后请求这个图片链接，获取图片的二进制数据
        resp = requests.get(item.get('image'))
        if codes.ok == resp.status_code:
            # 构造文件存储路径
            file_path = img_path + os.path.sep + '{file_name}.{file_suffix}'.format(
                file_name=md5(resp.content).hexdigest(),
                file_suffix='jpg')
            if not os.path.exists(file_path):
                with open(file_path, 'wb') as f:
                    # 以二进制的形式写入文件。图片的名称可以使用其内容的MD5值，这样可以去除重复。
                    f.write(resp.content)
                print('Downloaded image path is %s' % file_path)
            else:
                print('Already Downloaded', file_path)
    except requests.ConnectionError:
        print('Failed to Save Image，item %s' % item)

# 每一页
def main(offset):
    json = get_page(offset)
    for item in get_images(json):
        print(item)
        save_image(item)


GROUP_START = 0
GROUP_END = 3

if __name__ == '__main__':
    # 分页的起始页数和终止页数，分别为GROUP_START和GROUP_END，还利用了多线程的线程池，调用其map()方法实现多线程下载
    pool = Pool()
    groups = ([x * 20 for x in range(GROUP_START, GROUP_END + 1)])
    pool.map(main, groups)
    pool.close()
    pool.join()

三结果

原文链接：https://blog.csdn.net/chengqiuming/article/details/86549390