城市空气质量(AQI)数据爬虫

 

全国各城市(网页上有的)空气质量爬虫,包括城市名称,AQI等信息,保存为.csv文件格式

网址首页:https://www.aqistudy.cn/historydata/index.php

首先是获取城市名称模块,实质就是从网页上得到可检索的所有“城市”字符串列表;

import requests
from lxml import etree
import time
from urllib import parse
import pandas as pd
from selenium import webdriver
import urllib.parse

headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.204 Safari/537.36'
}

url = "https://www.aqistudy.cn/historydata/"
response = requests.get(url, headers=headers)
text = response.content.decode('utf-8')
html = etree.HTML(text)
city_set = list()
citys = html.xpath("//div[@class='all']/div/ul")
for city in citys:
    messages = city.xpath(".//li")
    for message in messages:
        city_name = message.xpath(".//a/text()")
        city_name = "".join(city_name)
        # print(city_name)
        city_set.append(city_name)
print(len(city_set))#输出可爬取的城市数量
print(city_set)#打印所有爬取的城市列表

然后确定所有城市的哪几年的哪几个月空气质量信息需要爬取;

def get_month_set():
    month_set = list()
    for i in range(1, 10):
        month_set.append(('2018-0%s' % i))# 这里目前只获取2018年前10个月
    for i in range(10, 13):
        month_set.append(('2018-%s' % i))# 这里获取2018年后2个月
    return month_set                     # 一页能看到一个月的数据,所以循环以月为单位即可
month_set = get_month_set()
month_set.reverse()

最后是数据爬取模块,目的从确定的城市和确定的年月中爬取数据;

driver = webdriver.PhantomJS(r'E:\phantomjs-2.1.1-windows\bin\phantomjs.exe')
base_url = 'https://www.aqistudy.cn/historydata/daydata.php?city='

file_name = 'AQI_2018.csv'
fp = open(file_name, 'w', encoding='utf-8-sig')
fp.write('%s,%s,%s,%s,%s,%s,%s,%s,%s,%s\n'%('city','date','AQI','grade','PM25','PM10','SO2','CO','NO2','O3_8h'))#表头

for ct in range(0, len(city_set)):
    for i in range(len(month_set)):
            str_month = month_set[i]
            weburl = ('%s%s&month=%s' % (base_url, parse.quote(city_set[ct]), str_month))
            driver.get(weburl)
            time.sleep(1)
            dfs = pd.read_html(driver.page_source,header=0)[0]
            time.sleep(1)  # 防止页面一带而过,爬不到内容
            if len(dfs) != 0:
                for j in range(0,len(dfs)):
                    date = dfs.iloc[j,0]
                    aqi = dfs.iloc[j,1]
                    grade = dfs.iloc[j,2]
                    pm25 = dfs.iloc[j,3]
                    pm10 = dfs.iloc[j,4]
                    so2 = dfs.iloc[j,5]
                    co = dfs.iloc[j,6]
                    no2 = dfs.iloc[j,7]
                    o3 = dfs.iloc[j,8]
                    fp.write(('%s,%s,%s,%s,%s,%s,%s,%s,%s,%s\n' % (city_set[ct],date,aqi,grade,pm25,pm10,so2,co,no2,o3)))
                print('yes---%s,%s---DONE' % (city_set[ct], str_month))
                localtime = time.asctime(time.localtime(time.time()))
                print("time :", localtime)
            else:
                print('%s,%s--error' % (city_set[ct], str_month))
                localtime = time.asctime(time.localtime(time.time()))
                print("time :", localtime)
fp.close()
driver.quit()
print("已完成,谢谢!")

三部分模块连在一起即是完整的爬取全国城市空气质量的code;

其中,phantomjs.exe(百度搜索)及python包需要自行下载;

大部分代码是修改别人的,我在此基础上做了些优化,增加了功能,相关信息显示的会更加完善;

遇到疑似反爬虫的情况(run到一半就不动了),建议更换网络。

参考:

https://zhuanlan.zhihu.com/p/132496133 # 网页城市列表获取

https://blog.csdn.net/jancydc/article/details/107511400 #将城市循环,获取AQI

https://blog.csdn.net/weixin_40651515/article/details/84592530 #网络爬虫&获取AQI主体部分

 


版权声明:本文为Along1617188原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接和本声明。