经验:
1、利用chrome的network,通过翻页操作,快速定位到获取数据的url
2、利用Postman,可以快速生成爬虫的代码
注意点:
1、导出csv时候,中文乱码
2、抓取时间时候,格式转化
代码:
1、API类
如果网站是通过API直接获取的json数据的话,那么不用分析页面dom,比较简单
result1 = []
url = "https://ecp.sgcc.com.cn/ecp2.0/ecpwcmcore//index/noteList"
headers = {'Content-Type': "application/json", 'cache-control': "no-cache"}
for page in range(1, 11):
payload = "{\"firstPageMenuId\": \"2018032700291334\", \"index\": " + str(page) + ", \"key\": \"\", \"orgId\": \"\", \"purOrgCode\": \"\", \"purOrgStatus\": \"\", \"purType\": \"\", \"size\": 20}"
response = requests.request("POST", url, data=payload, headers=headers)
mydicts = json.loads(response.text)
for mydict in mydicts["resultValue"]["noteList"]:
publishTime = datetime.strptime(mydict["noticePublishTime"], '%Y-%m-%d')
link = "https://ecp.sgcc.com.cn/ecp2.0/portal/#/doc/doc-spec/" + str(mydict["firstPageDocId"]) + "_2018032700291334"
obj = {'title': mydict["title"], 'publishTime': publishTime.strftime("%Y-%m-%d"), 'link': link}
result1.append(obj)
print(result1)
2、URL返回网页类
有的网站是在内部嵌套了个iframe,或者异步加载数据页面,要找到真正的链接,不能单纯的用url地址
result2 = []
for page in range(1, 11):
url = "http://bulletin.sntba.com/xxfbcmses/search/bulletin.html?dates=300&categoryId=88&page=" + str(page) + "&showStatus=1"
html = urllib.urlopen(url)
bsObj = BeautifulSoup(html, "html.parser")
for tr in bsObj.find("table", {"class": "table_text"}).findAll("tr"):
if len(tr.findAll('td')) > 0:#过滤掉title
title = tr.findAll('td')[0].get_text().strip()
publishTime = datetime.strptime(tr.findAll('td')[4].get_text().strip(), '%Y-%m-%d')
link = tr.findAll('td')[0].find('a')['href'][20:-2]
obj = {'title': title, 'publishTime': publishTime.strftime("%Y-%m-%d"), 'link': link}
result2.append(obj)
print(result2)
3、导出csv
reload(sys)
sys.setdefaultencoding('utf-8')
csvFile = open('result.csv', 'wb')
#‘r’:只读(缺省。如果文件不存在,则抛出错误)
#‘w’:只写(如果文件不存在,则自动创建文件)
#‘a’:附加到文件末尾(如果文件不存在,则自动创建文件)
#‘r+’:读写(如果文件不存在,则抛出错误)
csvWriter = unicodecsv.writer(csvFile, encoding="utf-8-sig")
result = []
#中间省略爬虫代码
for o in result:
title = o['title']
publishTime = o['publishTime']
link = o['link']
csvWriter.writerow([title, publishTime, link])
csvFile.close()
4、引入
#coding=utf-8
import urllib
import requests
import json
import time #为了更像真人浏览网页,通过这个设置每几秒爬一次
import sys
import unicodecsv
from datetime import datetime
from bs4 import BeautifulSoup #美化抓取的页面
其他:
需要验证码的网站,还在研究中~~~