python爬虫实验心得_Python爬虫总结

经验:

1、利用chrome的network,通过翻页操作,快速定位到获取数据的url

2、利用Postman,可以快速生成爬虫的代码

注意点:

1、导出csv时候,中文乱码

2、抓取时间时候,格式转化

代码:

1、API类

如果网站是通过API直接获取的json数据的话,那么不用分析页面dom,比较简单

result1 = []

url = "https://ecp.sgcc.com.cn/ecp2.0/ecpwcmcore//index/noteList"

headers = {'Content-Type': "application/json", 'cache-control': "no-cache"}

for page in range(1, 11):

payload = "{\"firstPageMenuId\": \"2018032700291334\", \"index\": " + str(page) + ", \"key\": \"\", \"orgId\": \"\", \"purOrgCode\": \"\", \"purOrgStatus\": \"\", \"purType\": \"\", \"size\": 20}"

response = requests.request("POST", url, data=payload, headers=headers)

mydicts = json.loads(response.text)

for mydict in mydicts["resultValue"]["noteList"]:

publishTime = datetime.strptime(mydict["noticePublishTime"], '%Y-%m-%d')

link = "https://ecp.sgcc.com.cn/ecp2.0/portal/#/doc/doc-spec/" + str(mydict["firstPageDocId"]) + "_2018032700291334"

obj = {'title': mydict["title"], 'publishTime': publishTime.strftime("%Y-%m-%d"), 'link': link}

result1.append(obj)

print(result1)

2、URL返回网页类

有的网站是在内部嵌套了个iframe,或者异步加载数据页面,要找到真正的链接,不能单纯的用url地址

result2 = []

for page in range(1, 11):

url = "http://bulletin.sntba.com/xxfbcmses/search/bulletin.html?dates=300&categoryId=88&page=" + str(page) + "&showStatus=1"

html = urllib.urlopen(url)

bsObj = BeautifulSoup(html, "html.parser")

for tr in bsObj.find("table", {"class": "table_text"}).findAll("tr"):

if len(tr.findAll('td')) > 0:#过滤掉title

title = tr.findAll('td')[0].get_text().strip()

publishTime = datetime.strptime(tr.findAll('td')[4].get_text().strip(), '%Y-%m-%d')

link = tr.findAll('td')[0].find('a')['href'][20:-2]

obj = {'title': title, 'publishTime': publishTime.strftime("%Y-%m-%d"), 'link': link}

result2.append(obj)

print(result2)

3、导出csv

reload(sys)

sys.setdefaultencoding('utf-8')

csvFile = open('result.csv', 'wb')

#‘r’:只读(缺省。如果文件不存在,则抛出错误)

#‘w’:只写(如果文件不存在,则自动创建文件)

#‘a’:附加到文件末尾(如果文件不存在,则自动创建文件)

#‘r+’:读写(如果文件不存在,则抛出错误)

csvWriter = unicodecsv.writer(csvFile, encoding="utf-8-sig")

result = []

#中间省略爬虫代码

for o in result:

title = o['title']

publishTime = o['publishTime']

link = o['link']

csvWriter.writerow([title, publishTime, link])

csvFile.close()

4、引入

#coding=utf-8

import urllib

import requests

import json

import time #为了更像真人浏览网页,通过这个设置每几秒爬一次

import sys

import unicodecsv

from datetime import datetime

from bs4 import BeautifulSoup #美化抓取的页面

其他:

需要验证码的网站,还在研究中~~~