python爬虫实验心得_Python爬虫总结

经验：

1、利用chrome的network，通过翻页操作，快速定位到获取数据的url

2、利用Postman，可以快速生成爬虫的代码

注意点：

1、导出csv时候，中文乱码

2、抓取时间时候，格式转化

代码：

1、API类

如果网站是通过API直接获取的json数据的话，那么不用分析页面dom，比较简单

result1 = []

url = "https://ecp.sgcc.com.cn/ecp2.0/ecpwcmcore//index/noteList"

headers = {'Content-Type': "application/json", 'cache-control': "no-cache"}

for page in range(1, 11):

payload = "{\"firstPageMenuId\": \"2018032700291334\", \"index\": " + str(page) + ", \"key\": \"\", \"orgId\": \"\", \"purOrgCode\": \"\", \"purOrgStatus\": \"\", \"purType\": \"\", \"size\": 20}"

response = requests.request("POST", url, data=payload, headers=headers)

mydicts = json.loads(response.text)

for mydict in mydicts["resultValue"]["noteList"]:

publishTime = datetime.strptime(mydict["noticePublishTime"], '%Y-%m-%d')

link = "https://ecp.sgcc.com.cn/ecp2.0/portal/#/doc/doc-spec/" + str(mydict["firstPageDocId"]) + "_2018032700291334"

obj = {'title': mydict["title"], 'publishTime': publishTime.strftime("%Y-%m-%d"), 'link': link}

result1.append(obj)

print(result1)

2、URL返回网页类

有的网站是在内部嵌套了个iframe，或者异步加载数据页面，要找到真正的链接，不能单纯的用url地址

result2 = []

for page in range(1, 11):

url = "http://bulletin.sntba.com/xxfbcmses/search/bulletin.html?dates=300&categoryId=88&page=" + str(page) + "&showStatus=1"

html = urllib.urlopen(url)

bsObj = BeautifulSoup(html, "html.parser")

for tr in bsObj.find("table", {"class": "table_text"}).findAll("tr"):

if len(tr.findAll('td')) > 0:#过滤掉title

title = tr.findAll('td')[0].get_text().strip()

publishTime = datetime.strptime(tr.findAll('td')[4].get_text().strip(), '%Y-%m-%d')

link = tr.findAll('td')[0].find('a')['href'][20:-2]

obj = {'title': title, 'publishTime': publishTime.strftime("%Y-%m-%d"), 'link': link}

result2.append(obj)

print(result2)

3、导出csv

reload(sys)

sys.setdefaultencoding('utf-8')

csvFile = open('result.csv', 'wb')

#‘r’：只读（缺省。如果文件不存在，则抛出错误）

#‘w’：只写（如果文件不存在，则自动创建文件）

#‘a’：附加到文件末尾（如果文件不存在，则自动创建文件）

#‘r+’：读写（如果文件不存在，则抛出错误）

csvWriter = unicodecsv.writer(csvFile, encoding="utf-8-sig")

result = []

#中间省略爬虫代码

for o in result:

title = o['title']

publishTime = o['publishTime']

link = o['link']

csvWriter.writerow([title, publishTime, link])

csvFile.close()

4、引入

#coding=utf-8

import urllib

import requests

import json

import time #为了更像真人浏览网页，通过这个设置每几秒爬一次

import sys

import unicodecsv

from datetime import datetime

from bs4 import BeautifulSoup #美化抓取的页面

其他：

需要验证码的网站，还在研究中～～～