近期正在学习python ,结合自己过往的工作,所以闲来无事,试下爬17track 的轨迹。
爬取途径是:利用静态页面爬取,需要了解前端网页知识。
三方包:pyquery
话不多说,看代码吧:
#!/usr/bin/env python3
#coding=utf-8
from pyquery import PyQuery as pq
import pymysql
def get_time(d1):
l=[]
for data in d1('time'):
msg=d1(data).text()
#print(msg[0:11],len(msg))
l.append(msg[0:10])
return l
def get_message(d1):
s=[]
for data in d1('p'):
msg1=d1(data).text()
s.append(msg1)
return s
def main():
d = pq(filename="18.html")
d1 = d(".ori-block")#查找类是ori-block的html模块
d2 = d('.text-uppercase').text()获取类是text-uppercase的文本内容
print (type(d2))#测试返回的数据类型,为str
i=0
while i < len(get_time(d1)):
print(d2+"/"+get_time(d1)[i]+"/"+get_message(d1)[i])
i += 1
main()
抓取结果如下:
1Z3Y18900337899118/2018-07-05/LAS VEGAS, NV, US, DELIVERED
1Z3Y18900337899118/2018-07-05/Las Vegas, NV, United States, Destination Scan
1Z3Y18900337899118/2018-07-04/Las Vegas, NV, United States, Arrival Scan
1Z3Y18900337899118/2018-07-04/Departure Scan
1Z3Y18900337899118/2018-07-04/Arrival Scan
1Z3Y18900337899118/2018-07-04/Ontario, CA, United States, Departure Scan
1Z3Y18900337899118/2018-07-04/Origin Scan
1Z3Y18900337899118/2018-06-30/United States, Order Processed: Ready for UPS
ps:
17track的轨迹请求url地址:
post请求地址:https://t.17track.net/restapi/track
请求参数:
{"guid":"7a0d6ce750964b20b7ab6207a1639e16",#等于g
"data":[{"num":"LY372939201CN"},{"num":"LY372947242CN"},{"num":"LY373619583CN"}]}
难点在于如何破解guid 的值。
从页面的js代码中有一串关于guid生成的位置:
this.defaults.nowNums = a (这个字典里面有个guid)
JS好的童鞋可以去尝试破解下。