python例题求乘客等车时间_利用Python数据处理进行公交车到站时间预测（一）

1.数据格式

id int id编号

type int 41表示站间数据，42中间站进出数据 43始末站进出数据

route_id int 线路ID号，10454，10069，120881

bus_id varchar 车辆编号

station_id varchar 站点编号

lon decimal 经度

lat decimal 纬度

speed decimal 速度

direction decimal 方向

gpsflag int gps状态 0有效，1无效

updownflag int 上下行，0上行，1下行

inoutflag int 进出站，0进站，1出站

runningflag int 运营状态，0正常运营，1停止运营

onlineflag int 在线状态，0正常状态，1不在线

create_time timestamp gps时间

共十五个字段，如下截图所示：

2.简单数据清洗

首先，删除线路id编号，因为我们本次处理的是一条线路。根据运营状态、在线状态、gps是否有效，可删除无效数据。

利用上下行的标志位，将简单清理后的数据分成两部分，上行部分和下行部分:

然后，根据不同的公交汽车，把上下行数据按照不通公交车分类。生成两个List。每个List分别对应上行或者下行公共汽车的集合，List的元素就是该公共汽车在数据采集周期内的每个到达每个站点的不同位置

3.获取间隔时间

假设我们现在有了单辆bus的信息，那么计算相邻两站之间的时间，只需要根据type和inoutflag就可以了。只需要type为42(表示为在中间站)同事inoutflag为0.表示进站。提取符合这两条的记录便可以计算所有车站之间的行驶间隔了。最后我们把数据删除的只剩下站点和到站时间信息。

由于我们要获取的是时间间隔，而我们现在只有到站时间。利用python的时间处理模块，将这一时间字符串转化为时间戳，然后利用list计算出各站点之间的gap(时间差)，然后保存为Series后插入到dataframe格式中。

最后，由于数据存在误差，gps传输的数据也容易受到干扰，所以需要删除一些明显诡异的值。

4.源代码

# -*- coding: utf-8 -*-

"""

Created on Tue Dec 15 19:51:52 2015

@author: Luyixiao

"""

import pandas as pd

import numpy as np

import time

def disData(path):

data = pd.read_table(path,header=None)#read the txt data as table

daVal = data.drop(2,axis = 1)#delet the useless columns(the rout id)

daVal = daVal[daVal[13]==0]#onlineflag

daVal = daVal[daVal[12]==0]#runningflag

daVal = daVal[daVal[9]==0]#gpsflag

daVal =daVal.drop([5,6,7,8,9,12,13], axis = 1)

upRoad = daVal[daVal[10] == 0]#updownflag get the up flow data

downRoad = daVal[daVal[10]==1]

groupedUp = upRoad.groupby(3)#bus_id

upList = []

for bus in groupedUp:

upList.append(bus)

groupedDown = downRoad.groupby(3)#bus_id

downList = []

for bus in groupedDown:

downList.append(bus)

return upList,downList

#above return value is a list,elements in lists can be as the input

def timeGet(da):

inrec = da[da[11]==0]#inoutflag ,endure that the bus enters station

inrec = inrec[inrec[1]==42]

clr = inrec.drop([0,1,3,10,11],axis = 1)

gg = clr.groupby(4)#group by the station

timl=[]#the list store the time stamp

for cnt in range(0,len(clr)):

timl.append(time.mktime(time.strptime(clr.iat[cnt,1],'%Y-%m-%d %H:%M:%S')))

gap = []

for cnt in range(0,len(timl)-1):

gap.append(timl[cnt+1]-timl[cnt])

gap.append(0)#the last one define as zero for the corresponding of length

clr['gap'] = pd.Series(gap,index= clr.index)#add the row to the data frame

gd = clr.groupby(4)

ll = []

for si in gd:

ll.append(si)#each station in each car as a group,we average them

kk = {}#the dict for store the "station":"average_time_for_this_bus"

for cnt in range(0,len(ll)):

temp = ll[cnt][1][ll[cnt][1]['gap']<600]

temp = temp[temp['gap']>60]

if len(temp)*2 < len(ll[cnt][1]):

ave = 0

else:

ave = temp.sum()['gap']/len(temp)

kk[ll[cnt][0]] = ave

return kk,gap

总之呢，groupby之后遍历，转成List是一种很好用的技巧。

本文同步分享在博客“钱塘小甲子”(CSDN)。

如有侵权，请联系 support@oschina.cn 删除。

本文参与“OSC源创计划”，欢迎正在阅读的你也加入，一起分享。

原文链接：https://blog.csdn.net/weixin_42202078/article/details/112889578