python进行数据抽取_python中的数据抽取

首先创建一个数据帧(使用随机数据):import pandas as pd

import numpy as np

from datetime import datetime, timedelta

ab = pd.DataFrame()

ab["subjectID"] = np.random.randint(5, size=200)#random list of "subjects" from 0 to 4

ab["day_number"] = np.random.randint(50, size=200)#random list of "dates" from 0 to 50

ab['real_date'] = ab.day_number.apply(lambda d: datetime(2018, 1, 1) + timedelta(days=d)) #to simulate real dates

ab["score1"] = np.random.randint(200, size=200)#meant to simulate one measurement from one subject

ab["score2"] = np.random.randint(400, size=200)#meant to simulate a second measurement

min_day = ab.real_date.min()

ab = ab.groupby(['subjectID', 'real_date']).sum() #because some subjects have more than 1 score each day

print(ab.head(10))

day_number score1 score2

subjectID real_date

0 2018-01-01 0 306 273

2018-01-04 3 32 60

2018-01-05 4 61 135

2018-01-08 21 477 393

2018-01-09 8 22 341

2018-01-10 9 137 30

2018-01-11 30 281 674

2018-01-14 13 183 396

2018-01-15 14 41 337

2018-01-16 15 83 50

然后用下一天的数据填充没有数据的天数:

^{pr2}$

4天的下一次重采样(分组依据):res = df.reset_index(level='subjectID').groupby('subjectID').resample('4D').first() #group by 4 days periods and keep only the first value

res = res.drop(columns='subjectID')

print(res.head(10))

day_number score1 score2

subjectID real_date

0 2018-01-01 0 306 273

2018-01-05 4 61 135

2018-01-09 8 22 341

2018-01-13 13 183 396

2018-01-17 18 91 46

2018-01-21 20 76 333

2018-01-25 48 131 212

2018-01-29 29 92 81

2018-02-02 32 172 55

2018-02-06 72 98 246

最后,当有超过4天的周期没有数据时,重新设置索引并处理情况:res = res.reset_index('real_date', drop=True) #the real_date has no meaning anymore

res['real_date'] = res.day_number.apply(lambda d: min_day + timedelta(days=d)) #good real_date based on the day_number

res = res.drop(columns='day_number')

res = res.set_index('real_date', append=True)

res = res.groupby(level=['subjectID', 'real_date']).first() #regroups periods with no data for more than 4 days

print(res.head(10))

score1 score2

subjectID real_date

0 2018-01-01 306 273

2018-01-05 61 135

2018-01-09 22 341

2018-01-14 183 396

2018-01-19 91 46

2018-01-21 76 333

2018-01-30 92 81

2018-02-02 172 55

2018-02-10 40 218

2018-02-15 110 112

这有点复杂,但我认为这是最好的办法。虽然我不知道效率如何,但似乎也没那么糟。在


版权声明:本文为weixin_34543510原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接和本声明。