首先创建一个数据帧(使用随机数据):import pandas as pd
import numpy as np
from datetime import datetime, timedelta
ab = pd.DataFrame()
ab["subjectID"] = np.random.randint(5, size=200)#random list of "subjects" from 0 to 4
ab["day_number"] = np.random.randint(50, size=200)#random list of "dates" from 0 to 50
ab['real_date'] = ab.day_number.apply(lambda d: datetime(2018, 1, 1) + timedelta(days=d)) #to simulate real dates
ab["score1"] = np.random.randint(200, size=200)#meant to simulate one measurement from one subject
ab["score2"] = np.random.randint(400, size=200)#meant to simulate a second measurement
min_day = ab.real_date.min()
ab = ab.groupby(['subjectID', 'real_date']).sum() #because some subjects have more than 1 score each day
print(ab.head(10))
day_number score1 score2
subjectID real_date
0 2018-01-01 0 306 273
2018-01-04 3 32 60
2018-01-05 4 61 135
2018-01-08 21 477 393
2018-01-09 8 22 341
2018-01-10 9 137 30
2018-01-11 30 281 674
2018-01-14 13 183 396
2018-01-15 14 41 337
2018-01-16 15 83 50
然后用下一天的数据填充没有数据的天数:
^{pr2}$
4天的下一次重采样(分组依据):res = df.reset_index(level='subjectID').groupby('subjectID').resample('4D').first() #group by 4 days periods and keep only the first value
res = res.drop(columns='subjectID')
print(res.head(10))
day_number score1 score2
subjectID real_date
0 2018-01-01 0 306 273
2018-01-05 4 61 135
2018-01-09 8 22 341
2018-01-13 13 183 396
2018-01-17 18 91 46
2018-01-21 20 76 333
2018-01-25 48 131 212
2018-01-29 29 92 81
2018-02-02 32 172 55
2018-02-06 72 98 246
最后,当有超过4天的周期没有数据时,重新设置索引并处理情况:res = res.reset_index('real_date', drop=True) #the real_date has no meaning anymore
res['real_date'] = res.day_number.apply(lambda d: min_day + timedelta(days=d)) #good real_date based on the day_number
res = res.drop(columns='day_number')
res = res.set_index('real_date', append=True)
res = res.groupby(level=['subjectID', 'real_date']).first() #regroups periods with no data for more than 4 days
print(res.head(10))
score1 score2
subjectID real_date
0 2018-01-01 306 273
2018-01-05 61 135
2018-01-09 22 341
2018-01-14 183 396
2018-01-19 91 46
2018-01-21 76 333
2018-01-30 92 81
2018-02-02 172 55
2018-02-10 40 218
2018-02-15 110 112
这有点复杂,但我认为这是最好的办法。虽然我不知道效率如何,但似乎也没那么糟。在