Groupby

DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=<object object>, observed=False, dropna=True）

groupby操作涉及到分割对象、应用函数和组合结果的某种组合。这可以用于对大量数据进行分组，并对这些分组进行计算操作。

参数介绍

by:后面跟要分组的索引，返回DataFrameGroupBy对象，不可以直接查看。
axis=（{0，index}，{1，columns}）:坐标轴选取，0为行，1为列。default为0。
lever:
as_index:对于聚合输出，返回带有组标签的对象作为索引。只与数据输入相关。as index=False实际上是“sql风格”的分组输出。
sort
group_keys:调用apply时，将组键添加到索引以标识片段
squeeze
observed
dropna：True or False ,默认为True, 为真则删除含有空值的行和列。

1.Transform

Age	Country	Income
5000	China	10000
4321	China	10000
1234	India	5000
4010	India	5002
250	America	40000
250	Japan	50000
4500	China	8000
4321	India	5000

代码如下（示例）：

df_transform = df.groupby('Country').transform(min)
print(df_transform)
    输出
   Income   Age
0    8000  4321
1    8000  4321
2    5000  1234
3    5000  1234
4   40000   250
5   50000   250
6    8000  4321
7    5000  1234

返回的数据结构不变

2.Agg聚合

代码如下（示例）：

    df_agg = df.groupby('Country').agg(['min', 'mean', 'max'])
    print(df_agg)
    输出
       Age                    Income                     
              min         mean   max    min          mean    max
    Country                                                     
    America   250   250.000000   250  40000  40000.000000  40000
    China    4321  4607.000000  5000   8000   9333.333333  10000
    India    1234  3188.333333  4321   5000   5000.666667   5002
    Japan     250   250.000000   250  50000  50000.000000  50000

想更换列名可以采用{列名：统计函数}

3.Apply

代码如下（示例）：

df_apply = df.groupby('Country').apply(min)
print(df_apply)
    输出
         Country  Income   Age
Country                       
America  America   40000   250
China      China    8000  4321
India      India    5000  1234
比agg多了一列

后续碰到的一些问题：

rain.groupby("User_id")["Coupon_id"]

跟

train.groupby(["User_id","Coupon_id"])

效果一样,后者运行好像慢一些。

原文链接：https://blog.csdn.net/weixin_44261347/article/details/109146088