python pandas库的groupby函数用法介绍

pandas 的 groupby 是一个强劲且常用的功能，用于对数据进行分组并应用各种操作。以下是一些 groupby 的常见用法和示例：

1. 基本用法 — 单列分组

groupby 一般与聚合函数（如 sum、mean、count 等）一起使用。以下是一些常见的 groupby 用法示例：

import pandas as pd

# 创建示例数据
data = {
     Date : [ 2021-01-01 ,  2021-01-01 ,  2021-01-02 ,  2021-01-02 ],
     ProductID : [1, 2, 1, 2],
     Platform : [ A ,  A ,  B ,  B ],
     Sales : [100, 150, 200, 250],
     Quantity : [10, 15, 20, 25]
}

df = pd.DataFrame(data)

# 按 Date 分组，并计算每组的总销售额
grouped = df.groupby( Date ).agg({
     Sales :  sum ,
     Quantity :  sum 
})

print(grouped)

输出结果:

          Sales  Quantity
Date                     
2021-01-01    250        25
2021-01-02    450        45

2. 多列分组

可以按多列进行分组，例如按 Date 和 ProductID 分组：

# 按 Date 和 ProductID 分组，并计算每组的总销售额
grouped = df.groupby([ Date ,  ProductID ]).agg({
     Sales :  sum ,
     Quantity :  sum 
})

print(grouped)

输出结果:

                   Sales  Quantity
Date       ProductID                
2021-01-01 1          100        10
           2          150        15
2021-01-02 1          200        20
           2          250        25

3. 使用自定义聚合函数

你还可以使用自定义的聚合函数，例如计算每组的平均销售额：

# 按 Date 分组，并计算每组的平均销售额
grouped = df.groupby( Date ).agg({
     Sales :  mean ,
     Quantity :  mean 
})

print(grouped)

输出结果：

          Sales  Quantity
Date                     
2021-01-01  125.0      12.5
2021-01-02  225.0      22.5

4. 使用 `apply` 方法

如果需要更灵活的分组操作，可以使用 apply 方法：

# 按 Date 分组，并对每组应用自定义函数
grouped = df.groupby( Date ).apply(lambda x: x[ Sales ].sum() - x[ Quantity ].sum())

print(grouped)

输出结果：

Date
2021-01-01    225
2021-01-02    405
dtype: int64

5. 重置索引

在使用 groupby 进行分组操作后，索引会变成分组列。如果需要恢复原来的索引，可以使用 reset_index 方法：

# 按 Date 分组，并计算每组的总销售额，然后重置索引
grouped = df.groupby( Date ).agg({
     Sales :  sum ,
     Quantity :  sum 
}).reset_index()

print(grouped)

输出结果：

        Date  Sales  Quantity
0  2021-01-01    250        25
1  2021-01-02    450        45

6. 把修改反应到原始数据上

groupby 操作会返回一个新的 DataFrame，不支持 inplace 操作。

可以将结果覆盖原 DataFrame 来达到目的。

import pandas as pd

# 创建示例数据
data = {
     Date : [ 2021-01-01 ,  2021-01-01 ,  2021-01-02 ,  2021-01-02 ],
     ProductID : [1, 2, 1, 2],
     Platform : [ A ,  A ,  B ,  B ],
     Sales : [100, 150, 200, 250],
     Quantity : [10, 15, 20, 25]
}

df = pd.DataFrame(data)

print(df)

df = df.groupby( Date ).agg({
     Sales :  sum ,
     Quantity :  sum 
}).reset_index()

print(df)

输出结果：

         Date  ProductID Platform  Sales  Quantity
0  2021-01-01          1        A    100        10
1  2021-01-01          2        A    150        15
2  2021-01-02          1        B    200        20
3  2021-01-02          2        B    250        25
         Date  Sales  Quantity
0  2021-01-01    250        25
1  2021-01-02    450        45

7. `as_index` 参数说明

as_index 参数用于控制分组键是否成为结果的索引。默认情况下，as_index=True，分组键会成为结果的索引。如果设置 as_index=False，分组键会作为普通列返回。

当 as_index=True 时，分组键会成为结果的索引。这样做的好处是可以利用分组键进行快速查询和选择。

当 as_index=False 时，分组键会作为普通列返回。这样做的好处是可以避免索引的额外复杂性，特别是在后续需要对结果 DataFrame 进行进一步处理时。

# 按 Date 分组，并计算每组的总销售额
grouped = df.groupby( Date , as_index=False).agg({
    Sales :  sum ,
    Quantity :  sum 
})

print(grouped)

输出：

         Date  Sales  Quantity
0  2021-01-01    250        25
1  2021-01-02    450        45

选择 as_index 参数取决于具体需求。如果你需要将分组键作为列进行操作，提议设置 as_index=False；如果你希望利用分组键进行索引操作，使用默认的 as_index=True 即可。

8. `as_index=False` 与 `reset_index()`的异同

as_index=False 和 reset_index() 在某些情况下的确可以达到一样的效果，但它们的使用场景和细节有一些区别。

在 groupby 时直接指定 as_index=False 可以避免生成索引，而是将分组键作为普通列保留在结果中。

如果已经进行了 groupby 且没有使用 as_index=False，可以使用 reset_index() 将索引转换为列。

区别:

操作时机：as_index=False 是在 groupby 操作时就指定不使用索引，而 reset_index() 是在 groupby 操作完成后再将索引转换为列。
代码简洁性：如果你知道你不需要分组键作为索引，使用 as_index=False 会更简洁，由于它在一个步骤中完成了操作。而使用 reset_index() 则需要多一个步骤。