简介

Pandas Data Type

为什么要关注dtype

一、astype and apply

方案一

方案二

方案三

二、统计哪一个sku在2019年卖出去的数量最多

1. 使用pivot_table 解决

2. 使用groupby 解决

我是总结

简介

在做数据分析的时候，很重要的一点是要了解数据的具体类型，避免在数据分析过程中遇到奇怪的问题。
使用pandas进行数据分析时，难免会遇到需要转换数据类型的问题。本文主要介绍pandas基本数据类型(dtype)

Pandas Data Type

Pandas dtype	Python type	NumPy type	Usage
object	str or mixed	string, unicode, mixed types	Text or mixed numeric and non-numeric values
int64	int	int_, int8, int16, int32, int64, uint8, uint16, uint32, uint64	Integer numbers
float64	float	float_, float16, float32, float64	Floating point numbers
bool	bool	bool_	True/False values
datetime64	NA	datetime64[ns]	Date and time values
timedelta[ns]	NA	NA	Differences between two datetimes
category	NA	NA	Finite list of text values

为什么要关注dtype

使用pandas进一步数据分析之前要先检查数据
可能因为数据类型导致的报错和错误的结果

本文将使用如下csv进行说明：

# data_type.csv
Sku,Views,Month,Day,Year,Sold,Reviews,Active
212039,20,2,2,2019,10,2,Y
212038,21,2,2,2018,10,2,Y
212037,22,2,2,2019,10,2,Y
212036,23,2,2,2019,10,2,Y
212035,24,2,2,2019,10,2,Y
212034,25,2,2,2019,10,2,Y
212033,26,2,2,2019,10,2,Y
212032,27,2,2,2019,10,2,Y
212031,28,2,2,2019,10,2,N
212030,29,2,2,2019,10,2,N
212039,20,3,3,2019,100,50,Y
212038,21,3,3,2019,90,48,Y
212037,22,3,3,2019,80,46,Y
212036,23,3,3,2019,70,44,Y
212035,无,3,3,2019,无,0,Y

import pandas as pd
import numpy as np
df = pd.read_csv("../datas/data_type.csv")

df

	Sku	Views	Month	Day	Year	Sold	Reviews	Active
0	212039	20	2	2	2019	10	2	Y
1	212038	21	2	2	2018	10	2	Y
2	212037	22	2	2	2019	10	2	Y
3	212036	23	2	2	2019	10	2	Y
4	212035	24	2	2	2019	10	2	Y
5	212034	25	2	2	2019	10	2	Y
6	212033	26	2	2	2019	10	2	Y
7	212032	27	2	2	2019	10	2	Y
8	212031	28	2	2	2019	10	2	N
9	212030	29	2	2	2019	10	2	N
10	212039	20	3	3	2019	100	50	Y
11	212038	21	3	3	2019	90	48	Y
12	212037	22	3	3	2019	80	46	Y
13	212036	23	3	3	2019	70	44	Y
14	212035	无	3	3	2019	无	0	Y

df.dtypes

Sku         int64
Views      object
Month       int64
Day         int64
Year        int64
Sold       object
Reviews     int64
Active     object
dtype: object

一、astype and apply

下面介绍下astype和apply两个函数, 具体用法可以使用help(df[‘Active’].astype)

astype: 类型转换，转换为指定的pandas data type
apply：将函数返回值保存到Series中

首先将Active列转换为bool看看发生了什么

df['Active'].astype('bool')

0     True
1     True
2     True
3     True
4     True
5     True
6     True
7     True
8     True
9     True
10    True
11    True
12    True
13    True
14    True
Name: Active, dtype: bool

从上面结果看到所有都为True，第8和第9行也显示了True，而期望的结果则是第8和第9行显示False

那如何做到呢

方案一：手写函数替换
方案二：使用lambda
方案三：使用np.where

方案一

# 方案一
def convert_bool(val):
    """
    Convert the string value to bool
     - if Y, then return True
     - if N, then return False
    """

    if val == 'Y':
        return True
    return False

df['Active'].apply(convert_bool)

0      True
1      True
2      True
3      True
4      True
5      True
6      True
7      True
8     False
9     False
10     True
11     True
12     True
13     True
14     True
Name: Active, dtype: bool

方案二

# 方案二
df["Active"].apply(lambda item: True if item=='Y' else False)

0      True
1      True
2      True
3      True
4      True
5      True
6      True
7      True
8     False
9     False
10     True
11     True
12     True
13     True
14     True
Name: Active, dtype: bool

方案三

# 方案三
np.where(df["Active"] == "Y", True, False)
# df['Active'] = np.where(df["Active"] == "Y", True, False)

array([ True,  True,  True,  True,  True,  True,  True,  True, False,
       False,  True,  True,  True,  True,  True])

二、统计哪一个sku在2019年卖出去的数量最多

第一步: 统计所有sku在2019年销售的数量之和
第二步: 取出最大销售数量的sku

1. 使用pivot_table 解决

使用pivot_table函数
将Sku和Year作为索引
Sold作为values计算

# drop if year != 2019
newdf = df.copy(deep=True)
newdf = newdf[newdf["Year"] == 2019]
pd.pivot_table(newdf, index=['Sku','Year'], values=['Sold'], aggfunc=np.sum)

		Sold
Sku	Year
212030	2019	10
212031	2019	10
212032	2019	10
212033	2019	10
212034	2019	10
212035	2019	10无
212036	2019	1070
212037	2019	1080
212038	2019	90
212039	2019	10100

从上面结果看出，sold最终结果不是我们期望的，看起来像是字符串拼接，让我们一起看看发生了什么

首先想到的是检查数据类型

newdf.dtypes

Sku         int64
Views      object
Month       int64
Day         int64
Year        int64
Sold       object
Reviews     int64
Active     object
dtype: object

Sold object不是int类型，所以导致np.sum计算时得到的结果不是期望的

那直接转换成int类型？？？

newdf['Sold'].astype(int)
# will get follow error:
# ValueError: invalid literal for int() with base 10: '无'

毫无疑问地报错了，这就需要我们进行数据清理，将无效数据去掉

这里我们看一个神奇的函数

pd.to_numeric(arg, errors=’coerce’, downcast=None) 可以使用help函数查看具体用法
If errors = ‘coerce’, then invalid parsing will be set as NaN.即解析不出来将会返回NaN

# fillna if NaN, then fill in 0.
pd.to_numeric(newdf['Sold'], errors='coerce').fillna(0)

0      10.0
2      10.0
3      10.0
4      10.0
5      10.0
6      10.0
7      10.0
8      10.0
9      10.0
10    100.0
11     90.0
12     80.0
13     70.0
14      0.0
Name: Sold, dtype: float64

# 重写df['Sold']
# 可以看到newdf['212035']['Sold']='无' 变成了结果：0.0
newdf['Sold'] = pd.to_numeric(newdf['Sold'], errors='coerce').fillna(0)
newdf

	Sku	Views	Month	Day	Year	Sold	Reviews	Active
0	212039	20	2	2	2019	10.0	2	Y
2	212037	22	2	2	2019	10.0	2	Y
3	212036	23	2	2	2019	10.0	2	Y
4	212035	24	2	2	2019	10.0	2	Y
5	212034	25	2	2	2019	10.0	2	Y
6	212033	26	2	2	2019	10.0	2	Y
7	212032	27	2	2	2019	10.0	2	Y
8	212031	28	2	2	2019	10.0	2	N
9	212030	29	2	2	2019	10.0	2	N
10	212039	20	3	3	2019	100.0	50	Y
11	212038	21	3	3	2019	90.0	48	Y
12	212037	22	3	3	2019	80.0	46	Y
13	212036	23	3	3	2019	70.0	44	Y
14	212035	无	3	3	2019	0.0	0	Y

再次执行pivot_table函数

frame = pd.pivot_table(newdf, index=['Sku'], values=['Sold'], aggfunc=[np.sum])
frame

	sum
	Sold
Sku
212030	10.0
212031	10.0
212032	10.0
212033	10.0
212034	10.0
212035	10.0
212036	80.0
212037	90.0
212038	90.0
212039	110.0

获取最大值

# 方案一
max_sold_nums = frame[('sum','Sold')].max()
# 获取索引
max_sold_idx = frame[('sum','Sold')].idxmax()
# 获取某一行
max_sold_infos = frame.loc[max_sold_idx]
print('Max sold numbers: \n', max_sold_nums)
print('Max sold sku details: \n', max_sold_infos)

Max sold numbers: 
 110.0
Max sold sku details: 
 sum  Sold    110.0
Name: 212039, dtype: float64

# 方案二
# 将columns的MultiIndex拆分，使用stack函数
frame.columns

MultiIndex([('sum', 'Sold')],
           )

frame.stack().reset_index()

	Sku	level_1	sum
0	212030	Sold	10.0
1	212031	Sold	10.0
2	212032	Sold	10.0
3	212033	Sold	10.0
4	212034	Sold	10.0
5	212035	Sold	10.0
6	212036	Sold	80.0
7	212037	Sold	90.0
8	212038	Sold	90.0
9	212039	Sold	110.0

single_frame = frame.stack().reset_index()
max_sold_nums = single_frame['sum'].max()
max_sold_idx = single_frame['sum'].idxmax()
max_sold_infos = single_frame.loc[max_sold_idx]
print('Max sold numbers: \n', max_sold_nums)
print('Max sold sku details: \n', max_sold_infos)

Max sold numbers: 
 110.0
Max sold sku details: 
 Sku        212039
level_1      Sold
sum           110
Name: 9, dtype: object

2. 使用groupby 解决


max_sold_nums = newdf.groupby(['Sku'])['Sold'].sum().max()
max_sold_idx = newdf.groupby(['Sku'])['Sold'].sum().idxmax()

print('Max sold numbers: \n', max_sold_nums)
print('Max sold sku: \n', max_sold_idx)

Max sold numbers: 
 110.0
Max sold sku: 
 212039

我是总结

介绍了pandas的data type以及类型转换
max，idxmax以及loc的用法
pivot_table 透视表的简单使用
groupby的简单使用

扫码关注公众号

扫码关注公众号: 风起帆扬了
来一起学习，成长，分享
航行在测试的大道上
喜欢就点赞吧

原文链接：https://blog.csdn.net/keithsoul/article/details/110941335