目录
1.创建
从字典创建
字典的键是列名,字典的长度即是数据的长度,有广播机制,所以每个键下值的长度除了一样,还可以长度为1,通过广播机制不足。但长度不能是其他值。
In [9]: df2 = pd.DataFrame({'A': 1.,
...: 'B': pd.Timestamp('20130102'),
...: 'C': pd.Series(1, index=list(range(4)), dtype='float32'),
...: 'D': np.array([3] * 4, dtype='int32'),
...: 'E': pd.Categorical(["test", "train", "test", "train"]),
...: 'F': 'foo'})
...:
In [10]: df2
Out[10]:
A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo
从列表创建
列表的长度即是记录的长度,列表中的一个元素(这边的字典)就是一条记录,其中字典的键是列名。相同的键对应的值组成一列。
from pandas import DataFrame
d = [{"f1": 1, "f2": 2, "f3": 3},
{"f2": 12, "f1": 14, "f3": 16},
{"f3": 25, "f2": 24, "f1": 26},
{"f1": 35, "f2": 34, "f3": 36}]
df = DataFrame(d)
print(df)
输出
f1 f2 f3
0 1 2 3
1 14 12 16
2 26 24 25
3 35 34 36
从numpy数组创建
df = DataFrame(np.random.randn(6, 4), columns=list('ABCD'))
print(df)
输出
A B C D
0 0.039461 0.774348 -0.007067 0.738565
1 -0.142427 -0.287318 0.743472 -0.609328
2 -0.731498 0.095589 -0.664986 0.078787
3 0.583649 0.036846 0.050926 0.911483
4 -0.822104 0.260254 1.887518 -1.561972
5 0.391799 -0.392966 0.681349 -1.643266
从python原生列表创建
d = [[1, 2, 3, 4],
[11, 12, 13, 14],
[21, 22, 23, 24]]
dates = pd.date_range('20130101', periods=3)
df = DataFrame(d, columns=list('ABCD'))
print(df)
输出:
A B C D
0 1 2 3 4
1 11 12 13 14
2 21 22 23 24
2.按列按行取值
- 若取多行或多列(切片 or 中括号枚举,这种方式下哪怕取一行一列特性和多行多列一样)
- 取多行多列均返回DataFrame 类
- 切片再套一层中括号会报错,中括号里面只能一一枚举.(例如:origin_df.iloc[:, [1,2,3]],而不能写成origin_df.iloc[:, [1:3]]) ,索引即行操作同
- 注意用字段名或索引名切片的时候,左闭右闭,均能取到.
- 注意用下标切片取多行多列时,左闭右开,最右边取不到.
- values 属性返回二维矩阵,类型是numpy.ndarray
- 若取单行或单列所得类型得看取值方式
- 所取行或列外面再套一层中括号,得到 DataFrame类型(例如origin_df[['A']]), values 属性返回二维矩阵
- 否则 Series 类型(例如origin_df['A']), .values 属性返回一维矩阵,类型是numpy.ndarray
是否返回副本
- 凡是多套一层中括号的均是返回副本(切片无法套中括号),副本的修改值的操作不会影响到源.
- 否则不返回副本,修改会影响到源,枚举必须套括号.
drop操作都不会影响源
返回值类型
取单列
origin_df = DataFrame(np.arange(24).reshape(6, 4),
columns=list('ABCD'), index=list('abcdef'))
print("单列:{}\n单列套中括号: {}\nloc取单列: {}\nloc取单列套中括号: {}\n\
iloc取单列: {}\niloc取单列套中括号: {}".format(
type(origin_df['B']), type(origin_df[['B']]),
type(origin_df.loc[:, 'B']), type(origin_df.loc[:, ['B']]),
type(origin_df.iloc[:, 1]), type(origin_df.iloc[:, [1]])
))
print("="* 20)
try:
print(origin_df.loc['B']) # 报错
except KeyError as ex:
print(repr(ex))
单列:<class 'pandas.core.series.Series'> 单列套中括号: <class 'pandas.core.frame.DataFrame'> loc取单列: <class 'pandas.core.series.Series'> loc取单列套中括号: <class 'pandas.core.frame.DataFrame'> iloc取单列: <class 'pandas.core.series.Series'> iloc取单列套中括号: <class 'pandas.core.frame.DataFrame'> ==================== KeyError('B')
取单行
origin_df = DataFrame(np.arange(24).reshape(6, 4),
columns=list('ABCD'), index=list('abcdef'))
print("loc取单行: {}\nloc取单行套中括号: {}\niloc取单行: {}\niloc取单行套中括号: {}".format(
type(origin_df.loc['a']), type(origin_df.loc[['d'], :]),
type(origin_df.iloc[0, :]), type(origin_df.iloc[[1], :])
))
print(origin_df.loc['b']) #简版
print(origin_df.iloc[0]) #简版
print("="* 20)
try:
print(origin_df['b']) # 报错
except KeyError as ex:
print(repr(ex))
loc取单行: <class 'pandas.core.series.Series'> loc取单行套中括号: <class 'pandas.core.frame.DataFrame'> iloc取单行: <class 'pandas.core.series.Series'> iloc取单行套中括号: <class 'pandas.core.frame.DataFrame'> A 4 B 5 C 6 D 7 Name: b, dtype: int64 A 0 B 1 C 2 D 3 Name: a, dtype: int64 ==================== KeyError('b')
取多列
origin_df = DataFrame(np.arange(24).reshape(6, 4),
columns=list('ABCD'), index=list('abcdef'))
print("多列:{}\n多列套中括号: {}\nloc取多列: {}\nloc取多列套中括号: {}\n\
iloc取多列: {}\niloc取多列套中括号: {}".format(
type(origin_df['B':'C']), type(origin_df[['B',"D"]]),
type(origin_df.loc[:, 'B':"D"]), type(origin_df.loc[:, ['B',"D"]]),
type(origin_df.iloc[:, 1:3]), type(origin_df.iloc[:, [1,3]])
))
print(origin_df.loc[:, 'B':"D"])
print("="* 20)
try:
print(origin_df.loc['B':"C"]) # 报错
except KeyError as ex:
print(repr(ex))
多列:<class 'pandas.core.frame.DataFrame'> 多列套中括号: <class 'pandas.core.frame.DataFrame'> loc取多列: <class 'pandas.core.frame.DataFrame'> loc取多列套中括号: <class 'pandas.core.frame.DataFrame'> iloc取多列: <class 'pandas.core.frame.DataFrame'> iloc取多列套中括号: <class 'pandas.core.frame.DataFrame'> B C D a 1 2 3 b 5 6 7 c 9 10 11 d 13 14 15 e 17 18 19 f 21 22 23 ==================== Empty DataFrame Columns: [A, B, C, D] Index: []
取多行
origin_df = DataFrame(np.arange(24).reshape(6, 4),
columns=list('ABCD'), index=list('abcdef'))
print("loc取多行: {}\nloc取多行套中括号: {}\niloc取多行: {}\niloc取多行套中括号: {}".format(
type(origin_df.loc['a':'b', :]), type(origin_df.loc[['d','e'], :]),
type(origin_df.iloc[0:1, :]), type(origin_df.iloc[[1,3], :])
))
print(origin_df.loc['b':'d']) #简版
print(origin_df.iloc[0:2]) #简版
print("="* 20)
print(origin_df['b': 'd']) # 可以取到,注意与单行的区别
loc取多行: <class 'pandas.core.frame.DataFrame'> loc取多行套中括号: <class 'pandas.core.frame.DataFrame'> iloc取多行: <class 'pandas.core.frame.DataFrame'> iloc取多行套中括号: <class 'pandas.core.frame.DataFrame'> A B C D b 4 5 6 7 c 8 9 10 11 d 12 13 14 15 A B C D a 0 1 2 3 b 4 5 6 7 ==================== A B C D b 4 5 6 7 c 8 9 10 11 d 12 13 14 15
取行列操作是否返回副本
origin_df = DataFrame([[c + str(i) for i in range(6)]
for c in ("ABCDEF")],
columns=list("ABCDEF"), index=list("uvwxyz"))
print(origin_df)
A B C D E F u A0 B0 C0 D0 E0 F0 v A1 B1 C1 D1 E1 F1 w A2 B2 C2 D2 E2 F2 x A3 B3 C3 D3 E3 F3 y A4 B4 C4 D4 E4 F4 z A5 B5 C5 D5 E5 F5
取单列
origin_df = DataFrame([[c + str(i) for c in ("ABCDEF")]
for i in range(6)],
columns=list("ABCDEF"), index=list("uvwxyz"))
col_1 = origin_df['A']
col_1[0] = "单列"
col_2 = origin_df[['B']]
col_2.iloc[0, 0] = "单列套中括号"
col_3 = origin_df.loc[:, "C"]
col_3.iloc[0] = "loc取单列"
col_4 = origin_df.loc[:, ["D"]]
col_4.iloc[0,0] = "loc取单列套中括号"
col_5 = origin_df.iloc[:, 4]
col_5.iloc[0] = "iloc取单列"
col_6 = origin_df.iloc[:, [5]]
col_6.iloc[0,0] = "iloc取单列套中括号"
print(origin_df)
A B C D E F u 单列 B0 loc取单列 D0 iloc取单列 F0 v A1 B1 C1 D1 E1 F1 w A2 B2 C2 D2 E2 F2 x A3 B3 C3 D3 E3 F3 y A4 B4 C4 D4 E4 F4 z A5 B5 C5 D5 E5 F5
取单行
origin_df = DataFrame([[c + str(i) for c in ("ABCDEF")]
for i in range(6)],
columns=list("ABCDEF"), index=list("uvwxyz"))
row_3 = origin_df.loc['w', :]
row_3.iloc[0] = "loc取单行"
row_4 = origin_df.loc[['x'], :]
row_4.iloc[0,0] = "loc取单行套中括号"
row_5 = origin_df.iloc[4, :]
row_5.iloc[0] = "iloc取单行"
row_6 = origin_df.iloc[[5], :]
row_6.iloc[0,0] = "loc取单行套中括号"
print(origin_df)
A B C D E F u A0 B0 C0 D0 E0 F0 v A1 B1 C1 D1 E1 F1 w loc取单行 B2 C2 D2 E2 F2 x A3 B3 C3 D3 E3 F3 y iloc取单行 B4 C4 D4 E4 F4 z A5 B5 C5 D5 E5 F5
取多列
origin_df = DataFrame([[c + str(i) for c in ("ABCDEF")]
for i in range(6)],
columns=list("ABCDEF"), index=list("uvwxyz"))
# 取不到Empty DataFrame,Columns: [A, B, C, D, E, F],Index: []
# col_1 = origin_df['A':'B']
col_2 = origin_df[['A',"B"]]
col_2.iloc[1, 0] = "多列套中括号"
col_3 = origin_df.loc[:, "C":"D"]
col_3.iloc[2, 0] = "loc取多列"
print("列名切片col_3\n {}".format(col_3))
col_4 = origin_df.loc[:, ["C","D"]]
col_4.iloc[3, 0] = "loc取多列套中括号"
col_5 = origin_df.iloc[:, 4:5]
col_5.iloc[4, 0] = "iloc取多列"
print("=" * 20)
print("下标切片col_5\n {}".format(col_5))
col_6 = origin_df.iloc[:, [4,5]]
col_6.iloc[5,0] = "iloc取多列套中括号"
print("=" * 20)
print(origin_df)
列名切片col_3 C D u C0 D0 v C1 D1 w loc取多列 D2 x C3 D3 y C4 D4 z C5 D5 ==================== 下标切片col_5 E u E0 v E1 w E2 x E3 y iloc取多列 z E5 ==================== A B C D E F u A0 B0 C0 D0 E0 F0 v A1 B1 C1 D1 E1 F1 w A2 B2 loc取多列 D2 E2 F2 x A3 B3 C3 D3 E3 F3 y A4 B4 C4 D4 iloc取多列 F4 z A5 B5 C5 D5 E5 F5
取多行
origin_df = DataFrame([[c + str(i) for c in ("ABCDEF")]
for i in range(6)],
columns=list("ABCDEF"), index=list("uvwxyz"))
row_3 = origin_df.loc['w':'x', :]
row_3.iloc[1, 2] = "loc取多行"
print("索引名切片row_3\n {}".format(row_3))
row_4 = origin_df.loc[['w', 'x'], :]
row_4.iloc[1, 3] = "loc取多行套中括号"
row_5 = origin_df.iloc[4:5, :]
row_5.iloc[0, 4] = "iloc取多行"
print("=" * 20)
print("下标切片row_5\n {}".format(row_5))
row_6 = origin_df.iloc[[4,5], :]
row_6.iloc[1,5] = "loc取多行套中括号"
print("=" * 20)
print(origin_df)
索引名切片row_3 A B C D E F w A2 B2 C2 D2 E2 F2 x A3 B3 loc取多行 D3 E3 F3 ==================== 下标切片row_5 A B C D E F y A4 B4 C4 D4 iloc取多行 F4 ==================== A B C D E F u A0 B0 C0 D0 E0 F0 v A1 B1 C1 D1 E1 F1 w A2 B2 C2 D2 E2 F2 x A3 B3 loc取多行 D3 E3 F3 y A4 B4 C4 D4 iloc取多行 F4 z A5 B5 C5 D5 E5 F5
3.values 属性
- 若取多行或多列(切片 or 中括号枚举,这种方式下哪怕取一行一列特性和多行多列一样)
- values 属性返回二维矩阵,类型是numpy.ndarray
- 若取单行或单列所得类型得看取值方式
- 所取行或列外面再套一层中括号(例如origin_df[['A']]), values 属性返回二维矩阵
- 否则 values 属性返回一维矩阵,类型是numpy.ndarray
是否返回副本
凡是多套一层中括号的均是返回副本(切片无法套中括号),副本的修改值的操作不会影响到源.
否则不返回副本,修改会影响到源,枚举必须套括号.
col_1 = origin_df['A']
print("col_1\n{}".format(col_1.values))
col_2 = origin_df[['B']]
print("col_2\n{}".format(col_2.values))
row_3 = origin_df.loc['w', :]
print("row_3\n{}".format(row_3.values))
row_4 = origin_df.loc[['x'], :]
print("row_4\n{}".format(row_4.values))
col_2 = origin_df[['A',"B"]]
print("col_2\n{}".format(col_2.values))
col_3 = origin_df.loc[:, "C":"C"]
print(" 哪怕切片仅取一列性质依旧通多列col_3\n{}".format(col_3.values))
row_5 = origin_df.iloc[4:5, :]
print("row_5\n {}".format(row_5.values))
row_6 = origin_df.iloc[[4,5], :]
print("row_6\n {}".format(row_6.values))
col_1 ['A0' 'A1' 'A2' 'A3' 'A4' 'A5'] col_2 [['B0'] ['B1'] ['B2'] ['B3'] ['B4'] ['B5']] row_3 ['A2' 'B2' 'C2' 'D2' 'E2' 'F2'] row_4 [['A3' 'B3' 'C3' 'D3' 'E3' 'F3']] col_2 [['A0' 'B0'] ['A1' 'B1'] ['A2' 'B2'] ['A3' 'B3'] ['A4' 'B4'] ['A5' 'B5']] 哪怕切片仅取一列性质依旧通多列col_3 [['C0'] ['C1'] ['C2'] ['C3'] ['C4'] ['C5']] [['C0' 'D0'] ['C1' 'D1'] ['C2' 'D2'] ['C3' 'D3'] ['C4' 'D4'] ['C5' 'D5']] row_5 [['A4' 'B4' 'C4' 'D4' 'E4' 'F4']] row_6 [['A4' 'B4' 'C4' 'D4' 'E4' 'F4'] ['A5' 'B5' 'C5' 'D5' 'E5' 'F5']]
4.布尔索引
dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))
print(df)
A B C D
2013-01-01 0.161101 0.364128 1.735634 -0.835255
2013-01-02 1.164170 0.384188 0.302318 -0.293224
2013-01-03 1.116850 1.469352 0.867080 -0.420124
2013-01-04 0.952359 1.056309 -2.857191 0.668887
2013-01-05 -0.097658 -0.794298 1.387195 -0.897870
2013-01-06 -0.270472 -1.841921 2.008927 1.140431
排序:返回副本
df.sort_values(by='B')
A | B | C | D | |
---|---|---|---|---|
2013-01-06 | -0.270472 | -1.841921 | 2.008927 | 1.140431 |
2013-01-05 | -0.097658 | -0.794298 | 1.387195 | -0.897870 |
2013-01-01 | 0.161101 | 0.364128 | 1.735634 | -0.835255 |
2013-01-02 | 1.164170 | 0.384188 | 0.302318 | -0.293224 |
2013-01-04 | 0.952359 | 1.056309 | -2.857191 | 0.668887 |
2013-01-03 | 1.116850 | 1.469352 | 0.867080 | -0.420124 |
print(df[df['A'] > 0])
A B C D
2013-01-01 0.161101 0.364128 1.735634 -0.835255
2013-01-02 1.164170 0.384188 0.302318 -0.293224
2013-01-03 1.116850 1.469352 0.867080 -0.420124
2013-01-04 0.952359 1.056309 -2.857191 0.668887
print(df[df > 0])
A B C D
2013-01-01 0.161101 0.364128 1.735634 NaN
2013-01-02 1.164170 0.384188 0.302318 NaN
2013-01-03 1.116850 1.469352 0.867080 NaN
2013-01-04 0.952359 1.056309 NaN 0.668887
2013-01-05 NaN NaN 1.387195 NaN
2013-01-06 NaN NaN 2.008927 1.140431
df2 = df.copy()
df2['E'] = ['one', 'one', 'two', 'three', 'four', 'three']
print(df2)
print(df2[df2['E'].isin(['two', 'four'])])
A B C D E
2013-01-01 0.161101 0.364128 1.735634 -0.835255 one
2013-01-02 1.164170 0.384188 0.302318 -0.293224 one
2013-01-03 1.116850 1.469352 0.867080 -0.420124 two
2013-01-04 0.952359 1.056309 -2.857191 0.668887 three
2013-01-05 -0.097658 -0.794298 1.387195 -0.897870 four
2013-01-06 -0.270472 -1.841921 2.008927 1.140431 three
A B C D E
2013-01-03 1.116850 1.469352 0.867080 -0.420124 two
2013-01-05 -0.097658 -0.794298 1.387195 -0.897870 four
print(df2[~df2['E'].isin(['two', 'four'])])
A B C D E
2013-01-01 0.161101 0.364128 1.735634 -0.835255 one
2013-01-02 1.164170 0.384188 0.302318 -0.293224 one
2013-01-04 0.952359 1.056309 -2.857191 0.668887 three
2013-01-06 -0.270472 -1.841921 2.008927 1.140431 three
5.index
reindex
Create a new index and reindex the dataframe. By default values in the new index that do not have corresponding records in the dataframe are assigned ``NaN``.
重新设置df的索引,并且df的顺序是新索引的定义顺序,记录的位置可能发生变化,如果新的索引在原先的df存在,该索引的记录就是原纪录,否则,该记录填默认值NaN。
df = pd.DataFrame(np.random.randn(10, 4), columns=list("ABCD"))
print(df)
A B C D
0 0.903525 -0.664247 -0.645762 -0.762519
1 0.981854 -1.070156 -1.164206 -0.908125
2 0.309620 -0.786684 -0.960699 1.606932
3 -1.488677 0.281483 0.856681 0.613150
4 0.772205 0.601886 0.344716 -1.800654
5 0.769349 0.875296 0.074671 -0.333205
6 0.721913 -0.148773 -0.825000 -0.903127
7 -0.886161 0.625793 0.102159 0.264182
8 -0.225532 -0.221453 1.164743 1.037622
9 -0.046355 -1.238612 0.042434 -0.473256
df1 = df.reindex(list(range(15, 5, -1)))
print(df1)
A B C D
15 NaN NaN NaN NaN
14 NaN NaN NaN NaN
13 NaN NaN NaN NaN
12 NaN NaN NaN NaN
11 NaN NaN NaN NaN
10 NaN NaN NaN NaN
9 -0.046355 -1.238612 0.042434 -0.473256
8 -0.225532 -0.221453 1.164743 1.037622
7 -0.886161 0.625793 0.102159 0.264182
6 0.721913 -0.148773 -0.825000 -0.903127
关于index
重新设置索引,相当于仅将df的索引改变一下,记录还是在原先的位置。
df1 = df.copy()
df1.index = range(15, 5, -1)
print(df1)
A B C D
15 0.903525 -0.664247 -0.645762 -0.762519
14 0.981854 -1.070156 -1.164206 -0.908125
13 0.309620 -0.786684 -0.960699 1.606932
12 -1.488677 0.281483 0.856681 0.613150
11 0.772205 0.601886 0.344716 -1.800654
10 0.769349 0.875296 0.074671 -0.333205
9 0.721913 -0.148773 -0.825000 -0.903127
8 -0.886161 0.625793 0.102159 0.264182
7 -0.225532 -0.221453 1.164743 1.037622
6 -0.046355 -1.238612 0.042434 -0.473256
版权声明:本文为qq_xuanshuang原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接和本声明。