选择 DataFrame 列的方法大全

因 457079928 python草堂群,网友 @星河·璀动 问到pandas选择数据列的问题,故把自己的学习总结的笔记整理出来。

完整的ipynb文件,和文中示例数据请移步python草堂群下载。

注意:这些方法返回的都是视图。

  1. 如果你确信原始数据的一些列是无用的,可以在删除这些列后,创建 DataFrame 的一个 .copy(deep=True),然后删除原 DataFrame, 以节省内存;
  2. 如果列不是很多,在导入数据时就排除这些列。

选择列的两种思路

  1. 需要的列较少,直接选择需要的列;
  2. 不需要的列较少,删除或排除不要的列。
import numpy as np
import pandas as pd
import pprint as pp

baseball_df = pd.read_csv('baseball.csv',index_col='No.')
pp.pprint(baseball_df)
              id  year  stint team   lg    g   ab   r    h  X2b  ...   rbi  \
No.                                                              ...         
4      ansonca01  1871      1  RC1  NaN   25  120  29   39   11  ...  16.0   
44     forceda01  1871      1  WS3  NaN   32  162  45   45    9  ...  29.0   
68     mathebo01  1871      1  FW1  NaN   19   89  15   24    3  ...  10.0   
99     startjo01  1871      1  NY2  NaN   33  161  35   58    5  ...  34.0   
102    suttoez01  1871      1  CL1  NaN   29  128  35   45    3  ...  23.0   
...          ...   ...    ...  ...  ...  ...  ...  ..  ...  ...  ...   ...   
89525  benitar01  2007      2  FLO   NL   34    0   0    0    0  ...   0.0   
89526  benitar01  2007      1  SFN   NL   19    0   0    0    0  ...   0.0   
89530  ausmubr01  2007      1  HOU   NL  117  349  38   82   16  ...  25.0   
89533   aloumo01  2007      1  NYN   NL   87  328  51  112   19  ...  49.0   
89534  alomasa02  2007      1  NYN   NL    8   22   1    3    1  ...   0.0   

        sb   cs  bb    so  ibb  hbp   sh   sf  gidp  
No.                                                  
4      6.0  2.0   2   1.0  NaN  NaN  NaN  NaN   NaN  
44     8.0  0.0   4   0.0  NaN  NaN  NaN  NaN   NaN  
68     2.0  1.0   2   0.0  NaN  NaN  NaN  NaN   NaN  
99     4.0  2.0   3   0.0  NaN  NaN  NaN  NaN   NaN  
102    3.0  1.0   1   0.0  NaN  NaN  NaN  NaN   NaN  
...    ...  ...  ..   ...  ...  ...  ...  ...   ...  
89525  0.0  0.0   0   0.0  0.0  0.0  0.0  0.0   0.0  
89526  0.0  0.0   0   0.0  0.0  0.0  0.0  0.0   0.0  
89530  6.0  1.0  37  74.0  3.0  6.0  4.0  1.0  11.0  
89533  3.0  0.0  27  30.0  5.0  2.0  0.0  3.0  13.0  
89534  0.0  0.0   0   3.0  0.0  0.0  0.0  0.0   0.0  

[21699 rows x 22 columns]

方法一:直接指定需要的列

pp.pprint(baseball_df.query('g>50')[['year','g','ab','r','h']])
       year    g   ab   r    h
No.                           
249    1872   55  282  62   76
268    1872   51  248  66   66
276    1873   52  254  53  101
293    1873   55  245  56   62
344    1873   52  223  40   43
...     ...  ...  ...  ..  ...
89499  2007   69  189  23   48
89521  2007  126  340  75   94
89523  2007  141  517  68  130
89530  2007  117  349  38   82
89533  2007   87  328  51  112

[11704 rows x 5 columns]

方法二:获取需要的列名列表

使用列表推导排除不需要的列

cols = [i for i in baseball_df.columns if i not in ['id', 'ibb','gidp','X2b','X3b','lg']]
baseball_df[cols]
yearstintteamgabrhhrrbisbcsbbsohbpshsf
No.
418711RC1251202939016.06.02.021.0NaNNaNNaN
4418711WS3321624545029.08.00.040.0NaNNaNNaN
6818711FW119891524010.02.01.020.0NaNNaNNaN
9918711NY2331613558134.04.02.030.0NaNNaNNaN
10218711CL1291283545323.03.01.010.0NaNNaNNaN
...................................................
8952520072FLO3400000.00.00.000.00.00.00.0
8952620071SFN1900000.00.00.000.00.00.00.0
8953020071HOU1173493882325.06.01.03774.06.04.01.0
8953320071NYN87328511121349.03.00.02730.02.00.03.0
8953420071NYN8221300.00.00.003.00.00.00.0

21699 rows × 16 columns

baseball_df.query('g>100.0')[cols]
yearstintteamgabrhhrrbisbcsbbsohbpshsf
No.
198818841LS2103447101150447.00.0NaN13NaN2.0NaNNaN
198918841CL510745193123532.00.0NaN24NaN4.0NaNNaN
205318841LS21064043989040.00.0NaN13NaN5.0NaNNaN
208218841CN2112472117148771.00.0NaN37NaN10.0NaNNaN
210218841SL411047411513010.00.0NaN18NaN6.0NaNNaN
...................................................
8948120071SLN11736539921253.00.02.04175.00.02.03.0
8948920071NYN139538711392487.04.00.052118.011.00.06.0
8952120071SFN12634075942866.05.00.013254.03.00.02.0
8952320071HOU141517681301050.04.03.023112.03.07.05.0
8953020071HOU1173493882325.06.01.03774.06.04.01.0

7554 rows × 16 columns

#也可以这样:

baseball_df[cols].query('g>100')
yearstintteamlggabrhhrrbisbcsbbsohbpshsf
No.
198818841LS2AA103447101150447.00.0NaN13NaN2.0NaNNaN
198918841CL5AA10745193123532.00.0NaN24NaN4.0NaNNaN
205318841LS2AA1064043989040.00.0NaN13NaN5.0NaNNaN
208218841CN2AA112472117148771.00.0NaN37NaN10.0NaNNaN
210218841SL4AA11047411513010.00.0NaN18NaN6.0NaNNaN
......................................................
8948120071SLNNL11736539921253.00.02.04175.00.02.03.0
8948920071NYNNL139538711392487.04.00.052118.011.00.06.0
8952120071SFNNL12634075942866.05.00.013254.03.00.02.0
8952320071HOUNL141517681301050.04.03.023112.03.07.05.0
8953020071HOUNL1173493882325.06.01.03774.06.04.01.0

7554 rows × 17 columns

使用difference方法返回索引的差集

baseball_df.columns
Index(['id', 'year', 'stint', 'team', 'lg', 'g', 'ab', 'r', 'h', 'X2b', 'X3b',
       'hr', 'rbi', 'sb', 'cs', 'bb', 'so', 'ibb', 'hbp', 'sh', 'sf', 'gidp'],
      dtype='object')
cols = baseball_df.columns.difference(['ibb','gidp','X2b','X3b','id'],sort=False) 
#sort=False,否则返回的df的columns会重新排序
cols
Index(['year', 'stint', 'team', 'lg', 'g', 'ab', 'r', 'h', 'hr', 'rbi', 'sb',
       'cs', 'bb', 'so', 'hbp', 'sh', 'sf'],
      dtype='object')
baseball_df.query('g>100.0')[cols]
yearstintteamlggabrhhrrbisbcsbbsohbpshsf
No.
198818841LS2AA103447101150447.00.0NaN13NaN2.0NaNNaN
198918841CL5AA10745193123532.00.0NaN24NaN4.0NaNNaN
205318841LS2AA1064043989040.00.0NaN13NaN5.0NaNNaN
208218841CN2AA112472117148771.00.0NaN37NaN10.0NaNNaN
210218841SL4AA11047411513010.00.0NaN18NaN6.0NaNNaN
......................................................
8948120071SLNNL11736539921253.00.02.04175.00.02.03.0
8948920071NYNNL139538711392487.04.00.052118.011.00.06.0
8952120071SFNNL12634075942866.05.00.013254.03.00.02.0
8952320071HOUNL141517681301050.04.03.023112.03.07.05.0
8953020071HOUNL1173493882325.06.01.03774.06.04.01.0

7554 rows × 17 columns

pp.pprint(baseball_df)
              id  year  stint team   lg    g   ab   r    h  X2b  ...   rbi  \
No.                                                              ...         
4      ansonca01  1871      1  RC1  NaN   25  120  29   39   11  ...  16.0   
44     forceda01  1871      1  WS3  NaN   32  162  45   45    9  ...  29.0   
68     mathebo01  1871      1  FW1  NaN   19   89  15   24    3  ...  10.0   
99     startjo01  1871      1  NY2  NaN   33  161  35   58    5  ...  34.0   
102    suttoez01  1871      1  CL1  NaN   29  128  35   45    3  ...  23.0   
...          ...   ...    ...  ...  ...  ...  ...  ..  ...  ...  ...   ...   
89525  benitar01  2007      2  FLO   NL   34    0   0    0    0  ...   0.0   
89526  benitar01  2007      1  SFN   NL   19    0   0    0    0  ...   0.0   
89530  ausmubr01  2007      1  HOU   NL  117  349  38   82   16  ...  25.0   
89533   aloumo01  2007      1  NYN   NL   87  328  51  112   19  ...  49.0   
89534  alomasa02  2007      1  NYN   NL    8   22   1    3    1  ...   0.0   

        sb   cs  bb    so  ibb  hbp   sh   sf  gidp  
No.                                                  
4      6.0  2.0   2   1.0  NaN  NaN  NaN  NaN   NaN  
44     8.0  0.0   4   0.0  NaN  NaN  NaN  NaN   NaN  
68     2.0  1.0   2   0.0  NaN  NaN  NaN  NaN   NaN  
99     4.0  2.0   3   0.0  NaN  NaN  NaN  NaN   NaN  
102    3.0  1.0   1   0.0  NaN  NaN  NaN  NaN   NaN  
...    ...  ...  ..   ...  ...  ...  ...  ...   ...  
89525  0.0  0.0   0   0.0  0.0  0.0  0.0  0.0   0.0  
89526  0.0  0.0   0   0.0  0.0  0.0  0.0  0.0   0.0  
89530  6.0  1.0  37  74.0  3.0  6.0  4.0  1.0  11.0  
89533  3.0  0.0  27  30.0  5.0  2.0  0.0  3.0  13.0  
89534  0.0  0.0   0   3.0  0.0  0.0  0.0  0.0   0.0  

[21699 rows x 22 columns]

删除不需要的列

pp.pprint(baseball_df.drop(['ibb','gidp','X2b','X3b','id','lg'],axis=1))
       year  stint team    g   ab   r    h  hr   rbi   sb   cs  bb    so  hbp  \
No.                                                                             
4      1871      1  RC1   25  120  29   39   0  16.0  6.0  2.0   2   1.0  NaN   
44     1871      1  WS3   32  162  45   45   0  29.0  8.0  0.0   4   0.0  NaN   
68     1871      1  FW1   19   89  15   24   0  10.0  2.0  1.0   2   0.0  NaN   
99     1871      1  NY2   33  161  35   58   1  34.0  4.0  2.0   3   0.0  NaN   
102    1871      1  CL1   29  128  35   45   3  23.0  3.0  1.0   1   0.0  NaN   
...     ...    ...  ...  ...  ...  ..  ...  ..   ...  ...  ...  ..   ...  ...   
89525  2007      2  FLO   34    0   0    0   0   0.0  0.0  0.0   0   0.0  0.0   
89526  2007      1  SFN   19    0   0    0   0   0.0  0.0  0.0   0   0.0  0.0   
89530  2007      1  HOU  117  349  38   82   3  25.0  6.0  1.0  37  74.0  6.0   
89533  2007      1  NYN   87  328  51  112  13  49.0  3.0  0.0  27  30.0  2.0   
89534  2007      1  NYN    8   22   1    3   0   0.0  0.0  0.0   0   3.0  0.0   

        sh   sf  
No.              
4      NaN  NaN  
44     NaN  NaN  
68     NaN  NaN  
99     NaN  NaN  
102    NaN  NaN  
...    ...  ...  
89525  0.0  0.0  
89526  0.0  0.0  
89530  4.0  1.0  
89533  0.0  3.0  
89534  0.0  0.0  

[21699 rows x 16 columns]

使用 filter 方法

baseball_df.filter(items=['year','g'])
yearg
No.
4187125
44187132
68187119
99187133
102187129
.........
89525200734
89526200719
895302007117
89533200787
8953420078

21699 rows × 2 columns

baseball_df.filter(regex='b$') #选择列名中包含b 的列
abX2bX3bsbbbibb
No.
41201136.02NaN
44162948.04NaN
6889312.02NaN
99161514.03NaN
102128373.01NaN
.....................
895250000.000.0
895260000.000.0
895303491636.0373.0
895333281913.0275.0
8953422100.000.0

21699 rows × 6 columns

baseball_df.filter(like='ea')
yearteam
No.
41871RC1
441871WS3
681871FW1
991871NY2
1021871CL1
.........
895252007FLO
895262007SFN
895302007HOU
895332007NYN
895342007NYN

21699 rows × 2 columns



版权声明:本文为sinat_32570141原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接和本声明。