EduCoder Pandas合并数据集第一关：Concat与Append操作

文章目录
任务描述
相关知识
合并时索引的处理
join和join_axes参数
append()方法
编程要求
任务描述
本关任务：使用read_csv()读取两个csv文件中的数据，将两个数据集合并，将索引设为Ladder列，并将缺失值填充为0。
相关知识
在Numpy中，我们介绍过可以用np.concatenate、np.stack、np.vstack和np.hstack实现合并功能。Pandas中有一个pd.concat()函数与concatenate语法类似，但是配置参数更多，功能也更强大，主要参数如下。
参数名说明
objs 参与连接的对象，必要参数
axis 指定轴，默认为0
join inner或者outer，默认为outer，指明其他轴的索引按哪种方式进行合并,inner表示取交集，outer表示取并集
join_axes 指明用于其他n-1条轴的索引，不执行并集/交集运算
keys 与连接对象有关的值，用于形成连接轴向上的层次化索引。可以是任意值的列表或数组
levels 指定用作层次化索引各级别上的索引
names 用于创建分层级别的名称，如果设置了keys和levels
verify_integrity 检查结果对象新轴上的重复情况，如果发现则引发异常。默认False允许重复
ignore_index 不保留连接轴上的索引，产生一组新索引
pd.concat()可以简单地合并一维的Series或DataFrame对象。
Series合并
ser1 = pd.Series(['A', 'B', 'C'], index=[1, 2, 3])
ser2 = pd.Series(['D', 'E', 'F'], index=[4, 5, 6])
pd.concat([ser1,ser2])
Out：
1 A 
2 B 
3 C 
4 D 
5 E 
6 F 
dtype: object
DataFrame合并，将concat的axis参数设置为1即可横向合并
df1 = pd.DataFrame([["A1","B1"],["A2","B2"]],index=[1,2],columns=["A","B"])
df2 = pd.DataFrame([["A3","B3"],["A4","B4"]],index=[3,4],columns=["A","B"])
pd.concat([df1,df2])
Out：
   A  B
1 A1 B1 
2 A2 B2 
3 A3 B3 
4 A4 B4
合并时索引的处理
np.concatenate与pd.concat最主要的差异之一就是Pandas在合并时会保留索引，即使索引是重复的！
df3 = pd.DataFrame([["A1","B1"],["A2","B2"]],index=[1,2],columns=["A","B"])
df4 = pd.DataFrame([["A1","B1"],["A2","B2"]],index=[1,2],columns=["A","B"])
pd.concat([df3,df4])
Out：
   A  B
1 A1 B1 
2 A2 B2 
1 A3 B3 
2 A4 B4
如果你想要检测pd.concat()合并的结果中是否出现了重复的索引，可以设置verify_integrity参数。将参数设置为True，合并时若有索引重复就会触发异常。
try: 
pd.concat([df3, df4], verify_integrity=True) 
except ValueError as e: 
print("ValueError:", e)
Out：
ValueError: Indexes have overlapping values: [0, 1]
有时索引无关紧要，那么合并时就可以忽略它们，可以通过设置 ignore_index参数为True来实现。
pd.concat([df3,df4],ignore_index=True)
Out：
 A B 
0 A0 B0 
1 A1 B1 
2 A2 B2 
3 A3 B3
另一种处理索引重复的方法是通过keys参数为数据源设置多级索引标签，这样结果数据就会带上多级索引。
pd.concat([df3, df4], keys=['x', 'y'])
Out：
   A B 
x 0 A0 B0 
1 A1 B1 
y 0 A2 B2 
1 A3 B3
join和join_axes参数
前面介绍的简单示例都有一个共同特点，那就是合并的DataFrame都是同样的列名。而在实际工作中，需要合并的数据往往带有不同的列名，而 pd.concat提供了一些参数来解决这类合并问题。
df5 = pd.DataFrame([["A1","B1","C1"],["A2","B2","C2"]],index=[1,2],columns=["A","B","C"])
df6 = pd.DataFrame([["B3","C3","D3"],["B4","C4","D4"]],index=[3,4],columns=["B","C","D"])
pd.concat([df5,df6])
Out：
 A  B  C  D
1 A1  B1 C1 NaN
2 A2  B2 C2 NaN
3 NaN B3 C3 D3
4 NaN B4 C4 D4
可以看到，结果中出现了缺失值，如果不想出现缺失值，可以使用join和join_axes参数。
pd.concat([df5,df6],join="inner") # 合并取交集
Out：
  B C 
1 B1 C1 
2 B2 C2 
3 B3 C3 
4 B4 C4

# join_axes的参数需为一个列表索引对象
pd.concat([df5,df6],join_axes=[pd.Index(["B","C"])])
Out：
 B C 
1 B1 C1 
2 B2 C2 
3 B3 C3 
4 B4 C4
append()方法
因为直接进行数组合并的需求非常普遍，所以Series和DataFrame 对象都支持append方法，让你通过最少的代码实现合并功能。例如，df1.append(df2)效果与pd.concat([df1,df2])一样。但是它和Python中的append不一样，每次使用Pandas中的append()都需要重新创建索引和数据缓存。
编程要求
data.csv和data1.csv是两份与各国幸福指数排名相关的数据，为了便于查看排名详情，所以需要将两份数据横向合并。数据列名含义如下：
列名说明
Country (region) 国家
Ladder 排名
SD of Ladder 排名的偏差
Positive affect 积极影响
Negative affect 消极影响
Social support 社会福利
Freedom 自由度
Corruption 腐败程度
Generosity 慷慨程度
Log of GDP per capita 人均GDP的对数
Healthy life expectancy 健康程度
读取step1/data.csv和step1/data1.csv两份数据；
首先将两个数据横向合并；
将索引设为排名（Ladder）列；
填充空值为0；
具体要求请参见后续测试样例。
请先仔细阅读右侧上部代码编辑区内给出的代码框架，再开始你的编程工作！
####测试说明
平台会对你编写的代码进行测试，对比你输出的数值与实际正确的数值，只有所有数据全部计算正确才能进入下一关。
测试输入：
无测试输入
预期输出：
Country (region) Freedom … Negative affect Social support
Ladder …
1 Finland 5.0 … 10.0 2.0
2 Denmark 6.0 … 26.0 4.0
3 Norway 3.0 … 29.0 3.0
4 Iceland 7.0 … 3.0 1.0
5 Netherlands 19.0 … 25.0 15.0
6 Switzerland 11.0 … 21.0 13.0
7 Sweden 10.0 … 8.0 25.0
8 New Zealand 8.0 … 12.0 5.0
9 Canada 9.0 … 49.0 20.0
10 Austria 26.0 … 24.0 31.0
11 Australia 17.0 … 37.0 7.0
12 Costa Rica 16.0 … 87.0 42.0
13 Israel 93.0 … 69.0 38.0
14 Luxembourg 28.0 … 19.0 27.0
15 United Kingdom 63.0 … 42.0 9.0
16 Ireland 33.0 … 32.0 6.0
17 Germany 44.0 … 30.0 39.0
18 Belgium 53.0 … 53.0 22.0
19 United States 62.0 … 70.0 37.0
20 Czech Republic 58.0 … 22.0 24.0
21 United Arab Emirates 4.0 … 56.0 72.0
22 Malta 12.0 … 103.0 16.0
23 Mexico 71.0 … 40.0 67.0
24 France 69.0 … 66.0 32.0
25 Taiwan 102.0 … 1.0 48.0
26 Chile 98.0 … 78.0 58.0
27 Guatemala 25.0 … 85.0 78.0
28 Saudi Arabia 68.0 … 82.0 62.0
29 Qatar 0.0 … 0.0 0.0
30 Spain 95.0 … 107.0 26.0
… … … … … …
127 Congo (Kinshasa) 125.0 … 95.0 107.0
128 Mali 110.0 … 122.0 112.0
129 Sierra Leone 116.0 … 149.0 135.0
130 Sri Lanka 55.0 … 81.0 80.0
131 Myanmar 29.0 … 86.0 96.0
132 Chad 142.0 … 151.0 141.0
133 Ukraine 141.0 … 44.0 56.0
134 Ethiopia 106.0 … 74.0 119.0
135 Swaziland 113.0 … 57.0 103.0
136 Uganda 99.0 … 139.0 114.0
137 Egypt 129.0 … 124.0 118.0
138 Zambia 73.0 … 128.0 115.0
139 Togo 120.0 … 147.0 149.0
140 India 41.0 … 115.0 142.0
141 Liberia 94.0 … 146.0 127.0
142 Comoros 148.0 … 114.0 143.0
143 Madagascar 146.0 … 96.0 128.0
144 Lesotho 97.0 … 64.0 98.0
145 Burundi 135.0 … 126.0 152.0
146 Zimbabwe 96.0 … 34.0 110.0
147 Haiti 152.0 … 119.0 146.0
148 Botswana 60.0 … 65.0 105.0
149 Syria 153.0 … 155.0 154.0
150 Malawi 65.0 … 110.0 150.0
151 Yemen 147.0 … 75.0 100.0
152 Rwanda 21.0 … 102.0 144.0
153 Tanzania 78.0 … 50.0 131.0
154 Afghanistan 155.0 … 133.0 151.0
155 Central African Republic 133.0 … 153.0 155.0
156 South Sudan 154.0 … 152.0 148.0
[156 rows x 11 columns]

参数名	说明
objs	参与连接的对象，必要参数
axis	指定轴，默认为0
join	inner或者outer，默认为outer，指明其他轴的索引按哪种方式进行合并,inner表示取交集，outer表示取并集
join_axes	指明用于其他n-1条轴的索引，不执行并集/交集运算
keys	与连接对象有关的值，用于形成连接轴向上的层次化索引。可以是任意值的列表或数组
levels	指定用作层次化索引各级别上的索引
names	用于创建分层级别的名称，如果设置了keys和levels
verify_integrity	检查结果对象新轴上的重复情况，如果发现则引发异常。默认False允许重复
ignore_index	不保留连接轴上的索引，产生一组新索引

列名	说明
Country (region)	国家
Ladder	排名
SD of Ladder	排名的偏差
Positive affect	积极影响
Negative affect	消极影响
Social support	社会福利
Freedom	自由度
Corruption	腐败程度
Generosity	慷慨程度
Log of GDP per capita	人均GDP的对数
Healthy life expectancy	健康程度

import pandas as pd

def task1():
    #********** Begin **********#
    data=pd.read_csv('step1/data.csv')    #读取data.csv
    data1=pd.read_csv('step1/data1.csv')  #读取data1.csv
    result=pd.concat([data,data1],axis=1) #横向合并
    result=result.T.drop_duplicates().T   #通过两次转置删除重复列
    result.index.name = 'Ladder'          #设置索引名
    result=result.fillna(0)				  #填充空值为0

    #********** End **********#
    return result

原文链接：https://blog.csdn.net/weixin_43608722/article/details/106349015