实例:异常值直接剔除
import numpy as np
data_list = [1,2,3,4,5,5,4,3,2,1,1,2,3,4,5,5,4,3,2,1,10000,-10000]
data_array = np.asarray(data_list)
mean = np.mean(data_array , axis=0)
std = np.std(data_array , axis=0)
preprocessed_data_array = [x for x in data_array if (x > mean - 3 * std)]
preprocessed_data_array = [x for x in preprocessed_data_array if (x < mean + 3 * std)]
print(preprocessed_data_array )
- 输出:
[1, 2, 3, 4, 5, 5, 4, 3, 2, 1, 1, 2, 3, 4, 5, 5, 4, 3, 2, 1]
实例:异常值替换为均值
import numpy as np
data_list = [1,2,3,4,5,5,4,3,2,1,1,2,3,4,5,5,4,3,2,1,10000,-10000]
# print(sum(data_list)/len(data_list))
data_array = np.asarray(data_list, dtype=float) # 注意:这里要指定 dtype 的类型,否则下面替换时可能会因数据类型不同而导致替换的均值的精度不同
mean = np.mean(data_array, axis=0)
std = np.std(data_array, axis=0)
print(mean)
floor = mean - 3*std
upper = mean + 3*std
for i, val in enumerate(data_array):
data_array[i] = float(np.where(((val<floor)|(val>upper)), mean, val))
print(data_array)
- 输出:
2.727272727272727
[1. 2. 3. 4. 5. 5.
4. 3. 2. 1. 1. 2.
3. 4. 5. 5. 4. 3.
2. 1. 2.72727273 2.72727273]
- 注意:如果 data_array = np.asarray(data_list, dtype=float) 没有指定 dtype=float,则默认取 mean 的下整数值替换异常值,输出如下:
2.727272727272727
[1 2 3 4 5 5 4 3 2 1 1 2 3 4 5 5 4 3 2 1 2 2]
参考
- https://www.kdnuggets.com/2017/02/removing-outliers-standard-deviation-python.html
版权声明:本文为sdnuwjw原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接和本声明。