数据预处理练习 等频分箱、one_hot(独热编码)、数据归一化 #python

任务1:数据基础训练

1.输入:一列数值型数据,输出:与输入等长的一列数据,每一项是输入数据在整列中的排序序号。如,输入:0.1 0.8 0.25,输出1 3 2
2.求给定一组数值型数据的均值、方差
3.给定两列数值型数据,求其和,积,差,商,输出数据列和输入等长,每一项为输入数据相应位置数据的和、积、差、商
4.给定一组数值型数据,将其归一化,(两种方式,最大最小归一化,Z-score标准化)
5.给定一组数值型数据以及一个预先给定的整数,将该组数据进行装箱
6.给定一组标称型数据,利用one-hot编码方法将其转换为多组01值数据

在数据练习中会用到numpy,sklearn,scipy等第三方模块,需要的朋友可以私信

为了方便起见设练习操作的对象如下:

listOne=[1,2,3,4,5,6,7,8,9,10,11,12,13]
name   =['steven','robin','obama']
action =['playing','eating','sleeping']
local  =['home','school','playground']
mixList=[name,action,local]
#taskOne
def indexList(list):
    return map(lambda x:sorted(list).index(x)+1,list)

#taskTwo
from numpy import average,std
def aveStd(list):
    return average(list),std(list)

#taskThree
def opreList(list_1,list_2,operation='add'):
    func=lambda L:L[0]+L[1]
    if(operation=='sub'):
        func=lambda L:L[0]-L[1]
    if(operation=='multi'):
        func=lambda L:L[0]*L[1]
    if(operation=='div'):
        func=lambda L:L[0]*1.0/L[1]
    innerList=map(func,zip(list_1,list_2))
    return innerList

#taskFour
from numpy import array
from sklearn import preprocessing
def normalization(list,method='z_score',feature_range=(0,1)):
    x=x_scaled=array(list)*1.0
    if(method=='z_score'):
        x_scaled=preprocessing.scale(x)
        return x_scaled
    if(method=='MaxMin'):
        Scaler=preprocessing.MinMaxScaler(feature_range).fit(x)#get scale_,min_
        x_scaled=Scaler.transform(x)
    return x_scaled

#taskFive Create an equifrequence box
def binning_depth(list,depth):
    head=tail=0 #recorded the local of each box
    counter=0
    for index1 in range(len(list)):
        counter+=1
        if(index1+2*depth>len(list)):
            if(len(list)-index1-depth<0.5*depth):
                head=len(list)
                ave=average(list[tail:head])
                while(tail!=head):
                    list[tail]=ave
                    tail+=1
                return list
        if(counter==depth or index1==len(list)-1):
            head=index1+1
            ave=average(list[tail:head])
            while(tail!=head):
                list[tail]=ave
                tail+=1
            counter=0
    return list

#taskSix
import numpy as np
def one_hot(referenceList,transformList):
    newRrefence=[]
    for list in referenceList:
        newRrefence+=list
    code=np.zeros(len(newRrefence))
    for elemt in transformList:
        code[newRrefence.index(elemt)]=1
    return code

代码测试:

if __name__=='__main__':
    print 'taskOne:'
    print indexList(listOne)
    print 'taskTwo:'
    print aveStd(listOne)
    print 'taskThree:'
    print opreList(listOne,listOne,'add')
    print 'taskFour:MaxMin:'
    print normalization(listOne,method='MaxMin',feature_range=(0,1))
    print 'taskFour:z_score:'
    print normalization(listOne,method='z_score')
    print 'taskFive:'
    print binning_depth(listOne,3)
    print 'taskSix:'
    print one_hot(mixList,['steven','playing','home'])

测试结果:
测试结果


版权声明:本文为qq_34263279原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接和本声明。