机器学习kaggle案例：沃尔玛招聘 - 商店销售预测

kaggle链接:https://www.kaggle.com/c/walmart-recruiting-store-sales-forecasting
ipynb文件：https://github.com/824024445/KaggleCases

一、简介

1.1 比赛描述

建模零售数据的一个挑战是需要根据有限的历史做出决策。如果圣诞节一年一次，那么有机会看到战略决策如何影响到底线。

在此招聘竞赛中，为求职者提供位于不同地区的45家沃尔玛商店的历史销售数据。每个商店都包含许多部门，参与者必须为每个商店中的每个部门预测销售额。要添加挑战，选定的假日降价事件将包含在数据集中。众所周知，这些降价会影响销售，但预测哪些部门受到影响以及影响程度具有挑战性。

想要在世界上最大的一些数据集的良好环境中工作吗？这是向沃尔玛招聘团队展示您的模特气概的机会。

这项比赛计入排名和成就。如果您希望考虑参加沃尔玛的面试，请在第一次参加时选中“允许主持人与我联系”复选框。

你必须在招募比赛中作为个人参加比赛。您只能使用提供的数据进行预测。

1.2 比赛评估

本次比赛的加权平均绝对误差（WMAE）评估：

[外链图片转存失败(img-l8sox1g6-1566399330281)(https://raw.githubusercontent.com/824024445/KaggleCases/master/img/walmart-recruiting-store-sales-forecasting/1-1.jpg)]

n是行数
yi是真实销售额
wi是权重，如果该周是假日周，wi=5，否则为1

提交文件：Id列是通过将Store，Dept和Date与下划线连接而形成的（例如Store_Dept_2012-11-02）

对于测试集中的每一行（商店+部门+日期三元组），您应该预测该部门的每周销售额。

1.3 数据描述

您将获得位于不同地区的45家沃尔玛商店的历史销售数据。每个商店都包含许多部门，您的任务是预测每个商店的部门范围内的销售额。

此外，沃尔玛全年举办多项促销降价活动。这些降价活动在突出的假期之前，其中最大的四个是超级碗，劳动节，感恩节和圣诞节。包括这些假期的周数在评估中的加权比非假日周高五倍。本次比赛提出的部分挑战是在没有完整/理想的历史数据的情况下模拟降价对这些假期周的影响。

stores.csv:
此文件包含有关45个商店的匿名信息，指示商店的类型和大小。

train.csv:
这是历史销售数据，涵盖2010-02-05至2012-11-01。在此文件中，您将找到以下字段：
Store - 商店编号
Dept - 部门编号
Date - 一周
Weekly_Sales - 给定商店中给定部门的销售额(目标值)
sHoliday - 周是否是一个特殊的假日周

test.csv:
此文件与train.csv相同，但我们保留了每周销售额。您必须预测此文件中每个商店，部门和日期三元组的销售额。

features.csv:
此文件包含与给定日期的商店，部门和区域活动相关的其他数据。它包含以下字段：
Store - 商店编号
Date - 一周
Temperature - 该地区的平均温度
Fuel_Price - 该地区的燃料成本
MarkDown1-5 - 与沃尔玛正在运营的促销降价相关的匿名数据。MarkDown数据仅在2011年11月之后提供，并非始终适用于所有商店。任何缺失值都标有NA。
CPI - 消费者物价指数
Unemployment - 失业率
IsHoliday - 周是否是一个特殊的假日周

为方便起见，数据集中的四个假期在接下来的几周内（并非所有假期都在数据中）：

超级碗：2月12日至10日，11月2日至11日，10月2日至12日，2月8日至2月13
日劳动节：10月9日至10日，9月9日至9日，9月9日至9月12日-13
感恩节：26-Nov- 10,25 -Nov-11,23-Nov-12,29-Nov-13
圣诞节：31-Dec-10,30-Dec-11,28-Dec-12,27-Dec -13

二、代码

2.1 获取数据

2.1.1 下载数据

我写了一个小函数来实现数据的下载，数据全都是官网原版数据，我存到了我的github上。（https://github.com/824024445/KaggleCases）

所有数据都下载到了你当前文件夹下的datasets文件下，每个案例涉及到的数据全部下载到了以该案例命名的文件夹下。

我所有的kaggle案例的博客，下载数据均会使用这个函数，只需要修改前两个常量即可。
> 注：此函数只用于下载数据，函数在该代码框内就运行了。不再用到其它代码中，包括常量，也不会用在其他地方。

import os
import zipfile
from six.moves import urllib

FILE_NAME = "walmart-recruiting-store-sales-forecasting.zip" #文件名
DATA_PATH ="datasets/walmart-recruiting-store-sales-forecasting" #存储文件的文件夹，取跟文件相同（相近）的名字便于区分
DATA_URL = "https://github.com/824024445/KaggleCases/blob/master/datasets/" + FILE_NAME + "?raw=true"


def fetch_data(data_url=DATA_URL, data_path=DATA_PATH, file_name=FILE_NAME):
    if not os.path.isdir(data_path): #查看当前文件夹下是否存在"datasets/titanic"，没有的话创建
        os.makedirs(data_path)
    zip_path = os.path.join(data_path, file_name) #下载到本地的文件的路径及名称
    # urlretrieve()方法直接将远程数据下载到本地
    urllib.request.urlretrieve(data_url, zip_path) #第二个参数zip_path是保存到的本地路径
    data_zip = zipfile.ZipFile(zip_path)
    data_zip.extractall(path=data_path) #什么参数都不输入就是默认解压到当前文件,为了保持统一，是泰坦尼克的数据就全部存到titanic文件夹下
    data_zip.close()
fetch_data()

2.1.2 读取数据

import pandas as pd
import numpy as np

train_df = pd.read_csv("datasets/walmart-recruiting-store-sales-forecasting/train.csv")
test_df = pd.read_csv("datasets/walmart-recruiting-store-sales-forecasting/test.csv")
features = pd.read_csv("datasets/walmart-recruiting-store-sales-forecasting/features.csv")
stores = pd.read_csv("datasets/walmart-recruiting-store-sales-forecasting/stores.csv")

train_df = train_df.merge(features, on=["Store", "Date"], how="left").merge(stores, on="Store", how="left")
test_df = test_df.merge(features, on=["Store", "Date"], how="left").merge(stores, on="Store", how="left")
combine = [train_df, test_df]
train_df.head()

	Store	Dept	Date	Weekly_Sales	IsHoliday_x	Temperature	Fuel_Price	MarkDown1	MarkDown2	MarkDown3	MarkDown4	MarkDown5	CPI	Unemployment	IsHoliday_y	Type	Size
0	1	1	2010-02-05	24924.50	False	42.31	2.572	NaN	NaN	NaN	NaN	NaN	211.096358	8.106	False	A	151315
1	1	1	2010-02-12	46039.49	True	38.51	2.548	NaN	NaN	NaN	NaN	NaN	211.242170	8.106	True	A	151315
2	1	1	2010-02-19	41595.55	False	39.93	2.514	NaN	NaN	NaN	NaN	NaN	211.289143	8.106	False	A	151315
3	1	1	2010-02-26	19403.54	False	46.63	2.561	NaN	NaN	NaN	NaN	NaN	211.319643	8.106	False	A	151315
4	1	1	2010-03-05	21827.90	False	46.50	2.625	NaN	NaN	NaN	NaN	NaN	211.350143	8.106	False	A	151315

2.2 初步观察数据

2.2.1 info()

train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 421570 entries, 0 to 421569
Data columns (total 17 columns):
Store           421570 non-null int64
Dept            421570 non-null int64
Date            421570 non-null object
Weekly_Sales    421570 non-null float64
IsHoliday_x     421570 non-null bool
Temperature     421570 non-null float64
Fuel_Price      421570 non-null float64
MarkDown1       150681 non-null float64
MarkDown2       111248 non-null float64
MarkDown3       137091 non-null float64
MarkDown4       134967 non-null float64
MarkDown5       151432 non-null float64
CPI             421570 non-null float64
Unemployment    421570 non-null float64
IsHoliday_y     421570 non-null bool
Type            421570 non-null object
Size            421570 non-null int64
dtypes: bool(2), float64(10), int64(3), object(2)
memory usage: 52.3+ MB

观察到：

MarkDown有太多缺失值，但是后面查看test发现test该特征比较完整，且后面查看想关性，该特征有挺高的相关性
其余特征没有空值，等会就不用补充缺失值了

test_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 115064 entries, 0 to 115063
Data columns (total 16 columns):
Store           115064 non-null int64
Dept            115064 non-null int64
Date            115064 non-null object
IsHoliday_x     115064 non-null bool
Temperature     115064 non-null float64
Fuel_Price      115064 non-null float64
MarkDown1       114915 non-null float64
MarkDown2       86437 non-null float64
MarkDown3       105235 non-null float64
MarkDown4       102176 non-null float64
MarkDown5       115064 non-null float64
CPI             76902 non-null float64
Unemployment    76902 non-null float64
IsHoliday_y     115064 non-null bool
Type            115064 non-null object
Size            115064 non-null int64
dtypes: bool(2), float64(9), int64(3), object(2)
memory usage: 13.4+ MB

观察测试集发现：

测试集的markdown数据还挺全的，这个特征可能还是有用的。就先做去掉这个特征的，然后提升阶段再考虑怎么利用这部分数据吧。
测试集的CPI和Unemployment有空值

2.2.2 describe()

train_df.describe()

	Store	Dept	Weekly_Sales	Temperature	Fuel_Price	MarkDown1	MarkDown2	MarkDown3	MarkDown4	MarkDown5	CPI	Unemployment	Size
count	421570.000000	421570.000000	421570.000000	421570.000000	421570.000000	150681.000000	111248.000000	137091.000000	134967.000000	151432.000000	421570.000000	421570.000000	421570.000000
mean	22.200546	44.260317	15981.258123	60.090059	3.361027	7246.420196	3334.628621	1439.421384	3383.168256	4628.975079	171.201947	7.960289	136727.915739
std	12.785297	30.492054	22711.183519	18.447931	0.458515	8291.221345	9475.357325	9623.078290	6292.384031	5962.887455	39.159276	1.863296	60980.583328
min	1.000000	1.000000	-4988.940000	-2.060000	2.472000	0.270000	-265.760000	-29.100000	0.220000	135.160000	126.064000	3.879000	34875.000000
25%	11.000000	18.000000	2079.650000	46.680000	2.933000	2240.270000	41.600000	5.080000	504.220000	1878.440000	132.022667	6.891000	93638.000000
50%	22.000000	37.000000	7612.030000	62.090000	3.452000	5347.450000	192.000000	24.600000	1481.310000	3359.450000	182.318780	7.866000	140167.000000
75%	33.000000	74.000000	20205.852500	74.280000	3.738000	9210.900000	1926.940000	103.990000	3595.040000	5563.800000	212.416993	8.572000	202505.000000
max	45.000000	99.000000	693099.360000	100.140000	4.468000	88646.760000	104519.540000	141630.610000	67474.850000	108519.280000	227.232807	14.313000	219622.000000

train_df.describe(include="O")

	Date	Type
count	421570	421570
unique	143	3
top	2011-12-23	A
freq	3027	215478

测试集CPI和Unemployment有缺失值，看一下它的结构

test_df[["CPI","Unemployment"]].describe()

	CPI	Unemployment
count	76902.000000	76902.000000
mean	176.961347	6.868733
std	41.239967	1.583427
min	131.236226	3.684000
25%	138.402033	5.771000
50%	192.304445	6.806000
75%	223.244532	8.036000
max	228.976456	10.199000

2.2.3 corr()

corr_matrix = train_df.corr()
corr_matrix.Weekly_Sales.sort_values(ascending=False)

Weekly_Sales    1.000000
Size            0.243828
Dept            0.148032
MarkDown5       0.090362
MarkDown1       0.085251
MarkDown3       0.060385
MarkDown4       0.045414
MarkDown2       0.024130
IsHoliday_y     0.012774
IsHoliday_x     0.012774
Fuel_Price     -0.000120
Temperature    -0.002312
CPI            -0.020921
Unemployment   -0.025864
Store          -0.085195
Name: Weekly_Sales, dtype: float64

corr_matrix[["MarkDown1","MarkDown2","MarkDown3","MarkDown4","MarkDown5"]].sort_values(by="MarkDown5", ascending=False)

	MarkDown1	MarkDown2	MarkDown3	MarkDown4	MarkDown5
MarkDown5	0.160257	-0.007440	-0.026467	0.107792	1.000000
Size	0.345673	0.108827	0.048913	0.168196	0.304575
MarkDown1	1.000000	0.024486	-0.108115	0.819238	0.160257
MarkDown4	0.819238	-0.007768	-0.071095	1.000000	0.107792
Weekly_Sales	0.085251	0.024130	0.060385	0.045414	0.090362
CPI	-0.055558	-0.039534	-0.023590	-0.049628	0.060630
Dept	-0.002426	0.000290	0.001784	0.004257	0.000109
Unemployment	0.050285	0.020940	0.012818	0.024963	-0.003843
MarkDown2	0.024486	1.000000	-0.050108	-0.007768	-0.007440
Temperature	-0.040594	-0.323927	-0.096880	-0.063947	-0.017544
MarkDown3	-0.108115	-0.050108	1.000000	-0.071095	-0.026467
Store	-0.119588	-0.035173	-0.031556	-0.009941	-0.026634
IsHoliday_x	-0.035586	0.334818	0.427960	-0.000562	-0.053719
IsHoliday_y	-0.035586	0.334818	0.427960	-0.000562	-0.053719
Fuel_Price	0.061371	-0.220895	-0.102092	-0.044986	-0.128065

markdown1和4的关联度比较大，只需要要一个就行，删除markdown4

2.3 数据清洗

2.3.1 缺失值处理

## Markdown 对于训练集markdown的缺失，这里先不处理，等会分成两个数据集，一个含缺失markdown然后填充，一个去掉这些数据
test_df[['MarkDown1','MarkDown2','MarkDown3','MarkDown5']] = test_df[['MarkDown1','MarkDown2','MarkDown3','MarkDown5']].fillna(0)
test_df[["CPI","Unemployment"]] = test_df[["CPI","Unemployment"]].fillna(method="ffill")

2.3.2 创建新特征

type转变成onehot编码

train_df = pd.get_dummies(train_df, columns=["Type"])
test_df = pd.get_dummies(test_df, columns=["Type"])
train_df.head()

	Store	Dept	Date	Weekly_Sales	IsHoliday_x	Temperature	Fuel_Price	MarkDown1	MarkDown2	MarkDown3	MarkDown4	MarkDown5	CPI	Unemployment	IsHoliday_y	Size	Type_A
0	1	1	2010-02-05	24924.50	False	42.31	2.572	NaN	NaN	NaN	NaN	NaN	211.096358	8.106	False	151315	1
1	1	1	2010-02-12	46039.49	True	38.51	2.548	NaN	NaN	NaN	NaN	NaN	211.242170	8.106	True	151315	1
2	1	1	2010-02-19	41595.55	False	39.93	2.514	NaN	NaN	NaN	NaN	NaN	211.289143	8.106	False	151315	1
3	1	1	2010-02-26	19403.54	False	46.63	2.561	NaN	NaN	NaN	NaN	NaN	211.319643	8.106	False	151315	1
4	1	1	2010-03-05	21827.90	False	46.50	2.625	NaN	NaN	NaN	NaN	NaN	211.350143	8.106	False	151315	1

把日期换成月份

train_df['Month'] = pd.to_datetime(train_df['Date']).dt.month
test_df["Month"] = pd.to_datetime(test_df['Date']).dt.month
#等下记得删除Date，test的暂时先不删，后面要用

温度
想来，人们在极端天气的时候不太会出门。所以把数据分成两组：小于22.01，大于91.03（根据温度分布划分的，画柱状图可得，我已经删掉了）

train_df.loc[(train_df["Temperature"]<22.01)|(train_df["Temperature"]>91.03), "Is_temp_extr"]=1
train_df.loc[(train_df["Temperature"]>=22.01)& (train_df["Temperature"]<=91.03), "Is_temp_extr"]=0

test_df.loc[(test_df["Temperature"]<22.01)|(test_df["Temperature"]>91.03), "Is_temp_extr"]=1
test_df.loc[(test_df["Temperature"]>=22.01)& (test_df["Temperature"]<=91.03), "Is_temp_extr"]=0

train_df.corr().Weekly_Sales.sort_values(ascending=False)[["Temperature", "Is_temp_extr"]]
#提取新特征后相关性提升了十多倍 等下记得把这个特征删除。

Temperature    -0.002312
Is_temp_extr   -0.030016
Name: Weekly_Sales, dtype: float64

燃油价格
人们会因为燃油费太贵不出门吗？

train_df.loc[train_df["Fuel_Price"]>3.47, "Is_fuel_expen"]=1
train_df.loc[train_df["Fuel_Price"]<=3.47, "Is_fuel_expen"]=0
#无论怎么改，这个相关性都很低，所以这个特征等下去除
train_df.corr().Weekly_Sales.sort_values(ascending=False)[["Fuel_Price", "Is_fuel_expen"]]

Fuel_Price      -0.000120
Is_fuel_expen   -0.006626
Name: Weekly_Sales, dtype: float64

IsHoliday
由于前面合并表格的时候的问题，出现了两个isholidy,删掉一个即可。
另外，把bool值换成0和5(后面权重）

train_df["IsHoliday"] = train_df["IsHoliday_x"].replace(True, 5).replace(False,0)
test_df["IsHoliday"] = test_df["IsHoliday_x"].replace(True, 5).replace(False,0)

train_df.corr().Weekly_Sales.sort_values(ascending=False)[["IsHoliday_x", "IsHoliday"]]

IsHoliday_x    0.012774
IsHoliday      0.012774
Name: Weekly_Sales, dtype: float64

2.3.3 删除特征

train_df = train_df.drop(["IsHoliday_x", "IsHoliday_y",'MarkDown4',"Date", "Temperature", "Fuel_Price","Is_fuel_expen"], axis=1)

#这是后面提交表格需要用到的变量，用到了测试集的date特征，先在这里给id变量赋值，然后就可以吧date特征删除了
id = test_df["Store"].astype(str)+"_"+test_df["Dept"].astype(str)+"_"+test_df["Date"].astype(str)
test_df = test_df.drop(["IsHoliday_x", "IsHoliday_y", "MarkDown4", "Date","Temperature", "Fuel_Price"], axis=1)

2.3.4 最终检查

将数据集用到模型前，一定要确保没有空值，所以最后再检查一下

先把训练集做成两份：一份含缺失的markdown，一个去除掉这些数据

train_df_one = train_df.copy()
train_df_two = train_df.copy()
train_df_one[['MarkDown1','MarkDown2','MarkDown3','MarkDown5']] = train_df_one[['MarkDown1','MarkDown2','MarkDown3','MarkDown5']].fillna(0)
train_df_two.dropna(inplace=True)

train_df_one.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 421570 entries, 0 to 421569
Data columns (total 16 columns):
Store           421570 non-null int64
Dept            421570 non-null int64
Weekly_Sales    421570 non-null float64
MarkDown1       421570 non-null float64
MarkDown2       421570 non-null float64
MarkDown3       421570 non-null float64
MarkDown5       421570 non-null float64
CPI             421570 non-null float64
Unemployment    421570 non-null float64
Size            421570 non-null int64
Type_A          421570 non-null uint8
Type_B          421570 non-null uint8
Type_C          421570 non-null uint8
Month           421570 non-null int64
Is_temp_extr    421570 non-null float64
IsHoliday       421570 non-null float64
dtypes: float64(9), int64(4), uint8(3)
memory usage: 46.2 MB

train_df_two.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 101480 entries, 92 to 421569
Data columns (total 16 columns):
Store           101480 non-null int64
Dept            101480 non-null int64
Weekly_Sales    101480 non-null float64
MarkDown1       101480 non-null float64
MarkDown2       101480 non-null float64
MarkDown3       101480 non-null float64
MarkDown5       101480 non-null float64
CPI             101480 non-null float64
Unemployment    101480 non-null float64
Size            101480 non-null int64
Type_A          101480 non-null uint8
Type_B          101480 non-null uint8
Type_C          101480 non-null uint8
Month           101480 non-null int64
Is_temp_extr    101480 non-null float64
IsHoliday       101480 non-null float64
dtypes: float64(9), int64(4), uint8(3)
memory usage: 11.1 MB

test_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 115064 entries, 0 to 115063
Data columns (total 15 columns):
Store           115064 non-null int64
Dept            115064 non-null int64
MarkDown1       115064 non-null float64
MarkDown2       115064 non-null float64
MarkDown3       115064 non-null float64
MarkDown5       115064 non-null float64
CPI             115064 non-null float64
Unemployment    115064 non-null float64
Size            115064 non-null int64
Type_A          115064 non-null uint8
Type_B          115064 non-null uint8
Type_C          115064 non-null uint8
Month           115064 non-null int64
Is_temp_extr    115064 non-null float64
IsHoliday       115064 non-null float64
dtypes: float64(8), int64(4), uint8(3)
memory usage: 11.7 MB

2.4 模型和预测

为了快速测试，写了一个类。我写的案例大部分都回用到这个类。不过每次因为性能评测的指标不同，所以需要微改。

import time
import os
from sklearn.metrics import mean_absolute_error
from sklearn.base import clone

class Tester():
    def __init__(self, target):
        self.target = target
        self.datasets = {}
        self.models = {}
        self.scores = {}
        self.cache = {} # 我们添加了一个简单的缓存来加快速度

    def addDataset(self, name, df):
        self.datasets[name] = df.copy()

    def addModel(self, name, model):
        self.models[name] = model
        
    def clearModels(self):
        self.models = {}

    def clearCache(self):
        self.cache = {}
    
    def testModelWithDataset(self, m_name, df_name, sample_len, cv):
        if (m_name, df_name, sample_len, cv) in self.cache:
            return self.cache[(m_name, df_name, sample_len, cv)]

        clf = clone(self.models[m_name])
        
        if not sample_len: 
            sample = self.datasets[df_name]
        else: sample = self.datasets[df_name].sample(sample_len)
            
        X = sample.drop([self.target], axis=1)
        Y = sample[self.target]

        #评分标准不一样的话，修改这里
        weights = X["IsHoliday"]
        clf.fit(X, Y)
        Y_pred = clf.predict(X)
        s = mean_absolute_error(Y, Y_pred, sample_weight=weights)
        self.cache[(m_name, df_name, sample_len, cv)] = s

        return s

    def runTests(self, sample_len=97056, cv=3):
        # 在所有添加的数据集上测试添加的模型
        for m_name in self.models:
            for df_name in self.datasets:
                # print('Testing %s' % str((m_name, df_name)), end='')
                start = time.time()

                score = self.testModelWithDataset(m_name, df_name, sample_len, cv)
                self.scores[(m_name, df_name)] = score
                
                end = time.time()
                
                # print(' -- %0.2fs ' % (end - start))

        print('--- Top 10 Results ---')
        # 评分标准改了之后这里也得改
        for score in sorted(self.scores.items(), key=lambda x: x[1])[:10]:
            # score = int(score[1])
            print(score)

    def obtian_result(self, X_test):
        clf = self.models[sorted(self.scores.items(), key=lambda x: x[1])[0][0]]
        Y_pred = clf.predict(X_test)
        return Y_pred

from sklearn.ensemble import ExtraTreesRegressor, RandomForestRegressor, GradientBoostingRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.feature_selection import RFE
from sklearn.neural_network import MLPRegressor

# 我们将在所有模型中使用测试对象
tester = Tester('Weekly_Sales')

# 添加数据集
tester.addDataset('all_markdown', train_df_one)
tester.addDataset('wipe_markdown', train_df_two)

# 添加模型
knn_reg = KNeighborsRegressor(n_neighbors=10)
tree_reg = ExtraTreesRegressor(n_estimators=100,max_features='auto', verbose=1, n_jobs=1)
rf_reg = RandomForestRegressor(n_estimators=100,max_features='log2', verbose=1)
svr_reg = SVR(kernel='rbf', gamma='auto')
mlp_reg = MLPRegressor(hidden_layer_sizes=(10,),  activation='relu', verbose=3)
gbrt_reg = GradientBoostingRegressor(max_depth=8, warm_start=True)
tester.addModel('KNeighborsRegressor', knn_reg)
tester.addModel('ExtraTreesRegressor', tree_reg)
tester.addModel('RandomForestRegressor', rf_reg)
tester.addModel('SVR', svr_reg)
tester.addModel('MLPRegressor', mlp_reg)
tester.addModel('GradientBoostingRegressor', gbrt_reg)

# 测试
tester.runTests()

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:   26.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    3.8s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:   28.6s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    3.8s finished

X = train_df_one.drop(["Weekly_Sales"], axis=1)
Y = train_df_one["Weekly_Sales"]

gbrt_reg.fit(X, Y)
Y_pred = gbrt_reg.predict(test_df)
submission = pd.DataFrame({
        "Id": id,
        "Weekly_Sales": pd.DataFrame(Y_pred)[0]
    })
id

submission.to_csv('submission.csv', index=False)

原文链接：https://blog.csdn.net/weixin_42662126/article/details/100005632