使用GridSearchCV对CatBoostClassifier分类器调参

实战：

def print_best_score(gsearch,param_test):
     # 输出best score
    print("Best score: %0.3f" % gsearch.best_score_)
    print("Best parameters set:")
    # 输出最佳的分类器到底使用了怎样的参数
    best_parameters = gsearch.best_estimator_.get_params()
    for param_name in sorted(param_test.keys()):
        print("\t%s: %r" % (param_name, best_parameters[param_name]))

params = {'depth': [4, 6, 10],
          'learning_rate' : [0.05, 0.1, 0.15],
#          'l2_leaf_reg': [1,4,9]
#          'iterations': [1200],
#           'early_stopping_rounds':[1000],
#           'task_type':['GPU'],
#           'loss_function':['MultiClass'],
          
         }
# cb = cbt.CatBoostClassifier()
estimator =cbt.CatBoostClassifier(iterations=2000,verbose=400,early_stopping_rounds=200,task_type='GPU',
                                        loss_function='MultiClass')

cbt_model = GridSearchCV(estimator, param_grid = params, scoring="accuracy", cv = 3)

# cbt_model = cbt.CatBoostClassifier(iterations=1200,learning_rate=0.05,verbose=300,
# early_stopping_rounds=1000,task_type='GPU',
# loss_function='MultiClass')

cbt_model.fit(train_x,train_y,eval_set=(train_x,train_y))
# cbt_model.grid_scores_, gsearch.best_params_, gsearch.best_score_
print_best_score(cbt_model,params)
oof = cbt_model.predict_proba(test_x)

上述涉及到的知识：
gridSearchCV（网格搜索）的参数、方法
class sklearn.model_selection.GridSearchCV(estimator, param_grid, scoring=None, fit_params=None, n_jobs=1, iid=True, refit=True, cv=None, verbose=0, pre_dispatch=‘2*n_jobs’, error_score=’raise’, return_train_score=’warn’)
（1） estimator
选择使用的分类器，并且传入除需要确定最佳的参数之外的其他参数。每一个分类器都需要一个scoring参数，或者score方法：estimator=RandomForestClassifier(min_samples_split=100,min_samples_leaf=20,max_depth=8,max_features=‘sqrt’,random_state=10),
（2）param_grid
需要最优化的参数的取值，值为字典或者列表，例如：param_grid =param_test1，param_test1 = {‘n_estimators’:range(10,71,10)}。
（3）scoring=None
模型评价标准，默认None,这时需要使用score函数；或者如scoring=‘roc_auc’，根据所选模型不同，评价准则不同。字符串（函数名），或是可调用对象，需要其函数签名形如：scorer(estimator, X, y)；如果是None，则使用estimator的误差估计函数
CatBoostClassifier/CatBoostRegressor
通用参数
learning_rate(eta)=automatically
depth(max_depth)=6: 树的深度
l2_leaf_reg(reg_lambda)=3 L2正则化系数
n_estimators(num_boost_round)(num_trees=1000)=1000: 解决ml问题的树的最大数量
one_hot_max_size=2: 对于某些变量进行one-hot编码
loss_function=‘Logloss’:
CatBoost具有两大优势，其一，它在训练过程中处理类别型特征，而不是在特征预处理阶段处理类别型特征；其二，选择树结构时，计算叶子节点的算法可以避免过拟合。

注意：
在对 CatBoost 调参时，很难对分类特征赋予指标。因此，同时给出了不传递分类特征时的调参结果，并评估了两个模型：一个包含分类特征，另一个不包含。我单独调整了独热最大量，因为它并不会影响其他参数。
如果未在cat_features参数中传递任何内容，CatBoost会将所有列视为数值变量。注意，如果某一列数据中包含字符串值，CatBoost 算法就会抛出错误。另外，带有默认值的 int 型变量也会默认被当成数值数据处理。在 CatBoost 中，必须对变量进行声明，才可以让算法将其作为分类变量处理。

cat_features_index = [2,3,4,5,6,7]
# With Categorical features
clf = cbt.CatBoostClassifier(iterations=2000,learning_rate=0.05,verbose=300,early_stopping_rounds=1000,task_type='GPU',
loss_function='MultiClass',depth=4)
clf.fit(train_x,train_y, cat_features= cat_features_index)

参考文献：
https://www.kaggle.com/manrunning/catboost-for-titanic-top-7
https://blog.csdn.net/weixin_41988628/article/details/83098130
https://blog.csdn.net/linxid/article/details/80723811
http://www.atyun.com/4650.html
https://www.cnblogs.com/nxf-rabbit75/p/10923549.html

原文链接：https://blog.csdn.net/cy_believ/article/details/101062776