训练好的岭回归对train_test进行预测,用误差平方和来衡量模型好坏
ridge=Ridge(alpha=5) ridge.fit(df_train_train,df_train_train_y) #用均方误差来判断模型好坏,结果越小越好 (((df_train_test_y-ridge.predict(df_train_test))**2).sum())/len(df_train_test_y)Out[ ]:
1983899445.438339
随机森林也可预测回归,对处理高维度效果较好,不要特征选择
#调参,对随机森林的最大特征选择进行调试 ,也需要用到交叉验证 from sklearn.ensemble import RandomForestRegressor max_features=[.1,.2,.3,.4,.5,.6,.7,.8,.9] test_score=[] for max_feature in max_features: clf=RandomForestRegressor(max_features=max_feature,n_estimators=100) score=np.sqrt(cross_val_score(clf,df_train_train,df_train_train_y,cv=5)) test_score.append(1-np.mean(score)) plt.plot(max_features,test_score) #得出误差得分图图三
通过图可知,当max_features最大特征数为0.5时,误差最小,所以代入max_feature=0.5
训练好的随机森林对train_test进行预测,用误差平方和来衡量模型好坏
rf=RandomForestRegressor(max_features=0.5,n_estimators=100) rf.fit(df_train_train,df_train_train_y) #用均方误差来判断模型好坏,结果越小越好 (((df_train_test_y-rf.predict(df_train_test))**2).sum())/len(df_train_test_y)Out[ ]:
1108361750.5652797
用Bagging(bootstrap aggregatin)集成框架来对岭回归进行融合计算
调参1:寻找合适子模型数量
#加载相关库 from sklearn.ensemble import BaggingRegressor #调参,寻找合适子模型数量 ridge=Ridge(5) params=[10,20,30,40,50,60,70,80,90,100] test_scores=[] for param in params: clf=BaggingRegressor(n_estimators=param,base_estimator=ridge) score=np.sqrt(cross_val_score(clf,df_train_train,df_train_train_y,cv=5)) test_scores.append(1-np.mean(score)) plt.plot(params,test_scores)图四
当训练的模型个数为70时,数据误差最小
调参2:寻找合适最大特征数
max_features=[.1,.2,.3,.4,.5,.6,.7,.8,.9] test_scores=[] for max_feature in max_features: clf=BaggingRegressor(n_estimators=70,base_estimator=ridge,max_features=max_feature) score=np.sqrt(cross_val_score(clf,df_train_train,df_train_train_y,cv=5)) test_scores.append(1-np.mean(score)) plt.plot(max_features,test_scores)图五
最大特征数为0.6时,误差最小
调参结束,进行模型检验
Bagging=BaggingRegressor(n_estimators=70,base_estimator=ridge,max_features=0.6) Bagging.fit(df_train_train,df_train_train_y) #用均方误差来判断模型好坏,结果越小越好 (((df_train_test_y-Bagging.predict(df_train_test))**2).sum())/len(df_train_test_y)Out[ ]:
1960180964.6378567
提交后,误差为0.1485
四千多中排名50%。还有很多可以优化的地方,等过段时间继续优化~