Kaggle(一):房价预测 (2)

训练好的岭回归对train_test进行预测,用误差平方和来衡量模型好坏

ridge=Ridge(alpha=5) ridge.fit(df_train_train,df_train_train_y) #用均方误差来判断模型好坏,结果越小越好 (((df_train_test_y-ridge.predict(df_train_test))**2).sum())/len(df_train_test_y)

Out[ ]:
1983899445.438339

2.随机森林

随机森林也可预测回归,对处理高维度效果较好,不要特征选择

#调参,对随机森林的最大特征选择进行调试 ,也需要用到交叉验证 from sklearn.ensemble import RandomForestRegressor max_features=[.1,.2,.3,.4,.5,.6,.7,.8,.9] test_score=[] for max_feature in max_features: clf=RandomForestRegressor(max_features=max_feature,n_estimators=100) score=np.sqrt(cross_val_score(clf,df_train_train,df_train_train_y,cv=5)) test_score.append(1-np.mean(score)) plt.plot(max_features,test_score) #得出误差得分图

图三

通过图可知,当max_features最大特征数为0.5时,误差最小,所以代入max_feature=0.5

训练好的随机森林对train_test进行预测,用误差平方和来衡量模型好坏

rf=RandomForestRegressor(max_features=0.5,n_estimators=100) rf.fit(df_train_train,df_train_train_y) #用均方误差来判断模型好坏,结果越小越好 (((df_train_test_y-rf.predict(df_train_test))**2).sum())/len(df_train_test_y)

Out[ ]:
1108361750.5652797

集成学习

用Bagging(bootstrap aggregatin)集成框架来对岭回归进行融合计算

调参1:寻找合适子模型数量

#加载相关库 from sklearn.ensemble import BaggingRegressor #调参,寻找合适子模型数量 ridge=Ridge(5) params=[10,20,30,40,50,60,70,80,90,100] test_scores=[] for param in params: clf=BaggingRegressor(n_estimators=param,base_estimator=ridge) score=np.sqrt(cross_val_score(clf,df_train_train,df_train_train_y,cv=5)) test_scores.append(1-np.mean(score)) plt.plot(params,test_scores)

图四

当训练的模型个数为70时,数据误差最小

调参2:寻找合适最大特征数

max_features=[.1,.2,.3,.4,.5,.6,.7,.8,.9] test_scores=[] for max_feature in max_features: clf=BaggingRegressor(n_estimators=70,base_estimator=ridge,max_features=max_feature) score=np.sqrt(cross_val_score(clf,df_train_train,df_train_train_y,cv=5)) test_scores.append(1-np.mean(score)) plt.plot(max_features,test_scores)

图五

最大特征数为0.6时,误差最小

调参结束,进行模型检验

Bagging=BaggingRegressor(n_estimators=70,base_estimator=ridge,max_features=0.6) Bagging.fit(df_train_train,df_train_train_y) #用均方误差来判断模型好坏,结果越小越好 (((df_train_test_y-Bagging.predict(df_train_test))**2).sum())/len(df_train_test_y)

Out[ ]:
1960180964.6378567

结果: 分析结果:三个结果,取均方误差最小的,即 随机森林 算法

提交后,误差为0.1485
四千多中排名50%。还有很多可以优化的地方,等过段时间继续优化~

内容版权声明:除非注明,否则皆为本站原创文章。

转载注明出处:https://www.heiqu.com/wpxpgz.html