结果如下:
liftSer Out[13]: colName area error 1.593838 mean radius 1.593838 mean concavity 1.593838 perimeter error 1.593838 radius error 1.593838 mean concave points 1.593838 mean area 1.593838 worst area 1.593838 concavity error 1.593838 worst concave points 1.593838 worst perimeter 1.593838 worst radius 1.593838 mean perimeter 1.593838 worst concavity 1.565875 mean compactness 1.565376 concave points error 1.536915 compactness error 1.536915 mean texture 1.536915 worst compactness 1.536915 mean smoothness 1.508453 worst texture 1.508453 worst smoothness 1.479992 mean symmetry 1.454027 worst fractal dimension 1.398103 worst symmetry 1.366146 fractal dimension error 1.280762 mean fractal dimension 1.258293 smoothness error 1.230331 texture error 1.223840 symmetry error 1.209118 Name: lift, dtype: float64 (4)信息值(IV) ##信息值(IV)计算 def iv(df,cont,disc,tag,bins): binDf = binStatistic(df,cont,disc,tag,bins) binDf['binPosCovAdj'] = (binDf['binPosCnt'].replace(0,1)) / binDf['posCnt'] #调整后区间正样本覆盖率(避免值为0无法取对数) binDf['binNegCovAdj'] = (binDf['binNegCnt'].replace(0,1)) / binDf['negCnt'] #调整后区间负样本覆盖率(避免值为0无法取对数) binDf['woe'] = binDf['binPosCovAdj'].apply(lambda x:math.log(x,math.e)) - binDf['binNegCovAdj'].apply(lambda x:math.log(x,math.e)) binDf['iv'] = binDf['woe'] * (binDf['binPosCovAdj'] - binDf['binNegCovAdj']) tmpSer = binDf.groupby('colName')['iv'].sum() tmpSer.name = 'iv' resSer = tmpSer.sort_values(ascending=False) return resSer ivSer = iv(dataset, continuousColList, discreteColList, targetCol, 10)结果如下:
ivSer Out[14]: colName worst perimeter 5.663336 worst area 5.407202 worst radius 5.391269 worst concave points 5.276160 mean concave points 5.117567 mean perimeter 4.643066 mean area 4.507951 mean radius 4.460431 area error 4.170720 mean concavity 3.999623 worst concavity 3.646313 perimeter error 2.777306 radius error 2.694609 worst compactness 2.320652 mean compactness 2.223346 concavity error 1.508040 concave points error 1.368055 mean texture 1.263312 worst texture 1.212502 worst smoothness 0.972226 worst symmetry 0.971215 compactness error 0.916664 mean smoothness 0.772058 mean symmetry 0.617936 worst fractal dimension 0.596305 fractal dimension error 0.253930 mean fractal dimension 0.244841 texture error 0.099411 smoothness error 0.087521 symmetry error 0.083463 Name: iv, dtype: float64这里,我们对比上述4个指标计算的重要性排序结果。
gainRankSer = gainSer.rank(ascending=False,method='min') #并列排名取最小排名 giniRankSer = giniSer.rank(ascending=False,method='min') #并列排名取最小排名 liftRankSer = liftSer.rank(ascending=False,method='min') #并列排名取最小排名 ivRankSer = ivSer.rank(ascending=False,method='min') #并列排名取最小排名 resDf = pd.concat([gainRankSer,giniRankSer,liftRankSer,ivRankSer],axis=1) resDf2 = resDf.sort_values(by='gain',ascending=True) #使用gain排名进行重排结果如下:
通过颜色深浅,可以很清晰的看到:
1)信息增益、基尼指数降低值的评估结果基本没有差别,二者与信息值(IV)在极个别的特征上有较小的差别;
2)区分度与其他三个指标,整体上差别不大,但是其对部分变量的评估结果相同,不能像其他三个指标那样这些变量的预测能力区分开(如表中预测能力前10名的变量)。
综上所述,在实际工程实践中,可以将多个指标综合起来看,以确定最终的显著变量范围,而不是仅使用一个指标来进行判断和筛选。