在Sklean的SVM类中,可以通过超参数C来控制这个平衡,C值越小,则街道越宽,但是违例会越多,如果SVM模型过度拟合,可以试试通过降低C来进行正则化
4.1 线性可分LinearSVC类 4.1.1 LinearSVC类重要参数说明penalty: string,'l1'or'l2',default='l2'
loss: string 'hing'or'squared_hinge',default='squared_hinge',hinge为标准的SVM损失函数
dual: bool,defalut=True,wen n_samples > n_features,dual=False,SVM的原始问题和对偶问题二者解相同
tol: float,deafult=le-4,用于提前停止标准
C: float,defult=1.0,为松弛变量的惩罚系数
multi_class: 默认为ovr,该参数不用修改
更多说明应查看源码
4.1.2 Hinge损失函数 函数max(0,1-t),当t>=1时,函数等于0,如果t<1,其导数为-1 def hinge(x): if x >=1 : return 0 else: return 1-x import numpy as np import matplotlib.pyplot as plt x = np.linspace(-2,4,20) y = [hinge(i) for i in x ] ax = plt.subplot(111) plt.ylim([-1,2]) ax.plot(x,y,'r-') plt.text(0.5,1.5,r'f(t) = max(0,1-t)',fontsize=20) plt.show() <Figure size 640x480 with 1 Axes> 4.1.3 使用iris训练集LinearSVC模型 from sklearn import datasets import pandas as pd iris = datasets.load_iris() print(iris.keys()) print('labels:',iris['target_names']) features,labels = iris['data'],iris['target'] print(features.shape,labels.shape) # 分析数据集 print('-------feature_names:',iris['feature_names']) iris_df = pd.DataFrame(features) print('-------info:',iris_df.info()) print('--------descibe:',iris_df.describe()) dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename']) labels: ['setosa' 'versicolor' 'virginica'] (150, 4) (150,) -------feature_names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)'] <class 'pandas.core.frame.DataFrame'> RangeIndex: 150 entries, 0 to 149 Data columns (total 4 columns): 0 150 non-null float64 1 150 non-null float64 2 150 non-null float64 3 150 non-null float64 dtypes: float64(4) memory usage: 4.8 KB -------info: None --------descibe: 0 1 2 3 count 150.000000 150.000000 150.000000 150.000000 mean 5.843333 3.057333 3.758000 1.199333 std 0.828066 0.435866 1.765298 0.762238 min 4.300000 2.000000 1.000000 0.100000 25% 5.100000 2.800000 1.600000 0.300000 50% 5.800000 3.000000 4.350000 1.300000 75% 6.400000 3.300000 5.100000 1.800000 max 7.900000 4.400000 6.900000 2.500000 # 数据进行预处理 from sklearn.preprocessing import StandardScaler,LabelEncoder from sklearn.model_selection import RandomizedSearchCV from sklearn.svm import LinearSVC from scipy.stats import uniform # 对数据进行标准化 scaler = StandardScaler() X = scaler.fit_transform(features) print(X.mean(axis=0)) print(X.std(axis=0)) # 对标签进行编码 encoder = LabelEncoder() Y = encoder.fit_transform(labels) # 调参 svc = LinearSVC(loss='hinge',dual=True) param_distributions = {'C':uniform(0,10)} rscv_clf =RandomizedSearchCV(estimator=svc, param_distributions=param_distributions,cv=3,n_iter=20,verbose=2) rscv_clf.fit(X,Y) print(rscv_clf.best_params_) [-1.69031455e-15 -1.84297022e-15 -1.69864123e-15 -1.40924309e-15] [1. 1. 1. 1.] Fitting 3 folds for each of 20 candidates, totalling 60 fits [CV] C=8.266733168092582 ............................................. [CV] .............................. C=8.266733168092582, total= 0.0s [CV] C=8.266733168092582 ............................................. [CV] .............................. C=8.266733168092582, total= 0.0s [CV] C=8.266733168092582 ............................................. [CV] .............................. C=8.266733168092582, total= 0.0s [CV] C=8.140498369662586 ............................................. [CV] .............................. C=8.140498369662586, total= 0.0s ... ... ... [CV] .............................. C=9.445168322251103, total= 0.0s [CV] C=9.445168322251103 ............................................. [CV] .............................. C=9.445168322251103, total= 0.0s [CV] C=2.100443613273717 ............................................. [CV] .............................. C=2.100443613273717, total= 0.0s [CV] C=2.100443613273717 ............................................. [CV] .............................. C=2.100443613273717, total= 0.0s [CV] C=2.100443613273717 ............................................. [CV] .............................. C=2.100443613273717, total= 0.0s {'C': 3.2357870215300046} # 模型评估 y_prab = rscv_clf.predict(X) result = np.equal(y_prab,Y).astype(np.float32) print('accuracy:',np.sum(result)/len(result)) accuracy: 0.9466666666666667 from sklearn.metrics import accuracy_score,precision_score,recall_score print('accracy_score:',accuracy_score(y_prab,Y)) print('precision_score:',precision_score(y_prab,Y,average='micro')) accracy_score: 0.9466666666666667 precision_score: 0.9466666666666667 5 附录 5.1 非线性SVM分类SVCSVC类通过参数kernel的设置可以实现线性和非线性分类,具体参数说明和属性说明如下
5.1.1 SVC类参数说明C: 惩罚系数,float,default=1.0
kernel: string,default='rbf',核函数选择,必须为('linear','poly','rbf','sigmoid','precomputed' or callable)其中一个
degree: 只有当kernel='poly'时才有意义,表示多项式核的深度
gamma: float,default='auto',核系数