Python 数据分析基础小结

一、数据读取 1、读写数据库数据 读取函数:

pandas.read_sql_table(table_name, con, schema=None, index_col=None, coerce_float=True, columns=None)

pandas.read_sql_query(sql, con, index_col=None, coerce_float=True)

pandas.read_sql(sql, con, index_col=None, coerce_float=True, columns=None)

sqlalchemy.creat_engine('数据库产品名+连接工具名://用户名:密码@数据库IP地址:数据库端口号/数据库名称?charset = 数据库数据编码')

写出函数:

DataFrame.to_sql(name, con, schema=None, if_exists=’fail’, index=True, index_label=None, dtype=None)

 

2、读写文本文件/csv数据 读取函数:

pandas.read_table(filepath_or_buffer, sep=’\t’, header=’infer’, names=None, index_col=None, dtype=None, engine=None, nrows=None)

pandas.read_csv(filepath_or_buffer, sep=’,’, header=’infer’, names=None, index_col=None, dtype=None, engine=None, nrows=None)

写出函数:

DataFrame.to_csv(path_or_buf=None, sep=’,’, na_rep=”, columns=None, header=True, index=True,index_label=None,mode=’w’,encoding=None)

 

3、读写excel(xls/xlsx)数据 读取函数:

pandas.read_excel(io, sheetname=0, header=0, index_col=None, names=None, dtype=None)

写出函数:

DataFrame.to_excel(excel_writer=None, sheetname=None’, na_rep=”, header=True, index=True, index_label=None, mode=’w’, encoding=None)

 

4、读取剪贴板数据:

pandas.read_clipboard()

 

二、数据预处理 1、数据清洗 重复数据处理

样本重复:

pandas.DataFrame(Series).drop_duplicates(self, subset=None, keep='first', inplace=False)

特征重复:

通用

def FeatureEquals(df): dfEquals=pd.DataFrame([],columns=df.columns,index=df.columns) for i in df.columns: for j in df.columns: dfEquals.loc[i,j]=df.loc[:,i].equals(df.loc[:,j]) return dfEquals

数值型特征

def drop_features(data,way = 'pearson',assoRate = 1.0): ''' 此函数用于求取相似度大于assoRate的两列中的一个,主要目的用于去除数值型特征的重复 data:数据框,无默认 assoRate:相似度,默认为1 ''' assoMat = data.corr(method = way) delCol = [] length = len(assoMat) for i in range(length): for j in range(i+1,length): if assoMat.iloc[i,j] >= assoRate: delCol.append(assoMat.columns[j]) return(delCol)

 

缺失值处理

识别缺失值

DataFrame.isnull()

DataFrame.notnull()

DataFrame.isna()

DataFrame.notna()

处理缺失值

删除:DataFrame.dropna(self, axis=0, how='any', thresh=None, subset=None, inplace=False)

定值填补: DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None)

插补: DataFrame.interpolate(method=’linear’, axis=0, limit=None, inplace=False,limit_direction=’forward’, limit_area=None, downcast=None,**kwargs)

 

异常值处理

3σ原则

def outRange(Ser1): boolInd = (Ser1.mean()-3*Ser1.std()>Ser1) | (Ser1.mean()+3*Ser1.var()< Ser1) index = np.arange(Ser1.shape[0])[boolInd] outrange = Ser1.iloc[index] return outrange

注: 此方法只适用于正态分布

箱线图分析

def boxOutRange(Ser): ''' Ser:进行异常值分析的DataFrame的某一列 ''' Low = Ser.quantile(0.25)-1.5*(Ser.quantile(0.75)-Ser.quantile(0.25)) Up = Ser.quantile(0.75)+1.5*(Ser.quantile(0.75)-Ser.quantile(0.25)) index = (Ser< Low) | (Ser>Up) Outlier = Ser.loc[index] return(Outlier) 2、合并数据

数据堆叠:pandas.concat(objs, axis=0, join='outer', join_axes=None, ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, copy=True)

主键合并:pandas.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False,suffixes=('_x', '_y'), copy=True, indicator=False)

重叠合并:pandas.DataFrame.combine_first(self, other)

3、数据变换

哑变量处理:pandas.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False)

数据离散化:pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False)

4、数据标准化

标准差标准化:sklearn.preprocessing.StandardScaler

离差标准化: sklearn.preprocessing.MinMaxScaler

三、模型构建 1、训练集测试集划分

sklearn.model_selection.train_test_split(*arrays, **options)

2、 降维

class sklearn.decomposition.PCA(n_components=None, copy=True, whiten=False, svd_solver=’auto’, tol=0.0, iterated_power=’auto’, random_state=None)

3、交叉验证

sklearn.model_selection.cross_validate(estimator, X, y=None, groups=None, scoring=None, cv=None, n_jobs=1, verbose=0, fit_params=None, pre_dispatch=‘2*n_jobs’, return_train_score=’warn’)

4、模型训练与预测

有监督模型

clf = lr.fit(X_train, y_train) clf.predict(X_test) 5、聚类 常用算法:

K均值:class sklearn.cluster.KMeans(n_clusters=8, init=’k-means++’, n_init=10, max_iter=300, tol=0.0001, precompute_distances=’auto’, verbose=0, random_state=None, copy_x=True, n_jobs=1, algorithm=’auto’)

DBSCAN密度聚类:class sklearn.cluster.DBSCAN(eps=0.5, min_samples=5, metric=’euclidean’, metric_params=None, algorithm=’auto’, leaf_size=30, p=None, n_jobs=1)

Birch层次聚类:class sklearn.cluster.Birch(threshold=0.5, branching_factor=50, n_clusters=3, compute_labels=True, copy=True)

评价:

轮廓系数:sklearn.metrics.silhouette_score(X, labels, metric=’euclidean’, sample_size=None, random_state=None, **kwds)

calinski_harabaz_score:sklearn.metrics.calinski_harabaz_score(X, labels)

completeness_score:sklearn.metrics.completeness_score(labels_true, labels_pred)

fowlkes_mallows_score:sklearn.metrics.fowlkes_mallows_score(labels_true, labels_pred, sparse=False)

homogeneity_completeness_v_measure:sklearn.metrics.homogeneity_completeness_v_measure(labels_true, labels_pred)

adjusted_rand_score:sklearn.metrics.adjusted_rand_score(labels_true, labels_pred)

homogeneity_score:sklearn.metrics.homogeneity_score(labels_true, labels_pred)

mutual_info_score:sklearn.metrics.mutual_info_score(labels_true, labels_pred, contingency=None)

normalized_mutual_info_score:sklearn.metrics.normalized_mutual_info_score(labels_true, labels_pred)

v_measure_score:sklearn.metrics.v_measure_score(labels_true, labels_pred)

注:后续含labels_true参数的均需真实值参与

6、分类 常用算法

Adaboost分类:class sklearn.ensemble.AdaBoostClassifier(base_estimator=None, n_estimators=50, learning_rate=1.0, algorithm=’SAMME.R’, random_state=None)

内容版权声明:除非注明,否则皆为本站原创文章。

转载注明出处:https://www.heiqu.com/zzdsds.html