数据分析之pandas常见的数据处理(四) (2)

map中返回的数据是一个具体值,不能迭代.

df3 = DataFrame({'color':['red','green','blue'],'project':['math','english','chemistry']}) price = {'red':5.56,'green':3.14,'chemistry':2.79} df3['price'] = df3['color'].map(price) display(df3) #输出: color project price 0 red math 5.56 1 green english 3.14 2 blue chemistry NaN df3 = DataFrame({'zs':[129,130,34],'ls':[136,98,8]},index = ['张三','李四','倩倩']) display(df3) display(df3['zs'].map({129:'你好',130:'非常好',34:'不错'})) display(df3['zs'].map({129:120})) def mapscore(score): if score<90: return 'failed' elif score>120: return 'excellent' else: return 'pass' df3['status'] = ddd['zs'].map(mapscore) df3 输出: zs ls 张三 129 136 李四 130 98 倩倩 34 8 张三 你好 李四 非常好 倩倩 不错 Name: zs, dtype: object 张三 120.0 李四 NaN 倩倩 NaN Name: zs, dtype: float64 Out[96]: ls zs status 张三 136 129 excellent 李四 98 130 excellent 倩倩 8 34 failed

3 rename()函数:替换索引 rename({索引键值对})

df4 = DataFrame({'color':['white','gray','purple','blue','green'],'value':np.random.randint(10,size = 5)}) new_index = {0:'first',1:'two',2:'three',3:'four',4:'five'} display(df4,df4.rename(new_index)) #输出: color value 0 white 2 1 gray 0 2 purple 9 3 blue 2 4 green 0 color value first white 2 two gray 0 three purple 9 four blue 2 five green 0 (3) 异常值检测与过滤

1 使用describe()函数查看每一列的描述性统计量

df = DataFrame(np.random.randint(10,size = 10)) display(df.describe()) 0 count 10.000000 mean 5.900000 std 2.685351 min 1.000000 25% 6.000000 50% 7.000000 75% 7.750000 max 8.000000

2 使用std()函数可以求得DataFrame对象每一列的标准差

df.std() #输出: 0 3.306559 dtype: float64

3 根据每一列的标准差,对DataFrame元素进行过滤。
借助any()函数,对每一列应用筛选条件,any过滤出所有符合条件的数据

display(df[(df>df.std()*3).any(axis = 1)]) df.drop(df[(np.abs(df) > (3*df.std())).any(axis=1)].index,inplace=True) display(df,df.shape) 输出: 0 1 2 7 9 6 8 8 9 8 1 0 1 0 5 0 1 3 3 3 3 5 4 2 4 5 7 6 7 1 6 8 7 7 (7, 2) (4) 排序

使用take()函数排序
可以借助np.random.permutation()函数随机排序

df5 = DataFrame(np.arange(25).reshape(5,5)) new_order = np.random.permutation(5) display(new_order) display(df5,df5.take(new_order)) #输出 array([4, 2, 3, 1, 0]) 0 1 2 3 4 0 0 1 2 3 4 1 5 6 7 8 9 2 10 11 12 13 14 3 15 16 17 18 19 4 20 21 22 23 24 0 1 2 3 4 4 20 21 22 23 24 2 10 11 12 13 14 3 15 16 17 18 19 1 5 6 7 8 9 0 0 1 2 3 4 (5) 数据分类分组

groupby()函数

import pandas as pd df = pd.DataFrame([{'col1':'a', 'col2':1, 'col3':'aa'}, {'col1':'b', 'col2':2, 'col3':'bb'}, {'col1':'c', 'col2':3, 'col3':'cc'}, {'col1':'a', 'col2':44, 'col3':'aa'}]) display(df) # 按col1分组并按col2求和 display(df.groupby(by='col1').agg({'col2':sum}).reset_index()) # 按col1分组并按col2求最值 display(df.groupby(by='col1').agg({'col2':['max', 'min']}).reset_index()) # 按col1 ,col3分组并按col2求和 display(df.groupby(by=['col1', 'col3']).agg({'col2':sum}).reset_index()) import matplotlib.pyplot as plt import pandas as pd import numpy as np from datetime import datetime ''' 分组groupby ''' df=pd.DataFrame({'key1':['a','a','b','b','a'], 'key2':['one','two','one','two','one'], 'data1':np.arange(5), 'data2':np.arange(5)}) print(df) # key1 key2 data1 data2 # 0 a one 0 0 # 1 a two 1 1 # 2 b one 2 2 # 3 b two 3 3 # 4 a one 4 4 ''' 根据分组进行计算 ''' #按key1分组,计算data1的平均值 grouped=df['data1'].groupby(df['key1']) print(grouped.mean()) # a 1.666667 # b 2.500000 #按key1和key2分组,计算data1的平均值 groupedmean=df['data1'].groupby([df['key1'],df['key2']]).mean() print(groupedmean) # key1 key2 # a one 2 # two 1 # b one 2 # two 3 #列变行 print(groupedmean.unstack()) # key2 one two # key1 # a 2 1 # b 2 3 df['key1']#获取出来的数据series数据 #groupby分组键可以是series还可以是数组 states=np.array(['Oh','Ca','Ca','Oh','Oh']) years=np.array([2005,2005,2006,2005,2006]) print(df['data1'].groupby([states,years]).mean()) # Ca 2005 1.0 # 2006 2.0 # Oh 2005 1.5 # 2006 4.0 #直接将列名进行分组,非数据项不在其中,非数据项会自动排除分组 print(df.groupby('key1').mean()) # data1 data2 # key1 # a 1.666667 1.666667 # b 2.500000 2.500000 #将入key2分组 print(df.groupby(['key1','key2']).mean()) # data1 data2 # key1 key2 # a one 2 2 # two 1 1 # b one 2 2 # two 3 3 #size()方法,返回含有分组大小的Series,得到分组的数量 print(df.groupby(['key1','key2']).size()) # key1 key2 # a one 2 # two 1 # b one 1 # two 1 ''' 对分组信息进行迭代 ''' #将a,b进行分组 for name,group in df.groupby('key1'): print(name) print(group) # a # key1 key2 data1 data2 # 0 a one 0 0 # 1 a two 1 1 # 4 a one 4 4 # b # key1 key2 data1 data2 # 2 b one 2 2 # 3 b two 3 3 #根据多个建进行分组 for (k1,k2),group in df.groupby(['key1','key2']): print(name) print(group) # key1 key2 data1 data2 # 0 a one 0 0 # 4 a one 4 4 # b # key1 key2 data1 data2 # 1 a two 1 1 # b # key1 key2 data1 data2 # 2 b one 2 2 # b # key1 key2 data1 data2 # 3 b two 3 3 ''' 选取一个或一组列,返回的Series的分组对象 ''' #对于groupBy对象,如果用一个或一组列名进行索引。就会聚合 print(df.groupby(df['key1'])['data1'])#根据key1分组,生成data1的数据 print(df.groupby(['key1'])[['data1','data2']].mean())#根据key1分组,生成data1,data2的数据 # data1 data2 # key1 # a 1.666667 1.666667 # b 2.500000 2.500000 print(df.groupby(['key1','key2'])['data1'].mean()) # key1 key2 # a one 2 # two 1 # b one 2 # two 3 ''' 通过函数进行分组 ''' #加入你能根据人名长度进行分组的话,就直接传入len函数 print(people.groupby(len,axis=1).sum())#杭州3是三个字母 # 2 3 # a 30.0 20.0 # b 23.0 21.0 # c 26.0 22.0 # d 42.0 23.0 # e 46.0 24.0 #还可以和数组、字典、列表、Series混合使用 key_list=['one','one','one','two','two'] print(people.groupby([len,key_list],axis=1).min()) # 2 3 # one two two # a 0.0 15.0 20.0 # b 1.0 16.0 21.0 # c 2.0 17.0 22.0 # d 3.0 18.0 23.0 # e 4.0 19.0 24.0 ''' 根据索引级别分组 ''' columns=pd.MultiIndex.from_arrays([['US',"US",'US','JP','JP'],[1,3,5,1,3]],names=['cty','tenor']) hier_df=pd.DataFrame(np.random.randn(4,5),columns=columns) print(hier_df) # cty US JP # tenor 1 3 5 1 3 # 0 -1.507729 2.112678 0.841736 -0.158109 -0.645219 # 1 0.355262 0.765209 -0.287648 1.134998 -0.440188 # 2 1.049813 0.763482 -0.362013 -0.428725 -0.355601 # 3 -0.868420 -1.213398 -0.386798 0.137273 0.678293 #根据级别分组 print(hier_df.groupby(level='cty',axis=1).count()) # cty JP US # 0 2 3 # 1 2 3 # 2 2 3 # 3 2 3 (6) 高级数据聚合

内容版权声明:除非注明,否则皆为本站原创文章。

转载注明出处:https://www.heiqu.com/wspdyg.html