数据分析入门——Pandas类库基础知识

日期：2021-05-03 栏目：程序人生浏览：次

使用python进行数据分析时，经常会用Pandas类库处理数据，将数据转换成我们需要的格式。Pandas中的有两个数据结构和处理数据相关，分别是Series和DataFrame。

Series

Series是一种类似于一维数组的对象，它有两个属性，value和index索引。可以像数组那样通过索引访问对应的值，它和数组有点类似也是python中的dict有点类似，数组中的索引只能是数字，而Series的索引既可以是数字类型也可以是字符类型。

创建Series对象
最简单的方式是通过list序列就可以创建Series对象

s1 = Series(['a','b','c','d']) s1 Out[16]: 0 a 1 b 2 c 3 d

没有指定索引时，会默认生成一个从0开始到N-1的整型索引。

Series会根据传入的list序列中元素的类型判断Series对象的数据类型，如果全部都是整型，则创建的Series对象是整型，如果有一个元素是浮点型，则创建的Series对象是浮点型，如果有一个是字符串，则创建的Series对象是object类型。

s1 = Series([1,2,3,4]) s1 Out[23]: 0 1 1 2 2 3 3 4 dtype: int64 s2 = Series([1,2,3,4.0]) s2 Out[25]: 0 1.0 1 2.0 2 3.0 3 4.0 dtype: float64 s3 = Series([1,2,3,'4']) s3 Out[27]: 0 1 1 2 2 3 3 4 dtype: object

除了通过list序列创建Series对象外，还可以通过dict创建Series对象。

s1 = Series({'a':1,'b':2,'c':3,'d':4}) s1 Out[37]: a 1 b 2 c 3 d 4 dtype: int64

通过dict词典创建Series对象时，会将词典的键初始化Series的Index，而dict的value初始化Series的value。

Series还支持传入一个dict词典和一个list序列创建Series对象：

dict1 = {'a':1,'b':2,'c':3,'d':4} index1 = ['a','b','e'] s1 = Series(dict1,index=index1) s1 Out[51]: a 1.0 b 2.0 e NaN dtype: float64

上面的代码中，指定了创建的Series对象s1的索引是index1，即'a','b'和'e'。s1的值是dict1中和index1索引相匹配的值，如果不匹配，则显示NaN。例如索引'e'和dict1中的键没有相匹配的，则索引'e'的值为NaN。索引'a'和索引'b'都匹配得上，因此值为1和2。

Series通过索引访问值：

s1 = Series({'a':1,'b':2,'c':3,'d':4}) s1 Out[39]: a 1 b 2 c 3 d 4 dtype: int64 s1['b'] Out[40]: 2

上面代码中通过s1['b']就可以访问到索引b对应的值。

Series支持逻辑和数学运算：

s1 = Series([2,5,-10,200]) s1 * 2 Out[53]: 0 4 1 10 2 -20 3 400 dtype: int64 s1[s1>0] Out[54]: 0 2 1 5 3 200 dtype: int64

对Series变量做数学运算，会作用于Series对象中的每一个元素。

s1 = Series([2,5,-10,200]) s1[s1>0] Out[7]: 0 2 1 5 3 200 dtype: int64

对Series做逻辑运算时，会将Series中的值替换为bool类型的对象。

s1 = Series([2,5,-10,200]) s1 Out[10]: 0 2 1 5 2 -10 3 200 dtype: int64 s1 > 0 Out[11]: 0 True 1 True 2 False 3 True dtype: bool

通过series的逻辑运算，可以过滤掉一些不符合条件的数据，例如过滤掉上面例子中小于0的元素：

s1 = Series([2,5,-10,200]) s1[s1 >0] Out[23]: 0 2 1 5 3 200 dtype: int64

Series对象和索引都有一个name属性，通过下面的方法可以设置Series对象和索引的name值：

fruit = {0:'apple',1:'orange',2:'banana'} fruitSeries = Series(fruit) fruitSeries.name='Fruit' fruitSeries Out[27]: 0 apple 1 orange 2 banana Name: Fruit, dtype: object fruitSeries.index.name='Fruit Index' fruitSeries Out[29]: Fruit Index 0 apple 1 orange 2 banana Name: Fruit, dtype: object

可以通过index复制方式直接修改Series对象的index：

fruitSeries.index=['a','b','c'] fruitSeries Out[31]: a apple b orange c banana Name: Fruit, dtype: object DataFrame

DataFrame是表格型的数据结构，和关系型数据库中的表很像，都是行和列组成，有列名，索引等属性。

我们可以认为DataFrame中的列其实就是上面提到的Series，有多少列就有多少个Series对象，它们共享同一个索引index。

通过dict字典创建DataFrame对象：

data = {'fruit':['Apple','Apple','Orange','Orange','Banana'], 'year':[2010,2011,2012,2011,2012], 'sale':[15000,17000,36000,24000,29000]} frame = DataFrame(data) frame Out[12]: fruit year sale 0 Apple 2010 15000 1 Apple 2011 17000 2 Orange 2012 36000 3 Orange 2011 24000 4 Banana 2012 29000

使用上面的方式创建DataFrame对象时，字典中每个元素的value值必须是列表，并且长度必须一致，如果长度不一致会报错。例如key为fruit、year、sale对应的列表长度必须一致。

创建DataFrame对象和会创建Series对象一样自动加上索引。

通过传入columns参数指定列的顺序：

data = {'fruit':['Apple','Apple','Orange','Orange','Banana'], 'year':[2010,2011,2012,2011,2012], 'sale':[15000,17000,36000,24000,29000]} frame = DataFrame(data,columns=['sale','fruit','year','price']) frame Out[25]: sale fruit year price 0 15000 Apple 2010 NaN 1 17000 Apple 2011 NaN 2 36000 Orange 2012 NaN 3 24000 Orange 2011 NaN 4 29000 Banana 2012 NaN

如果传入的列在数据中找不到，就会产生NaN值。

转载注明出处：https://www.heiqu.com/wsxxpj.html

数据分析入门——Pandas类库基础知识

相关推荐