无权限

Derrick博客站

【Pandas】数据结构DataFrame

一、DataFrame的创建

1、直接通过字典创建DataFrame

import pandas as pd

df = pd.DataFrame({"id":[5001,5002,5003],"name":["zhangsan","lisi","wangwu"],"age":[21,23,25]})
print(df)
【运行结果】
图片#193px #97px #B

2、通过字典创建时指定列的顺序和行索引

import pandas as pd

df = pd.DataFrame(data={"age": [20, 30, 40], "name": ["张三", "李四", "王五"]}, columns=["name", "age"], index=[101, 102, 103])
print(df)
【运行结果】
图片#144px #105px #B

二、DataFrame的常用属性

属性说明
indexDataFrame的行索引
columnsDataFrame的列标签
valuesDataFrame的值
ndimDataFrame的维度
shapeDataFrame的形状
sizeDataFrame的元素个数
dtypesDataFrame的元素类型
T行列转置
loc[]显式索引,按行列标签索引或切片
iloc[]隐式索引,按行列位置索引或切片
at[]使用行列标签访问单个元素
iat[]使用行列位置访问单个元素
  • loc[] 显式索引,按行列标签索引或切片

    import pandas as pd
    
    df = pd.DataFrame(data={"age": [20, 30, 40], "name": ["张三", "李四", "王五"],"sex":["boy","girl","boy"]}, columns=["name", "age","sex"], index=[101, 102, 103])
    print(df)
    print("--"*20)
    
    print(df.loc[101:102]) # 仅对索引切片
    print("--"*20)
    print(df.loc[101:103,["sex","age"]]) # 对索引和标签同时切片

    【运行结果】
    图片#373px #328px #B

  • iloc[] 隐式索引,按行列位置索引或切片

    import pandas as pd
    
    df = pd.DataFrame(data={"age": [20, 30, 40], "name": ["张三", "李四", "王五"],"sex":["boy","girl","boy"]}, columns=["name", "age","sex"], index=[101, 102, 103])
    print(df)
    print("--"*20)
    
    print(df.iloc[0:2]) # 仅对索引切片
    print("--"*20)
    print(df.iloc[0:3,1:3]) # 对索引和标签同时切片

    【运行结果】
    图片#341px #320px #B

 

  • at[] 使用行列标签访问单个元素

    df.at[101,"name"]

  • iat[] 使用行列位置访问单个元素

    print(df.iat[0,0])

三、DataFrame的常用方法

方法说明
head()查看前n行数据,默认5行
tail()查看后n行数据,默认5行
isin()元素是否包含在参数集合中
isna()元素是否为缺失值
sum()求和
mean()平均值
min()最小值
max()最大值
var()方差
std()标准差
median()中位数
mode()众数
quantile()指定位置的分位数,如quantile(0.5)
describe()常见统计信息
info()基本信息
value_counts()每个元素的个数(每行的个数)
count()非空元素的个数
drop_duplicates()去重
drop_duplicates(subset=[列名1, 列名2])
sample()随机采样
replace()用指定值代替原有值
equals()判断两个DataFrame是否相同
cummax()累计最大值
cummin()累计最小值
cumsum()累计和
cumprod()累计积
diff()一阶差分,对序列中的元素进行差分运算,也就是用当前元素减去前一个元素得到差值,默认情况下,它会计算一阶差分,即相邻元素之间的差值。参数:
periods:整数,默认为 1。表示要向前或向后移动的周期数,用于计算差值。正数表示向前移动,负数表示向后移动。
axis:指定计算的轴方向。0 或 'index' 表示按列计算,1 或 'columns' 表示按行计算,默认值为 0。
sort_index()按行索引排序
sort_values()按某列的值排序,可传入列表来按多列排序,并通过ascending参数设置升序或降序
sort_values([列名1, 列名2, ], asceding=[True, False, ])
nlargest()返回某列最大的n条数据
nlargest(n, [列名1, 列名2, ])
nsmallest()返回某列最小的n条数据
nsmallest(n, [列名1, 列名2, ])
  • 对指定列求平均值mean()

    import pandas as pd
    
    df = pd.DataFrame(data={"姓名":["Alice","Bob","Charlie","LiHua","LiHua"],"年龄":[25,30,35,20,22],"score":[95,90,85,75,75],"性别":["男",np.nan,"男","男","男"]})
    print(df)
    print(f"score的平均值为:{df["score"].mean()}")
    print(f"年龄的平均值为:{df["年龄"].mean()}")

  • 去重drop_duplicates()

    import pandas as pd
    import numpy as np
    
    df = pd.DataFrame(data={"姓名":["Alice","Bob","Charlie","LiHua","LiHua"],"年龄":[25,30,35,20,22],"score":[95,90,85,75,75],"性别":["男","女","女","男","男"]})
    
    print(df)
    print("--"*20)
    print(df.drop_duplicates(subset="姓名"))  # 指定字段去重
    print("--"*20)
    print(df.drop_duplicates()) # 不指定字段去重

    【运行结果】
    图片#364px #477px #B

  • 用指定值代替原有值replace()

    import pandas as pd
    
    df = pd.DataFrame(data={"姓名":["Alice","Bob","Charlie","LiHua","LiHua"],"年龄":[25,30,35,20,22],"score":[95,90,85,75,75],"性别":["男",np.nan,"男","男","男"]})
    print(df)
    print("--"*20)
    print(df.replace("LiHua","ZhangMing"))

    【运行结果】
    图片#333px #314px #B

 

  • 判断两个DataFrame是否相同equals()

    print(df.drop_duplicates()) # 不指定字段去重
    #%%
    import pandas as pd
    
    df1 = pd.DataFrame(data={"姓名":["Alice","Bob"],"年龄":[25,30],"score":[95,90],"性别":["男","男"]})
    df2 = pd.DataFrame(data={"姓名":["Alice","ZhangMing"],"年龄":[25,30],"score":[95,90],"性别":["男","男"]}) 
    df3 = pd.DataFrame(data={"姓名":["Alice","Bob"],"年龄":[25,30],"score":[95,90],"性别":["男","男"]}) 
    
    print(df1.equals(df2))
    print(df1.equals(df3))

    【运行结果】
    False
    True

  • 累计最大值 cummax()

    • axis=0是按列,axis=1是按行

    • 比较的时候,第一个数不动,从第二个数开始比较,谁大就取谁

import pandas as pd

df = pd.DataFrame(data={"A":[1,6,7,2,3],"B":[2,3,1,6,8]})
print(df)
print("--"*20)
print(df.cummax(axis=0))
print("--"*20)
print(df.cummax(axis=1))
  • 按指定字段排序 sort_values()

    import pandas as pd
    
    df = pd.DataFrame(data={"姓名":["Alice","Bob","Charlie","LiHua","LiHua"],"年龄":[25,30,35,20,22],"score":[95,90,85,75,75],"性别":["男","女","男","男","男"]},index=[105,103,101,107,102])
    print(df)
    print(df.sort_values(by="score"))

    【运行结果】
    图片#255px #278px #B

  • 返回某列最大的几条数据nlargest()

    import pandas as pd
    
    df = pd.DataFrame(data={"姓名":["Alice","Bob","Charlie","LiHua","LiHua"],"年龄":[25,30,35,20,22],"score":[95,90,85,75,75],"性别":["男","女","男","男","男"]},index=[105,103,101,107,102])
    print(df)
    print(df.nlargest(n=3,columns="年龄"))

    【运行结果】
    图片#280px #252px #B

四、DataFrame的布尔索引

import pandas as pd

df = pd.DataFrame(
 data={"age": [20, 30, 40, 10], "name": ["张三", "李四", "王五", "赵六"]},
 columns=["name", "age"],
 index=[101, 104, 103, 102],
)

print(df)
print("--"*20)

print(df["age"] > 25)  # 返回布尔类型,检索哪些是年龄大于25岁的
print("--"*20)

print(df[df["age"] > 25])  # 根据布尔类型True,重新生成一个DataFrame
【运行结果】
图片#360px #324px #B

 

五、DataFrame的运算

1、DataFrame与标量运算

import pandas as pd

df = pd.DataFrame(
 data={"age": [20, 30, 40, 10], "name": ["张三", "李四", "王五", "赵六"]},
 columns=["name", "age"],
 index=[101, 104, 103, 102],
)

print(df)
print("--"*20)

print(df*2)
【运行结果】
图片#364px #283px #B

2、DataFrame与DataFrame运算

       根据标签索引进行对位计算,索引没有匹配上的用NaN填充。

import pandas as pd

df1 = pd.DataFrame(
 data={"age": [10, 20, 30, 40], "name": ["张三", "李四", "王五", "赵六"]},
 columns=["name", "age"],
 index=[101, 102, 103, 104],
)
df2 = pd.DataFrame(
 data={"age": [10, 20, 30, 40], "name": ["张三", "李四", "王五", "田七"]},
 columns=["name", "age"],
 index=[102, 103, 104, 105],
)
print(df1 + df2)
【运行结果】
图片#225px #168px #B

 

六、DataFrame的更改操作

1、设置行索引

(1)通过set_index()设置行索引

import pandas as pd

df = pd.DataFrame({"age": [20, 30, 40, 10], "name": ["张三", "李四", "王五", "赵六"], "id": [101, 102, 103, 104]})
print(df)
print("--"*20)

df.set_index("id", inplace=True)
print(df)
【运行结果】
图片#352px #300px #B

(2)通过reset_index()重置行索引

import pandas as pd

df = pd.DataFrame(data={"age": [20, 30, 40, 10], "name": ["张三", "李四", "王五", "赵六"]},index=[101, 102, 103, 104])
print(df)
print("--"*20)

print(df.reset_index())
【运行结果】
图片#381px #297px #B

2、修改行索引名和列名

(1)通过rename()修改行索引名和列名

import pandas as pd

df = pd.DataFrame(data={"age": [20, 30, 40, 10], "name": ["张三", "李四", "王五", "赵六"]},index=[101, 102, 103, 104])
print(df)
print("--"*20)

df.rename(index={101:"一",102:"二",103:"三",104:"四"},columns={"age":"年龄","name":"姓名"},inplace=True)
print(df)
【运行结果】
图片#384px #282px #B

(2)将index和columns重新赋值

import pandas as pd

df = pd.DataFrame(data={"age": [20, 30, 40, 10], "name": ["张三", "李四", "王五", "赵六"]},index=[101, 102, 103, 104])
print(df)
print("--"*20)
df.index = [108,109,110,111]
df.columns = ["年龄","姓名"]
print(df)
【运行结果】
图片#347px #269px #B

3、添加列

(1)通过 df["列名"] 添加列

import pandas as pd

df = pd.DataFrame(data={"age": [20, 30, 40, 10], "name": ["张三", "李四", "王五", "赵六"]},index=[101, 102, 103, 104])
print(df)
print("--"*20)

df["性别"]=["男","女","男","女"]
print(df)
【运行结果】
图片#363px #286px #B

(2)通过 insert(loc, column, value) 插入。该方法没有inplace参数,直接在原数据上修改。

import pandas as pd

df = pd.DataFrame(data={"age": [20, 30, 40, 10], "name": ["张三", "李四", "王五", "赵六"]},index=[101, 102, 103, 104])
print(df)
print("--"*20)

df.insert(loc=1,column="sex",value=["男","女","男","女"])
print(df)
【运行结果】
图片#354px #275px #B

 

4、删除列/行

(1)通过 df.drop("列名", axis=1) 删除,也可是删除行axis=0

import pandas as pd

df = pd.DataFrame(data={"age": [20, 30, 40, 10], "name": ["张三", "李四", "王五", "赵六"]},index=[101, 102, 103, 104])
print(df)
print("--"*20)

df.drop("age",axis=1,inplace=True) # 删除某列
print(df)
print("--"*20)

df.drop(103,axis=0,inplace=True)  # 删除某行
print(df)
【运行结果】
图片#356px #383px #B

 

七、DataFrame数据的导入与导出

1、导出数据

import os
import pandas as pd

# 创建目录
os.makedirs("data", exist_ok=True)  

# 创建DataFrame
df = pd.DataFrame(data={"id":[5001,5002,5003,5004],"name":["张三","李四","王五","赵六"],"age":[21,23,25,20]})
df.set_index("id",inplace=True)

# 将df数据导出到csv文件中
df.to_csv("data/test.csv")

# 将df数据导出到json文件中
df.to_json("data/test.json",orient="records",force_ascii=False)

# 将df数据导出到剪切板中
df.to_clipboard()

2、导入数据

 

评论

快捷导航

把好文章收藏到微信

打开微信,扫码查看

关闭

还没有账号?立即注册