提问人:Dave Challis 提问时间:8/3/2015 更新时间:11/25/2022 访问量:3886
在 Pandas 中为缺失的数据集添加值
Adding values for missing data combinations in Pandas
问:
我有一个 pandas 数据框,其中包含以下内容:
person_id status year count
0 'pass' 1980 4
0 'fail' 1982 1
1 'pass' 1981 2
如果我知道每个字段的所有可能值都是:
all_person_ids = [0, 1, 2]
all_statuses = ['pass', 'fail']
all_years = [1980, 1981, 1982]
我想在原始数据框中填充缺少的数据组合(person_id、状态和年份),即我希望新数据框包含:count=0
person_id status year count
0 'pass' 1980 4
0 'pass' 1981 0
0 'pass' 1982 0
0 'fail' 1980 0
0 'fail' 1981 0
0 'fail' 1982 2
1 'pass' 1980 0
1 'pass' 1981 2
1 'pass' 1982 0
1 'fail' 1980 0
1 'fail' 1981 0
1 'fail' 1982 0
2 'pass' 1980 0
2 'pass' 1981 0
2 'pass' 1982 0
2 'fail' 1980 0
2 'fail' 1981 0
2 'fail' 1982 0
在大熊猫身上有没有有效的方法可以做到这一点?
答:
14赞
EdChum
8/3/2015
#1
你可以使用 itertools.product
来生成所有组合,然后从中构造一个 df,将其与原始 df 和
fillna
合并,以填充缺失的计数值:0
In [77]:
import itertools
all_person_ids = [0, 1, 2]
all_statuses = ['pass', 'fail']
all_years = [1980, 1981, 1982]
combined = [all_person_ids, all_statuses, all_years]
df1 = pd.DataFrame(columns = ['person_id', 'status', 'year'], data=list(itertools.product(*combined)))
df1
Out[77]:
person_id status year
0 0 pass 1980
1 0 pass 1981
2 0 pass 1982
3 0 fail 1980
4 0 fail 1981
5 0 fail 1982
6 1 pass 1980
7 1 pass 1981
8 1 pass 1982
9 1 fail 1980
10 1 fail 1981
11 1 fail 1982
12 2 pass 1980
13 2 pass 1981
14 2 pass 1982
15 2 fail 1980
16 2 fail 1981
17 2 fail 1982
In [82]:
df1 = df1.merge(df, how='left').fillna(0)
df1
Out[82]:
person_id status year count
0 0 pass 1980 4
1 0 pass 1981 0
2 0 pass 1982 0
3 0 fail 1980 0
4 0 fail 1981 0
5 0 fail 1982 1
6 1 pass 1980 0
7 1 pass 1981 2
8 1 pass 1982 0
9 1 fail 1980 0
10 1 fail 1981 0
11 1 fail 1982 0
12 2 pass 1980 0
13 2 pass 1981 0
14 2 pass 1982 0
15 2 fail 1980 0
16 2 fail 1981 0
17 2 fail 1982 0
11赞
HYRY
8/3/2015
#2
通过 MultiIndex.from_product() 创建一个 MultiIndex,然后通过 、 、 创建一个 MultiIndex。set_index()
reindex()
reset_index()
import pandas as pd
import io
all_person_ids = [0, 1, 2]
all_statuses = ['pass', 'fail']
all_years = [1980, 1981, 1982]
df = pd.read_csv(io.BytesIO("""person_id status year count
0 pass 1980 4
0 fail 1982 1
1 pass 1981 2"""), delim_whitespace=True)
names = ["person_id", "status", "year"]
mind = pd.MultiIndex.from_product(
[all_person_ids, all_statuses, all_years], names=names)
df.set_index(names).reindex(mind, fill_value=0).reset_index()
评论
0赞
Dave Challis
8/3/2015
效果很好 - 你能粗略地解释上面每个步骤在做什么吗?(我以前没有使用过,但我很快就会阅读它们)。reindex
reset_index
1赞
HYRY
8/3/2015
reindex()
将行与新索引对齐,用于用 0 填充 NaN。我认为您可以保留 ,因为您可以使用它来快速选择元素。您可以通过将索引转换为列。fill_value=0
MultiIndex
reset_index()
0赞
vamsi
11/16/2022
这主要是一种静态的方式,我们有什么方法可以按日期动态地做到这一点吗?
5赞
mozway
4/5/2022
#3
它接受列名作为输入以及 {name: values} 字典,其中包含要完成的所需值的详尽列表:
import janitor
df.complete({'person_id': [0,1,2]}, 'status', 'year').fillna(0, downcast='infer')
输出:
person_id status year count
0 0 'fail' 1980 0
1 0 'fail' 1981 0
2 0 'fail' 1982 1
3 0 'pass' 1980 4
4 0 'pass' 1981 0
5 0 'pass' 1982 0
6 1 'fail' 1980 0
7 1 'fail' 1981 0
8 1 'fail' 1982 0
9 1 'pass' 1980 0
10 1 'pass' 1981 2
11 1 'pass' 1982 0
12 2 'fail' 1980 0
13 2 'fail' 1981 0
14 2 'fail' 1982 0
15 2 'pass' 1980 0
16 2 'pass' 1981 0
17 2 'pass' 1982 0
1赞
G.G
11/25/2022
#4
all_person_ids = [0, 1, 2]
all_statuses = ['pass', 'fail']
all_years = [1980, 1981, 1982]
pd.Series(all_person_ids).to_frame('person_id').merge(pd.Series(all_statuses).to_frame('status'), how='cross')\
.merge(pd.Series(all_years).to_frame('year'), how='cross')\
.merge(df1,on=['person_id','status','year'], how='left')\
.fillna(0)
person_id status year count
0 0 pass 1980 4.0
1 0 pass 1981 0.0
2 0 pass 1982 0.0
3 0 fail 1980 0.0
4 0 fail 1981 0.0
5 0 fail 1982 1.0
6 1 pass 1980 0.0
7 1 pass 1981 2.0
8 1 pass 1982 0.0
9 1 fail 1980 0.0
10 1 fail 1981 0.0
11 1 fail 1982 0.0
12 2 pass 1980 0.0
13 2 pass 1981 0.0
14 2 pass 1982 0.0
15 2 fail 1980 0.0
16 2 fail 1981 0.0
17 2 fail 1982 0.0
评论