提问人:3UqU57GnaX 提问时间:6/14/2023 更新时间:6/14/2023 访问量:117
加载 parquet 文件时筛选器中的布尔逻辑
Boolean logic in filters when loading parquet file
问:
我想删除 1900 年出生但尚未死亡的人。
下面的代码可以工作,但我需要两个过滤器来删除特定行。有没有更简单的方法来删除一个筛选器的行?
要重现的最少代码:
import pandas as pd
data = [
(1900, None,), # needs to be removed
(1900, 2000,),
(2000, None,),
(2000, 2020,),
]
df = pd.DataFrame(data, columns=['birth', 'death'])
df.to_parquet('test.parquet')
# Rows which do not match the filter predicate will be removed
filters= [
[
('birth', '!=', 1900),
],
[
('birth', '=', 1900),
('death', 'not in', [None]),
]
]
df2 = pd.read_parquet('test.parquet', filters=filters)
df2.head()
答:
1赞
mozway
6/14/2023
#1
你实际上不需要这个条件,你可以保留 ,等价于:('birth', '=', 1900)
(NOT BIRTH == 1900) OR (DEATH NOT IN NONE)
NOT (BIRTH == 1900 AND DEATH IN NONE)
filters= filters= [[('birth', '!=', 1900)], [('death', 'not in', [None])]]
df2 = pd.read_parquet('test.parquet', filters=filters)
您还可以使用:
import pyarrow.compute as pc
filters = (pc.field('birth')!=1900) | ~pc.field('death').isin([None])
# or
filters = ~( (pc.field('birth')==1900) & pc.field('death').isin([None]) )
输出:
birth death
0 1900 2000.0
1 2000 NaN
2 2000 2020.0
评论
1赞
3UqU57GnaX
6/14/2023
很好,谢谢。特别是很方便,因为它允许更多的自由度(即不是“外层”中的强制 OR)pc.field
评论