如何像在 SQL 中那样使用“in”和“not in”过滤 Pandas 数据帧-解网

问：

如何实现 SQL 和的等价物？INNOT IN

我有一个包含所需值的列表。场景如下：

df = pd.DataFrame({'country': ['US', 'UK', 'Germany', 'China']})
countries_to_keep = ['UK', 'China']

# pseudo-code:
df[df['country'] not in countries_to_keep]

我目前的执行方法如下：

df = pd.DataFrame({'country': ['US', 'UK', 'Germany', 'China']})
df2 = pd.DataFrame({'country': ['UK', 'China'], 'matched': True})

# IN
df.merge(df2, how='inner', on='country')

# NOT IN
not_in = df.merge(df2, how='left', on='country')
not_in = not_in[pd.isnull(not_in['matched'])]

但这似乎是一个可怕的笨拙。任何人都可以改进它吗？

Python Pandas 数据帧过滤 sql-function

3赞 jpp 6/28/2018

相关（性能/熊猫内部）：熊猫PD。Series.isin 与集合与数组的性能

0赞 Trenton McKinney 12/28/2020

使用值列表从 pandas 数据帧中选择行是类似的，但否定是在 2019 年作为编辑添加的。~

答：

1464赞 DSM 11/14/2013 #1

你可以使用 pd.系列.isin。

对于“IN”使用：something.isin(somewhere)

或者对于“不在”：~something.isin(somewhere)

举个工作的例子：

>>> df
    country
0        US
1        UK
2   Germany
3     China
>>> countries_to_keep
['UK', 'China']
>>> df.country.isin(countries_to_keep)
0    False
1     True
2    False
3     True
Name: country, dtype: bool
>>> df[df.country.isin(countries_to_keep)]
    country
1        UK
3     China
>>> df[~df.country.isin(countries_to_keep)]
    country
0        US
2   Germany

2赞 TomAugspurger 11/14/2013

如果你实际上正在处理一维数组（如你的示例），那么在第一行使用Series而不是DataFrame，就像@DSM使用的那样：df = pd.Series({'countries':['US','UK','Germany','China']})

4赞 DSM 11/15/2013

@TomAugspurger：像往常一样，我可能错过了什么。，无论是我的还是他的，都是. 是一个列表。产生，而不是，并且似乎甚至可以在 0.11.0.dev-14a04dd 中工作。dfDataFramecountriesdf[~df.countries.isin(countries)]DataFrameSeries

10赞 ifly6 5/19/2018

这个答案令人困惑，因为您一直在重用该变量。好吧，OP 做到了，这是继承的，但以前做得不好并不能证明现在做得不好是合理的。countries

2赞 le_llama 4/23/2020

@ifly6 ：同意，我犯了同样的错误，当我遇到错误时意识到了这一点：“'DataFrame'对象没有属性'countries'

4赞 bmurauer 3/25/2021

对于对波浪号感到困惑的人（比如我）：stackoverflow.com/questions/8305199/......

19赞 Kos 11/14/2013 #2

我通常对这样的行进行泛型过滤：

criterion = lambda row: row['countries'] not in countries
not_in = df[df.apply(criterion, axis=1)]

16赞 Jeff 11/14/2013

仅供参考，这比矢量化的@DSM soln 慢得多

2赞 Kos 11/14/2013

@Jeff我预料到这一点，但当我需要直接过滤 pandas 中不可用的东西时，这就是我的回退方式。（我正要说“像 .startwith 或正则表达式匹配，但刚刚发现 Series.str 拥有所有这些！

12赞 Sam Henderson 7/13/2017 #3

我想过滤掉具有 dfProfilesBusIds BUSINESS_ID BUSINESS_ID的 dfbc 行

dfbc = dfbc[~dfbc['BUSINESS_ID'].isin(dfProfilesBusIds['BUSINESS_ID'])]

159赞 MaxU - stand with Ukraine 7/19/2017 #4

使用 .query（）方法的替代解决方案：

In [5]: df.query("country in @countries_to_keep")
Out[5]:
  countries
1        UK
3     China

In [6]: df.query("country not in @countries_to_keep")
Out[6]:
  countries
0        US
2   Germany

11赞 Mike Honey 9/3/2020

.query 的可读性要强得多。特别是对于“不在”场景，与远处的波浪号相比。谢谢！

3赞 Itération 122442 9/7/2021

什么是@countries？另一个数据帧？一个列表？

0赞 NorwegianClassic 9/8/2021

@FlorianCastelain国家/地区是您要查看的列，OP 称为此列

3赞 MaxU - stand with Ukraine 9/8/2021

@FlorianCastelain，有人重命名了原始问题中的变量：->，所以我的答案无效。我已经相应地更新了我的答案。 - 是一个列表。countriescountries_to_keepcountries_to_keep

2赞 Maxim.K 12/9/2021

确实是最可读的解决方案。我想知道是否存在避免创建 .是否可以直接在查询中指定值列表？countries_to_keep

5赞 Ioannis Nasios 4/4/2018 #5

df = pd.DataFrame({'countries':['US','UK','Germany','China']})
countries = ['UK','China']

实施方式：

df[df.countries.isin(countries)]

在其他国家/地区实施：

df[df.countries.isin([x for x in np.unique(df.countries) if x not in countries])]

93赞 cs95 4/7/2019 #6

如何为 pandas DataFrame 实现“in”和“not in”？

Pandas 提供两种方法：Series.isin 和 DataFrame.isin，分别用于 Series 和 DataFrame。

基于一列筛选数据帧（也适用于系列）

最常见的方案是对特定列应用条件以筛选 DataFrame 中的行。isin

df = pd.DataFrame({'countries': ['US', 'UK', 'Germany', np.nan, 'China']})
df
  countries
0        US
1        UK
2   Germany
3     China

c1 = ['UK', 'China']             # list
c2 = {'Germany'}                 # set
c3 = pd.Series(['China', 'US'])  # Series
c4 = np.array(['US', 'UK'])      # array

Series.isin接受各种类型作为输入。以下是获得所需内容的所有有效方法：

df['countries'].isin(c1)

0    False
1     True
2    False
3    False
4     True
Name: countries, dtype: bool

# `in` operation
df[df['countries'].isin(c1)]

  countries
1        UK
4     China

# `not in` operation
df[~df['countries'].isin(c1)]

  countries
0        US
2   Germany
3       NaN

# Filter with `set` (tuples work too)
df[df['countries'].isin(c2)]

  countries
2   Germany

# Filter with another Series
df[df['countries'].isin(c3)]

  countries
0        US
4     China

# Filter with array
df[df['countries'].isin(c4)]

  countries
0        US
1        UK

对许多列进行筛选

有时，您需要在多个列上应用带有某些搜索词的“in”成员资格检查，

df2 = pd.DataFrame({
    'A': ['x', 'y', 'z', 'q'], 'B': ['w', 'a', np.nan, 'x'], 'C': np.arange(4)})
df2

   A    B  C
0  x    w  0
1  y    a  1
2  z  NaN  2
3  q    x  3

c1 = ['x', 'w', 'p']

要将条件应用于“A”和“B”列，请使用：isinDataFrame.isin

df2[['A', 'B']].isin(c1)

      A      B
0   True   True
1  False  False
2  False  False
3  False   True

由此，要保留至少有一列为 True 的行，我们可以沿第一个轴使用：any

df2[['A', 'B']].isin(c1).any(axis=1)

0     True
1    False
2    False
3     True
dtype: bool

df2[df2[['A', 'B']].isin(c1).any(axis=1)]

   A  B  C
0  x  w  0
3  q  x  3

请注意，如果要搜索每一列，只需省略列选择步骤并执行

df2.isin(c1).any(axis=1)

同样，若要保留 ALL 列为 True 的行，请以与以前相同的方式使用。all

df2[df2[['A', 'B']].isin(c1).all(axis=1)]

   A  B  C
0  x  w  0

值得注意的提及：，， list comprehensions （string data）`numpy.isinquery`

除了上述方法外，您还可以使用 numpy 等效项：numpy.isin。

# `in` operation
df[np.isin(df['countries'], c1)]

  countries
1        UK
4     China

# `not in` operation
df[np.isin(df['countries'], c1, invert=True)]

  countries
0        US
2   Germany
3       NaN

为什么值得考虑？由于开销较低，NumPy 函数通常比其 pandas 等效函数快一些。由于这是一个不依赖于索引对齐的元素操作，因此在极少数情况下，此方法不适合替代 pandas 的 .isin

Pandas 例程在处理字符串时通常是迭代的，因为字符串操作很难矢量化。有很多证据表明，这里的列表理解会更快。我们现在求助于检查。in

c1_set = set(c1) # Using `in` with `sets` is a constant time operation... 
                 # This doesn't matter for pandas because the implementation differs.
# `in` operation
df[[x in c1_set for x in df['countries']]]

  countries
1        UK
4     China

# `not in` operation
df[[x not in c1_set for x in df['countries']]]

  countries
0        US
2   Germany
3       NaN

但是，指定起来要笨拙得多，因此除非您知道自己在做什么，否则不要使用它。

最后，这个答案中也涵盖了哪些内容。numexpr FTW！DataFrame.query

3赞 Arthur D. Howland 10/21/2019

我喜欢它，但是如果我想比较 df3 中的列，即 df1 列怎么办？那会是什么样子？

17赞 Abhishek Gaur 6/1/2019 #7

从答案中整理可能的解决方案：

对于 IN：df[df['A'].isin([3, 6])]

对于不在：

df[-df["A"].isin([3, 6])]
df[~df["A"].isin([3, 6])]
df[df["A"].isin([3, 6]) == False]
df[np.logical_not(df["A"].isin([3, 6]))]

7赞 cs95 6/2/2019

这主要是重复来自其他答案的信息。using 是运算符的一口等价物。logical_not~

3赞 Billy Bonaros 9/16/2020 #8

如果您想保持列表的顺序，这是一个技巧：

df = pd.DataFrame({'country': ['US', 'UK', 'Germany', 'China']})
countries_to_keep = ['Germany', 'US']


ind=[df.index[df['country']==i].tolist() for i in countries_to_keep]
flat_ind=[item for sublist in ind for item in sublist]

df.reindex(flat_ind)

   country
2  Germany
0       US

0赞 GenDemo 6/24/2021 #9

我的 2c 值：我需要 in 和 if 语句的组合来表示数据帧，这对我有用。

sale_method = pd.DataFrame(model_data["Sale Method"].str.upper())
sale_method["sale_classification"] = np.where(
    sale_method["Sale Method"].isin(["PRIVATE"]),
    "private",
    np.where(
        sale_method["Sale Method"].str.contains("AUCTION"), "auction", "other"
    ),
)

12赞 padu 9/15/2021 #10

为什么没有人谈论各种过滤方法的性能？事实上，这个话题经常在这里弹出（见示例）。我对一个大型数据集进行了自己的性能测试。这是非常有趣和有启发性的。

df = pd.DataFrame({'animals': np.random.choice(['cat', 'dog', 'mouse', 'birds'], size=10**7), 
                   'number': np.random.randint(0,100, size=(10**7,))})

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000000 entries, 0 to 9999999
Data columns (total 2 columns):
 #   Column   Dtype 
---  ------   ----- 
 0   animals  object
 1   number   int64 
dtypes: int64(1), object(1)
memory usage: 152.6+ MB

%%timeit
# .isin() by one column
conditions = ['cat', 'dog']
df[df.animals.isin(conditions)]

367 ms ± 2.34 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
# .query() by one column
conditions = ['cat', 'dog']
df.query('animals in @conditions')

395 ms ± 3.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
# .loc[]
df.loc[(df.animals=='cat')|(df.animals=='dog')]

987 ms ± 5.17 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
df[df.apply(lambda x: x['animals'] in ['cat', 'dog'], axis=1)]

41.9 s ± 490 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
new_df = df.set_index('animals')
new_df.loc[['cat', 'dog'], :]

3.64 s ± 62.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
new_df = df.set_index('animals')
new_df[new_df.index.isin(['cat', 'dog'])]

469 ms ± 8.98 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
s = pd.Series(['cat', 'dog'], name='animals')
df.merge(s, on='animals', how='inner')

796 ms ± 30.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

因此，该方法被证明是最快的，而该方法是最慢的，这并不奇怪。isinapply()

7赞 rachwa 2/23/2022 #11

您也可以在 .query（） 中使用 .isin（）：

df.query('country.isin(@countries_to_keep).values')

# Or alternatively:
df.query('country.isin(["UK", "China"]).values')

若要否定查询，请使用：~

df.query('~country.isin(@countries_to_keep).values')

更新：

另一种方法是使用比较运算符：

df.query('country == @countries_to_keep')

# Or alternatively:
df.query('country == ["UK", "China"]')

要否定查询，请使用：!=

df.query('country != @countries_to_keep')

0赞 LondonRob 3/1/2022

很高兴知道，尽管这比这个答案的可读性要差一些，但它使用和里面.有趣的是，两者都支持！innot inqueryquery

0赞 cottontail 5/6/2023 #12

`groupby`可用于将数据帧“拆分”为两个

如果目标是将一个数据帧划分为两个数据帧，其中一个数据帧具有要保留的国家/地区，而另一个数据帧没有，则可以在调用中使用调用创建的布尔掩码将数据帧拆分为两个：have 和 have-not。isingroupby

df = pd.DataFrame({'country': ['US', 'UK', 'Germany', 'China'], 'value': range(4)})
countries_to_keep = ['UK', 'China']
df1, df2 = [g for _, g in df.groupby(df['country'].isin(countries_to_keep))]

`eval()`也可以使用

query()建议在其他地方计算数值表达式。一个相关的方法是。它可用于创建布尔掩码和过滤帧。它可以与可能在其他地方创建的其他蒙版一起使用，以实现更灵活的过滤。eval()

msk = df.eval('country in @countries_to_keep')
to_keep = df[msk]     # in
not_keep = df[~msk]   # not in

这很有用的一种特殊情况是，如果要使用条件筛选单个列，则内存效率非常低，因为它会创建筛选帧的副本，该副本需要再次针对单个列进行筛选，而使用布尔掩码列标签组合一次性选择该列。可以做同样的事情。¹queryloceval()

df = pd.DataFrame({'country': ['US', 'UK', 'Germany', 'China']*25000})
df[[f"col{i}" for i in range(50)]] = np.random.rand(100000, 50)
countries_to_keep = ['UK', 'China']

filtered = df.loc[df.eval('country==@countries_to_keep'), 'col1']

¹ 内存分析器测试：

import numpy as np
import pandas as pd
%load_ext memory_profiler

df = pd.DataFrame({'country': ['US', 'UK', 'Germany', 'China']*25000})
df[[f"col{i}" for i in range(50)]] = np.random.rand(100000, 50)
countries_to_keep = ['UK', 'China']


%memit x = df.loc[df.eval('country==@countries_to_keep'), 'col1']
# peak memory: 157.28 MiB, increment: 5.44 MiB

%memit y = df.query('country==@countries_to_keep')['col1']
# peak memory: 195.39 MiB, increment: 38.11 MiB

%memit z = df.loc[df['country'].isin(countries_to_keep), 'col1']
# peak memory: 176.93 MiB, increment: 0.76 MiB

上一个：如何创建一个新列，其中的值是根据现有列选择的？

下一个：序列的真值是模棱两可的。使用 a.empty、a.bool（）、a.item（）、a.any（）或 a.all（）

如何像在 SQL 中那样使用“in”和“not in”过滤 Pandas 数据帧

How to filter Pandas dataframe using 'in' and 'not in' like in SQL

评论

评论

评论

评论

如何为 pandas DataFrame 实现“in”和“not in”？

基于一列筛选数据帧（也适用于系列）

对许多列进行筛选

值得注意的提及：，， list comprehensions （string data）`numpy.isinquery`

评论

评论

评论

`groupby`可用于将数据帧“拆分”为两个

`eval()`也可以使用

如何像在 SQL 中那样使用“in”和“not in”过滤 Pandas 数据帧

How to filter Pandas dataframe using 'in' and 'not in' like in SQL

评论

评论

评论

评论

如何为 pandas DataFrame 实现“in”和“not in”？

基于一列筛选数据帧（也适用于系列）

对许多列进行筛选

值得注意的提及： ， ， list comprehensions （string data）numpy.isinquery

评论

评论

评论

groupby可用于将数据帧“拆分”为两个

eval()也可以使用

值得注意的提及：，， list comprehensions （string data）`numpy.isinquery`

`groupby`可用于将数据帧“拆分”为两个

`eval()`也可以使用