在 Polars 中，有没有更好的方法，如果字符串中的项目使用 .is_in 匹配列表中的项目，则仅返回字符串中的项目？-解网

问：

有没有更好的方法可以只在极坐标数组中与列表中包含的项目匹配时返回每个极坐标数组？pl.element()

当它起作用时，我收到错误警告，这让我相信可能有一种更简洁/更好的方法：The predicate 'col("").is_in([Series])' in 'when->then->otherwise' is not a valid aggregation and might produce a different number of rows than the group_by operation would. This behavior is experimental and may be subject to change

import polars as pl

terms = ['a', 'z']

(pl.LazyFrame({'a':['x y z']})
   .select(pl.col('a')
             .str.split(' ')
             .list.eval(pl.when(pl.element().is_in(terms))
                          .then(pl.element())
                          .otherwise(None))
             .list.drop_nulls()
             .list.join(' ')
           )
   .fetch()
 )

为了后代的缘故，它取代了我以前的尝试：.map_elements()

import polars as pl
import re

terms = ['a', 'z']

(pl.LazyFrame({'a':['x y z']})
   .select(pl.col('a')
             .str.split(' ')
             .map_elements(lambda x: ' '.join(list(set(re.findall('|'.join(terms), x)))),
                           return_dtype = pl.Utf8)
           )
   .fetch()
 )

数组列表蟒蛇极地。什么时候伊辛

您可以在内部使用，例如 - 还有 - 但前提是一个术语不重复，并且不确定它是否保留了输入的顺序。目标是从字符串列中删除列表中未包含的所有“单词”吗？.filter()list.eval().list.eval(pl.element().filter(pl.element().is_in(terms))).list.join(' ')pl.col("a").str.split(" ").list.set_intersection(terms).list.join(" ")terms

答：

2赞 Dean MacGregor 10/19/2023 #1

除了评论中@jqurious列出的技巧外，您还可以进行正则表达式提取。这开始很简单，但当我尝试不同的东西时，它变得有点笨拙。rust 正则表达式引擎的好处是它的性能非常高。不好的是它没有环顾四周，所以解决这个问题会让它看起来很笨拙。

没有环顾四周，为了确保我们没有从斑马身上拿走 z，我不得不在一个学期之前和之后提取空间。当然，第一个字母之前没有空格，最后一个字母后面没有空格，这就是为什么我在初始列之前和之后连接一个空格。此外，为了确保它可以连续捕获两个字母，我不得不将所有单空格替换为双倍空格，这些空格在提取步骤后被替换回单空格。

terms = ['a', 'z', 'x']
termsre = "(" + "|".join([f" {x} " for x in terms]) + ")"
(pl.LazyFrame({'a':['x y z z zebra a', 'x y z', 'a b c']})
 .with_columns(
     b = (pl.lit(" ") + pl.col('a')
       .str.replace_all(" ", "  ") + pl.lit(" "))
       .str.extract_all(termsre)
       .list.join('')
       .str.replace_all("  "," ")
       .str.strip_chars()
 )
 .collect()
)
shape: (3, 2)
┌─────────────────┬─────────┐
│ a               ┆ b       │
│ ---             ┆ ---     │
│ str             ┆ str     │
╞═════════════════╪═════════╡
│ x y z z zebra a ┆ x z z a │
│ x y z           ┆ x z     │
│ a b c           ┆ a       │
└─────────────────┴─────────┘

旁注，用于使用有限行数进行调试。您通常希望使用fetchcollect

上一个：这两个“类型”定义之间有什么区别，它们具有不同的通用位置，以及如何使用它们？[复制]

下一个：在 python 数据帧中执行 vlookup 类型链，该链标记完成的迭代次数并在另一个数据帧中发布结果

在 Polars 中，有没有更好的方法，如果字符串中的项目使用 .is_in 匹配列表中的项目，则仅返回字符串中的项目？

In Polars, is there a better way to only return items within a string if they match items in a list using .is_in?

评论