如何从嵌套列表中的字符串中剥离html元素， Python-解网

问：

我决定使用 BeautifulSoup 从 Pandas 列中提取字符串整数。BeautifulSoup 在一个简单的示例中效果很好，但是，不适用于 Pandas 中的列表列。我找不到任何错误。你能帮忙吗？

输入：

df = pd.DataFrame({
    "col1":[["<span style='color: red;'>9</span>", "abcd"], ["a", "b, d"], ["a, b, z, x, y"], ["a, y","y, z, b"]], 
    "col2":[0, 1, 0, 1],
})

for list in df["col1"]:
    for item in list:
        if "span" in item:
            soup = BeautifulSoup(item, features = "lxml")
            item = soup.get_text()
        else:
            None  

print(df)

期望输出：

df = pd.DataFrame({
        "col1":[["9", "abcd"], ["a", "b, d"], ["a, b, z, x, y"], ["a, y","y, z, b"]], 
        "col2":[0, 1, 0, 1],
    })

python html 熊猫 beautifulsoup xml 解析

def extract_integer(item):
    if "span" in item:
        soup = BeautifulSoup(item, features = "lxml")
        return soup.get_text()
    return item

df = pd.DataFrame({
    "col1":[["<span style='color: red;'>9</span>", "abcd"], ["a", "b, d"], ["a, b, z, x, y"], ["a, y","y, z, b"]], 
    "col2":[0, 1, 0, 1],
})

df["col1"] = df["col1"].apply(lambda x: [extract_integer(item) for item in x])

print(df)

输出：

               col1  col2
0         [9, abcd]     0
1         [a, b, d]     1
2   [a, b, z, x, y]     0
3   [a, y, y, z, b]     1

上一个：Spacy displacy.render 生成不符合 xml 的 </br> 标签

下一个：我没有在 URL 中收到 GET 请求

如何从嵌套列表中的字符串中剥离html元素， Python

How to strip html elements from string in nested list, Python

评论