如何从嵌套列表中的字符串中剥离html元素, Python

How to strip html elements from string in nested list, Python

提问人:Mr.Slow 提问时间:12/20/2022 更新时间:12/20/2022 访问量:75

问:

我决定使用 BeautifulSoup 从 Pandas 列中提取字符串整数。BeautifulSoup 在一个简单的示例中效果很好,但是,不适用于 Pandas 中的列表列。我找不到任何错误。你能帮忙吗?

输入:

df = pd.DataFrame({
    "col1":[["<span style='color: red;'>9</span>", "abcd"], ["a", "b, d"], ["a, b, z, x, y"], ["a, y","y, z, b"]], 
    "col2":[0, 1, 0, 1],
})

for list in df["col1"]:
    for item in list:
        if "span" in item:
            soup = BeautifulSoup(item, features = "lxml")
            item = soup.get_text()
        else:
            None  

print(df)

This is what I get

期望输出:

df = pd.DataFrame({
        "col1":[["9", "abcd"], ["a", "b, d"], ["a, b, z, x, y"], ["a, y","y, z, b"]], 
        "col2":[0, 1, 0, 1],
    })
python html 熊猫 beautifulsoup xml 解析

评论


答:

1赞 Xiddoc 12/20/2022 #1

您正在尝试在 Series 上使用 for 循环进行迭代,但在使用 Pandas 时,它比函数更受欢迎且更简单,如下所示:apply

def extract_text(lst):
    new_lst = []
    for item in lst:
        if "span" in item:
            new_lst.append(BeautifulSoup(item, features="lxml").text)
        else:
            new_lst.append(item)
            
    return new_lst

df['col1'] = df['col1'].apply(extract_text)

或者您可以使用列表推导式将其单行:

df['col1'] = df['col1'].apply(
    lambda lst: [BeautifulSoup(item, features = "lxml").text if "span" in item else item for item in lst]
)
1赞 Jamiu S. 12/20/2022 #2

这会将函数应用于列的每个元素,如果元素包含标记,则将原始值替换为提取的整数,如果元素不包含标记,则保持值不变。extract_integercol1"span"

def extract_integer(item):
    if "span" in item:
        soup = BeautifulSoup(item, features = "lxml")
        return soup.get_text()
    return item

df = pd.DataFrame({
    "col1":[["<span style='color: red;'>9</span>", "abcd"], ["a", "b, d"], ["a, b, z, x, y"], ["a, y","y, z, b"]], 
    "col2":[0, 1, 0, 1],
})

df["col1"] = df["col1"].apply(lambda x: [extract_integer(item) for item in x])

print(df)

输出:

               col1  col2
0         [9, abcd]     0
1         [a, b, d]     1
2   [a, b, z, x, y]     0
3   [a, y, y, z, b]     1