提问人:Mr.Slow 提问时间:12/20/2022 更新时间:12/20/2022 访问量:75
如何从嵌套列表中的字符串中剥离html元素, Python
How to strip html elements from string in nested list, Python
问:
我决定使用 BeautifulSoup 从 Pandas 列中提取字符串整数。BeautifulSoup 在一个简单的示例中效果很好,但是,不适用于 Pandas 中的列表列。我找不到任何错误。你能帮忙吗?
输入:
df = pd.DataFrame({
"col1":[["<span style='color: red;'>9</span>", "abcd"], ["a", "b, d"], ["a, b, z, x, y"], ["a, y","y, z, b"]],
"col2":[0, 1, 0, 1],
})
for list in df["col1"]:
for item in list:
if "span" in item:
soup = BeautifulSoup(item, features = "lxml")
item = soup.get_text()
else:
None
print(df)
期望输出:
df = pd.DataFrame({
"col1":[["9", "abcd"], ["a", "b, d"], ["a, b, z, x, y"], ["a, y","y, z, b"]],
"col2":[0, 1, 0, 1],
})
答:
1赞
Xiddoc
12/20/2022
#1
您正在尝试在 Series 上使用 for 循环进行迭代,但在使用 Pandas 时,它比函数更受欢迎且更简单,如下所示:apply
def extract_text(lst):
new_lst = []
for item in lst:
if "span" in item:
new_lst.append(BeautifulSoup(item, features="lxml").text)
else:
new_lst.append(item)
return new_lst
df['col1'] = df['col1'].apply(extract_text)
或者您可以使用列表推导式将其单行:
df['col1'] = df['col1'].apply(
lambda lst: [BeautifulSoup(item, features = "lxml").text if "span" in item else item for item in lst]
)
1赞
Jamiu S.
12/20/2022
#2
这会将函数应用于列的每个元素,如果元素包含标记,则将原始值替换为提取的整数,如果元素不包含标记,则保持值不变。extract_integer
col1
"span"
def extract_integer(item):
if "span" in item:
soup = BeautifulSoup(item, features = "lxml")
return soup.get_text()
return item
df = pd.DataFrame({
"col1":[["<span style='color: red;'>9</span>", "abcd"], ["a", "b, d"], ["a, b, z, x, y"], ["a, y","y, z, b"]],
"col2":[0, 1, 0, 1],
})
df["col1"] = df["col1"].apply(lambda x: [extract_integer(item) for item in x])
print(df)
输出:
col1 col2
0 [9, abcd] 0
1 [a, b, d] 1
2 [a, b, z, x, y] 0
3 [a, y, y, z, b] 1
评论