提问人:Jeff R 提问时间:9/20/2023 最后编辑:Jeff R 更新时间:9/20/2023 访问量:35
使用 BeautifulSoup 从 python 中的 XML 中提取特定标签
Extracting specific tag from XML in python using BeautifulSoup
问:
我有一个元数据文件,如下所示:
<?xml version='1.0' encoding='utf-8'?>
<package xmlns="http://www.idpf.org/2007/opf" unique-identifier="uuid_id" version="2.0">
<metadata xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:opf="http://www.idpf.org/2007/opf">
<dc:title>Princeton Review Digital SAT Premium Prep, 2024: 4 Practice Tests + Online Flashcards + Review & Tools</dc:title>
<dc:creator opf:file-as="Princeton Review, The" opf:role="aut">The Princeton Review</dc:creator>
<dc:identifier opf:scheme="ISBN">9780593516874</dc:identifier>
<dc:identifier opf:scheme="AMAZON">0593516877</dc:identifier>
<dc:identifier opf:scheme="GOODREADS">63139948</dc:identifier>
<dc:identifier opf:scheme="GOOGLE">o6i4EAAAQBAJ</dc:identifier>
</metadata>
</package>
我知道如何使用 BeautifulSoup 来提取像 .我正在努力如何仅提取 ISBN 字段 ()。<dc.title>
<dc:identifier opf:scheme="ISBN">
from bs4 import BeautifulSoup
with open ('metadata.opf', 'r') as f:
file = f.read()
metadata = BeautifulSoup(file, 'xml')
title = metadata.find('dc:title')
print(title.text)
author = metadata.find('dc:creator')
print(author.text)
# isbn = metadata.find_all('dc:identifier'). # This finds 4 fields, as expected.
如何限制它?我不能依赖字段的顺序,并且 ISBN 长度可能会有所不同。
答:
0赞
RQussous
9/20/2023
#1
根据文档,find 方法有一个 argument 属性,使用它你应该能够选择 ISBN
isbn = metadata.find('dc:identifier', attrs={"opf:scheme": "ISBN"})
所以代码可以写成这样
from bs4 import BeautifulSoup
with open ('metadata.opf', 'r') as f:
file = f.read()
metadata = BeautifulSoup(file, 'xml')
title = metadata.find('dc:title')
print(title.text)
author = metadata.find('dc:creator')
print(author.text)
isbn = metadata.find('dc:identifier', attrs={"opf:scheme": "ISBN"}) # This finds 4 fields, as expected.
print(isbn.text)
并应导致
Princeton Review Digital SAT Premium Prep, 2024: 4 Practice Tests + Online Flashcards + Review & Tools
The Princeton Review
9780593516874
评论
metadata.find('dc:identifier')
应该只找到第一个,这是您正在寻找的 ISBN,因为您正在打电话而不是问题是什么? 返回find
findAll
metadata.find('dc:identifier').text
'9780593516874'
[data for data in metadata.findAll('dc:identifier') if data['opf:scheme'] == 'ISBN']