提问人:Elias S 提问时间:10/29/2023 最后编辑:Elias S 更新时间:11/2/2023 访问量:99
lxml iterparse 会占用 4GB XML 文件的内存,即使使用 clear() 也是如此
lxml iterparse eats up memory for an 4GB XML file, even when clear() is used
问:
该脚本的目的是提取每年出版的文章/书籍数量,并从 xml 文件 dblp-2023-10-01.xml 中的元素中获取此信息。该文件可以在这里找到:https://dblp.org/xml/release/
from lxml import etree
xmlfile = 'dblp-2023-10-01.xml'
doc = etree.iterparse(xmlfile, tag='year', load_dtd=True)
_, root = next(doc)
counter_dict = {}
for event, element in doc:
if element.text not in counter_dict:
counter_dict[element.text] = 1
else:
counter_dict[element.text] += 1
root.clear()
当我为一个小文件运行代码时,它运行得很流畅。令我困惑的是,当我使用 dblp 文件运行时,它超过了 4GB(文件大小),这对我来说没有意义。
我还尝试运行替代版本,以确保它清除了它解析的内容:
for ancestor in elem.xpath('ancestor-or-self::*'):
while ancestor.getprevious() is not None:
del ancestor.getparent()[0]
没有任何改善
答:
0赞
Martin Honnen
10/29/2023
#1
我无法说出为什么 lxml 的 iterparse 需要所有这些内存,但我尝试了一个简单的 SAX 程序
import xml.sax
counter_dict = {}
class YearHandler(xml.sax.ContentHandler):
def __init__(self):
self.year = ''
self.isYear = False
def startElement(self,tag,attributes):
if tag == 'year':
self.isYear = True
self.year = ''
def endElement(self,tag):
if self.isYear and tag == 'year':
self.isYear = False
yearInt = int(self.year)
if yearInt in counter_dict:
counter_dict[yearInt] += 1
else:
counter_dict[yearInt] = 1
def characters(self,content):
if self.isYear:
self.year += content
if __name__=='__main__':
parser=xml.sax.make_parser()
parser.setFeature(xml.sax.handler.feature_namespaces,0)
parser.setContentHandler(YearHandler())
parser.parse('dblp-2023-10-01.xml')
print(counter_dict)
在我的 Windows 机器上,它消耗的内存和输出不到 10 MB
{2014: 292279, 2005: 158268, 2011: 250783, 2012: 263294, 2018: 374688, 2008: 203544, 1997: 57099, 2010: 228955, 2016: 314744, 2017: 339456, 2013: 280709, 2002: 97758, 2004: 135735, 2009: 222653, 2007: 189562, 2006: 176458, 1999: 71138, 2015: 302656, 2022: 470484, 2023: 305500, 2019: 417602, 2020: 433127, 1992: 34900, 2021: 456839, 1988: 21633, 1998: 64297, 1986: 16475, 1989: 24001, 1987: 17549, 2001: 86798, 1994: 45290, 1990: 28166, 2003: 116385, 1995: 47712, 2000: 80955, 1993: 40695, 1991: 31084, 1996: 52809, 1954: 225, 1971: 3120, 2024: 536, 1985: 13890, 1984: 12334, 1982: 9939, 1975: 5246, 1983: 10860, 1980: 7787, 1981: 8662, 1964: 1108, 1977: 5961, 1976: 5695, 1972: 3751, 1974: 5007, 1979: 6913, 1973: 4414, 1978: 6786, 1967: 1763, 1965: 1291, 1969: 2113, 1968: 2182, 1970: 2227, 1966: 1503, 1959: 715, 1961: 903, 1953: 173, 1960: 625, 1957: 343, 1955: 213, 1958: 464, 1956: 355, 1951: 46, 1962: 1186, 1952: 114, 1963: 1032, 1946: 31, 1947: 10, 1945: 9, 1939: 18, 1948: 41, 1942: 13, 1949: 52, 1941: 13, 1937: 16, 1940: 10, 1936: 12, 1950: 29, 1943: 8, 1944: 5, 1938: 11}
评论
0赞
Elias S
10/29/2023
谢谢马丁!执行时间是多久?我的代码花了大约 5 分钟,内存仅在某些点增加,我仍然无法弄清楚发生了什么
0赞
Martin Honnen
10/29/2023
花了几分钟,是的,我认为在我的机器上需要 4 到 5 分钟,但我认为预期的输入为 4 GB。在运行过程中,我没有注意到内存使用量发生了一些很大的变化,当然字典正在添加,但正如我所说,在任务管理器中,我看不到超过 10 MB 的内存使用量。当然,在处理输入时,不会有任何进入输入大小或不断增长的东西。
1赞
Hermann12
11/2/2023
#2
选项 1:我的机器在本地 2,5 分钟内完成,所以你也需要下载时间。您不仅应该清除 ,而是:root.clear()
elem.clear()
from lxml import etree
import gzip
import psutil
import time
time_start = time.time()
fd = gzip.open('dblp-2023-10-01.xml.gz', "r")
counter_dict = {}
for event, elem in etree.iterparse(fd, events=['end'], recover=True):
if elem.tag == "year":
if elem.text not in counter_dict:
counter_dict[elem.text] = 1
else:
counter_dict[elem.text] +=1
elem.clear()
#print(counter_dict)
print(dict(sorted(counter_dict.items())))
print("RAM:")
print(psutil.Process().memory_info().rss / (1024 * 1024))
print("Time:")
print((time.time() - time_start))
输出:
{'1936': 12, '1937': 16, '1938': 11, '1939': 18, '1940': 10, '1941': 13, '1942': 13, '1943': 8, '1944': 5, '1945': 9, '1946': 31, '1947': 10, '1948': 41, '1949': 52, '1950': 29, '1951': 46, '1952': 114, '1953': 173, '1954': 225, '1955': 213, '1956': 355, '1957': 343, '1958': 464, '1959': 715, '1960': 625, '1961': 903, '1962': 1186, '1963': 1032, '1964': 1108, '1965': 1291, '1966': 1503, '1967': 1763, '1968': 2182, '1969': 2113, '1970': 2227, '1971': 3120, '1972': 3751, '1973': 4414, '1974': 5007, '1975': 5246, '1976': 5695, '1977': 5961, '1978': 6786, '1979': 6913, '1980': 7787, '1981': 8662, '1982': 9939, '1983': 10860, '1984': 12334, '1985': 13890, '1986': 16475, '1987': 17549, '1988': 21633, '1989': 24001, '1990': 28166, '1991': 31084, '1992': 34900, '1993': 40695, '1994': 45290, '1995': 47712, '1996': 52809, '1997': 57099, '1998': 64297, '1999': 71138, '2000': 80955, '2001': 86798, '2002': 97758, '2003': 116385, '2004': 135735, '2005': 158268, '2006': 176458, '2007': 189562, '2008': 203544, '2009': 222653, '2010': 228955, '2011': 250783, '2012': 263294, '2013': 280709, '2014': 292279, '2015': 302656, '2016': 314744, '2017': 339456, '2018': 374688, '2019': 417602, '2020': 433127, '2021': 456839, '2022': 470484, '2023': 305500, '2024': 536}
RAM:
1141.16796875
Time:
151.0215344429016
选项 2 - 下载并解析流:
import gzip
from urllib.request import urlopen
from lxml import etree
import psutil
import time
time_start = time.time()
url = "https://dblp.org/xml/release/"
file = "dblp-2023-10-01.xml.gz"
fd = url+file
f = urlopen(fd)
fz = gzip.GzipFile(fileobj=f, mode="r")
counter_dict = {}
for event, elem in etree.iterparse(fz, events=['end'], recover=True):
if elem.tag == "year":
if elem.text not in counter_dict:
counter_dict[elem.text] = 1
else:
counter_dict[elem.text] +=1
elem.clear()
#print(counter_dict)
print(dict(sorted(counter_dict.items())))
print("RAM:")
print(psutil.Process().memory_info().rss / (1024 * 1024))
print("Time:")
print((time.time() - time_start))
输出:
{'1936': 12, '1937': 16, '1938': 11, '1939': 18, '1940': 10, '1941': 13, '1942': 13, '1943': 8, '1944': 5, '1945': 9, '1946': 31, '1947': 10, '1948': 41, '1949': 52, '1950': 29, '1951': 46, '1952': 114, '1953': 173, '1954': 225, '1955': 213, '1956': 355, '1957': 343, '1958': 464, '1959': 715, '1960': 625, '1961': 903, '1962': 1186, '1963': 1032, '1964': 1108, '1965': 1291, '1966': 1503, '1967': 1763, '1968': 2182, '1969': 2113, '1970': 2227, '1971': 3120, '1972': 3751, '1973': 4414, '1974': 5007, '1975': 5246, '1976': 5695, '1977': 5961, '1978': 6786, '1979': 6913, '1980': 7787, '1981': 8662, '1982': 9939, '1983': 10860, '1984': 12334, '1985': 13890, '1986': 16475, '1987': 17549, '1988': 21633, '1989': 24001, '1990': 28166, '1991': 31084, '1992': 34900, '1993': 40695, '1994': 45290, '1995': 47712, '1996': 52809, '1997': 57099, '1998': 64297, '1999': 71138, '2000': 80955, '2001': 86798, '2002': 97758, '2003': 116385, '2004': 135735, '2005': 158268, '2006': 176458, '2007': 189562, '2008': 203544, '2009': 222653, '2010': 228955, '2011': 250783, '2012': 263294, '2013': 280709, '2014': 292279, '2015': 302656, '2016': 314744, '2017': 339456, '2018': 374688, '2019': 417602, '2020': 433127, '2021': 456839, '2022': 470484, '2023': 305500, '2024': 536}
公羊: 1084.80859375 时间: 148.59651041030884
另请在此处阅读常见数据结构的内存效率。
评论
'https://dblp.org/xml/release/dblp-2023-10-01.xml.gz'