lxml iterparse 会占用 4GB XML 文件的内存,即使使用 clear() 也是如此

lxml iterparse eats up memory for an 4GB XML file, even when clear() is used

提问人:Elias S 提问时间:10/29/2023 最后编辑:Elias S 更新时间:11/2/2023 访问量:99

问:

该脚本的目的是提取每年出版的文章/书籍数量,并从 xml 文件 dblp-2023-10-01.xml 中的元素中获取此信息。该文件可以在这里找到:https://dblp.org/xml/release/

from lxml import etree

xmlfile = 'dblp-2023-10-01.xml'

doc = etree.iterparse(xmlfile, tag='year', load_dtd=True)
_, root = next(doc)
counter_dict = {}
for event, element in doc:
    if element.text not in counter_dict:
        counter_dict[element.text] = 1
    else:
        counter_dict[element.text] += 1
    root.clear() 

当我为一个小文件运行代码时,它运行得很流畅。令我困惑的是,当我使用 dblp 文件运行时,它超过了 4GB(文件大小),这对我来说没有意义。

我还尝试运行替代版本,以确保它清除了它解析的内容:

    for ancestor in elem.xpath('ancestor-or-self::*'):
        while ancestor.getprevious() is not None:
            del ancestor.getparent()[0]

没有任何改善

python xml lxml RAM 大文件

评论

0赞 Jim Garrison 10/29/2023
您需要提供更多的调试信息,包括堆栈跟踪和一些输出。在处理任何事件之前是否失败?如果期间,之后多少?等。请阅读如何提问
0赞 jdweng 10/29/2023
试用 powershell : $uri = ;$response = invoke-WebRequest -uri $uri;$numbers = $response。内容;Write-Host $numbers;'https://dblp.org/xml/release/dblp-2023-10-01.xml.gz'

答:

0赞 Martin Honnen 10/29/2023 #1

我无法说出为什么 lxml 的 iterparse 需要所有这些内存,但我尝试了一个简单的 SAX 程序

import xml.sax

counter_dict = {}

class YearHandler(xml.sax.ContentHandler):

    def __init__(self):
       self.year = ''
       self.isYear = False

    def startElement(self,tag,attributes):
       if tag == 'year':
           self.isYear = True
           self.year = ''

    def endElement(self,tag):
       if self.isYear and tag == 'year':
           self.isYear = False
           yearInt = int(self.year)
           if yearInt in counter_dict:
               counter_dict[yearInt] += 1
           else:
               counter_dict[yearInt] = 1


    def characters(self,content):
        if self.isYear:
           self.year += content

if __name__=='__main__':

    parser=xml.sax.make_parser()

    parser.setFeature(xml.sax.handler.feature_namespaces,0)

    parser.setContentHandler(YearHandler())

    parser.parse('dblp-2023-10-01.xml')

    print(counter_dict)

在我的 Windows 机器上,它消耗的内存和输出不到 10 MB

{2014: 292279, 2005: 158268, 2011: 250783, 2012: 263294, 2018: 374688, 2008: 203544, 1997: 57099, 2010: 228955, 2016: 314744, 2017: 339456, 2013: 280709, 2002: 97758, 2004: 135735, 2009: 222653, 2007: 189562, 2006: 176458, 1999: 71138, 2015: 302656, 2022: 470484, 2023: 305500, 2019: 417602, 2020: 433127, 1992: 34900, 2021: 456839, 1988: 21633, 1998: 64297, 1986: 16475, 1989: 24001, 1987: 17549, 2001: 86798, 1994: 45290, 1990: 28166, 2003: 116385, 1995: 47712, 2000: 80955, 1993: 40695, 1991: 31084, 1996: 52809, 1954: 225, 1971: 3120, 2024: 536, 1985: 13890, 1984: 12334, 1982: 9939, 1975: 5246, 1983: 10860, 1980: 7787, 1981: 8662, 1964: 1108, 1977: 5961, 1976: 5695, 1972: 3751, 1974: 5007, 1979: 6913, 1973: 4414, 1978: 6786, 1967: 1763, 1965: 1291, 1969: 2113, 1968: 2182, 1970: 2227, 1966: 1503, 1959: 715, 1961: 903, 1953: 173, 1960: 625, 1957: 343, 1955: 213, 1958: 464, 1956: 355, 1951: 46, 1962: 1186, 1952: 114, 1963: 1032, 1946: 31, 1947: 10, 1945: 9, 1939: 18, 1948: 41, 1942: 13, 1949: 52, 1941: 13, 1937: 16, 1940: 10, 1936: 12, 1950: 29, 1943: 8, 1944: 5, 1938: 11}

评论

0赞 Elias S 10/29/2023
谢谢马丁!执行时间是多久?我的代码花了大约 5 分钟,内存仅在某些点增加,我仍然无法弄清楚发生了什么
0赞 Martin Honnen 10/29/2023
花了几分钟,是的,我认为在我的机器上需要 4 到 5 分钟,但我认为预期的输入为 4 GB。在运行过程中,我没有注意到内存使用量发生了一些很大的变化,当然字典正在添加,但正如我所说,在任务管理器中,我看不到超过 10 MB 的内存使用量。当然,在处理输入时,不会有任何进入输入大小或不断增长的东西。
1赞 Hermann12 11/2/2023 #2

选项 1:我的机器在本地 2,5 分钟内完成,所以你也需要下载时间。您不仅应该清除 ,而是:root.clear()elem.clear()

from lxml import etree
import gzip

import psutil
import time
time_start = time.time()

fd = gzip.open('dblp-2023-10-01.xml.gz', "r")

counter_dict = {}
for event, elem in etree.iterparse(fd, events=['end'], recover=True):
    if elem.tag == "year":
        if elem.text not in counter_dict:
            counter_dict[elem.text] = 1
        else:
            counter_dict[elem.text] +=1
    elem.clear()
    
#print(counter_dict)
print(dict(sorted(counter_dict.items())))

print("RAM:")
print(psutil.Process().memory_info().rss / (1024 * 1024))
print("Time:")
print((time.time() - time_start))

输出:

{'1936': 12, '1937': 16, '1938': 11, '1939': 18, '1940': 10, '1941': 13, '1942': 13, '1943': 8, '1944': 5, '1945': 9, '1946': 31, '1947': 10, '1948': 41, '1949': 52, '1950': 29, '1951': 46, '1952': 114, '1953': 173, '1954': 225, '1955': 213, '1956': 355, '1957': 343, '1958': 464, '1959': 715, '1960': 625, '1961': 903, '1962': 1186, '1963': 1032, '1964': 1108, '1965': 1291, '1966': 1503, '1967': 1763, '1968': 2182, '1969': 2113, '1970': 2227, '1971': 3120, '1972': 3751, '1973': 4414, '1974': 5007, '1975': 5246, '1976': 5695, '1977': 5961, '1978': 6786, '1979': 6913, '1980': 7787, '1981': 8662, '1982': 9939, '1983': 10860, '1984': 12334, '1985': 13890, '1986': 16475, '1987': 17549, '1988': 21633, '1989': 24001, '1990': 28166, '1991': 31084, '1992': 34900, '1993': 40695, '1994': 45290, '1995': 47712, '1996': 52809, '1997': 57099, '1998': 64297, '1999': 71138, '2000': 80955, '2001': 86798, '2002': 97758, '2003': 116385, '2004': 135735, '2005': 158268, '2006': 176458, '2007': 189562, '2008': 203544, '2009': 222653, '2010': 228955, '2011': 250783, '2012': 263294, '2013': 280709, '2014': 292279, '2015': 302656, '2016': 314744, '2017': 339456, '2018': 374688, '2019': 417602, '2020': 433127, '2021': 456839, '2022': 470484, '2023': 305500, '2024': 536}

RAM:
1141.16796875
Time:
151.0215344429016

选项 2 - 下载并解析流:

import gzip
from urllib.request import urlopen
from lxml import etree

import psutil
import time
time_start = time.time()

url = "https://dblp.org/xml/release/"
file = "dblp-2023-10-01.xml.gz"
fd = url+file

f = urlopen(fd)
fz = gzip.GzipFile(fileobj=f, mode="r")

counter_dict = {}
for event, elem in etree.iterparse(fz, events=['end'], recover=True):
    if elem.tag == "year":
        if elem.text not in counter_dict:
            counter_dict[elem.text] = 1
        else:
            counter_dict[elem.text] +=1
    elem.clear()
    
#print(counter_dict)
print(dict(sorted(counter_dict.items())))

print("RAM:")
print(psutil.Process().memory_info().rss / (1024 * 1024))
print("Time:")
print((time.time() - time_start))

输出:

{'1936': 12, '1937': 16, '1938': 11, '1939': 18, '1940': 10, '1941': 13, '1942': 13, '1943': 8, '1944': 5, '1945': 9, '1946': 31, '1947': 10, '1948': 41, '1949': 52, '1950': 29, '1951': 46, '1952': 114, '1953': 173, '1954': 225, '1955': 213, '1956': 355, '1957': 343, '1958': 464, '1959': 715, '1960': 625, '1961': 903, '1962': 1186, '1963': 1032, '1964': 1108, '1965': 1291, '1966': 1503, '1967': 1763, '1968': 2182, '1969': 2113, '1970': 2227, '1971': 3120, '1972': 3751, '1973': 4414, '1974': 5007, '1975': 5246, '1976': 5695, '1977': 5961, '1978': 6786, '1979': 6913, '1980': 7787, '1981': 8662, '1982': 9939, '1983': 10860, '1984': 12334, '1985': 13890, '1986': 16475, '1987': 17549, '1988': 21633, '1989': 24001, '1990': 28166, '1991': 31084, '1992': 34900, '1993': 40695, '1994': 45290, '1995': 47712, '1996': 52809, '1997': 57099, '1998': 64297, '1999': 71138, '2000': 80955, '2001': 86798, '2002': 97758, '2003': 116385, '2004': 135735, '2005': 158268, '2006': 176458, '2007': 189562, '2008': 203544, '2009': 222653, '2010': 228955, '2011': 250783, '2012': 263294, '2013': 280709, '2014': 292279, '2015': 302656, '2016': 314744, '2017': 339456, '2018': 374688, '2019': 417602, '2020': 433127, '2021': 456839, '2022': 470484, '2023': 305500, '2024': 536}

公羊: 1084.80859375 时间: 148.59651041030884

另请在此处阅读常见数据结构的内存效率。