lxml iterparse 会占用 4GB XML 文件的内存，即使使用 clear（）也是如此-解网

问：

该脚本的目的是提取每年出版的文章/书籍数量，并从 xml 文件 dblp-2023-10-01.xml 中的元素中获取此信息。该文件可以在这里找到：https://dblp.org/xml/release/

from lxml import etree

xmlfile = 'dblp-2023-10-01.xml'

doc = etree.iterparse(xmlfile, tag='year', load_dtd=True)
_, root = next(doc)
counter_dict = {}
for event, element in doc:
    if element.text not in counter_dict:
        counter_dict[element.text] = 1
    else:
        counter_dict[element.text] += 1
    root.clear()

当我为一个小文件运行代码时，它运行得很流畅。令我困惑的是，当我使用 dblp 文件运行时，它超过了 4GB（文件大小），这对我来说没有意义。

我还尝试运行替代版本，以确保它清除了它解析的内容：

    for ancestor in elem.xpath('ancestor-or-self::*'):
        while ancestor.getprevious() is not None:
            del ancestor.getparent()[0]

没有任何改善

python xml lxml RAM 大文件

import xml.sax

counter_dict = {}

class YearHandler(xml.sax.ContentHandler):

    def __init__(self):
       self.year = ''
       self.isYear = False

    def startElement(self,tag,attributes):
       if tag == 'year':
           self.isYear = True
           self.year = ''

    def endElement(self,tag):
       if self.isYear and tag == 'year':
           self.isYear = False
           yearInt = int(self.year)
           if yearInt in counter_dict:
               counter_dict[yearInt] += 1
           else:
               counter_dict[yearInt] = 1


    def characters(self,content):
        if self.isYear:
           self.year += content

if __name__=='__main__':

    parser=xml.sax.make_parser()

    parser.setFeature(xml.sax.handler.feature_namespaces,0)

    parser.setContentHandler(YearHandler())

    parser.parse('dblp-2023-10-01.xml')

    print(counter_dict)

在我的 Windows 机器上，它消耗的内存和输出不到 10 MB

{2014: 292279, 2005: 158268, 2011: 250783, 2012: 263294, 2018: 374688, 2008: 203544, 1997: 57099, 2010: 228955, 2016: 314744, 2017: 339456, 2013: 280709, 2002: 97758, 2004: 135735, 2009: 222653, 2007: 189562, 2006: 176458, 1999: 71138, 2015: 302656, 2022: 470484, 2023: 305500, 2019: 417602, 2020: 433127, 1992: 34900, 2021: 456839, 1988: 21633, 1998: 64297, 1986: 16475, 1989: 24001, 1987: 17549, 2001: 86798, 1994: 45290, 1990: 28166, 2003: 116385, 1995: 47712, 2000: 80955, 1993: 40695, 1991: 31084, 1996: 52809, 1954: 225, 1971: 3120, 2024: 536, 1985: 13890, 1984: 12334, 1982: 9939, 1975: 5246, 1983: 10860, 1980: 7787, 1981: 8662, 1964: 1108, 1977: 5961, 1976: 5695, 1972: 3751, 1974: 5007, 1979: 6913, 1973: 4414, 1978: 6786, 1967: 1763, 1965: 1291, 1969: 2113, 1968: 2182, 1970: 2227, 1966: 1503, 1959: 715, 1961: 903, 1953: 173, 1960: 625, 1957: 343, 1955: 213, 1958: 464, 1956: 355, 1951: 46, 1962: 1186, 1952: 114, 1963: 1032, 1946: 31, 1947: 10, 1945: 9, 1939: 18, 1948: 41, 1942: 13, 1949: 52, 1941: 13, 1937: 16, 1940: 10, 1936: 12, 1950: 29, 1943: 8, 1944: 5, 1938: 11}

花了几分钟，是的，我认为在我的机器上需要 4 到 5 分钟，但我认为预期的输入为 4 GB。在运行过程中，我没有注意到内存使用量发生了一些很大的变化，当然字典正在添加，但正如我所说，在任务管理器中，我看不到超过 10 MB 的内存使用量。当然，在处理输入时，不会有任何进入输入大小或不断增长的东西。

1赞 Hermann12 11/2/2023 #2

选项 1：我的机器在本地 2,5 分钟内完成，所以你也需要下载时间。您不仅应该清除，而是：root.clear()elem.clear()

from lxml import etree
import gzip

import psutil
import time
time_start = time.time()

fd = gzip.open('dblp-2023-10-01.xml.gz', "r")

counter_dict = {}
for event, elem in etree.iterparse(fd, events=['end'], recover=True):
    if elem.tag == "year":
        if elem.text not in counter_dict:
            counter_dict[elem.text] = 1
        else:
            counter_dict[elem.text] +=1
    elem.clear()
    
#print(counter_dict)
print(dict(sorted(counter_dict.items())))

print("RAM:")
print(psutil.Process().memory_info().rss / (1024 * 1024))
print("Time:")
print((time.time() - time_start))

输出：

{'1936': 12, '1937': 16, '1938': 11, '1939': 18, '1940': 10, '1941': 13, '1942': 13, '1943': 8, '1944': 5, '1945': 9, '1946': 31, '1947': 10, '1948': 41, '1949': 52, '1950': 29, '1951': 46, '1952': 114, '1953': 173, '1954': 225, '1955': 213, '1956': 355, '1957': 343, '1958': 464, '1959': 715, '1960': 625, '1961': 903, '1962': 1186, '1963': 1032, '1964': 1108, '1965': 1291, '1966': 1503, '1967': 1763, '1968': 2182, '1969': 2113, '1970': 2227, '1971': 3120, '1972': 3751, '1973': 4414, '1974': 5007, '1975': 5246, '1976': 5695, '1977': 5961, '1978': 6786, '1979': 6913, '1980': 7787, '1981': 8662, '1982': 9939, '1983': 10860, '1984': 12334, '1985': 13890, '1986': 16475, '1987': 17549, '1988': 21633, '1989': 24001, '1990': 28166, '1991': 31084, '1992': 34900, '1993': 40695, '1994': 45290, '1995': 47712, '1996': 52809, '1997': 57099, '1998': 64297, '1999': 71138, '2000': 80955, '2001': 86798, '2002': 97758, '2003': 116385, '2004': 135735, '2005': 158268, '2006': 176458, '2007': 189562, '2008': 203544, '2009': 222653, '2010': 228955, '2011': 250783, '2012': 263294, '2013': 280709, '2014': 292279, '2015': 302656, '2016': 314744, '2017': 339456, '2018': 374688, '2019': 417602, '2020': 433127, '2021': 456839, '2022': 470484, '2023': 305500, '2024': 536}

RAM:
1141.16796875
Time:
151.0215344429016

选项 2 - 下载并解析流：

import gzip
from urllib.request import urlopen
from lxml import etree

import psutil
import time
time_start = time.time()

url = "https://dblp.org/xml/release/"
file = "dblp-2023-10-01.xml.gz"
fd = url+file

f = urlopen(fd)
fz = gzip.GzipFile(fileobj=f, mode="r")

counter_dict = {}
for event, elem in etree.iterparse(fz, events=['end'], recover=True):
    if elem.tag == "year":
        if elem.text not in counter_dict:
            counter_dict[elem.text] = 1
        else:
            counter_dict[elem.text] +=1
    elem.clear()
    
#print(counter_dict)
print(dict(sorted(counter_dict.items())))

print("RAM:")
print(psutil.Process().memory_info().rss / (1024 * 1024))
print("Time:")
print((time.time() - time_start))

输出：

{'1936': 12, '1937': 16, '1938': 11, '1939': 18, '1940': 10, '1941': 13, '1942': 13, '1943': 8, '1944': 5, '1945': 9, '1946': 31, '1947': 10, '1948': 41, '1949': 52, '1950': 29, '1951': 46, '1952': 114, '1953': 173, '1954': 225, '1955': 213, '1956': 355, '1957': 343, '1958': 464, '1959': 715, '1960': 625, '1961': 903, '1962': 1186, '1963': 1032, '1964': 1108, '1965': 1291, '1966': 1503, '1967': 1763, '1968': 2182, '1969': 2113, '1970': 2227, '1971': 3120, '1972': 3751, '1973': 4414, '1974': 5007, '1975': 5246, '1976': 5695, '1977': 5961, '1978': 6786, '1979': 6913, '1980': 7787, '1981': 8662, '1982': 9939, '1983': 10860, '1984': 12334, '1985': 13890, '1986': 16475, '1987': 17549, '1988': 21633, '1989': 24001, '1990': 28166, '1991': 31084, '1992': 34900, '1993': 40695, '1994': 45290, '1995': 47712, '1996': 52809, '1997': 57099, '1998': 64297, '1999': 71138, '2000': 80955, '2001': 86798, '2002': 97758, '2003': 116385, '2004': 135735, '2005': 158268, '2006': 176458, '2007': 189562, '2008': 203544, '2009': 222653, '2010': 228955, '2011': 250783, '2012': 263294, '2013': 280709, '2014': 292279, '2015': 302656, '2016': 314744, '2017': 339456, '2018': 374688, '2019': 417602, '2020': 433127, '2021': 456839, '2022': 470484, '2023': 305500, '2024': 536}

公羊： 1084.80859375 时间： 148.59651041030884

另请在此处阅读常见数据结构的内存效率。

上一个：无法使用 lxml 读取和查询 graphml 文件

下一个：使用 lxml 解析来自 href 的实际链接 [duplicate]

lxml iterparse 会占用 4GB XML 文件的内存，即使使用 clear（）也是如此

lxml iterparse eats up memory for an 4GB XML file, even when clear() is used

评论

评论

lxml iterparse 会占用 4GB XML 文件的内存，即使使用 clear（） 也是如此

lxml iterparse eats up memory for an 4GB XML file, even when clear() is used

评论

评论

lxml iterparse 会占用 4GB XML 文件的内存，即使使用 clear（）也是如此