PyPDF2:为什么一个页面适合提取文本,而另一个页面抛出流结束错误

PyPDF2: why is one page good to extract text and another throws an end of stream error

提问人:Robert Gates 提问时间:10/23/2023 最后编辑:toyota SupraRobert Gates 更新时间:10/29/2023 访问量:39

问:

reader = PdfReader(pdf_path)
for page in reader.pages:
  str_page = page.extract_text()

“流意外结束”错误

好页面示例:

{'/rotate': 0, '/type': '/page', '/parent': IndirectObject(51, 0, 2214350944112), '/MediaBox': [0, 0, 522, 1008], '/Contents': IndirectObject(5, 0, 2214350944112), '/Resources': {'/ProcSet': ['/PDF', '/Text', '/ImageB', '/ImageC'], '/XObject': {'/X6': IndirectObject(13, 0, 2214350944112), '/X5': IndirectObject(11, 0, 2214350944112), '/X4': IndirectObject(10, 0, 2214350944112), '/X3': IndirectObject(8, 0, 2214350944112), '/X2': IndirectObject(7, 0, 2214350944112), '/X1': IndirectObject(6, 0, 2214350944112), '/X17': IndirectObject(32, 0, 2214350944112), '/X16': IndirectObject(30, 0, 2214350944112), '/X15': IndirectObject(28, 0, 2214350944112), '/X14': IndirectObject(27, 0, 2214350944112), '/X13': IndirectObject(25, 0, 2214350944112), '/X9': IndirectObject(18, 0, 2214350944112), '/X12': IndirectObject(23, 0, 2214350944112), '/X11': IndirectObject(21, 0, 2214350944112), '/X8': IndirectObject(16, 0, 2214350944112), '/X7': IndirectObject(15, 0, 2214350944112), '/X10':IndirectObject(20, 0, 2214350944112)}, '/Font': {'/F11': IndirectObject(38, 0, 2214350944112), '/F10': IndirectObject(37, 0, 2214350944112), '/F9': IndirectObject(36, 0, 2214350944112), '/F8': IndirectObject(35, 0, 2214350944112), '/F7': IndirectObject(34, 0, 2214350944112), '/F6': IndirectObject(31, 0, 2214350944112), '/F5': IndirectObject(29, 0, 2214350944112), '/F4': IndirectObject(26, 0, 2214350944112), '/F3': IndirectObject(24, 0, 2214350944112), '/F2': IndirectObject(19, 0,2214350944112), '/F1': IndirectObject(14, 0, 2214350944112), '/F12': IndirectObject(39, 0, 2214350944112)}}}

坏页面示例

{'/rotate': 0, '/type': '/page', '/parent': IndirectObject(51, 0, 1815281175584), '/MediaBox': [0, 0, 522, 1008], '/contents': IndirectObject(40, 0, 1815281175584), '/Resources': {'/ProcSet': ['/PDF', '/Text', '/ImageB', '/ImageC'], '/XObject': {'/X34': IndirectObject(58, 0, 1815281175584), '/X33': IndirectObject(57, 0, 1815281175584), '/X32': IndirectObject(56, 0, 1815281175584), '/X28': IndirectObject(52, 0, 1815281175584), '/X30': IndirectObject(54, 0, 1815281175584), '/X26': IndirectObject(48, 0, 1815281175584), '/X24': IndirectObject(46, 0, 1815281175584), '/X23': IndirectObject(45, 0, 1815281175584), '/X22': IndirectObject(44, 0, 1815281175584), '/X19': IndirectObject(41, 0, 1815281175584), '/X21': IndirectObject(43, 0, 1815281175584)}, '/Font': {'/F2': IndirectObject(19, 0, 1815281175584), '/F1': IndirectObject(14, 0, 1815281175584), '/F8': IndirectObject(35, 0, 1815281175584), '/F7': IndirectObject(34, 0, 1815281175584)}}}

我已经阅读了 PyPDF2 文档的例外页面,我知道建议使用 pdfminer 来提取剩余的文本。

通常看 Traceback 会让我找到某个地方,但我迷失在这个

python-3.x pypdf

评论


答:

0赞 Martin Thoma 10/29/2023 #1

因为一个页面对象存储正确,而另一个页面对象具有意外的流结束。

问题是内联图像被“宣布”了,但随后什么也没发生。文件刚刚结束。

这不是pypdf的错误,而是一个损坏的PDF文档。

如果您可以共享 PDF + 如果其他 PDF 查看器可以正确显示该页面,则可以在 pypdf 问题跟踪器中打开问题。我们称之为“鲁棒性问题”。