pypdf 文本提取在某些 PDF 上引发 IndexError-解网

问：

我正在使用 Python （v 3.10.11）和 pypdf （v 3.17.0）从多个 PDF 中提取文本。

最近，我遇到了一种特殊类型的文件，我无法从中提取文本，因为库抛出异常。

    File "...\pypdf\_cmap.py", line 481, in type1_alternative
        if words[3] != b"put":
IndexError: list index out of range

^{前方提供完整代码}

这些文件未加密，我能够获取它们的元数据（根据它，文件是使用 TCPDF 创建的）并执行其他操作，但每当我尝试在其其中一个页面上使用该功能时，问题就会出现。extract_text

可以在此处找到不可文本提取的 PDF 示例。

我已经搜索了面临完全相同的问题/异常的人/主题，但我没有找到他们。但是，我认为我可能会遇到 Python 文本提取在某些 pdf 或 PyPDF2 字体读取问题上不起作用的情况

寻找其他选项，我发现 Pypdf2（据我所知它已经被弃用，维护者/开发人员已将精力转移到 pypdf）可以提取文本。

代码示例：

# from pypdf import PdfReader
# from PyPDF2 import PdfReader

reader = PdfReader("example.pdf")
number_of_pages = len(reader.pages)
page = reader.pages[0]
text = page.extract_text()
print(text)

如果我使用 PyPDF2 运行它，我将获得正确的文本：

CÁMARA DE COMERCIO...
...
...

如果我尝试使用 pypdf，我会得到：

Traceback (most recent call last):
    File "...\prueba_pdf\test.py", line 8, in <module>
        text = page.extract_text()
    File "...\prueba_pdf\venv\lib\site-packages\pypdf\_page.py", line 2284, in extract_text
        return self._extract_text(
    File "...\prueba_pdf\venv\lib\site-packages\pypdf\_page.py", line 1903, in _extract_text
        cmaps[f] = build_char_map(f, space_width, obj)
    File "...\prueba_pdf\venv\lib\site-packages\pypdf\_cmap.py", line 29, in build_char_map
        font_subtype, font_halfspace, font_encoding, font_map = build_char_map_from_dict(
    File "...\prueba_pdf\venv\lib\site-packages\pypdf\_cmap.py", line 54, in build_char_map_from_dict
        map_dict, space_code, int_entry = parse_to_unicode(ft, space_code)
    File "...\prueba_pdf\venv\lib\site-packages\pypdf\_cmap.py", line 224, in parse_to_unicode
        return type1_alternative(ft, map_dict, space_code, int_entry)
    File "...\prueba_pdf\venv\lib\site-packages\pypdf\_cmap.py", line 481, in type1_alternative
        if words[3] != b"put":
IndexError: list index out of range

有没有办法让 pypdf 在这种情况下工作？我错过了什么吗？

P.S. 我宁愿继续使用 pypdf，而不是在我的项目中有更多的依赖项。

python pypdf 文本提取

pypdf 文本提取在某些 PDF 上引发 IndexError

pypdf text extraction throws IndexError on some PDFs

评论