我想替换 PDF 中的占位符,可以很好地读出它,但是当我尝试编辑它时,我收到 EoF 错误

I want to replace placeholders in a PDF, can read it out fine, but when I try to edit it, I get EoF error

提问人:SailingHobo 提问时间:8/9/2023 最后编辑:halferSailingHobo 更新时间:8/9/2023 访问量:70

问:

我有一个带有空格符的 PDF 文件,需要将其替换为 Excel 列表中的数据。这应该不难。

我可以阅读pdf并找到关键字,但是当我尝试替换它们时,我总是以EoF问题告终。我检查了文件,EoF 代码位于其末尾

0006053835 00000 n 0006055751 00000 n 0006054024 00000 n 0006055774 00000 n trailer <</Size 1091/Root 1090 0 R/Info 1014 0 R/ID[<937CFAF97C6B065B1AAADAC71CA971E2><937CFAF97C6B065B1AAADAC71CA971E2>]>> startxref 6056146 %%EOF

我不确定是什么原因导致了这个错误。


\Python coding\placeholder.py", line 37, in <module>
    edited_page = PdfReader(BytesIO(page_text.encode('utf-8'))).pages[0]
  File "\AppData\Local\Programs\Python\Python310\lib\site-packages\PyPDF2\_reader.py", line 319, in __init__
    self.read(stream)
  File "\AppData\Local\Programs\Python\Python310\lib\site-packages\PyPDF2\_reader.py", line 1415, in read
    self._find_eof_marker(stream)
  File "\AppData\Local\Programs\Python\Python310\lib\site-packages\PyPDF2\_reader.py", line 1471, in _find_eof_marker
    raise PdfReadError("EOF marker not found")
PyPDF2.errors.PdfReadError: EOF marker not found

这是代码

import pandas as pd
from PyPDF2 import PdfReader, PdfWriter
from io import BytesIO
import chardet

# Path to the Excel file containing placeholder data
excel_filename = "prozesse-220-238.xlsx"

# Read the Excel data
df_excel = pd.read_excel(excel_filename)

# Path to the PDF file
pdf_filename = "handbook-2-with-placeholder.pdf"
pdf_reader = PdfReader(pdf_filename)

# Initialize a BytesIO object for the PDF
output = BytesIO()

# Iterate through all rows in the Excel and search for keywords in the PDF
for _, row in df_excel.iterrows():
    found_keywords = []
    for column_name, value in row.items():
        if str(value) in pdf_reader.pages[0].extract_text():
            found_keywords.append(column_name)
    
    page_text = pdf_reader.pages[0].extract_text()
    for keyword in found_keywords:
        page_text = page_text.replace("{" + keyword + "}", str(row[keyword]))
    
    # Determine the encoding of the PDF
    result = chardet.detect(page_text.encode())
    encoding = result['encoding']
    
    # Create a PdfWriter object and add the edited page
    pdf_writer = PdfWriter()
    pdf_writer.add_page(pdf_reader.pages[0])
    edited_page = PdfReader(BytesIO(page_text.encode(encoding))).pages[0]
    pdf_writer.add_page(edited_page)
    
    # Write the edited page to the output file object
    pdf_writer.write(output)

# Save the created PDF
edited_pdf_filename = "edited_handbook.pdf"
with open(edited_pdf_filename, "wb") as edited_pdf_file:
    edited_pdf_file.write(output.getvalue())

print("Edited PDF has been saved:", edited_pdf_filename)

由于错误与 37 有关,我尝试使用特定的编码而不使用。(UTF8 格式)。我不知道是什么原因导致了这个错误,也不知道如何避免它。

我希望它能立即工作,但我尝试了在网上找到的一些不同的库和代码,但我对这个版本最有信心。我尝试使用特定的编码 (utf8) 和不使用,但都没有成功。

我使用 chardet 来确定编码类型,但尚无定论。

import chardet

pdf_filename = "handbook-2-with-placeholder.pdf"

with open(pdf_filename, 'rb') as pdf_file:
    raw_data = pdf_file.read()
    result = chardet.detect(raw_data)
    print("Detected encoding:", result['encoding'])

Detected encoding: None

python pdf 错误处理 eof pypdf

评论


答: 暂无答案