提问人:SailingHobo 提问时间:8/9/2023 最后编辑:halferSailingHobo 更新时间:8/9/2023 访问量:70
我想替换 PDF 中的占位符,可以很好地读出它,但是当我尝试编辑它时,我收到 EoF 错误
I want to replace placeholders in a PDF, can read it out fine, but when I try to edit it, I get EoF error
问:
我有一个带有空格符的 PDF 文件,需要将其替换为 Excel 列表中的数据。这应该不难。
我可以阅读pdf并找到关键字,但是当我尝试替换它们时,我总是以EoF问题告终。我检查了文件,EoF 代码位于其末尾
0006053835 00000 n 0006055751 00000 n 0006054024 00000 n 0006055774 00000 n trailer <</Size 1091/Root 1090 0 R/Info 1014 0 R/ID[<937CFAF97C6B065B1AAADAC71CA971E2><937CFAF97C6B065B1AAADAC71CA971E2>]>> startxref 6056146 %%EOF
我不确定是什么原因导致了这个错误。
\Python coding\placeholder.py", line 37, in <module>
edited_page = PdfReader(BytesIO(page_text.encode('utf-8'))).pages[0]
File "\AppData\Local\Programs\Python\Python310\lib\site-packages\PyPDF2\_reader.py", line 319, in __init__
self.read(stream)
File "\AppData\Local\Programs\Python\Python310\lib\site-packages\PyPDF2\_reader.py", line 1415, in read
self._find_eof_marker(stream)
File "\AppData\Local\Programs\Python\Python310\lib\site-packages\PyPDF2\_reader.py", line 1471, in _find_eof_marker
raise PdfReadError("EOF marker not found")
PyPDF2.errors.PdfReadError: EOF marker not found
这是代码
import pandas as pd
from PyPDF2 import PdfReader, PdfWriter
from io import BytesIO
import chardet
# Path to the Excel file containing placeholder data
excel_filename = "prozesse-220-238.xlsx"
# Read the Excel data
df_excel = pd.read_excel(excel_filename)
# Path to the PDF file
pdf_filename = "handbook-2-with-placeholder.pdf"
pdf_reader = PdfReader(pdf_filename)
# Initialize a BytesIO object for the PDF
output = BytesIO()
# Iterate through all rows in the Excel and search for keywords in the PDF
for _, row in df_excel.iterrows():
found_keywords = []
for column_name, value in row.items():
if str(value) in pdf_reader.pages[0].extract_text():
found_keywords.append(column_name)
page_text = pdf_reader.pages[0].extract_text()
for keyword in found_keywords:
page_text = page_text.replace("{" + keyword + "}", str(row[keyword]))
# Determine the encoding of the PDF
result = chardet.detect(page_text.encode())
encoding = result['encoding']
# Create a PdfWriter object and add the edited page
pdf_writer = PdfWriter()
pdf_writer.add_page(pdf_reader.pages[0])
edited_page = PdfReader(BytesIO(page_text.encode(encoding))).pages[0]
pdf_writer.add_page(edited_page)
# Write the edited page to the output file object
pdf_writer.write(output)
# Save the created PDF
edited_pdf_filename = "edited_handbook.pdf"
with open(edited_pdf_filename, "wb") as edited_pdf_file:
edited_pdf_file.write(output.getvalue())
print("Edited PDF has been saved:", edited_pdf_filename)
由于错误与 37 有关,我尝试使用特定的编码而不使用。(UTF8 格式)。我不知道是什么原因导致了这个错误,也不知道如何避免它。
我希望它能立即工作,但我尝试了在网上找到的一些不同的库和代码,但我对这个版本最有信心。我尝试使用特定的编码 (utf8) 和不使用,但都没有成功。
我使用 chardet 来确定编码类型,但尚无定论。
import chardet
pdf_filename = "handbook-2-with-placeholder.pdf"
with open(pdf_filename, 'rb') as pdf_file:
raw_data = pdf_file.read()
result = chardet.detect(raw_data)
print("Detected encoding:", result['encoding'])
Detected encoding: None
答: 暂无答案
评论