更新 Python 代码 - PyPDF2 库已弃用 Python 代码中使用的对象

Update Python code - the PyPDF2 library has deprecated objects used in Python code

提问人:VicRam0001 提问时间:8/29/2023 最后编辑:VicRam0001 更新时间:8/29/2023 访问量:914

问:

我已经能够在基于 Linux 的操作系统中使用 Python 代码,但是当我尝试在基于 Windows 的操作系统上运行相同的代码时,我收到了弃用消息。

我的问题是:如何更新代码以克服弃用问题?

  1. 使用的 Python 代码是:
import PyPDF2
import openpyxl

def pdf_to_text(pdf_file):
    text = ""
    with open(pdf_file, "rb") as file:
        pdf_reader = PyPDF2.PdfFileReader(file)
        for page_num in range(pdf_reader.getNumPages()):
            page = pdf_reader.getPage(page_num)
            text += page.extractText()
            return text

def save_text_to_excel(text, excel_file):
    workbook = openpyxl.Workbook()
    sheet = workbook.active
    lines = text.split("\n")
    for row_num, line in enumerate(lines, 1):
        sheet.cell(row=row_num, column=1, value=line)
        workbook.save(excel_file)

if __name__ == "__main__":
    pdf_file = "PDF_File_name.pdf"
    excel_file = "output.xlsx"

pdf_text = pdf_to_text(pdf_file)
save_text_to_excel(pdf_text, excel_file)

输出:PyPDF2.errors.DeprecationError:PdfFileReader 已弃用,并在 PyPDF2 3.0.0 中删除。请改用 PdfReader。"

  1. 所以我更新了这个 Python 代码:
import PyPDF2
import openpyxl

def pdf_to_text(pdf_file):
    text = ""
    with open(pdf_file, "rb") as file:
        pdf_reader = PyPDF2.PdfReader(file)
        for page_num in range(pdf_reader.getNumPages()):
            page = pdf_reader.getPage(page_num)
            text += page.extractText()
            return text

def save_text_to_excel(text, excel_file):
    workbook = openpyxl.Workbook()
    sheet = workbook.active
    lines = text.split("\n")
    for row_num, line in enumerate(lines, 1):
        sheet.cell(row=row_num, column=1, value=line)
        workbook.save(excel_file)

if __name__ == "__main__":
    pdf_file = "PDF_File_name.pdf"
    excel_file = "output.xlsx"

pdf_text = pdf_to_text(pdf_file)
save_text_to_excel(pdf_text, excel_file)

输出:PyPDF2.errors.DeprecationError: reader.getNumPages 已弃用,并在 PyPDF2 3.0.0 中删除。请改用 len(reader.pages)。"

  1. 接下来,我根据 https://pypdf2.readthedocs.io/en/latest/user/migration-1-to-2.html 的建议更新了 Python 代码,这些建议需要更新哪些状态:

reader.getNumPages() / reader.numPages ➔ len(reader.pages)

import PyPDF2
import openpyxl

def pdf_to_text(pdf_file):
    text = ""
    with open(pdf_file, "rb") as file:
        pdf_reader = PyPDF2.PdfReader(file)
        for page_num in range(pdf_reader.len(reader.pages)):
            page = pdf_reader.getPage(page_num)
            text += page.extractText()
            return text

def save_text_to_excel(text, excel_file):
    workbook = openpyxl.Workbook()
    sheet = workbook.active
    lines = text.split("\n")
    for row_num, line in enumerate(lines, 1):
        sheet.cell(row=row_num, column=1, value=line)
        workbook.save(excel_file)

if __name__ == "__main__":
    pdf_file = "PDF_File_name.pdf"
    excel_file = "output.xlsx"

pdf_text = pdf_to_text(pdf_file)
save_text_to_excel(pdf_text, excel_file)

输出:AttributeError:”PdfReader“对象没有属性”len”"

  1. 我根据“Abdul Aziz Barkat”的评论更新了代码: 错别字:pdf_reader.len(reader.pages) 将其与弃用消息中所述的 len(reader.pages) 进行比较......你必须写 len(pdf_reader.pages),len 是 Python 中的内置函数。
import PyPDF2
import openpyxl

def pdf_to_text(pdf_file):
    text = ""
    with open(pdf_file, "rb") as file:
        pdf_reader = PyPDF2.PdfReader(file)
        for page_num in range(len(pdf_reader.pages)):
            page = pdf_reader.getPage(page_num)
            text += page.extractText()
            return text

def save_text_to_excel(text, excel_file):
    workbook = openpyxl.Workbook()
    sheet = workbook.active
    lines = text.split("\n")
    for row_num, line in enumerate(lines, 1):
        sheet.cell(row=row_num, column=1, value=line)
        workbook.save(excel_file)

if __name__ == "__main__":
    pdf_file = "computers.pdf"
    excel_file = "output.xlsx"

pdf_text = pdf_to_text(pdf_file)
save_text_to_excel(pdf_text, excel_file)

输出:PyPDF2.errors.DeprecationError:reader.getPage(pageNumber)已弃用,并在PyPDF2 3.0.0中删除。请改用 reader.pages[page_number]。"

python-3.x 弃用警告 python-pdfreader

评论

4赞 Abdul Aziz Barkat 8/29/2023
错别字:将其与弃用消息中所述进行比较...你必须写 ,是 Python 中的内置函数pdf_reader.len(reader.pages)len(reader.pages)len(pdf_reader.pages)len
0赞 VicRam0001 8/29/2023
谢谢,我按照您的建议更新了代码,请参阅上面主要评论中的注释 (4)。

答:

0赞 Musabbir Arrafi 8/29/2023 #1

您尝试使用方法进行读取的方式在新版本中已被弃用。按照 PdfFileReader 类文档了解更多信息。这是更正后的代码:pdf

import openpyxl
from PyPDF2 import PdfFileReader

def pdf_to_text(pdf_file):
    text = ""
    with open(pdf_file, "rb") as file:
        pdf_reader = PdfFileReader(file)
        print(pdf_reader.numPages)
        for page_num in range(pdf_reader.numPages):
            page = pdf_reader.getPage(page_num)
            text += page.extractText()
        return text

def save_text_to_excel(text, excel_file):
    workbook = openpyxl.Workbook()
    sheet = workbook.active
    lines = text.split("\n")
    for row_num, line in enumerate(lines, 1):
        sheet.cell(row=row_num, column=1, value=line)
        workbook.save(excel_file)

if __name__ == "__main__":
    pdf_file = "test.pdf"
    excel_file = "output.xlsx"
    pdf_text = pdf_to_text(pdf_file)
    print(pdf_text)
    save_text_to_excel(pdf_text, excel_file)

评论

0赞 Musabbir Arrafi 8/29/2023
我添加了一些 print 语句来查看是否有任何字段是从 pdf 中提取的,请忽略它
0赞 Musabbir Arrafi 8/29/2023
这应该可以解决您的问题,如果您提出任何其他问题,请告诉我
0赞 VicRam0001 8/29/2023 #2

感谢(Abdul 和 Musabbir)的反馈,我已经按照建议更新了代码,还使用迁移指南更新了已弃用的元素: https://pypdf2.readthedocs.io/en/latest/user/migration-1-to-2.html

此代码现在在使用 Windows 操作系统的 Python 3x 上运行:

import openpyxl
import PyPDF2

def pdf_to_text(pdf_file):
    text = ""
    with open(pdf_file, "rb") as file:
        pdf_reader = PyPDF2.PdfReader(file)
        for page_num in range(len(pdf_reader.pages)):
            page = pdf_reader.pages[page_num]
            text += page.extract_text()
        return text

def save_text_to_excel(text, excel_file):
    workbook = openpyxl.Workbook()
    sheet = workbook.active
    lines = text.split("\n")
    for row_num, line in enumerate(lines, 1):
        sheet.cell(row=row_num, column=1, value=line)
        workbook.save(excel_file)

if __name__ == "__main__":
    pdf_file = "PDF-file-name.pdf"
    excel_file = "output.xlsx"
    pdf_text = pdf_to_text(pdf_file)
    save_text_to_excel(pdf_text, excel_file)