代码在写在一行上时未检测到 Word 文档中的数字-解网

问：

我是一名化学老师，试图想出一个代码来扫描包含我的学生数字的 excel 文件，然后从他们的报告（word、excel 或 pdf 格式）中提取这些数字。然后，代码根据学生编号命名文件夹。

代码运行良好，我使用 ChatGPT 编写它，因为我的知识非常有限。唯一的问题是，当数字写在一行上时，代码无法从 word 文档中提取数字。我不是说下划线，而是真的在一行上（见提供的图片）。下面是代码和图片：

import os
import re
import shutil
import pandas as pd
from docx import Document
import fitz  # PyMuPDF library

# Function to extract numbers from text
def extract_numbers(text):
    return re.findall(r'\d+', text)

# Function to find valid numbers in a given text
def find_valid_numbers(text, valid_numbers):
    numbers = extract_numbers(text)
    return [number for number in numbers if number in valid_numbers]

# Input and output folders
input_folder = 'keeping this private :)'  # Change this to your input folder path
output_folder = 'also keeping this private :)'  # Change this to your output folder path
valid_numbers_file = 'liste_etudiant.xlsx'  # Excel file containing the list of valid numbers

# Read the entire Excel file into a DataFrame
valid_numbers_df = pd.read_excel(valid_numbers_file, header=None)

# Flatten the DataFrame into a list of all values
valid_numbers_list = valid_numbers_df.values.flatten().astype(str).tolist()

# Function to extract text from a PDF file using PyMuPDF
def extract_text_from_pdf(pdf_path):
    text = ""
    try:
        pdf_document = fitz.open(pdf_path)
        for page_num in range(pdf_document.page_count):
            page = pdf_document[page_num]
            text += page.get_text()
    except Exception as e:
        print(f"Error extracting text from {pdf_path}: {str(e)}")
    return text

# Iterate through the files in the input folder
for filename in os.listdir(input_folder):
    file_path = os.path.join(input_folder, filename)
    
    try:
        if filename.endswith('.docx'):
            # Read and process Word documents
            doc = Document(file_path)
            doc_text = '\n'.join([para.text for para in doc.paragraphs])
            valid_numbers_found = set(find_valid_numbers(doc_text, valid_numbers_list))
        elif filename.endswith('.xlsx'):
            # Read and process Excel documents
            df = pd.read_excel(file_path, header=None)
            excel_values = df.values.flatten().astype(str).tolist()
            valid_numbers_found = set(find_valid_numbers(' '.join(excel_values), valid_numbers_list))
        elif filename.endswith('.pdf'):
            # Read and process PDF documents
            pdf_text = extract_text_from_pdf(file_path)
            valid_numbers_found = set(find_valid_numbers(pdf_text, valid_numbers_list))
        else:
            # Skip unsupported file types
            print(f"Skipping: {filename} (Unsupported file type)")
            continue

        if valid_numbers_found:
            # Construct the new filename using the found numbers separated by a hyphen
            new_filename = '-'.join(valid_numbers_found) + '_Rapport' + os.path.splitext(filename)[1]
            
            # Copy the file to the output folder with the new filename
            shutil.copy(file_path, os.path.join(output_folder, new_filename))
            
            print(f"Processed: {filename} -> {new_filename}")
        else:
            print(f"Skipping: {filename} (Could not find valid numbers in the document)")
    except Exception as e:
        print(f"Error processing {filename}: {str(e)}")

print("Processing complete.")

提供的图片

我尝试使用 ChatGPT 进行一些故障排除，但每个解决方案都不起作用。就像我说的，我可以阅读和理解大部分简单的代码，但我没有能力解决这个问题。

谢谢你的帮助！

python ms-word 文本提取下划线

答： 暂无答案

上一个：使用 AWS textract 从 pdf 中提取数据

下一个：如何使用 python 在窗口中安装 pdflib TET

代码在写在一行上时未检测到 Word 文档中的数字

Code not detecting numbers in a word doc when written over a line

评论