代码在写在一行上时未检测到 Word 文档中的数字

Code not detecting numbers in a word doc when written over a line

提问人:Joanick 提问时间:11/1/2023 最后编辑:Joanick 更新时间:11/1/2023 访问量:31

问:

我是一名化学老师,试图想出一个代码来扫描包含我的学生数字的 excel 文件,然后从他们的报告(word、excel 或 pdf 格式)中提取这些数字。然后,代码根据学生编号命名文件夹。

代码运行良好,我使用 ChatGPT 编写它,因为我的知识非常有限。唯一的问题是,当数字写在一行上时,代码无法从 word 文档中提取数字。我不是说下划线,而是真的在一行上(见提供的图片)。下面是代码和图片:

import os
import re
import shutil
import pandas as pd
from docx import Document
import fitz  # PyMuPDF library

# Function to extract numbers from text
def extract_numbers(text):
    return re.findall(r'\d+', text)

# Function to find valid numbers in a given text
def find_valid_numbers(text, valid_numbers):
    numbers = extract_numbers(text)
    return [number for number in numbers if number in valid_numbers]

# Input and output folders
input_folder = 'keeping this private :)'  # Change this to your input folder path
output_folder = 'also keeping this private :)'  # Change this to your output folder path
valid_numbers_file = 'liste_etudiant.xlsx'  # Excel file containing the list of valid numbers

# Read the entire Excel file into a DataFrame
valid_numbers_df = pd.read_excel(valid_numbers_file, header=None)

# Flatten the DataFrame into a list of all values
valid_numbers_list = valid_numbers_df.values.flatten().astype(str).tolist()

# Function to extract text from a PDF file using PyMuPDF
def extract_text_from_pdf(pdf_path):
    text = ""
    try:
        pdf_document = fitz.open(pdf_path)
        for page_num in range(pdf_document.page_count):
            page = pdf_document[page_num]
            text += page.get_text()
    except Exception as e:
        print(f"Error extracting text from {pdf_path}: {str(e)}")
    return text

# Iterate through the files in the input folder
for filename in os.listdir(input_folder):
    file_path = os.path.join(input_folder, filename)
    
    try:
        if filename.endswith('.docx'):
            # Read and process Word documents
            doc = Document(file_path)
            doc_text = '\n'.join([para.text for para in doc.paragraphs])
            valid_numbers_found = set(find_valid_numbers(doc_text, valid_numbers_list))
        elif filename.endswith('.xlsx'):
            # Read and process Excel documents
            df = pd.read_excel(file_path, header=None)
            excel_values = df.values.flatten().astype(str).tolist()
            valid_numbers_found = set(find_valid_numbers(' '.join(excel_values), valid_numbers_list))
        elif filename.endswith('.pdf'):
            # Read and process PDF documents
            pdf_text = extract_text_from_pdf(file_path)
            valid_numbers_found = set(find_valid_numbers(pdf_text, valid_numbers_list))
        else:
            # Skip unsupported file types
            print(f"Skipping: {filename} (Unsupported file type)")
            continue

        if valid_numbers_found:
            # Construct the new filename using the found numbers separated by a hyphen
            new_filename = '-'.join(valid_numbers_found) + '_Rapport' + os.path.splitext(filename)[1]
            
            # Copy the file to the output folder with the new filename
            shutil.copy(file_path, os.path.join(output_folder, new_filename))
            
            print(f"Processed: {filename} -> {new_filename}")
        else:
            print(f"Skipping: {filename} (Could not find valid numbers in the document)")
    except Exception as e:
        print(f"Error processing {filename}: {str(e)}")

print("Processing complete.")

提供的图片

我尝试使用 ChatGPT 进行一些故障排除,但每个解决方案都不起作用。就像我说的,我可以阅读和理解大部分简单的代码,但我没有能力解决这个问题。

谢谢你的帮助!

python ms-word 文本提取 下划线

评论


答: 暂无答案