提问人:Joanick 提问时间:11/1/2023 最后编辑:Joanick 更新时间:11/1/2023 访问量:31
代码在写在一行上时未检测到 Word 文档中的数字
Code not detecting numbers in a word doc when written over a line
问:
我是一名化学老师,试图想出一个代码来扫描包含我的学生数字的 excel 文件,然后从他们的报告(word、excel 或 pdf 格式)中提取这些数字。然后,代码根据学生编号命名文件夹。
代码运行良好,我使用 ChatGPT 编写它,因为我的知识非常有限。唯一的问题是,当数字写在一行上时,代码无法从 word 文档中提取数字。我不是说下划线,而是真的在一行上(见提供的图片)。下面是代码和图片:
import os
import re
import shutil
import pandas as pd
from docx import Document
import fitz # PyMuPDF library
# Function to extract numbers from text
def extract_numbers(text):
return re.findall(r'\d+', text)
# Function to find valid numbers in a given text
def find_valid_numbers(text, valid_numbers):
numbers = extract_numbers(text)
return [number for number in numbers if number in valid_numbers]
# Input and output folders
input_folder = 'keeping this private :)' # Change this to your input folder path
output_folder = 'also keeping this private :)' # Change this to your output folder path
valid_numbers_file = 'liste_etudiant.xlsx' # Excel file containing the list of valid numbers
# Read the entire Excel file into a DataFrame
valid_numbers_df = pd.read_excel(valid_numbers_file, header=None)
# Flatten the DataFrame into a list of all values
valid_numbers_list = valid_numbers_df.values.flatten().astype(str).tolist()
# Function to extract text from a PDF file using PyMuPDF
def extract_text_from_pdf(pdf_path):
text = ""
try:
pdf_document = fitz.open(pdf_path)
for page_num in range(pdf_document.page_count):
page = pdf_document[page_num]
text += page.get_text()
except Exception as e:
print(f"Error extracting text from {pdf_path}: {str(e)}")
return text
# Iterate through the files in the input folder
for filename in os.listdir(input_folder):
file_path = os.path.join(input_folder, filename)
try:
if filename.endswith('.docx'):
# Read and process Word documents
doc = Document(file_path)
doc_text = '\n'.join([para.text for para in doc.paragraphs])
valid_numbers_found = set(find_valid_numbers(doc_text, valid_numbers_list))
elif filename.endswith('.xlsx'):
# Read and process Excel documents
df = pd.read_excel(file_path, header=None)
excel_values = df.values.flatten().astype(str).tolist()
valid_numbers_found = set(find_valid_numbers(' '.join(excel_values), valid_numbers_list))
elif filename.endswith('.pdf'):
# Read and process PDF documents
pdf_text = extract_text_from_pdf(file_path)
valid_numbers_found = set(find_valid_numbers(pdf_text, valid_numbers_list))
else:
# Skip unsupported file types
print(f"Skipping: {filename} (Unsupported file type)")
continue
if valid_numbers_found:
# Construct the new filename using the found numbers separated by a hyphen
new_filename = '-'.join(valid_numbers_found) + '_Rapport' + os.path.splitext(filename)[1]
# Copy the file to the output folder with the new filename
shutil.copy(file_path, os.path.join(output_folder, new_filename))
print(f"Processed: {filename} -> {new_filename}")
else:
print(f"Skipping: {filename} (Could not find valid numbers in the document)")
except Exception as e:
print(f"Error processing {filename}: {str(e)}")
print("Processing complete.")
我尝试使用 ChatGPT 进行一些故障排除,但每个解决方案都不起作用。就像我说的,我可以阅读和理解大部分简单的代码,但我没有能力解决这个问题。
谢谢你的帮助!
答: 暂无答案
评论