Python:在特定文件夹上将多个/多个 .docx 文件从 ANSI 转换为 UTF-8

Python: Convert several/multiple .docx file from ANSI to UTF-8 on a particular folder

提问人:Hellena Crainicu 提问时间:3/6/2023 更新时间:3/8/2023 访问量:141

问:

我不是很好的程序员。但是我想制作一个py代码,可以从特定文件夹将多个/多个.docx文件从ANSI转换为UTF-8。

我将从这个开始。但我不知道如何从文件夹中选择文件。也许有人帮了我一点。

from unidecode import unidecode

python2_text = docx_paragraph.text
unicode_text = python2_text.decode("utf-8", "replace") if isinstance(python2_text , str) else python2_text
unidecode(unicode_text)
python-3.x utf-8

评论


答:

1赞 Andreas 3/6/2023 #1

import os
import zipfile
import io
import chardet

# Set the folder path where the .docx files are located
folder_path = os.getcwd()

# Loop through all files in the folder
for filename in os.listdir(folder_path):
    if filename.endswith(".docx"):
        # Open the .docx file
        file_path = os.path.join(folder_path, filename)
        try:
            with zipfile.ZipFile(file_path) as docx_file:
                # Read the contents of the document.xml file
                xml_content = docx_file.read('word/document.xml')
        except Exception as e:
            print(f"Error opening {file_path}: {e}")
            continue

        # Detect the current encoding of the file
        detected_encoding = chardet.detect(xml_content)['encoding']
        print(f"{file_path} is encoded in {detected_encoding}")
        # If the detected encoding is not UTF-8, save the file in UTF-8 format
        if detected_encoding != "utf-8":
            new_filename = os.path.splitext(filename)[0] + "_utf8.docx"
            new_file_path = os.path.join(folder_path, new_filename)
            with zipfile.ZipFile(new_file_path, "w") as docx_file:
                # Write the contents of the modified document.xml file
                docx_file.writestr('word/document.xml', xml_content.decode(detected_encoding).encode('utf-8'))
            print(f"Converted {file_path} from {detected_encoding} to UTF-8 and saved as {new_file_path}")
        else:
            print(f"{file_path} is already in UTF-8 format")

评论

0赞 Hellena Crainicu 3/6/2023
AttributeError: 'Settings' object has no attribute 'original_encoding'查看打印屏幕:snipboard.io/iN3fx6.jpg
1赞 Andreas 3/7/2023
你用什么python?问题是针对 3.x,但也许您的意思是 python 2?
0赞 Just Me 3/7/2023
我有同样的错误。我使用 Python 版本 3.1.0 和 PyScripter 版本 4.2.1.0
1赞 Andreas 3/7/2023
你是对的。它是旧版本,它正在工作。那么你能给我一些文件进行测试吗?我现在只有utf-8。
1赞 Andreas 3/7/2023
这些文件显示为 UTF-8。