输出 csv 行末尾的不符合 UTF-8 的字符“\x{0D}”

Non UTF-8 compliant character "\x{0D}" at the end of output csv rows

提问人:jfontana 提问时间:6/19/2023 最后编辑:jfontana 更新时间:6/27/2023 访问量:110

问:

我有一个非常令人沮丧的问题,我不知道如何解决。以下 Python 脚本处理目录中的一堆 XML 文档,并从中提取信息。使用该信息,它会创建一个 csv 文件。

import re
import time
import csv
from lxml import etree as et
from pathlib import Path
from joblib import Parallel, delayed
from tqdm import tqdm
import ftfy


st = time.time()

XMLDIR = Path('/Users/josepm.fontana/Downloads/CICA_CORPUS_XML_CLEAN')
files = [e for e in XMLDIR.iterdir() if e.is_file()]
xml_doc = [f for f in files if f.with_suffix(".xml")]
myCSV_FILE = "/Volumes/SanDisk1TB/_CORPUS_WORK/CSVs/TestDataSet19-6-23_YZh.csv"
time_log = Path(
    '/Volumes/SanDisk1TB/_CORPUS_WORK/TEXT_FILES/log_time_pathlib.txt')
results = Path(
    '/Volumes/SanDisk1TB/_CORPUS_WORK/TEXT_FILES/TestDataSet19-6-23_YZh.txt')


tok_path = et.XPath('//tok | //dtok')


def xml_extract(xml_doc):

    root_element = et.parse(xml_doc).getroot()
    autor = None
    data = None
    tipus = None
    dialecte = None

    header = root_element.find("header")
    if header is not None:
        for el in header:

            if el.get("type") == "autor":
                autor = el.text
                autor = ftfy.fix_text(autor)
            elif el.get("type") == "data":
                data = el.text
                data = ftfy.fix_text(data)
            elif el.get("type") == "tipologia":
                tipus = el.text
                tipus = ftfy.fix_text(tipus)
            elif el.get("type") == "dialecte":
                dialecte = el.text
                dialecte = ftfy.fix_text(dialecte)


    all_toks = tok_path(root_element)

    matching_toks = filter(lambda tok: tok.get('xpos') is not None and tok.get(
        'xpos').startswith('A') and not (tok.get('xpos').startswith('AX')), all_toks)

    for el in matching_toks:
        preceding_tok = el.xpath(
            "./preceding-sibling::tok[1][@lemma and @xpos]")
        preceding_tok_with_dtoks = el.xpath(
            "./preceding-sibling::tok[1][not(@lemma) and not(@xpos)]"
        )
        following_dtok_of_dtok = el.xpath("./preceding-sibling::dtok[1]")

        if el.tag == 'tok':
            tok_dtok = 'tok'
            Adj = "".join(el.itertext())
            Adj_lemma = el.get('lemma')
            Adj_xpos = el.get('xpos')
            Adj = ftfy.fix_text(Adj)

        elif el.tag == 'dtok':
            tok_dtok = 'dtok'
            Adj = el.get('form')
            Adj_lemma = el.get('lemma')
            Adj_xpos = el.get('xpos')
            Adj = ftfy.fix_text(Adj)

        pos = all_toks.index(el)

        RelevantPrecedingElements = all_toks[max(pos - 6, 0):pos]

        RelevantFollowingElements = all_toks[pos + 1:max(pos + 6, 1)]
        
        
        if RelevantPrecedingElements:
            
            prec1 = RelevantPrecedingElements[-1]
        else:
            prec1 = None

        if RelevantFollowingElements:
            foll1 = RelevantFollowingElements[0]
        else:
            foll1 = None
        
        
        
        

        ElementsContext = all_toks[max(pos - 6, 0):pos + 1]

        context_list = []
        
        
        
        if ElementsContext:
            for elem in ElementsContext:
                elem_text = "".join(elem.itertext())
                assert elem_text is not None
                context_list.append(elem_text)

        Adj = f"<{Adj}>"

        for elem in RelevantFollowingElements:
            elem_text = "".join(elem.itertext())
            assert elem_text is not None
            context_list.append(elem_text)

        fol_lem = foll1.get('lemma') if foll1 is not None else None
        prec_lem = prec1.get('lemma') if prec1 is not None else None
        fol_xpos = foll1.get('xpos') if foll1 is not None else None
        prec_xpos = prec1.get('xpos') if prec1 is not None else None
        
        
        fol_form = None
        if foll1 is not None:
            if foll1.tag == "tok":
                fol_form = foll1.text
            elif foll1.tag == "dtok":
                fol_form = foll1.get("form")
            
        prec_form = None
        if prec1 is not None:

            if prec1.tag == "tok":
                prec_form = prec1.text
            elif prec1.tag == "dtok":
                prec_form = prec1.get("form")

        context = " ".join(context_list).replace(
            " ,", ",").replace(" .", ".").replace("   ", " ").replace("  ", " ")

        llista = [
            context,
            prec_form,
            Adj,
            fol_form,
            prec_lem,
            Adj_lemma,
            fol_lem,
            prec_xpos,
            Adj_xpos,
            fol_xpos,
            tok_dtok,
            xml_doc.name,
            autor,
            data,
            tipus,
            dialecte,
        ]

        writer = csv.writer(csv_file, delimiter=";")
        writer.writerow(llista)
        with open(results, "a") as Results:
            Results.write(f"@@@ {context} @@@\n\n")
            Results.write(f"Source: {xml_doc.name}\n\n\n")


with open(myCSV_FILE, "a+", encoding="UTF8", newline='') as csv_file:




    #Parallel(n_jobs=-1,  prefer="threads")(delayed(xml_extract)(xml_doc) for xml_doc in tqdm(files))
    Parallel(n_jobs=-1, prefer="threads")(delayed(xml_extract)(xml_doc) for xml_doc in tqdm(files) if not xml_doc.name.startswith("."))

    

elapsed_time = time.time() - st

with open(
    time_log, "a"
) as Myfile:
    Myfile.write(f"\n \n The end: The whole process took {elapsed_time} \n")

创建的文本文件是完美的 UTF-8。所有 XML 文档都经过双重检查和三重检查,以确保它们也都正确格式为 UTF-8。

但是,在创建的 csv 文件的每一行的末尾都有“\x{0D}”字符。

我完全不明白这一点。此脚本基于以下脚本,该脚本创建格式正确的 csv 文件,其中不会发生此问题。主要区别在于,在有问题的代码中,我通过“joblib”库引入了并行化,否则处理所有这些文件需要很长时间。

import re
import time
import csv
from lxml import etree as et
from pathlib import Path


st = time.time()

#XMLDIR = Path('/Volumes/SanDisk1TB/_CORPUS_WORK/CICA_WORKING_NEW')
XMLDIR = Path('/Users/josepm.fontana/Downloads/CICA_CORPUS_XML_CLEAN')
files = [e for e in XMLDIR.iterdir() if e.is_file()]
xml_doc = [f for f in files if f.with_suffix(".xml")]
myCSV_FILE = "/Volumes/SanDisk1TB/_CORPUS_WORK/CSVs/clitic_context_testTEST2.csv"
time_log = Path('/Volumes/SanDisk1TB/_CORPUS_WORK/TEXT_FILES/log_time_pathlib.txt')
results = Path('/Volumes/SanDisk1TB/_CORPUS_WORK/TEXT_FILES/resultsTEST2.txt')



tok_path = et.XPath('//tok')

def xml_extract(root_element):

    all_toks = tok_path(root_element)

    matching_toks = filter(lambda tok: re.match(r'^[EeLl][LlOoAa][Ss]*$', "".join(tok.itertext())) is not None and not(tok.get('xpos').startswith('D')), all_toks)

    for el in matching_toks: 

        fake_clitic = "".join(el.itertext())
        pos = all_toks.index(el)


        RelevantPrecedingElements = all_toks[max(pos - 6, 0):pos]
        print(RelevantPrecedingElements)

        prec1 = RelevantPrecedingElements[-1]
        #foll1 = all_toks[pos + 1]



        RelevantFollowingElements = all_toks[pos + 1:max(pos + 6, 1)]
       #prec1 = RelevantFollowingElements[]
        #foll1 = all_toks[pos + 1]
        print(RelevantFollowingElements)

        foll1 = RelevantFollowingElements[0]


        context_list = []
        context_clean = []

        for elem in RelevantPrecedingElements:
            elem_text = "".join(elem.itertext())
            assert elem_text is not None
            context_list.append(elem_text)            
            context_clean.append(elem_text)

        # adjective = '<' + str(el.text) + '>'
        fake_clitic = f"<{fake_clitic}>"
        fake_clitic_clean = f"{el.text}"

        print(fake_clitic)
        context_list.append(fake_clitic)
        context_clean.append(fake_clitic_clean)

        for elem in RelevantFollowingElements:
            elem_text = "".join(elem.itertext())
            assert elem_text is not None
            context_list.append(elem_text)
            context_clean.append(elem_text)



        lema_fol = foll1.get('lemma') if foll1 is not None else None
        lema_prec = prec1.get('lemma') if prec1 is not None else None
        xpos_fol = foll1.get('xpos') if foll1 is not None else None
        xpos_prec = prec1.get('xpos') if prec1 is not None else None
        form_fol = foll1.text if foll1 is not None else None
        form_prec = prec1.text if prec1 is not None else None

        context = " ".join(context_list)
        clean_context = " ".join(context_clean).replace(" ,", ",").replace(" .", ".")
        print(f"Context is: {context}")


        llista = [
            context,
            lema_prec,
            xpos_prec,
            form_prec,
            fake_clitic,
            lema_fol,
            xpos_fol,
            form_fol,
        ]

        writer = csv.writer(csv_file, delimiter=";")
        writer.writerow(llista)
        with open(
            results, "a"
        ) as Results:
            Results.write(f"@@@ {context} @@@\n\n")
            Results.write(f"{clean_context}\n\n")
            Results.write(f"Source: {xml_doc.name}\n\n\n")

with open(myCSV_FILE, "a+", encoding="UTF8", newline="") as csv_file:

    for xml_doc in files:
        if xml_doc.name.startswith("."):
            continue
        doc = xml_doc.stem # this was 
        print(doc)
        start_file_time_beforeParse = time.time()
        print(start_file_time_beforeParse)
        print(
            f"{time.time() - st} seconds after the beginning of the process I'm starting to get the root of {xml_doc.name}"
        )
        file_root = et.parse(xml_doc).getroot()
        xml_extract(file_root)
        print(
            f"I ran through {xml_doc.name} in {time.time() - start_file_time_beforeParse} seconds!"
        )
        with open(
            time_log, "a"
        ) as Myfile:
            Myfile.write("Time it took to getroot and parse ")
            Myfile.write(xml_doc.name)
            Myfile.write("\n")
            Myfile.write("Time it took to loop through the entire ")
            Myfile.write(xml_doc.name)
            Myfile.write(" is: ")
            Myfile.write(f"{time.time() - start_file_time_beforeParse} seconds!")
            Myfile.write("\n")
            Myfile.write("\n")

elapsed_time = time.time() - st


with open(
    time_log, "a"
) as Myfile:
    Myfile.write(f"\n \n The end: The whole process took {elapsed_time} \n")


print("Execution time:", elapsed_time, "seconds")

我将不胜感激您能提供的任何帮助。这真的很令人沮丧。

以下是一些示例 XML 文件的链接,例如我正在尝试处理的文件:

示例 XML 文件

编辑:

改编扎克·杨(Zach Young)的剧本以解决有问题的任务:

import csv
import re
import time

from pathlib import Path

from lxml import etree as et

beg_main = time.time()

#xmls_dir = Path("./xmls")

xmls_dir = Path('/PathTo/CLEAN_COMP_TEST2')
files = [e for e in xmls_dir.iterdir() if e.is_file()]
xml_files = [f for f in files if f.with_suffix(".xml")]

csv_path = Path("/PathTo/My_Output.csv")


csv_file = open(csv_path, "w", newline="", encoding="utf-8")
writer = csv.writer(csv_file, delimiter=";")

results_path = Path("/PathTo/my_results.txt")


results = open(results_path, "w", encoding="utf-8")

times_path = Path("/PathTo/my_times.txt")
times = open(times_path, "w", encoding="utf-8")

tok_path = et.XPath('//tok | //dtok')

def xml_extract(doc_root, fname: str):
    all_toks = tok_path(doc_root)    
    
    matching_toks = filter(
        lambda tok: 
            tok.get('xpos') is not None and tok.get
            (
        'xpos').startswith('A') and not (tok.get('xpos').startswith('AX')
        ), 
        all_toks
        )
    
    for el in matching_toks:
        preceding_tok = el.xpath(
            "./preceding-sibling::tok[1][@lemma and @xpos]")
        preceding_tok_with_dtoks = el.xpath(
            "./preceding-sibling::tok[1][not(@lemma) and not(@xpos)]"
        )
        following_dtok_of_dtok = el.xpath("./preceding-sibling::dtok[1]")

        if el.tag == 'tok':
            tok_dtok = 'tok'
            Adj = "".join(el.itertext())
            Adj_lemma = el.get('lemma')
            Adj_xpos = el.get('xpos')

        elif el.tag == 'dtok':
            tok_dtok = 'dtok'
            Adj = el.get('form')
            Adj_lemma = el.get('lemma')
            Adj_xpos = el.get('xpos')

        pos = all_toks.index(el)

        RelevantPrecedingElements = all_toks[max(pos - 6, 0):pos]

        RelevantFollowingElements = all_toks[pos + 1:max(pos + 6, 1)]

        
        if RelevantPrecedingElements:
            
            prec1 = RelevantPrecedingElements[-1]
        else:
            prec1 = None

        if RelevantFollowingElements:
            foll1 = RelevantFollowingElements[0]
        else:
            foll1 = None

                
        ElementsContext = all_toks[max(pos - 6, 0):pos + 1]


        context_list = []
        
        if ElementsContext:
            for elem in ElementsContext:
                elem_text = "".join(elem.itertext())
                assert elem_text is not None
                context_list.append(elem_text)



   
        
        Adj = f"<{Adj}>"

        

        for elem in RelevantFollowingElements:
            elem_text = "".join(elem.itertext())
            assert elem_text is not None
            context_list.append(elem_text)


        fol_lem = foll1.get('lemma') if foll1 is not None else None
        prec_lem = prec1.get('lemma') if prec1 is not None else None
        fol_xpos = foll1.get('xpos') if foll1 is not None else None
        prec_xpos = prec1.get('xpos') if prec1 is not None else None
        

        fol_form = None

        if foll1 is not None:
            if foll1.tag == "tok":
                fol_form = foll1.text
            elif foll1.tag == "dtok":
                fol_form = foll1.get("form")
                
        prec_form = None
        if prec1 is not None:

            if prec1.tag == "tok":
                prec_form = prec1.text
            elif prec1.tag == "dtok":
                prec_form = prec1.get("form")
                
        context = " ".join(context_list).replace(
            " ,", ",").replace(" .", ".").replace("   ", " ").replace("  ", " ")

        #print(f"Context is: {context}")
        

        llista = [
            context,
            prec_form,
            Adj,
            fol_form,
            prec_lem,
            Adj_lemma,
            fol_lem,
            prec_xpos,
            Adj_xpos,
            fol_xpos,
            tok_dtok,
            xml_file.name,
            autor,
            data,
            tipus,
            dialecte,
        ]

        writer.writerow(llista)
        results.write(f"@@@ {context} @@@\n\n")
        results.write(f"Source: {fname}\n\n\n")


for xml_file in xml_files:
    if xml_file.name.startswith("."):
        continue

    beg_extract = time.time()
    doc_root = et.parse(xml_file, parser=None).getroot()
    obra = None
    autor = None
    data = None
    tipus = None
    dialecte = None

    header = doc_root.find("header")
    if header is not None:
        for el in header:
            if el.get("type") == "obra":
                obra = el.text
            elif el.get("type") == "autor":
                autor = el.text
            elif el.get("type") == "data":
                data = el.text
            elif el.get("type") == "tipologia":
                tipus = el.text
            elif el.get("type") == "dialecte":
                dialecte = el.text

    xml_extract(doc_root, xml_file.name)

    times.write(f"Time to extract {xml_file.name}: {time.time() - beg_extract}s\n")

elapsed = time.time() - beg_main
times.write(f"\n \n The end: The whole process took {elapsed}s\n")

print("Execution time:", elapsed, "seconds")

python csv utf-8 字符编码

评论

0赞 JosefZ 6/20/2023
编辑您的问题以改进您的最小可重复示例。特别是,坚持“最小”......
1赞 JosefZ 6/20/2023
我不知道语法,但是与 (U+000D,回车 (CR))。另请参阅 editpadpro.com/tricklinebreak.html"\x{0D}""\x0D""\r"
0赞 jfontana 6/20/2023
谢谢@JosepfZ。相信我,如果可以的话,我会举出一个更简约的例子。但这就是问题所在。这个奇怪的问题只有在我在这里使用并行化时才会发生。因此,这与图书馆的交互有关。正如我所说,我对第二个脚本没有任何问题。但是,据我所知,csv 文件的写入方式没有区别。希望真正擅长此的人可以检查代码的复杂性并发现问题。
1赞 jfontana 6/20/2023
感谢您的有见地的评论扎克。关于编码,我还有很多东西要学。日志和进度条是在我进行大量调试和监视的时期引入的。现在可以拿走了。但是,字符编码问题,这绝对是一个与从并发进程对 CSV 的错误写入相关的问题。
1赞 jfontana 6/20/2023
@Zach Young,我不确定如何实现您的建议,因为XML文档的数量和其中一些文档的长度是巨大的。这就是为什么在完成每个XML文档后,需要编写结果并关闭文档的原因。我没有看到一种简单的方法,将 400 多个文档中的所有信息保存在内存中,然后返回它以打开 csv 文件和文本文件并写入。

答:

1赞 Zach Young 6/21/2023 #1

根据我们在评论中的小讨论,我建议从以下内容开始。您可以打开所有文件一次,以便在最顶部进行写入,然后在需要写入的任何位置引用它们(虽然不是并行的,只是同步):

import csv
import re
import time

from pathlib import Path

from lxml import etree as et

beg_main = time.time()

xmls_dir = Path("./xmls")
files = [e for e in xmls_dir.iterdir() if e.is_file()]
xml_files = [f for f in files if f.with_suffix(".xml")]

csv_path = Path("./my_output.csv")
csv_file = open(csv_path, "w", newline="", encoding="utf-8")
writer = csv.writer(csv_file, delimiter=";")

results_path = Path("./my_results.txt")
results = open(results_path, "w", encoding="utf-8")

times_path = Path("./my_times.txt")
times = open(times_path, "w", encoding="utf-8")

tok_path = et.XPath("//tok")

def xml_extract(doc_root, fname: str):
    all_toks = tok_path(doc_root)

    matching_toks = filter(
        lambda tok: (
            re.match(r"^[EeLl][LlOoAa][Ss]*$", "".join(tok.itertext())) is not None
            and not (tok.get("xpos").startswith("D"))
        ),
        all_toks,
    )

    for el in matching_toks:
        fake_clitic = "".join(el.itertext())
        pos = all_toks.index(el)

        RelevantPrecedingElements = all_toks[max(pos - 6, 0) : pos]

        prec1 = RelevantPrecedingElements[-1]

        RelevantFollowingElements = all_toks[pos + 1 : max(pos + 6, 1)]

        foll1 = RelevantFollowingElements[0]

        context_list = []
        context_clean = []

        for elem in RelevantPrecedingElements:
            elem_text = "".join(elem.itertext())
            assert elem_text is not None
            context_list.append(elem_text)
            context_clean.append(elem_text)

        fake_clitic = f"<{fake_clitic}>"
        fake_clitic_clean = f"{el.text}"

        context_list.append(fake_clitic)
        context_clean.append(fake_clitic_clean)

        for elem in RelevantFollowingElements:
            elem_text = "".join(elem.itertext())
            assert elem_text is not None
            context_list.append(elem_text)
            context_clean.append(elem_text)

        lema_fol = foll1.get("lemma") if foll1 is not None else None
        lema_prec = prec1.get("lemma") if prec1 is not None else None
        xpos_fol = foll1.get("xpos") if foll1 is not None else None
        xpos_prec = prec1.get("xpos") if prec1 is not None else None
        form_fol = foll1.text if foll1 is not None else None
        form_prec = prec1.text if prec1 is not None else None

        context = " ".join(context_list)
        clean_context = " ".join(context_clean).replace(" ,", ",").replace(" .", ".")

        llista = [
            context,
            lema_prec,
            xpos_prec,
            form_prec,
            fake_clitic,
            lema_fol,
            xpos_fol,
            form_fol,
        ]

        writer.writerow(llista)

        results.write(f"@@@ {context} @@@\n\n")
        results.write(f"{clean_context}\n\n")
        results.write(f"Source: {fname}\n\n\n")

for xml_file in xml_files:
    if xml_file.name.startswith("."):
        continue

    beg_extract = time.time()
    doc_root = et.parse(xml_file, parser=None).getroot()
    xml_extract(doc_root, xml_file.name)

    times.write(f"Time to extract {xml_file.name}: {time.time() - beg_extract}s\n")

elapsed = time.time() - beg_main
times.write(f"\n \n The end: The whole process took {elapsed}s\n")

print("Execution time:", elapsed, "seconds")

当程序退出时,Python 会为你关闭文件,所以你不需要所有的缩进。with open(...)

我在您共享的 16 个 XML 文件上运行了这个版本和您的版本。

在我的机器上,这与打开extract_xml中的文件有一些不同。我的运行时间大约是 80% 的时间(比你的快 20%)。不过,我有双通道 SSD,所以我的读/写速度很快。如果您没有这种硬件,打开/写入/关闭将需要更长的时间。不过,我不知道这是否足以看到你所经历的放缓。为了处理您共享的 ZIP 中的所有 16 个文件,我的运行时间为 0.0055 秒,而您的文件仅运行了 0.0066 秒。此外,在我的试验中,我发现仅注释掉您的打印/调试语句也可以节省时间。

在你共享的示例 XML 上试用我的代码,看看它与你的运行方式相比。

至于奇怪的写入错误,你总是会得到多个代理试图同时写入。如果你真的想要/需要追求并行性,你需要弄清楚如何同步写入,以便一次只有一个进程尝试/可以写入任何一个文件......这可能会破坏您最初想要并行化的全部原因。

Lemme 知道这对你来说是怎么回事。祝你好运!

评论

0赞 jfontana 6/28/2023
我编辑了我的 OP 以添加我对你的剧本的改编。好消息是,如果没有并行化,就没有字符编码问题。坏消息是,即使您的脚本使用示例文件的速度稍快一些,当我尝试使用更具代表性的 XML 文档示例时,它变得相当慢。特别是,它遇到了一个 32MB 的 XML 文件。不幸的是,其中有很多。也许我在调整你的代码时做错了什么,但它仍然比我需要的慢。我会继续寻找。感谢您的帮助。
0赞 Zach Young 6/28/2023
@jfontana,嘿,很高兴它至少帮了一点忙。我看不出你已经量化了这个过程有多慢。处理这个 32 MB 的文件需要多长时间?为了根除该过程中的缓慢部分,您可能需要进行分析。如果你在这方面需要更多的帮助,你需要创建一个新问题(因为你已经把这个答案标记为已接受,它可能不会引起任何人的注意)。我建议调查 Python 的分析选项,然后如果你遇到困难,就会问一个新问题。另外,您必须多久运行一次整个过程?
1赞 jfontana 7/13/2023
对不起,我之前没有看到这个。在我发布最后一条评论后,我开始做其他事情,我没有检查是否有新的帖子。我刚刚看到通知对我的电子邮件队列进行了一些清理。无论如何,这是需要很长时间才能处理的文件的链接:dropbox.com/home/Public/A-02.xml.zip
1赞 jfontana 7/13/2023
非常感谢您的解释要点。对我来说,解释甚至比实际的解决方案更有价值,所以你真的让我开心。
1赞 jfontana 7/14/2023
尝试使用此链接: dropbox.com/s/k0lzq9tv5fi7jvv/A-02.xml.zip?dl=0 这个应该有效。我今天早上刚刚使用 A-02.xml 文件尝试了您的替代代码。结果是: 在 43 秒内运行op_extract 在 0.68 秒内运行my_extract_index 当您考虑以前的版本时,这是非常惊人的差异: 提取 A-02.xml 的时间:16850.478314876556s