提问人:jfontana 提问时间:6/19/2023 最后编辑:jfontana 更新时间:6/27/2023 访问量:110
输出 csv 行末尾的不符合 UTF-8 的字符“\x{0D}”
Non UTF-8 compliant character "\x{0D}" at the end of output csv rows
问:
我有一个非常令人沮丧的问题,我不知道如何解决。以下 Python 脚本处理目录中的一堆 XML 文档,并从中提取信息。使用该信息,它会创建一个 csv 文件。
import re
import time
import csv
from lxml import etree as et
from pathlib import Path
from joblib import Parallel, delayed
from tqdm import tqdm
import ftfy
st = time.time()
XMLDIR = Path('/Users/josepm.fontana/Downloads/CICA_CORPUS_XML_CLEAN')
files = [e for e in XMLDIR.iterdir() if e.is_file()]
xml_doc = [f for f in files if f.with_suffix(".xml")]
myCSV_FILE = "/Volumes/SanDisk1TB/_CORPUS_WORK/CSVs/TestDataSet19-6-23_YZh.csv"
time_log = Path(
'/Volumes/SanDisk1TB/_CORPUS_WORK/TEXT_FILES/log_time_pathlib.txt')
results = Path(
'/Volumes/SanDisk1TB/_CORPUS_WORK/TEXT_FILES/TestDataSet19-6-23_YZh.txt')
tok_path = et.XPath('//tok | //dtok')
def xml_extract(xml_doc):
root_element = et.parse(xml_doc).getroot()
autor = None
data = None
tipus = None
dialecte = None
header = root_element.find("header")
if header is not None:
for el in header:
if el.get("type") == "autor":
autor = el.text
autor = ftfy.fix_text(autor)
elif el.get("type") == "data":
data = el.text
data = ftfy.fix_text(data)
elif el.get("type") == "tipologia":
tipus = el.text
tipus = ftfy.fix_text(tipus)
elif el.get("type") == "dialecte":
dialecte = el.text
dialecte = ftfy.fix_text(dialecte)
all_toks = tok_path(root_element)
matching_toks = filter(lambda tok: tok.get('xpos') is not None and tok.get(
'xpos').startswith('A') and not (tok.get('xpos').startswith('AX')), all_toks)
for el in matching_toks:
preceding_tok = el.xpath(
"./preceding-sibling::tok[1][@lemma and @xpos]")
preceding_tok_with_dtoks = el.xpath(
"./preceding-sibling::tok[1][not(@lemma) and not(@xpos)]"
)
following_dtok_of_dtok = el.xpath("./preceding-sibling::dtok[1]")
if el.tag == 'tok':
tok_dtok = 'tok'
Adj = "".join(el.itertext())
Adj_lemma = el.get('lemma')
Adj_xpos = el.get('xpos')
Adj = ftfy.fix_text(Adj)
elif el.tag == 'dtok':
tok_dtok = 'dtok'
Adj = el.get('form')
Adj_lemma = el.get('lemma')
Adj_xpos = el.get('xpos')
Adj = ftfy.fix_text(Adj)
pos = all_toks.index(el)
RelevantPrecedingElements = all_toks[max(pos - 6, 0):pos]
RelevantFollowingElements = all_toks[pos + 1:max(pos + 6, 1)]
if RelevantPrecedingElements:
prec1 = RelevantPrecedingElements[-1]
else:
prec1 = None
if RelevantFollowingElements:
foll1 = RelevantFollowingElements[0]
else:
foll1 = None
ElementsContext = all_toks[max(pos - 6, 0):pos + 1]
context_list = []
if ElementsContext:
for elem in ElementsContext:
elem_text = "".join(elem.itertext())
assert elem_text is not None
context_list.append(elem_text)
Adj = f"<{Adj}>"
for elem in RelevantFollowingElements:
elem_text = "".join(elem.itertext())
assert elem_text is not None
context_list.append(elem_text)
fol_lem = foll1.get('lemma') if foll1 is not None else None
prec_lem = prec1.get('lemma') if prec1 is not None else None
fol_xpos = foll1.get('xpos') if foll1 is not None else None
prec_xpos = prec1.get('xpos') if prec1 is not None else None
fol_form = None
if foll1 is not None:
if foll1.tag == "tok":
fol_form = foll1.text
elif foll1.tag == "dtok":
fol_form = foll1.get("form")
prec_form = None
if prec1 is not None:
if prec1.tag == "tok":
prec_form = prec1.text
elif prec1.tag == "dtok":
prec_form = prec1.get("form")
context = " ".join(context_list).replace(
" ,", ",").replace(" .", ".").replace(" ", " ").replace(" ", " ")
llista = [
context,
prec_form,
Adj,
fol_form,
prec_lem,
Adj_lemma,
fol_lem,
prec_xpos,
Adj_xpos,
fol_xpos,
tok_dtok,
xml_doc.name,
autor,
data,
tipus,
dialecte,
]
writer = csv.writer(csv_file, delimiter=";")
writer.writerow(llista)
with open(results, "a") as Results:
Results.write(f"@@@ {context} @@@\n\n")
Results.write(f"Source: {xml_doc.name}\n\n\n")
with open(myCSV_FILE, "a+", encoding="UTF8", newline='') as csv_file:
#Parallel(n_jobs=-1, prefer="threads")(delayed(xml_extract)(xml_doc) for xml_doc in tqdm(files))
Parallel(n_jobs=-1, prefer="threads")(delayed(xml_extract)(xml_doc) for xml_doc in tqdm(files) if not xml_doc.name.startswith("."))
elapsed_time = time.time() - st
with open(
time_log, "a"
) as Myfile:
Myfile.write(f"\n \n The end: The whole process took {elapsed_time} \n")
创建的文本文件是完美的 UTF-8。所有 XML 文档都经过双重检查和三重检查,以确保它们也都正确格式为 UTF-8。
但是,在创建的 csv 文件的每一行的末尾都有“\x{0D}”字符。
我完全不明白这一点。此脚本基于以下脚本,该脚本创建格式正确的 csv 文件,其中不会发生此问题。主要区别在于,在有问题的代码中,我通过“joblib”库引入了并行化,否则处理所有这些文件需要很长时间。
import re
import time
import csv
from lxml import etree as et
from pathlib import Path
st = time.time()
#XMLDIR = Path('/Volumes/SanDisk1TB/_CORPUS_WORK/CICA_WORKING_NEW')
XMLDIR = Path('/Users/josepm.fontana/Downloads/CICA_CORPUS_XML_CLEAN')
files = [e for e in XMLDIR.iterdir() if e.is_file()]
xml_doc = [f for f in files if f.with_suffix(".xml")]
myCSV_FILE = "/Volumes/SanDisk1TB/_CORPUS_WORK/CSVs/clitic_context_testTEST2.csv"
time_log = Path('/Volumes/SanDisk1TB/_CORPUS_WORK/TEXT_FILES/log_time_pathlib.txt')
results = Path('/Volumes/SanDisk1TB/_CORPUS_WORK/TEXT_FILES/resultsTEST2.txt')
tok_path = et.XPath('//tok')
def xml_extract(root_element):
all_toks = tok_path(root_element)
matching_toks = filter(lambda tok: re.match(r'^[EeLl][LlOoAa][Ss]*$', "".join(tok.itertext())) is not None and not(tok.get('xpos').startswith('D')), all_toks)
for el in matching_toks:
fake_clitic = "".join(el.itertext())
pos = all_toks.index(el)
RelevantPrecedingElements = all_toks[max(pos - 6, 0):pos]
print(RelevantPrecedingElements)
prec1 = RelevantPrecedingElements[-1]
#foll1 = all_toks[pos + 1]
RelevantFollowingElements = all_toks[pos + 1:max(pos + 6, 1)]
#prec1 = RelevantFollowingElements[]
#foll1 = all_toks[pos + 1]
print(RelevantFollowingElements)
foll1 = RelevantFollowingElements[0]
context_list = []
context_clean = []
for elem in RelevantPrecedingElements:
elem_text = "".join(elem.itertext())
assert elem_text is not None
context_list.append(elem_text)
context_clean.append(elem_text)
# adjective = '<' + str(el.text) + '>'
fake_clitic = f"<{fake_clitic}>"
fake_clitic_clean = f"{el.text}"
print(fake_clitic)
context_list.append(fake_clitic)
context_clean.append(fake_clitic_clean)
for elem in RelevantFollowingElements:
elem_text = "".join(elem.itertext())
assert elem_text is not None
context_list.append(elem_text)
context_clean.append(elem_text)
lema_fol = foll1.get('lemma') if foll1 is not None else None
lema_prec = prec1.get('lemma') if prec1 is not None else None
xpos_fol = foll1.get('xpos') if foll1 is not None else None
xpos_prec = prec1.get('xpos') if prec1 is not None else None
form_fol = foll1.text if foll1 is not None else None
form_prec = prec1.text if prec1 is not None else None
context = " ".join(context_list)
clean_context = " ".join(context_clean).replace(" ,", ",").replace(" .", ".")
print(f"Context is: {context}")
llista = [
context,
lema_prec,
xpos_prec,
form_prec,
fake_clitic,
lema_fol,
xpos_fol,
form_fol,
]
writer = csv.writer(csv_file, delimiter=";")
writer.writerow(llista)
with open(
results, "a"
) as Results:
Results.write(f"@@@ {context} @@@\n\n")
Results.write(f"{clean_context}\n\n")
Results.write(f"Source: {xml_doc.name}\n\n\n")
with open(myCSV_FILE, "a+", encoding="UTF8", newline="") as csv_file:
for xml_doc in files:
if xml_doc.name.startswith("."):
continue
doc = xml_doc.stem # this was
print(doc)
start_file_time_beforeParse = time.time()
print(start_file_time_beforeParse)
print(
f"{time.time() - st} seconds after the beginning of the process I'm starting to get the root of {xml_doc.name}"
)
file_root = et.parse(xml_doc).getroot()
xml_extract(file_root)
print(
f"I ran through {xml_doc.name} in {time.time() - start_file_time_beforeParse} seconds!"
)
with open(
time_log, "a"
) as Myfile:
Myfile.write("Time it took to getroot and parse ")
Myfile.write(xml_doc.name)
Myfile.write("\n")
Myfile.write("Time it took to loop through the entire ")
Myfile.write(xml_doc.name)
Myfile.write(" is: ")
Myfile.write(f"{time.time() - start_file_time_beforeParse} seconds!")
Myfile.write("\n")
Myfile.write("\n")
elapsed_time = time.time() - st
with open(
time_log, "a"
) as Myfile:
Myfile.write(f"\n \n The end: The whole process took {elapsed_time} \n")
print("Execution time:", elapsed_time, "seconds")
我将不胜感激您能提供的任何帮助。这真的很令人沮丧。
以下是一些示例 XML 文件的链接,例如我正在尝试处理的文件:
编辑:
改编扎克·杨(Zach Young)的剧本以解决有问题的任务:
import csv
import re
import time
from pathlib import Path
from lxml import etree as et
beg_main = time.time()
#xmls_dir = Path("./xmls")
xmls_dir = Path('/PathTo/CLEAN_COMP_TEST2')
files = [e for e in xmls_dir.iterdir() if e.is_file()]
xml_files = [f for f in files if f.with_suffix(".xml")]
csv_path = Path("/PathTo/My_Output.csv")
csv_file = open(csv_path, "w", newline="", encoding="utf-8")
writer = csv.writer(csv_file, delimiter=";")
results_path = Path("/PathTo/my_results.txt")
results = open(results_path, "w", encoding="utf-8")
times_path = Path("/PathTo/my_times.txt")
times = open(times_path, "w", encoding="utf-8")
tok_path = et.XPath('//tok | //dtok')
def xml_extract(doc_root, fname: str):
all_toks = tok_path(doc_root)
matching_toks = filter(
lambda tok:
tok.get('xpos') is not None and tok.get
(
'xpos').startswith('A') and not (tok.get('xpos').startswith('AX')
),
all_toks
)
for el in matching_toks:
preceding_tok = el.xpath(
"./preceding-sibling::tok[1][@lemma and @xpos]")
preceding_tok_with_dtoks = el.xpath(
"./preceding-sibling::tok[1][not(@lemma) and not(@xpos)]"
)
following_dtok_of_dtok = el.xpath("./preceding-sibling::dtok[1]")
if el.tag == 'tok':
tok_dtok = 'tok'
Adj = "".join(el.itertext())
Adj_lemma = el.get('lemma')
Adj_xpos = el.get('xpos')
elif el.tag == 'dtok':
tok_dtok = 'dtok'
Adj = el.get('form')
Adj_lemma = el.get('lemma')
Adj_xpos = el.get('xpos')
pos = all_toks.index(el)
RelevantPrecedingElements = all_toks[max(pos - 6, 0):pos]
RelevantFollowingElements = all_toks[pos + 1:max(pos + 6, 1)]
if RelevantPrecedingElements:
prec1 = RelevantPrecedingElements[-1]
else:
prec1 = None
if RelevantFollowingElements:
foll1 = RelevantFollowingElements[0]
else:
foll1 = None
ElementsContext = all_toks[max(pos - 6, 0):pos + 1]
context_list = []
if ElementsContext:
for elem in ElementsContext:
elem_text = "".join(elem.itertext())
assert elem_text is not None
context_list.append(elem_text)
Adj = f"<{Adj}>"
for elem in RelevantFollowingElements:
elem_text = "".join(elem.itertext())
assert elem_text is not None
context_list.append(elem_text)
fol_lem = foll1.get('lemma') if foll1 is not None else None
prec_lem = prec1.get('lemma') if prec1 is not None else None
fol_xpos = foll1.get('xpos') if foll1 is not None else None
prec_xpos = prec1.get('xpos') if prec1 is not None else None
fol_form = None
if foll1 is not None:
if foll1.tag == "tok":
fol_form = foll1.text
elif foll1.tag == "dtok":
fol_form = foll1.get("form")
prec_form = None
if prec1 is not None:
if prec1.tag == "tok":
prec_form = prec1.text
elif prec1.tag == "dtok":
prec_form = prec1.get("form")
context = " ".join(context_list).replace(
" ,", ",").replace(" .", ".").replace(" ", " ").replace(" ", " ")
#print(f"Context is: {context}")
llista = [
context,
prec_form,
Adj,
fol_form,
prec_lem,
Adj_lemma,
fol_lem,
prec_xpos,
Adj_xpos,
fol_xpos,
tok_dtok,
xml_file.name,
autor,
data,
tipus,
dialecte,
]
writer.writerow(llista)
results.write(f"@@@ {context} @@@\n\n")
results.write(f"Source: {fname}\n\n\n")
for xml_file in xml_files:
if xml_file.name.startswith("."):
continue
beg_extract = time.time()
doc_root = et.parse(xml_file, parser=None).getroot()
obra = None
autor = None
data = None
tipus = None
dialecte = None
header = doc_root.find("header")
if header is not None:
for el in header:
if el.get("type") == "obra":
obra = el.text
elif el.get("type") == "autor":
autor = el.text
elif el.get("type") == "data":
data = el.text
elif el.get("type") == "tipologia":
tipus = el.text
elif el.get("type") == "dialecte":
dialecte = el.text
xml_extract(doc_root, xml_file.name)
times.write(f"Time to extract {xml_file.name}: {time.time() - beg_extract}s\n")
elapsed = time.time() - beg_main
times.write(f"\n \n The end: The whole process took {elapsed}s\n")
print("Execution time:", elapsed, "seconds")
答:
根据我们在评论中的小讨论,我建议从以下内容开始。您可以打开所有文件一次,以便在最顶部进行写入,然后在需要写入的任何位置引用它们(虽然不是并行的,只是同步):
import csv
import re
import time
from pathlib import Path
from lxml import etree as et
beg_main = time.time()
xmls_dir = Path("./xmls")
files = [e for e in xmls_dir.iterdir() if e.is_file()]
xml_files = [f for f in files if f.with_suffix(".xml")]
csv_path = Path("./my_output.csv")
csv_file = open(csv_path, "w", newline="", encoding="utf-8")
writer = csv.writer(csv_file, delimiter=";")
results_path = Path("./my_results.txt")
results = open(results_path, "w", encoding="utf-8")
times_path = Path("./my_times.txt")
times = open(times_path, "w", encoding="utf-8")
tok_path = et.XPath("//tok")
def xml_extract(doc_root, fname: str):
all_toks = tok_path(doc_root)
matching_toks = filter(
lambda tok: (
re.match(r"^[EeLl][LlOoAa][Ss]*$", "".join(tok.itertext())) is not None
and not (tok.get("xpos").startswith("D"))
),
all_toks,
)
for el in matching_toks:
fake_clitic = "".join(el.itertext())
pos = all_toks.index(el)
RelevantPrecedingElements = all_toks[max(pos - 6, 0) : pos]
prec1 = RelevantPrecedingElements[-1]
RelevantFollowingElements = all_toks[pos + 1 : max(pos + 6, 1)]
foll1 = RelevantFollowingElements[0]
context_list = []
context_clean = []
for elem in RelevantPrecedingElements:
elem_text = "".join(elem.itertext())
assert elem_text is not None
context_list.append(elem_text)
context_clean.append(elem_text)
fake_clitic = f"<{fake_clitic}>"
fake_clitic_clean = f"{el.text}"
context_list.append(fake_clitic)
context_clean.append(fake_clitic_clean)
for elem in RelevantFollowingElements:
elem_text = "".join(elem.itertext())
assert elem_text is not None
context_list.append(elem_text)
context_clean.append(elem_text)
lema_fol = foll1.get("lemma") if foll1 is not None else None
lema_prec = prec1.get("lemma") if prec1 is not None else None
xpos_fol = foll1.get("xpos") if foll1 is not None else None
xpos_prec = prec1.get("xpos") if prec1 is not None else None
form_fol = foll1.text if foll1 is not None else None
form_prec = prec1.text if prec1 is not None else None
context = " ".join(context_list)
clean_context = " ".join(context_clean).replace(" ,", ",").replace(" .", ".")
llista = [
context,
lema_prec,
xpos_prec,
form_prec,
fake_clitic,
lema_fol,
xpos_fol,
form_fol,
]
writer.writerow(llista)
results.write(f"@@@ {context} @@@\n\n")
results.write(f"{clean_context}\n\n")
results.write(f"Source: {fname}\n\n\n")
for xml_file in xml_files:
if xml_file.name.startswith("."):
continue
beg_extract = time.time()
doc_root = et.parse(xml_file, parser=None).getroot()
xml_extract(doc_root, xml_file.name)
times.write(f"Time to extract {xml_file.name}: {time.time() - beg_extract}s\n")
elapsed = time.time() - beg_main
times.write(f"\n \n The end: The whole process took {elapsed}s\n")
print("Execution time:", elapsed, "seconds")
当程序退出时,Python 会为你关闭文件,所以你不需要所有的缩进。with open(...)
我在您共享的 16 个 XML 文件上运行了这个版本和您的版本。
在我的机器上,这与打开extract_xml中的文件有一些不同。我的运行时间大约是 80% 的时间(比你的快 20%)。不过,我有双通道 SSD,所以我的读/写速度很快。如果您没有这种硬件,打开/写入/关闭将需要更长的时间。不过,我不知道这是否足以看到你所经历的放缓。为了处理您共享的 ZIP 中的所有 16 个文件,我的运行时间为 0.0055 秒,而您的文件仅运行了 0.0066 秒。此外,在我的试验中,我发现仅注释掉您的打印/调试语句也可以节省时间。
在你共享的示例 XML 上试用我的代码,看看它与你的运行方式相比。
至于奇怪的写入错误,你总是会得到多个代理试图同时写入。如果你真的想要/需要追求并行性,你需要弄清楚如何同步写入,以便一次只有一个进程尝试/可以写入任何一个文件......这可能会破坏您最初想要并行化的全部原因。
Lemme 知道这对你来说是怎么回事。祝你好运!
评论
"\x{0D}"
"\x0D"
"\r"
␍