提问人:xxgaryxx 提问时间:11/13/2023 更新时间:11/14/2023 访问量:44
解析脚本中的迭代:值保持不变
Iteration in Parsing Script: Value remains the same
问:
我目前正在开发一个解析器,该解析器迭代包含收益电话会议记录的 .txt 文件。目的是提取首席执行官所说的部分。提供的代码片段是负责提取各种信息(例如通话日期和公司)的较大脚本的一部分。您可以在此处找到完整的成绩单,包括正则表达式:https://regex101.com/r/mhKevB/1
presentation_part = """
--------------------------------------------------------------------------------
Inge G. Thulin, 3M Company - Chairman, CEO & President [3]
--------------------------------------------------------------------------------
Thank you, Bruce, and good morning, everyone. Coming off a strong 2017, our team opened the new year with broad-based organic growth across all business groups. We expanded margins and posted a double-digit increase in earnings per share while continuing to invest in our business and return cash to our shareholders.
"""
ceos_lname_clean = ['Thulin', 'Davis']
try:
ceos_speaches_pres = []
if len(ceos_lname_clean) != 0:
for lname in ceos_lname_clean:
ceo_pattern = fr'(?m){lname}.*?(?:CEO|Chief Executive Officer)\b(?:(?!\n-+$).)*?\[\d+\]\s+^-+\s+((?s:.*?))(?=\s+^-+|\Z)' #Alternatives pattern wo neben dem Begriff CEO auch auf den Namen des CEO gematched wird
ceo_textparts_pres = re.findall(ceo_pattern, presentation_part, re.DOTALL | re.IGNORECASE)
ceo_speech_presentation = " ".join(ceo_textparts_pres)
ceos_speaches_pres.append(ceo_speech_presentation)
#Overall_dict[folder][comp_path]["CEO Presentation Speech"] = ceos_speaches_pres ##Add the text to a dict
else: ##try for COO in case ceos_lname_clean is empty
coos_speaches_pres = []
for coo_lname in coos_lname_clean:
coo_pattern = fr'(?m){coo_lname}.*?(?:COO|Chief Operating Officer)\b(?:(?!\n-+$).)*?\[\d+\]\s+^-+\s+((?s:.*?))(?=\s+^-+|\Z)' #Alternatives pattern wo neben dem Begriff COO auch auf den Namen des COO gematched wird
coo_textparts_pres = re.findall(coo_pattern, presentation_part, re.DOTALL | re.IGNORECASE)
coo_speech_presentation = " ".join(coo_textparts_pres)
coos_speaches_pres.append(coo_speech_presentation)
#Overall_dict[folder][comp_path]["COO Presentation Speech"] = coos_speaches_pres ##Add the text to a dict
except:
print("PROBLEM")
提供的代码段成功提取了 Thulin 说出的文本。但是,当集成到整个脚本中时,会出现一个问题:ceo_textparts_pres保留了上一次迭代的值。也就是说,即使戴维斯ceo_textparts_pres应该保持空白,它也保存着图林所说的文本。
我花了一整天的时间解决这个问题,但没有成功,并且越来越沮丧。不幸的是,整个脚本太广泛了,无法在此处发布,但即使是可能导致此问题的最小提示或建议也将不胜感激。
提前感谢您的帮助。
答:
1赞
steviestickman
11/14/2023
#1
姓氏和排名部分之间的正则表达式模式,{lname}
(CEO|Ch.Ex.Of.)
即,由于标志而匹配多行。导致戴维斯和图林在介绍部分匹配。
我建议不要使用 re。DOTALL 标志,并使用或仅匹配换行符为特定部分打开它,如下所示: ..*?
re.DOTALL
(?s:.*)
(?:\n|.)*
为了演示,我在下面添加了一个包含两种模式的测试用例。注释掉的行使用 而不是 为该部分禁用了 DOTALL。并且与坏情况不匹配。(?-e:.*?)
.*?
import re
presentation_part = """
today Davis
met miss Thulin
they were both CEO
on day number [3]
- bad case"""
ceos_lname_clean = ["Thulin", "Davis"]
ceos_speaches_pres = []
for lname in ceos_lname_clean:
ceo_pattern = rf"(?m){lname}.*?(?:CEO|Chief Executive Officer)\b(?:(?!\n-+$).)*?\[\d+\]\s+^-+\s+((?s:.*?))(?=\s+^-+|\Z)"
# ceo_pattern = rf"(?m){lname}(?-s:.*?)(?:CEO|Chief Executive Officer)\b(?:(?!\n-+$).)*?\[\d+\]\s+^-+\s+((?s:.*?))(?=\s+^-+|\Z)"
ceo_textparts_pres = re.findall(
ceo_pattern, presentation_part, re.DOTALL | re.IGNORECASE
)
ceo_speech_presentation = " ".join(ceo_textparts_pres)
ceos_speaches_pres.append(ceo_speech_presentation)
print(ceos_speaches_pres)
评论
0赞
xxgaryxx
11/14/2023
非常感谢!我从来没有想过这一点
评论