解析脚本中的迭代:值保持不变

Iteration in Parsing Script: Value remains the same

提问人:xxgaryxx 提问时间:11/13/2023 更新时间:11/14/2023 访问量:44

问:

我目前正在开发一个解析器,该解析器迭代包含收益电话会议记录的 .txt 文件。目的是提取首席执行官所说的部分。提供的代码片段是负责提取各种信息(例如通话日期和公司)的较大脚本的一部分。您可以在此处找到完整的成绩单,包括正则表达式:https://regex101.com/r/mhKevB/1

    presentation_part = """
--------------------------------------------------------------------------------
Inge G. Thulin,  3M Company - Chairman, CEO & President    [3]
--------------------------------------------------------------------------------

          Thank you, Bruce, and good morning, everyone. Coming off a strong 2017, our team opened the new year with broad-based organic growth across all business groups. We expanded margins and posted a double-digit increase in earnings per share while continuing to invest in our business and return cash to our shareholders.
"""

ceos_lname_clean = ['Thulin', 'Davis']


try:
    ceos_speaches_pres = []
    if len(ceos_lname_clean) != 0: 
        for lname in ceos_lname_clean:
            ceo_pattern = fr'(?m){lname}.*?(?:CEO|Chief Executive Officer)\b(?:(?!\n-+$).)*?\[\d+\]\s+^-+\s+((?s:.*?))(?=\s+^-+|\Z)' #Alternatives pattern wo neben dem Begriff CEO auch auf den Namen des CEO gematched wird
            ceo_textparts_pres = re.findall(ceo_pattern, presentation_part, re.DOTALL | re.IGNORECASE)
            ceo_speech_presentation = " ".join(ceo_textparts_pres)
            ceos_speaches_pres.append(ceo_speech_presentation)
        #Overall_dict[folder][comp_path]["CEO Presentation Speech"] = ceos_speaches_pres ##Add the text to a dict

    else: ##try for COO in case ceos_lname_clean is empty
        coos_speaches_pres = [] 
        for coo_lname in coos_lname_clean:
            coo_pattern = fr'(?m){coo_lname}.*?(?:COO|Chief Operating Officer)\b(?:(?!\n-+$).)*?\[\d+\]\s+^-+\s+((?s:.*?))(?=\s+^-+|\Z)' #Alternatives pattern wo neben dem Begriff COO auch auf den Namen des COO gematched wird
            coo_textparts_pres = re.findall(coo_pattern, presentation_part, re.DOTALL | re.IGNORECASE)
            coo_speech_presentation = " ".join(coo_textparts_pres)
            coos_speaches_pres.append(coo_speech_presentation)
        #Overall_dict[folder][comp_path]["COO Presentation Speech"] = coos_speaches_pres ##Add the text to a dict
except:
    print("PROBLEM")

提供的代码段成功提取了 Thulin 说出的文本。但是,当集成到整个脚本中时,会出现一个问题:ceo_textparts_pres保留了上一次迭代的值。也就是说,即使戴维斯ceo_textparts_pres应该保持空白,它也保存着图林所说的文本。

我花了一整天的时间解决这个问题,但没有成功,并且越来越沮丧。不幸的是,整个脚本太广泛了,无法在此处发布,但即使是可能导致此问题的最小提示或建议也将不胜感激。

提前感谢您的帮助。

Python 正则表达式

评论


答:

1赞 steviestickman 11/14/2023 #1

姓氏和排名部分之间的正则表达式模式,{lname}(CEO|Ch.Ex.Of.)

即,由于标志而匹配多行。导致戴维斯和图林在介绍部分匹配。 我建议不要使用 re。DOTALL 标志,并使用或仅匹配换行符为特定部分打开它,如下所示: ..*?re.DOTALL(?s:.*)(?:\n|.)*

为了演示,我在下面添加了一个包含两种模式的测试用例。注释掉的行使用 而不是 为该部分禁用了 DOTALL。并且与坏情况不匹配。(?-e:.*?).*?

import re

presentation_part = """
today Davis
met miss Thulin
they were both CEO
on day number [3]
- bad case"""

ceos_lname_clean = ["Thulin", "Davis"]


ceos_speaches_pres = []
for lname in ceos_lname_clean:
    ceo_pattern = rf"(?m){lname}.*?(?:CEO|Chief Executive Officer)\b(?:(?!\n-+$).)*?\[\d+\]\s+^-+\s+((?s:.*?))(?=\s+^-+|\Z)"
    # ceo_pattern = rf"(?m){lname}(?-s:.*?)(?:CEO|Chief Executive Officer)\b(?:(?!\n-+$).)*?\[\d+\]\s+^-+\s+((?s:.*?))(?=\s+^-+|\Z)"
    ceo_textparts_pres = re.findall(
        ceo_pattern, presentation_part, re.DOTALL | re.IGNORECASE
    )
    ceo_speech_presentation = " ".join(ceo_textparts_pres)
    ceos_speaches_pres.append(ceo_speech_presentation)

print(ceos_speaches_pres)

评论

0赞 xxgaryxx 11/14/2023
非常感谢!我从来没有想过这一点