Python 中的文本查找和替换问题

problem with text find and replacement in python

提问人:fararmaoholcezoltar 提问时间:9/29/2023 最后编辑:rioV8fararmaoholcezoltar 更新时间:9/29/2023 访问量:60

问:

我有非常具体的功能。我有 2 个字符串,一个是代码输入的备份,第二个是通过替换空格、提取信息等步骤进行修改的(在这种情况下并不重要)。

我需要在这些字符串中找到匹配项,即使第一个字符串被修改。找到匹配项后,我需要从原始字符串中存储匹配项(不作修改),并将其从“sub_str”/“modified_sub_str”中删除。

def find_and_save(sub_str, main_str):
    # Convert both strings to lowercase and remove spaces, commas, and hyphens for case-insensitive matching
    sub_str_mod = re.escape(sub_str.lower().replace(" ", "").replace(",", "").replace("-", ""))
    main_str_mod = main_str.lower().replace(" ", "").replace(",", "").replace("-", "")

    # Use re.search() to find the substring in the modified main string
    match = re.search(sub_str_mod, main_str_mod)

    if match:
        start = match.start()
        end = match.end()

        count = 0
        original_start = 0
        original_end = 0

        for i, c in enumerate(main_str):
            if c not in [' ', ',', '-']:
                count += 1
            if count == start + 1:
                original_start = i
            if count == end:
                original_end = i + 1
                break

        original_sub_str = main_str[original_start:original_end]

        # If the whole sub_str is matching with some part of main_str, return an empty string as modified_sub_str
        if original_sub_str.lower().replace(" ", "").replace(",", "").replace("-", "") == sub_str_mod:
            modified_sub_str = ""
        else:
            # Remove the matching part from sub_str in a case-insensitive manner
            modified_sub_str = re.sub(re.escape(original_sub_str), '', sub_str, flags=re.IGNORECASE)

        return modified_sub_str, original_sub_str  # Returns the modified sub_str and the matched string in its original form
    else:
        return sub_str, None  # Returns sub_str as it was and None if no match is found

但是我对此代码有一个特定的问题。例如,如果我有像这样的输入

sub_str = "internationalworkshopongraphene/ceramiccomposites2016,wgcc2016"

main_str = "Roč. 37, č. 12, International Workshop on Graphene/Ceramic Composites 2016, WGCC 2016 (2017), s. 3773-3780 [print, online]" 

此代码可以找到匹配项,可以返回“original_sub_str”,但不能从“modified_sub_str”中删除匹配项。

这些输入也存在同样的问题: “sub_str” - “main_str”

"isnnm-2016,internationalsymposiumon"
"Roč. 2017, č. 65, ISNNM-2016, International Symposium on Novel and Nano Materials (2017), s. 76-82 [print, online]"

"fractographyofadvancedceramics5“fractographyfrommacro-tonano-scale”" 
"Roč. 37, č. 14, Fractography of Advanced Ceramics 5 “Fractography from MACRO- to NANO-scale” (2017), s. 4315-4322 [print, online]"

"73.zjazdchemikov,zborníkabstraktov"
"Roč. 17, č. 1, 73. Zjazd chemikov, zborník abstraktov (2021), s. 246-246 [print, online]" 

即使使用 AI,我也找不到解决方案,但我知道替换功能、唯一符号、区分大小写存在问题。

python 替换 提取 文本挖掘

评论

0赞 rioV8 9/29/2023
original_sub_str.lower().replace(...) 总是 == sub_str_mod
0赞 rioV8 9/29/2023
还要提及您期望的 4 种情况的函数结果是什么
0赞 rioV8 9/29/2023
如果没有空格,怎么能在中找到空格,为什么要使用,如果你想要文字搜索sub_stroriginal_sub_strsub_strre.search()
0赞 rioV8 9/29/2023
为什么不使用调试器并单步执行代码并检查每个步骤结果并确定出错的位置和原因
0赞 fararmaoholcezoltar 9/29/2023
正如我所写的,函数可以找到匹配项,但不能从“modified_sub_str”中删除匹配项,并且在输入上返回 modified_sub_str == sub_str。如果您能找到更有效的方法来查找匹配项,我愿意更改 re.search(),然后从输入“sub_str”中删除匹配项

答:

0赞 Mahboob Nur 9/29/2023 #1

您已修改find_and_save函数以提高匹配精度。

def find_and_save(sub_str, main_str):
   
    sub_str_mod = re.escape(sub_str.lower().replace(" ", "").replace(",", "").replace("-", ""))
    main_str_mod = main_str.lower().replace(" ", "").replace(",", "").replace("-", "")

 
    match = re.search(sub_str_mod, main_str_mod)

    if match:
        start = match.start()
        end = match.end()

        count = 0
        original_start = 0
        original_end = 0

        for i, c in enumerate(main_str):
            if c not in [' ', ',', '-']:
                count += 1
            if count == start + 1:
                original_start = i
            if count == end:
                original_end = i + 1
                break

        original_sub_str = main_str[original_start:original_end]

       
        if original_sub_str.lower().replace(" ", "").replace(",", "").replace("-", "") == sub_str_mod:
            modified_sub_str = ""
        else:
            
            modified_sub_str = re.sub(re.escape(original_sub_str), '', sub_str, flags=re.IGNORECASE)

        return modified_sub_str, original_sub_str  
    else:
        return sub_str, None  # Returns sub_str as it was and None if no match is found

sub_str = "internationalworkshopongraphene/ceramiccomposites2016,wgcc2016"
main_str = "Roč. 37, č. 12, International Workshop on Graphene/Ceramic Composites 2016, WGCC 2016 (2017), s. 3773-3780 [print, online]"

modified_sub_str, original_sub_str = find_and_save(sub_str, main_str)
print("Modified Substring:", modified_sub_str)
print("Original Substring:", original_sub_str)

评论

0赞 rioV8 9/29/2023
你修改了什么,看起来你只删除了评论
0赞 Mahboob Nur 9/29/2023
也有一些字符串格式更改和缩进。请仔细查看并尝试执行它。如果它不起作用,我会再试一次。
0赞 fararmaoholcezoltar 9/29/2023
是的,它有帮助,但仍然没有解决这个例子:“73.zjazdchemikov,zborníkabstraktov”“Roč。17, č. 1, 73.Zjazd chemikov, zborník abstraktov (2021), s. 246-246 [印刷,在线]”
1赞 rioV8 9/29/2023 #2

你是一个正则表达式转义字符串。 转换为 ,现在找不到,因为没有反斜杠。(下次使用调试器)sub_str_mod.\.original_sub_stroriginal_sub_str

删除并使用文本字符串查找执行所有操作。re

删除了 因为测试总是elseifTrue

def clean_str(s) -> str:
    return s.lower().replace(" ", "").replace(",", "").replace("-", "")

def find_and_save(sub_str, main_str):
    # Convert both strings to lowercase and remove spaces, commas, and hyphens for case-insensitive matching
    sub_str_mod = clean_str(sub_str)
    main_str_mod = clean_str(main_str)

    # find the substring in the modified main string
    start = main_str_mod.find(sub_str_mod)
    if start == -1:
        return sub_str, None  # Returns sub_str as it was and None if no match is found

    end = start + len(sub_str_mod)

    count = 0
    original_start = 0
    original_end = 0

    for i, c in enumerate(main_str):
        if c not in [' ', ',', '-']:
            count += 1
        if count == start + 1:
            original_start = i
        if count == end:
            original_end = i + 1
            break

    original_sub_str = main_str[original_start:original_end]

    # If the whole sub_str is matching with some part of main_str, return an empty string as modified_sub_str
    modified_sub_str = ""
    if clean_str(original_sub_str) == sub_str_mod:  # always True
        modified_sub_str = ""
    return modified_sub_str, original_sub_str  # Returns the modified sub_str and the matched string in its original form

4 个案例的输出:

('', 'International Workshop on Graphene/Ceramic Composites 2016, WGCC 2016')
('', 'ISNNM-2016, International Symposium on')
('', 'Fractography of Advanced Ceramics 5 “Fractography from MACRO- to NANO-scale”')
('', '73. Zjazd chemikov, zborník abstraktov')