Python 中的文本查找和替换问题-解网

问：

我有非常具体的功能。我有 2 个字符串，一个是代码输入的备份，第二个是通过替换空格、提取信息等步骤进行修改的（在这种情况下并不重要）。

我需要在这些字符串中找到匹配项，即使第一个字符串被修改。找到匹配项后，我需要从原始字符串中存储匹配项（不作修改），并将其从“sub_str”/“modified_sub_str”中删除。

def find_and_save(sub_str, main_str):
    # Convert both strings to lowercase and remove spaces, commas, and hyphens for case-insensitive matching
    sub_str_mod = re.escape(sub_str.lower().replace(" ", "").replace(",", "").replace("-", ""))
    main_str_mod = main_str.lower().replace(" ", "").replace(",", "").replace("-", "")

    # Use re.search() to find the substring in the modified main string
    match = re.search(sub_str_mod, main_str_mod)

    if match:
        start = match.start()
        end = match.end()

        count = 0
        original_start = 0
        original_end = 0

        for i, c in enumerate(main_str):
            if c not in [' ', ',', '-']:
                count += 1
            if count == start + 1:
                original_start = i
            if count == end:
                original_end = i + 1
                break

        original_sub_str = main_str[original_start:original_end]

        # If the whole sub_str is matching with some part of main_str, return an empty string as modified_sub_str
        if original_sub_str.lower().replace(" ", "").replace(",", "").replace("-", "") == sub_str_mod:
            modified_sub_str = ""
        else:
            # Remove the matching part from sub_str in a case-insensitive manner
            modified_sub_str = re.sub(re.escape(original_sub_str), '', sub_str, flags=re.IGNORECASE)

        return modified_sub_str, original_sub_str  # Returns the modified sub_str and the matched string in its original form
    else:
        return sub_str, None  # Returns sub_str as it was and None if no match is found

但是我对此代码有一个特定的问题。例如，如果我有像这样的输入

sub_str = "internationalworkshopongraphene/ceramiccomposites2016,wgcc2016"

和

main_str = "Roč. 37, č. 12, International Workshop on Graphene/Ceramic Composites 2016, WGCC 2016 (2017), s. 3773-3780 [print, online]"

此代码可以找到匹配项，可以返回“original_sub_str”，但不能从“modified_sub_str”中删除匹配项。

这些输入也存在同样的问题： “sub_str” - “main_str”

"isnnm-2016,internationalsymposiumon"
"Roč. 2017, č. 65, ISNNM-2016, International Symposium on Novel and Nano Materials (2017), s. 76-82 [print, online]"

"fractographyofadvancedceramics5“fractographyfrommacro-tonano-scale”" 
"Roč. 37, č. 14, Fractography of Advanced Ceramics 5 “Fractography from MACRO- to NANO-scale” (2017), s. 4315-4322 [print, online]"

"73.zjazdchemikov,zborníkabstraktov"
"Roč. 17, č. 1, 73. Zjazd chemikov, zborník abstraktov (2021), s. 246-246 [print, online]"

即使使用 AI，我也找不到解决方案，但我知道替换功能、唯一符号、区分大小写存在问题。

python 替换提取文本挖掘

def find_and_save(sub_str, main_str):
   
    sub_str_mod = re.escape(sub_str.lower().replace(" ", "").replace(",", "").replace("-", ""))
    main_str_mod = main_str.lower().replace(" ", "").replace(",", "").replace("-", "")

 
    match = re.search(sub_str_mod, main_str_mod)

    if match:
        start = match.start()
        end = match.end()

        count = 0
        original_start = 0
        original_end = 0

        for i, c in enumerate(main_str):
            if c not in [' ', ',', '-']:
                count += 1
            if count == start + 1:
                original_start = i
            if count == end:
                original_end = i + 1
                break

        original_sub_str = main_str[original_start:original_end]

       
        if original_sub_str.lower().replace(" ", "").replace(",", "").replace("-", "") == sub_str_mod:
            modified_sub_str = ""
        else:
            
            modified_sub_str = re.sub(re.escape(original_sub_str), '', sub_str, flags=re.IGNORECASE)

        return modified_sub_str, original_sub_str  
    else:
        return sub_str, None  # Returns sub_str as it was and None if no match is found

sub_str = "internationalworkshopongraphene/ceramiccomposites2016,wgcc2016"
main_str = "Roč. 37, č. 12, International Workshop on Graphene/Ceramic Composites 2016, WGCC 2016 (2017), s. 3773-3780 [print, online]"

modified_sub_str, original_sub_str = find_and_save(sub_str, main_str)
print("Modified Substring:", modified_sub_str)
print("Original Substring:", original_sub_str)

def clean_str(s) -> str:
    return s.lower().replace(" ", "").replace(",", "").replace("-", "")

def find_and_save(sub_str, main_str):
    # Convert both strings to lowercase and remove spaces, commas, and hyphens for case-insensitive matching
    sub_str_mod = clean_str(sub_str)
    main_str_mod = clean_str(main_str)

    # find the substring in the modified main string
    start = main_str_mod.find(sub_str_mod)
    if start == -1:
        return sub_str, None  # Returns sub_str as it was and None if no match is found

    end = start + len(sub_str_mod)

    count = 0
    original_start = 0
    original_end = 0

    for i, c in enumerate(main_str):
        if c not in [' ', ',', '-']:
            count += 1
        if count == start + 1:
            original_start = i
        if count == end:
            original_end = i + 1
            break

    original_sub_str = main_str[original_start:original_end]

    # If the whole sub_str is matching with some part of main_str, return an empty string as modified_sub_str
    modified_sub_str = ""
    if clean_str(original_sub_str) == sub_str_mod:  # always True
        modified_sub_str = ""
    return modified_sub_str, original_sub_str  # Returns the modified sub_str and the matched string in its original form

4 个案例的输出：

('', 'International Workshop on Graphene/Ceramic Composites 2016, WGCC 2016')
('', 'ISNNM-2016, International Symposium on')
('', 'Fractography of Advanced Ceramics 5 “Fractography from MACRO- to NANO-scale”')
('', '73. Zjazd chemikov, zborník abstraktov')

上一个：Python NLTK 文本离散图有 y 垂直轴是向后/反向顺序

下一个：提取多列（？）python 中的 PDF

Python 中的文本查找和替换问题

problem with text find and replacement in python

评论

评论