提问人:fararmaoholcezoltar 提问时间:9/29/2023 最后编辑:rioV8fararmaoholcezoltar 更新时间:9/29/2023 访问量:60
Python 中的文本查找和替换问题
problem with text find and replacement in python
问:
我有非常具体的功能。我有 2 个字符串,一个是代码输入的备份,第二个是通过替换空格、提取信息等步骤进行修改的(在这种情况下并不重要)。
我需要在这些字符串中找到匹配项,即使第一个字符串被修改。找到匹配项后,我需要从原始字符串中存储匹配项(不作修改),并将其从“sub_str”/“modified_sub_str”中删除。
def find_and_save(sub_str, main_str):
# Convert both strings to lowercase and remove spaces, commas, and hyphens for case-insensitive matching
sub_str_mod = re.escape(sub_str.lower().replace(" ", "").replace(",", "").replace("-", ""))
main_str_mod = main_str.lower().replace(" ", "").replace(",", "").replace("-", "")
# Use re.search() to find the substring in the modified main string
match = re.search(sub_str_mod, main_str_mod)
if match:
start = match.start()
end = match.end()
count = 0
original_start = 0
original_end = 0
for i, c in enumerate(main_str):
if c not in [' ', ',', '-']:
count += 1
if count == start + 1:
original_start = i
if count == end:
original_end = i + 1
break
original_sub_str = main_str[original_start:original_end]
# If the whole sub_str is matching with some part of main_str, return an empty string as modified_sub_str
if original_sub_str.lower().replace(" ", "").replace(",", "").replace("-", "") == sub_str_mod:
modified_sub_str = ""
else:
# Remove the matching part from sub_str in a case-insensitive manner
modified_sub_str = re.sub(re.escape(original_sub_str), '', sub_str, flags=re.IGNORECASE)
return modified_sub_str, original_sub_str # Returns the modified sub_str and the matched string in its original form
else:
return sub_str, None # Returns sub_str as it was and None if no match is found
但是我对此代码有一个特定的问题。例如,如果我有像这样的输入
sub_str = "internationalworkshopongraphene/ceramiccomposites2016,wgcc2016"
和
main_str = "Roč. 37, č. 12, International Workshop on Graphene/Ceramic Composites 2016, WGCC 2016 (2017), s. 3773-3780 [print, online]"
此代码可以找到匹配项,可以返回“original_sub_str”,但不能从“modified_sub_str”中删除匹配项。
这些输入也存在同样的问题: “sub_str” - “main_str”
"isnnm-2016,internationalsymposiumon"
"Roč. 2017, č. 65, ISNNM-2016, International Symposium on Novel and Nano Materials (2017), s. 76-82 [print, online]"
"fractographyofadvancedceramics5“fractographyfrommacro-tonano-scale”"
"Roč. 37, č. 14, Fractography of Advanced Ceramics 5 “Fractography from MACRO- to NANO-scale” (2017), s. 4315-4322 [print, online]"
"73.zjazdchemikov,zborníkabstraktov"
"Roč. 17, č. 1, 73. Zjazd chemikov, zborník abstraktov (2021), s. 246-246 [print, online]"
即使使用 AI,我也找不到解决方案,但我知道替换功能、唯一符号、区分大小写存在问题。
答:
0赞
Mahboob Nur
9/29/2023
#1
您已修改find_and_save函数以提高匹配精度。
def find_and_save(sub_str, main_str):
sub_str_mod = re.escape(sub_str.lower().replace(" ", "").replace(",", "").replace("-", ""))
main_str_mod = main_str.lower().replace(" ", "").replace(",", "").replace("-", "")
match = re.search(sub_str_mod, main_str_mod)
if match:
start = match.start()
end = match.end()
count = 0
original_start = 0
original_end = 0
for i, c in enumerate(main_str):
if c not in [' ', ',', '-']:
count += 1
if count == start + 1:
original_start = i
if count == end:
original_end = i + 1
break
original_sub_str = main_str[original_start:original_end]
if original_sub_str.lower().replace(" ", "").replace(",", "").replace("-", "") == sub_str_mod:
modified_sub_str = ""
else:
modified_sub_str = re.sub(re.escape(original_sub_str), '', sub_str, flags=re.IGNORECASE)
return modified_sub_str, original_sub_str
else:
return sub_str, None # Returns sub_str as it was and None if no match is found
sub_str = "internationalworkshopongraphene/ceramiccomposites2016,wgcc2016"
main_str = "Roč. 37, č. 12, International Workshop on Graphene/Ceramic Composites 2016, WGCC 2016 (2017), s. 3773-3780 [print, online]"
modified_sub_str, original_sub_str = find_and_save(sub_str, main_str)
print("Modified Substring:", modified_sub_str)
print("Original Substring:", original_sub_str)
评论
0赞
rioV8
9/29/2023
你修改了什么,看起来你只删除了评论
0赞
Mahboob Nur
9/29/2023
也有一些字符串格式更改和缩进。请仔细查看并尝试执行它。如果它不起作用,我会再试一次。
0赞
fararmaoholcezoltar
9/29/2023
是的,它有帮助,但仍然没有解决这个例子:“73.zjazdchemikov,zborníkabstraktov”“Roč。17, č. 1, 73.Zjazd chemikov, zborník abstraktov (2021), s. 246-246 [印刷,在线]”
1赞
rioV8
9/29/2023
#2
你是一个正则表达式转义字符串。 转换为 ,现在找不到,因为没有反斜杠。(下次使用调试器)sub_str_mod
.
\.
original_sub_str
original_sub_str
删除并使用文本字符串查找执行所有操作。re
删除了 因为测试总是else
if
True
def clean_str(s) -> str:
return s.lower().replace(" ", "").replace(",", "").replace("-", "")
def find_and_save(sub_str, main_str):
# Convert both strings to lowercase and remove spaces, commas, and hyphens for case-insensitive matching
sub_str_mod = clean_str(sub_str)
main_str_mod = clean_str(main_str)
# find the substring in the modified main string
start = main_str_mod.find(sub_str_mod)
if start == -1:
return sub_str, None # Returns sub_str as it was and None if no match is found
end = start + len(sub_str_mod)
count = 0
original_start = 0
original_end = 0
for i, c in enumerate(main_str):
if c not in [' ', ',', '-']:
count += 1
if count == start + 1:
original_start = i
if count == end:
original_end = i + 1
break
original_sub_str = main_str[original_start:original_end]
# If the whole sub_str is matching with some part of main_str, return an empty string as modified_sub_str
modified_sub_str = ""
if clean_str(original_sub_str) == sub_str_mod: # always True
modified_sub_str = ""
return modified_sub_str, original_sub_str # Returns the modified sub_str and the matched string in its original form
4 个案例的输出:
('', 'International Workshop on Graphene/Ceramic Composites 2016, WGCC 2016')
('', 'ISNNM-2016, International Symposium on')
('', 'Fractography of Advanced Ceramics 5 “Fractography from MACRO- to NANO-scale”')
('', '73. Zjazd chemikov, zborník abstraktov')
评论
original_sub_str.lower().replace(...)
总是== sub_str_mod
sub_str
original_sub_str
sub_str
re.search()