提问人:alvas 提问时间:11/11/2023 最后编辑:Wiktor Stribiżewalvas 更新时间:11/11/2023 访问量:47
用空格切碎左边、右边的字符串以迭代正则表达式匹配
Chomping left, right string by whitespace to iterate regex matches
问:
目标是提取匹配的“单词”(以 为界),给定 difflib 输出,例如:\b|$|\s
SequenceMatcher.get_matching_blocks()
s1 = “HYC00 Schulrucksack Damen, Causal Travel Schultaschen 14 Zoll Laptop Rucksack für Mädchen im Teenageralter Leichter Rucksack Wasserabweisend Bookbag College Boys Men Work Daypack”
s2 = “HYC00 学校背包女士,因果旅行书包 14 英寸笔记本电脑背包,适合十几岁的女孩轻量级背包防水书包大学男士工作背包”
要提取的预期匹配块包括:
['HYC00', 'Causal Travel', '14', 'Laptop', 'Bookbag College Boys Men Work Daypack']
简单的情况是,difflib 中的匹配块立即以 为界,例如\b|$|\s
import re
from difflib import SequenceMatcher
s1 = "HYC00 Schulrucksack Damen, Causal Travel Schultaschen 14 Zoll Laptop Rucksack für Mädchen im Teenageralter Leichter Rucksack Wasserabweisend Bookbag College Boys Men Work Daypack"
s2 = "HYC00 School Backpack Women, Causal Travel School Bags 14 Inch Laptop Backpack for Teenage Girls Lightweight Backpack Water-Repellent Bookbag College Boys Men Work Daypack"
def is_substring_a_phrase(substring, s1):
if substring:
# Check if matching substring is bounded by word boundary.
match = re.findall(rf"\b{substring}(?=\s|$)", s1)
if match:
return match[0]
def matcher(s1, s2):
x = SequenceMatcher(None, s1, s2)
for m in x.get_matching_blocks():
# Extract the substring.
full_substring = s1[m.a:m.a+m.size].strip()
match = is_substring_a_phrase(full_substring, s1)
if match:
yield match
continue
matcher(s1, s2)
[输出]:
['14', 'Laptop', 'Bookbag College Boys Men Work Daypack']
然后要捕获 和 ,匹配块分别是 和 ,因此我们必须进行一些“咀嚼”并删除左、右或最左边和右边的“单词”,即HYC00
Causal Travel
HYC00 Sch
men, Causual Travel
def matcher(s1, s2):
x = SequenceMatcher(None, s1, s2)
for m in x.get_matching_blocks():
# Extract the substring.
full_substring = s1[m.a:m.a+m.size].strip()
match = is_substring_a_phrase(full_substring, s1)
if match:
yield match
continue
# Extract the left chomp substring.
left = " ".join(s1[m.a:m.a+m.size].strip().split()[1:])
match = is_substring_a_phrase(left, s1)
if match:
yield match
continue
# Extract the right chomp substring.
right = " ".join(s1[m.a:m.a+m.size].strip().split()[:-1])
match = is_substring_a_phrase(right, s1)
if match:
yield match
continue
# Extract the right chomp substring.
leftright = " ".join(s1[m.a:m.a+m.size].strip().split()[1:-1])
match = is_substring_a_phrase(leftright, s1)
if match:
yield match
continue
matcher(s1, s2)
[输出]:
['HYC00',
'Causal Travel',
'14',
'Laptop',
'Bookbag College Boys Men Work Daypack']
虽然上面的代码片段按预期工作,但我的部分问题:
- 有没有办法避免各种 chomp 和多个 if-else 的重复代码来提取由 限制的匹配块?
\b|$|\s
- 有没有一种直接的方法可以指定 in 以仅获取以 为界的部分?
.get_matching_blocks()
\b|$|\s
- 有没有其他方法可以在不以这种混乱的方式使用get_matching_blocks的情况下实现相同的目标?
答:
1赞
alvas
11/11/2023
#1
来自@megaing的评论
from difflib import SequenceMatcher
s1 = "HYC00 Schulrucksack Damen, Causal Travel Schultaschen 14 Zoll Laptop Rucksack für Mädchen im Teenageralter Leichter Rucksack Wasserabweisend Bookbag College Boys Men Work Daypack"
s2 = "HYC00 School Backpack Women, Causal Travel School Bags 14 Inch Laptop Backpack for Teenage Girls Lightweight Backpack Water-Repellent Bookbag College Boys Men Work Daypack"
x = SequenceMatcher(None, s1.split(), s2.split())
for m in x.get_matching_blocks():
# Extract the substring.
full_substring = " ".join(s1.split()[m.a:m.a+m.size])
print(full_substring)
[输出]:
HYC00
Causal Travel
14
Laptop
Bookbag College Boys Men Work Daypack
评论
SequenceMatcher