用空格切碎左边、右边的字符串以迭代正则表达式匹配

Chomping left, right string by whitespace to iterate regex matches

提问人:alvas 提问时间:11/11/2023 最后编辑:Wiktor Stribiżewalvas 更新时间:11/11/2023 访问量:47

问:

目标是提取匹配的“单词”(以 为界),给定 difflib 输出,例如:\b|$|\sSequenceMatcher.get_matching_blocks()

s1 = “HYC00 Schulrucksack Damen, Causal Travel Schultaschen 14 Zoll Laptop Rucksack für Mädchen im Teenageralter Leichter Rucksack Wasserabweisend Bookbag College Boys Men Work Daypack”

s2 = “HYC00 学校背包女士,因果旅行书包 14 英寸笔记本电脑背包,适合十几岁的女孩轻量级背包防水书包大学男士工作背包”

要提取的预期匹配块包括:

['HYC00', 'Causal Travel', '14', 'Laptop', 'Bookbag College Boys Men Work Daypack']

简单的情况是,difflib 中的匹配块立即以 为界,例如\b|$|\s

import re
from difflib import SequenceMatcher

s1 = "HYC00 Schulrucksack Damen, Causal Travel Schultaschen 14 Zoll Laptop Rucksack für Mädchen im Teenageralter Leichter Rucksack Wasserabweisend Bookbag College Boys Men Work Daypack"

s2 = "HYC00 School Backpack Women, Causal Travel School Bags 14 Inch Laptop Backpack for Teenage Girls Lightweight Backpack Water-Repellent Bookbag College Boys Men Work Daypack"

def is_substring_a_phrase(substring, s1):
  if substring:
    # Check if matching substring is bounded by word boundary.
    match = re.findall(rf"\b{substring}(?=\s|$)", s1)
    if match: 
      return match[0]

def matcher(s1, s2):
  x = SequenceMatcher(None, s1, s2)
  for m in x.get_matching_blocks():
    # Extract the substring.
    full_substring = s1[m.a:m.a+m.size].strip()
    match = is_substring_a_phrase(full_substring, s1)
    if match:
      yield match
      continue

matcher(s1, s2)

[输出]:

['14', 'Laptop', 'Bookbag College Boys Men Work Daypack']

然后要捕获 和 ,匹配块分别是 和 ,因此我们必须进行一些“咀嚼”并删除左、右或最左边和右边的“单词”,即HYC00Causal TravelHYC00 Schmen, Causual Travel

def matcher(s1, s2):
  x = SequenceMatcher(None, s1, s2)
  for m in x.get_matching_blocks():
    # Extract the substring.
    full_substring = s1[m.a:m.a+m.size].strip()
    match = is_substring_a_phrase(full_substring, s1)
    if match:
      yield match
      continue

    # Extract the left chomp substring.
    left = " ".join(s1[m.a:m.a+m.size].strip().split()[1:])
    match = is_substring_a_phrase(left, s1)
    if match:
      yield match
      continue


    # Extract the right chomp substring.
    right = " ".join(s1[m.a:m.a+m.size].strip().split()[:-1])
    match = is_substring_a_phrase(right, s1)
    if match:
      yield match
      continue


    # Extract the right chomp substring.
    leftright = " ".join(s1[m.a:m.a+m.size].strip().split()[1:-1])
    match = is_substring_a_phrase(leftright, s1)
    if match:
      yield match
      continue

matcher(s1, s2)

[输出]:

['HYC00',
 'Causal Travel',
 '14',
 'Laptop',
 'Bookbag College Boys Men Work Daypack']

虽然上面的代码片段按预期工作,但我的部分问题:

  • 有没有办法避免各种 chomp 和多个 if-else 的重复代码来提取由 限制的匹配块?\b|$|\s
  • 有没有一种直接的方法可以指定 in 以仅获取以 为界的部分?.get_matching_blocks()\b|$|\s
  • 有没有其他方法可以在不以这种混乱的方式使用get_matching_blocks的情况下实现相同的目标?
python 子字符串 difflib

评论

3赞 MegaIng 11/11/2023
相反,您可以先根据分隔符将两个文本分解为单词,然后在两个单词列表中使用。SequenceMatcher

答:

1赞 alvas 11/11/2023 #1

来自@megaing的评论

from difflib import SequenceMatcher

s1 = "HYC00 Schulrucksack Damen, Causal Travel Schultaschen 14 Zoll Laptop Rucksack für Mädchen im Teenageralter Leichter Rucksack Wasserabweisend Bookbag College Boys Men Work Daypack"

s2 = "HYC00 School Backpack Women, Causal Travel School Bags 14 Inch Laptop Backpack for Teenage Girls Lightweight Backpack Water-Repellent Bookbag College Boys Men Work Daypack"


x = SequenceMatcher(None, s1.split(), s2.split())

for m in x.get_matching_blocks():
    # Extract the substring.
    full_substring = " ".join(s1.split()[m.a:m.a+m.size])
    print(full_substring)

[输出]:

HYC00
Causal Travel
14
Laptop
Bookbag College Boys Men Work Daypack