Python:获取与另一个字符串最接近的字符串的子字符串

Python: Get substring of a string with a closest match to another string

提问人:cadavre 提问时间:9/21/2023 更新时间:9/21/2023 访问量:50

问:

今天给你一个很好的算法琐事。:)

我有两个字符串——一个是较长的句子,另一个是较长的句子,由 LLM 在较长的句子中发现。让我们看一个例子:

  • 长句:“如果你是一名编码员,你应该考虑从苹果购买配备M2的MacBook Pro 15英寸,这将为你的AI用例提供充足的计算能力。
  • 短句:“苹果MacBook Pro 15” M2”

我需要用最接近短字符串的内容标记长句字符串。结果将是字符位置索引。[start:end]

可接受的结果可能是这样的(以下之一):

If you're a coder you should consider buying a MacBook Pro 15inch with an M2 from Apple that will provide you with a plenty of computing power for your AI use-cases.
                                               ^^^^^^^^^^^^^^^^^^ [47:65]
/or/
If you're a coder you should consider buying a MacBook Pro 15inch with an M2 from Apple that will provide you with a plenty of computing power for your AI use-cases.
                                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [47:76]
/or/
If you're a coder you should consider buying a MacBook Pro 15inch with an M2 from Apple that will provide you with a plenty of computing power for your AI use-cases.
                                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [47:87]

我考虑过:

  • 成员经营者,
  • difflib方法
  • 正则表达式,
  • Levenshtein,

但没有一个真正适合这种情况。

我能想到的接近的是:

  1. 获取 .length = len(short_string)
  2. 按空格拆分为一组长度的子字符串。long_stringlength
  3. 计算每个子字符串之间的 Levenshtein 差值。short_string
  4. 最近的距离赢得它。
short_string = "four five eight"
long_string = "one two three four five six seven eight nine"

length = 3

substrings = [
  "one two three",
  "two three four",
  "three four five",
  "four five six",
  "five six seven",
  "six seven eight",
  "seven eight nine"
]

for sentence in substrings:
  Levenshtein.distance(sentence, short_string)

winner = "four five six"

您能想到的其他想法或开源工具吗?

python 字符串比较 levenshtein-distance

评论


答:

0赞 blhsing 9/21/2023 #1

以下方法应该可以很好地满足您的目的:

  1. 将短句拆分为单词,并将它们连接成一个由 s 分隔的交替模式的正则表达式。|
  2. 用 找到长句中正则表达式的所有匹配项,从而生成具有短句中单词每个匹配项的起始索引和结束索引的对象。re.finditerre.Match
  3. 用于生成对象对的所有组合。每对对象将用于将长句子与第一个对象的起始索引和第二个对象的结束索引切片。itertools.combinationsMatchMatchMatch
  4. 使用该函数从切片长句和短句相似度最高的对象对组合中进行选择,由 difflib 计算。SequenceMatcher.ratiomaxMatch

所以有了:

import re
from difflib import SequenceMatcher
from itertools import combinations

def closest_substring(long, short):
    a, b = max(
        combinations(re.finditer('|'.join(short.split()), long), 2),
        key=lambda c: SequenceMatcher(None, long[c[0].start():c[1].end()], short).ratio()
    )
    return long[a.start():b.end()]

代码如下:

long_string = "If you're a coder you should consider buying a MacBook Pro 15inch with an M2 from Apple that will provide you with a plenty of computing power for your AI use-cases."
short_string = 'Apple MacBook Pro 15" M2'
print(closest_substring(long_string, short_string))

将输出:

MacBook Pro 15inch with an M2

演示:在线试用!