Python 或 R 上下文感知模糊匹配

Python or R Context aware fuzzy matching

提问人:user3507584 提问时间:10/24/2023 更新时间:10/25/2023 访问量:45

问:

我正在尝试匹配两个包含食物描述 [ 和 ] 的字符串列。我应用了一种算法来加权词频,因此不太频繁的单词具有更多的权重,但它失败了,因为它无法识别对象。foods1foods2

例如,项目“葡萄干百吉饼”与“葡萄干沙拉”而不是“百吉饼”匹配,因为“葡萄干”是一个不太常见的词。然而,“葡萄干百吉饼”比“葡萄干沙拉”更接近于作为实际物体的“百吉饼”。foods1foods2

R 中的示例:

foods1 <- c('bagel plain','bagel with raisins and olives', 'hamburger','bagel with olives','bagel with raisins')
foods1_id <- seq.int(1,length(foods1))

foods2 <- c('bagel','pizza','salad with raisins','tuna and olives')
foods2_id <- c(letters[1:length(foods2)])

require(fedmatch)
fuzzy_result <- merge_plus(data1 = data.frame(foods1_id,foods1, stringsAsFactors = F), 
                           data2 = data.frame(foods2_id,foods2, stringsAsFactors = F),
                           by.x = "foods1",
                           by.y = "foods2", match_type = "fuzzy",  
                           fuzzy_settings = build_fuzzy_settings(method = "wgt_jaccard", nthread = 2,maxDist = .75), 
                           unique_key_1 = "foods1_id",
                           unique_key_2 = "foods2_id")

结果,请参阅第 3 行将“葡萄干百吉饼”与“葡萄干沙拉”相匹配。“葡萄干和橄榄百吉饼”的最后一行与“金枪鱼和橄榄”相匹配:foods1foods2foods1foods2

fuzzy_results
$matches
   foods2_id foods1_id                        foods1             foods2
1:         a         1                   bagel plain              bagel
2:         a         4             bagel with olives              bagel
3:         c         5            bagel with raisins salad with raisins
4:         d         2 bagel with raisins and olives    tuna and olives

R 或 Python 中是否有任何模糊匹配算法能够理解正在匹配的对象?[因此,“百吉饼”被认为更接近“葡萄干百吉饼”,而不是“葡萄干沙拉”]。

python r 模糊比较

评论

1赞 Suraj Shourie 10/24/2023
一个(更复杂的)选项是为每个食物项目生成单词嵌入,然后计算食物 1 与食物 2 中每个项目的相似性分数,然后匹配。
0赞 user3507584 10/25/2023
词嵌入是什么意思?:/

答:

1赞 Suraj Shourie 10/25/2023 #1

为了扩展我的评论,您可以尝试使用单词嵌入的 NLP 概念,这只是单词或句子的向量/数字表示。词嵌入的一个简化含义是,它们以某种方式生成,以捕获词之间的语义含义,因此相似的词最终会出现在同一个集群中。

对于像您这样的小型数据库,这可能有点矫枉过正,但是在生成嵌入后,您可以使用余弦相似度来查找彼此最接近的食物。

您可以使用许多预训练模型,尽管您可能需要进行一些研究才能找到最适合您的用例的模型(如果您有数据,也可以对其进行微调,但那是另一回事)。

请参阅下面未优化的 python 实现:

# init
# !pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer, util
from scipy import spatial
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
import numpy as np

sentences1 = ['bagel plain','bagel with raisins and olives', 'hamburger','bagel with olives','bagel with raisins', 'bagel']
sentences2 = ['bagel','pizza','salad with raisins','tuna and olives']
sentences = sentences1+ sentences2
sentences = list(set(sentences)) # get unique words

# Initialize the model
model = SentenceTransformer('all-MiniLM-L6-v2') # try different models

# Create embeddings for each sentence
embeddings = model.encode(sentences)

# loop through each word in sentence 1 and compare cosine similarity for words in sentence2, select the one with highest similarity:
indices1 = [sentences.index(i) for i in sentences1]
indices2 = [sentences.index(i) for i in sentences2]
emb1, emb2 = embeddings[indices1], embeddings[indices2]

arr_cos, arr_sent = [], []
for i in range(len(sentences1)):
  cos = cosine_similarity(emb1[i].reshape(1,embeddings.shape[1]), emb2).flatten()
  idx = np.argmax(cos)
  # print(i, idx, cos.shape)
  arr_cos.append(cos[idx])
  arr_sent.append(sentences2[idx])

print(pd.DataFrame({'sent1': sentences1, 'paired': arr_sent, 'cosine': arr_cos}))

输出:

                           sent1              paired    cosine
0                    bagel plain               bagel  0.808948
1  bagel with raisins and olives  salad with raisins  0.638765
2                      hamburger               pizza  0.437424
3              bagel with olives               bagel  0.686805
4             bagel with raisins               bagel  0.707621
5                          bagel               bagel  1.000000

评论

0赞 user3507584 10/30/2023
非常感谢 Suraj!我设法复制了它。这是一个很棒的MWE,我可以从中进一步发展!:-)