提问人:user3507584 提问时间:10/24/2023 更新时间:10/25/2023 访问量:45
Python 或 R 上下文感知模糊匹配
Python or R Context aware fuzzy matching
问:
我正在尝试匹配两个包含食物描述 [ 和 ] 的字符串列。我应用了一种算法来加权词频,因此不太频繁的单词具有更多的权重,但它失败了,因为它无法识别对象。foods1
foods2
例如,项目“葡萄干百吉饼”与“葡萄干沙拉”而不是“百吉饼”匹配,因为“葡萄干”是一个不太常见的词。然而,“葡萄干百吉饼”比“葡萄干沙拉”更接近于作为实际物体的“百吉饼”。foods1
foods2
R 中的示例:
foods1 <- c('bagel plain','bagel with raisins and olives', 'hamburger','bagel with olives','bagel with raisins')
foods1_id <- seq.int(1,length(foods1))
foods2 <- c('bagel','pizza','salad with raisins','tuna and olives')
foods2_id <- c(letters[1:length(foods2)])
require(fedmatch)
fuzzy_result <- merge_plus(data1 = data.frame(foods1_id,foods1, stringsAsFactors = F),
data2 = data.frame(foods2_id,foods2, stringsAsFactors = F),
by.x = "foods1",
by.y = "foods2", match_type = "fuzzy",
fuzzy_settings = build_fuzzy_settings(method = "wgt_jaccard", nthread = 2,maxDist = .75),
unique_key_1 = "foods1_id",
unique_key_2 = "foods2_id")
结果,请参阅第 3 行将“葡萄干百吉饼”与“葡萄干沙拉”相匹配。“葡萄干和橄榄百吉饼”的最后一行与“金枪鱼和橄榄”相匹配:foods1
foods2
foods1
foods2
fuzzy_results
$matches
foods2_id foods1_id foods1 foods2
1: a 1 bagel plain bagel
2: a 4 bagel with olives bagel
3: c 5 bagel with raisins salad with raisins
4: d 2 bagel with raisins and olives tuna and olives
R 或 Python 中是否有任何模糊匹配算法能够理解正在匹配的对象?[因此,“百吉饼”被认为更接近“葡萄干百吉饼”,而不是“葡萄干沙拉”]。
答:
1赞
Suraj Shourie
10/25/2023
#1
为了扩展我的评论,您可以尝试使用单词嵌入的 NLP 概念,这只是单词或句子的向量/数字表示。词嵌入的一个简化含义是,它们以某种方式生成,以捕获词之间的语义含义,因此相似的词最终会出现在同一个集群中。
对于像您这样的小型数据库,这可能有点矫枉过正,但是在生成嵌入后,您可以使用余弦相似度来查找彼此最接近的食物。
您可以使用许多预训练模型,尽管您可能需要进行一些研究才能找到最适合您的用例的模型(如果您有数据,也可以对其进行微调,但那是另一回事)。
请参阅下面未优化的 python 实现:
# init
# !pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer, util
from scipy import spatial
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
import numpy as np
sentences1 = ['bagel plain','bagel with raisins and olives', 'hamburger','bagel with olives','bagel with raisins', 'bagel']
sentences2 = ['bagel','pizza','salad with raisins','tuna and olives']
sentences = sentences1+ sentences2
sentences = list(set(sentences)) # get unique words
# Initialize the model
model = SentenceTransformer('all-MiniLM-L6-v2') # try different models
# Create embeddings for each sentence
embeddings = model.encode(sentences)
# loop through each word in sentence 1 and compare cosine similarity for words in sentence2, select the one with highest similarity:
indices1 = [sentences.index(i) for i in sentences1]
indices2 = [sentences.index(i) for i in sentences2]
emb1, emb2 = embeddings[indices1], embeddings[indices2]
arr_cos, arr_sent = [], []
for i in range(len(sentences1)):
cos = cosine_similarity(emb1[i].reshape(1,embeddings.shape[1]), emb2).flatten()
idx = np.argmax(cos)
# print(i, idx, cos.shape)
arr_cos.append(cos[idx])
arr_sent.append(sentences2[idx])
print(pd.DataFrame({'sent1': sentences1, 'paired': arr_sent, 'cosine': arr_cos}))
输出:
sent1 paired cosine
0 bagel plain bagel 0.808948
1 bagel with raisins and olives salad with raisins 0.638765
2 hamburger pizza 0.437424
3 bagel with olives bagel 0.686805
4 bagel with raisins bagel 0.707621
5 bagel bagel 1.000000
评论
0赞
user3507584
10/30/2023
非常感谢 Suraj!我设法复制了它。这是一个很棒的MWE,我可以从中进一步发展!:-)
评论