Python 或 R 上下文感知模糊匹配-解网

问：

我正在尝试匹配两个包含食物描述 [ 和 ] 的字符串列。我应用了一种算法来加权词频，因此不太频繁的单词具有更多的权重，但它失败了，因为它无法识别对象。foods1foods2

例如，项目“葡萄干百吉饼”与“葡萄干沙拉”而不是“百吉饼”匹配，因为“葡萄干”是一个不太常见的词。然而，“葡萄干百吉饼”比“葡萄干沙拉”更接近于作为实际物体的“百吉饼”。foods1foods2

R 中的示例：

foods1 <- c('bagel plain','bagel with raisins and olives', 'hamburger','bagel with olives','bagel with raisins')
foods1_id <- seq.int(1,length(foods1))

foods2 <- c('bagel','pizza','salad with raisins','tuna and olives')
foods2_id <- c(letters[1:length(foods2)])

require(fedmatch)
fuzzy_result <- merge_plus(data1 = data.frame(foods1_id,foods1, stringsAsFactors = F), 
                           data2 = data.frame(foods2_id,foods2, stringsAsFactors = F),
                           by.x = "foods1",
                           by.y = "foods2", match_type = "fuzzy",  
                           fuzzy_settings = build_fuzzy_settings(method = "wgt_jaccard", nthread = 2,maxDist = .75), 
                           unique_key_1 = "foods1_id",
                           unique_key_2 = "foods2_id")

结果，请参阅第 3 行将“葡萄干百吉饼”与“葡萄干沙拉”相匹配。“葡萄干和橄榄百吉饼”的最后一行与“金枪鱼和橄榄”相匹配：foods1foods2foods1foods2

fuzzy_results
$matches
   foods2_id foods1_id                        foods1             foods2
1:         a         1                   bagel plain              bagel
2:         a         4             bagel with olives              bagel
3:         c         5            bagel with raisins salad with raisins
4:         d         2 bagel with raisins and olives    tuna and olives

R 或 Python 中是否有任何模糊匹配算法能够理解正在匹配的对象？[因此，“百吉饼”被认为更接近“葡萄干百吉饼”，而不是“葡萄干沙拉”]。

python r 模糊比较

# init
# !pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer, util
from scipy import spatial
from sklearn.metrics.pairwise import cosine_similarity
import pandas as pd
import numpy as np

sentences1 = ['bagel plain','bagel with raisins and olives', 'hamburger','bagel with olives','bagel with raisins', 'bagel']
sentences2 = ['bagel','pizza','salad with raisins','tuna and olives']
sentences = sentences1+ sentences2
sentences = list(set(sentences)) # get unique words

# Initialize the model
model = SentenceTransformer('all-MiniLM-L6-v2') # try different models

# Create embeddings for each sentence
embeddings = model.encode(sentences)

# loop through each word in sentence 1 and compare cosine similarity for words in sentence2, select the one with highest similarity:
indices1 = [sentences.index(i) for i in sentences1]
indices2 = [sentences.index(i) for i in sentences2]
emb1, emb2 = embeddings[indices1], embeddings[indices2]

arr_cos, arr_sent = [], []
for i in range(len(sentences1)):
  cos = cosine_similarity(emb1[i].reshape(1,embeddings.shape[1]), emb2).flatten()
  idx = np.argmax(cos)
  # print(i, idx, cos.shape)
  arr_cos.append(cos[idx])
  arr_sent.append(sentences2[idx])

print(pd.DataFrame({'sent1': sentences1, 'paired': arr_sent, 'cosine': arr_cos}))

输出：

                           sent1              paired    cosine
0                    bagel plain               bagel  0.808948
1  bagel with raisins and olives  salad with raisins  0.638765
2                      hamburger               pizza  0.437424
3              bagel with olives               bagel  0.686805
4             bagel with raisins               bagel  0.707621
5                          bagel               bagel  1.000000

Python 或 R 上下文感知模糊匹配

Python or R Context aware fuzzy matching

评论

评论