如何使用哈希图或其他更好的方法从一组字符串中删除重复的字符串？-解网

问：

我有一个字符串数据集，其中包含超过 2000 万组字符串，字符串的长度从 50 到 5000 不等，现在我想删除其中重复（或非常相似）的字符串。我发现可以解决这个问题，但不确定它是否正确，这是我的解决方案，faiss

np.random.seed(42)
data = np.random.random((1000, 128)).astype('float32')

use_gpu = faiss.get_num_gpus() > 0

index = faiss.GpuIndexFlatL2(faiss.StandardGpuResources(), 128) if use_gpu else faiss.IndexFlatL2(128)

index.add(data)
query_vector = np.random.random((1, 128)).astype('float32')
k = 5  
distances, indices = index.search(query_vector, k)

print("the cloest index：", indices)
print("distance is：", distances)

问题是我需要先对我的字符串进行编码，然后将其放入？我认为这仍然需要时间index

任何建议对我都有帮助。

python 哈希哈希映射

from gensim.models import Word2Vec
import numpy as np
import faiss

# Example of string data
string_data = ["example sentence one", "example sentence two", "another sentence", ...]

# Building a Word2Vec model
word2vec_model = Word2Vec([sentence.split() for sentence in string_data], vector_size=128, window=5, min_count=1, workers=4)

# Gets a vector for each string
vector_data = np.array([word2vec_model.wv[sentence.split()] for sentence in string_data])

# Building FAISS index
use_gpu = faiss.get_num_gpus() > 0
index = faiss.GpuIndexFlatL2(faiss.StandardGpuResources(), 128) if use_gpu else faiss.IndexFlatL2(128)
index.add(vector_data)

# Example of similar string search
query_vector = np.array([word2vec_model.wv["searching"]])
k = 5
distances, indices = index.search(query_vector, k)

print("Indices of the closest strings:", indices)
print("Distances:", distances)

请务必根据您的需求和数据特征调整参数和方法。

上一个：Laravel - 登录 Laravel - Passowrd HashBytes （'mD5'）

下一个：.jpe 和 .jxr 文件扩展名的 Java PhotoHashing 问题

如何使用哈希图或其他更好的方法从一组字符串中删除重复的字符串？

How to use hashmap or other better methods to remove duplicate strings from a set of string?

评论