如何匹配两个不同数据帧的字符串python

How to match string of two different dataframe python

提问人:Annisa Lianda 提问时间:1/16/2023 更新时间:1/17/2023 访问量:59

问:

如何匹配不同数据框上的正文文本?我使用 python 编码,但由于某种原因,Match 列中的结果都是 False。即使数据框 1 和数据框 2 之间存在匹配的文本内容。

这是我的代码:

# List of search keywords
search_term = ["Gempa AND #gempa cianjur AND #gempa maluku", 
               "Sambo AND #ferdy sambo AND #brigadir j",
               "Lukas Enembe AND #lukas enembe tersangka AND #gubernur papua",
               "Puan Maharani AND #pdip AND #pilpres2024",
               "Putri Candrawathi AND #LPSK AND #brigadir yosua",
               "Resesi AND #resesi AND #APBD DKI",
               "IKN AND #ikn AND #ibu kota nusantara",
               "Piala AFF 2022 AND #piala aff 2022 AND #pssi",
               "Pemilu 2024 AND #partai politik pemilu 2024",
               "BMKG AND #BMKG",
               "Kripto AND #kripto AND #investasi",
               "Ekonomi AND #ekonomi indonesia AND #jokowi",
               "Elon Musk AND #elon musk",
               "Jokowi AND #Jokowi",
               "Puan AND #puan",
               "Ganjar Pranowo AND #Ganjar Pranowo AND #Pilpres 2024"]

# Calling DataFrame constructor on list
# with indices and columns specified
searc_term_df = pd.DataFrame(search_term,columns =['Search Term'])
searc_term_df['Search Term'] = searc_term_df['Search Term'].str.replace('AND','')
searc_term_df['Search Term'] = searc_term_df['Search Term'].str.replace('#','')

# Tokenize a sentence into a piece of words
def tokenize_data(tweet):
   return word_tokenize(tweet)
searc_term_df['Search Term'] = searc_term_df['Search Term'].apply(tokenize_data)

# Remove brackets from string
searc_term_df['Search Term'] = searc_term_df.astype(str).apply(lambda col:col.str.strip('[]'))
# Remove single quotes from string
searc_term_df['Search Term'] = searc_term_df['Search Term'].str.replace('\'', '')
searc_term_df

输出如下:

enter image description here

我想将它与数据帧 2 匹配,这导致数据帧 2 如下:enter image description here

这是匹配它的代码,但我得到的结果都是 False :

df_all['Match'] = df_all['Text'].isin(searc_term_df['Search Term'])

这是错误的输出:

enter image description here

Python Pandas DataFrame 文本 匹配

评论


答:

0赞 gustaph 1/17/2023 #1

我必须解决这个问题的想法是使用 scikit-learn 的 CountVectorizer 的词袋想法。

最初,我创建了一个数据集来模拟您的数据集,添加了一些您发现的单词。df_all['Text']searc_term_df

test_text = [
    "Lorem ipsum gempa sit amet.",
    "Lorem ipsum dolor sit amet.",
    "Lorem ipsum Puan sit amet.",
    "Lorem ipsum Jokowi sit amet.",
    "Lorem ipsum dolor sit amet.",
    "Lorem ipsum dolor sit amet.",
    "Lorem ipsum Elon Musk amet.",
    "Lorem ipsum dolor sit amet.",
    "Lorem ipsum dolor sit amet.",
    "Lorem ipsum Pranowo sit amet.",
    "Lorem ipsum dolor, 2024 amet.",
]

df_all = pd.DataFrame(data=test_text, columns=["text"])

然后我实例化了 Bag of Words () 模型。CountVectorizer()

from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

# here I removed the commas from the search terms because they were also being
# interpreted as a match, but you can come up with another fancy pre-processing strategy
search_terms = searc_term_df["Search Term"].replace({",": ""}, regex=True)

# fits the model to your text data
vectorizer = CountVectorizer().fit(df_all["text"])

# creates a bag of words (BoW)
bow_text = vectorizer.transform(df_all["text"]).toarray()

# creates a bag of words of the search terms too
bow_search_terms = vectorizer.transform(search_terms).toarray()

# creates a sparse matrix that corresponds to the number of matches
# for each search term. A sum equal to 0 means there is no match.
is_a_match = (np.dot(bow_text, bow_search_terms.transpose()).sum(axis=1) != 0)
df_all["match"] = is_a_match

df_all

这是输出

    text                            match
0   Lorem ipsum gempa sit amet.      True
1   Lorem ipsum dolor sit amet.     False
2   Lorem ipsum Puan sit amet.       True
3   Lorem ipsum Jokowi sit amet.     True
4   Lorem ipsum dolor sit amet.     False
5   Lorem ipsum dolor sit amet.     False
6   Lorem ipsum Elon Musk amet.      True
7   Lorem ipsum dolor sit amet.     False
8   Lorem ipsum dolor sit amet.     False
9   Lorem ipsum Pranowo sit amet.    True
10  Lorem ipsum dolor, 2024 amet.    True