获取每个文档的主题名称

GET topic names for each document

提问人:Usman Rafiq 提问时间:2/17/2019 更新时间:2/17/2019 访问量:155

问:

我正在尝试使用此链接中的示例为文档进行主题建模 https://www.w3cschool.cn/doc_scikit_learn/scikit_learn-auto_examples-applications-topics_extraction_with_nmf_lda.html

我的问题 我怎样才能知道哪些文档对应于哪个主题?

到目前为止,这就是我所做的

n_features = 1000
n_topics = 8
n_top_words = 20

with open('dataset.txt', 'r') as data_file:
    input_lines = [line.strip() for line in data_file.readlines()]
    mydata = [line for line in input_lines]

def print_top_words(model, feature_names, n_top_words):
    for topic_idx, topic in enumerate(model.components_):
        print("Topic #%d:" % topic_idx)
        print(" ".join([feature_names[i]
                    for i in topic.argsort()[:-n_top_words - 1:-1]]))                        

    print()

tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2, token_pattern='\\b\\w{2,}\\w+\\b',
                            max_features=n_features,
                            stop_words='english')
tf = tf_vectorizer.fit_transform(mydata)

lda = LatentDirichletAllocation(n_topics=3, max_iter=5,
                            learning_method='online',
                            learning_offset=50.,
                            random_state=0)

lda.fit(tf)

print("\nTopics in LDA model:")
tf_feature_names = tf_vectorizer.get_feature_names()

print_top_words(lda, tf_feature_names, n_top_words)



#And to add find top topic related to each document
doc_topic = lda.transform(tf)
for n in range(doc_topic.shape[0]):
    topic_most_pr = doc_topic[n].argmax()
    print("doc: {} topic: {}\n".format(n,topic_most_pr))

预期输出为

Doc| Assigned Topic |   Words_in_assigned_topic
1       2                science,humanbody,bones    
python scikit-学习 主题建模

评论

0赞 Jim Todd 2/17/2019
看看这个 stackoverflow.com/questions/26304191/...

答: 暂无答案