【爬虫实战】python文本分析库——Gensim
Gensim是一个可以用来进行文本处理和向量化的Python库。以下是一个使用Gensim进行文本分析的基本例子:
from gensim import corpora, models, similarities
# 示例文本数据
documents = [
"Human machine interface for lab abc computer",
"A survey of user opinion of computer system response time",
"The EPS user interface management system",
"System and human system engineering testing of CAD rental",
"Relation of user perceived response time to error measurement",
"The generation of random binary unordered trees",
"The intersection graph of paths in trees",
"Graph minors IV Widths of trees and well quasi ordering",
"Graph minors A survey"
]
# 创建字典
dictionary = corpora.Dictionary(documents)
# 向量化文档:每个文档变成一个向量,向量中的每个元素是一个单词和它在文档中的出现次数
corpus = [dictionary.doc2bow(text.lower().split()) for text in documents]
# 训练TF-IDF模型
tfidf = models.TfidfModel(corpus)
# 转换为TF-IDF向量
corpus_tfidf = tfidf[corpus]
# 使用Item-based Latent Factor模型进行推荐
similarity_matrix = similarities.MatrixSimilarity(corpus_tfidf)
index = similarities.Similarity.index(similarity_matrix, corpus_tfidf)
# 查询最相似的文档
query = "graph minors survey"
query_vec = dictionary.doc2bow(query.lower().split())
sims = similarity_matrix[query_vec]
print("Query:", query)
for index, sim in sorted(enumerate(sims), key=lambda item: -item[1]):
print(f"{index}: {documents[index]} - Similarity: {sim:.4f}")
这段代码首先定义了一些示例文本数据,然后创建了一个字典来映射文档中的单词,接着将每个文档转换为一个向量,并训练了一个TF-IDF模型。之后,它使用Item-based Latent Factor模型(一种基于内容的推荐系统)来找到查询与文档集合中其他文档的相似度。最后,它打印出与查询最相似的文档列表。这个例子展示了如何使用Gensim进行基本的文本分析和推荐系统构建。
评论已关闭