sklearn实现简单的LDA实例

line1="小明很喜欢学习计算机,有很多计算机的书。"
line2="小红很喜欢计算机,听过很多计算机的课"
line3="猫正在吃一条咸鱼"
import jieba
res1 = ' '.join(jieba.cut(line1))
print(res1)
res2 = ' '.join(jieba.cut(line2))
print(res2)
res3 = ' '.join(jieba.cut(line3))
print(res3)
#从文件导入停用词表
stpwrdpath = "stop_words.txt"
stpwrd_dic = open(stpwrdpath, 'rb')
stpwrd_content = stpwrd_dic.read()
#将停用词表转换为list  
stpwrdlst = stpwrd_content.splitlines()
stpwrd_dic.close()
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
corpus = [res1,res2,res3]
print("***********************")
print(corpus)
cntVector = CountVectorizer(stop_words=stpwrdlst)
cntTf = cntVector.fit_transform(corpus)
print("***************")
print(cntVector.get_feature_names())
print("***************")
print(cntVector.vocabulary_    )  
print("***************")
print(type(cntTf))
print("***********************")
print(cntTf)
print("***********************")
print(cntTf.todense())
lda = LatentDirichletAllocation(n_components=2,
                                learning_offset=50.,
                                random_state=0)
docres = lda.fit_transform(cntTf)
print(docres)
print(lda.components_)

import joblib
joblib.dump(lda,"lda_model.txt")
lda_model = joblib.load("lda_model.txt")
print("***********************")
docres = lda_model.transform(cntTf)
print(docres)
print(lda_model.components_)

打印结果：

Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.874 seconds.
Prefix dict has been built successfully.
小明 很 喜欢 学习 计算机 , 有 很多 计算机 的 书 。
小红 很 喜欢 计算机 , 听过 很多 计算机 的 课
猫 正在 吃 一条 咸鱼
***********************
['小明 很 喜欢 学习 计算机 , 有 很多 计算机 的 书 。', '小红 很 喜欢 计算机 , 听过 很多 计算机 的 课', '猫 正在 吃 一条 咸鱼']
***************
['一条', '听过', '咸鱼', '喜欢', '学习', '小明', '小红', '很多', '正在', '计算机']
***************
{'小明': 5, '喜欢': 3, '学习': 4, '计算机': 9, '很多': 7, '小红': 6, '听过': 1, '正在': 8, '一条': 0, '咸鱼': 2}
***************
<class 'scipy.sparse.csr.csr_matrix'>
***********************
  (0, 5)	1
  (0, 3)	1
  (0, 4)	1
  (0, 9)	2
  (0, 7)	1
  (1, 3)	1
  (1, 9)	2
  (1, 7)	1
  (1, 6)	1
  (1, 1)	1
  (2, 8)	1
  (2, 0)	1
  (2, 2)	1
***********************
[[0 0 0 1 1 1 0 1 0 2]
 [0 1 0 1 0 0 1 1 0 2]
 [1 0 1 0 0 0 0 0 1 0]]
[[0.07562513 0.92437487]
 [0.07562515 0.92437485]
 [0.87268309 0.12731691]]
[[1.49691732 0.50826822 1.49691732 0.50850296 0.50826695 0.50826695
  0.50826822 0.50850296 1.49691732 0.50856976]
 [0.50308268 1.49173178 0.50308268 2.49149704 1.49173305 1.49173305
  1.49173178 2.49149704 0.50308268 4.49143024]]
***********************
[[0.07562513 0.92437487]
 [0.07562515 0.92437485]
 [0.87268309 0.12731691]]
[[1.49691732 0.50826822 1.49691732 0.50850296 0.50826695 0.50826695
  0.50826822 0.50850296 1.49691732 0.50856976]
 [0.50308268 1.49173178 0.50308268 2.49149704 1.49173305 1.49173305
  1.49173178 2.49149704 0.50308268 4.49143024]]

下面分析下各个步骤代码的作用，首先调用jieba库进行分词,然后读取停用词列表（文件：stop_words.txt，网上很容易找到），然后使用CountVectorizer，生成每段话的词频向量，这里返回的是一个稀疏矩阵（csr_matrix）cntTf，CountVectorizer会将所有词从0开始编码。最后调用LDA模型这里输入的是稀疏矩阵，可以看到显然前两段话更偏向同一个主题，同为主题2，由于“计算机”出现两次，故其权重最高，而第三段话，偏向主题1，无重复出现词，故没有权重太高的词。fit_transform表示线训练训练好后预测，transform则直接进行预测，由于没有自带的save函数，可以使用joblib的dump函数实现模型保存和加载。

原文链接：https://blog.csdn.net/fangfanglovezhou/article/details/120210170