假设有语料库一共只要2篇文档:d 1 d_1d1和d 2 d_2d2,其中
d 1 = ( A , B , C , D , A ) d_1=(A,B,C,D,A)d1=(A,B,C,D,A)一共有5个单词组成;d 2 = ( B , E , A , B ) d_2=(B,E,A,B)d2=(B,E,A,B),一共有4个单词组成。
1.TF
TF即词频(Term Frequency),每篇文档中关键词的频率(该文档单词/该文档单词总数),对于文档d 1 d_1d1和文档d 2 d_2d2有:
| d 1 d_1d1 | d 2 d_2d2 | |
|---|---|---|
| A | 2 5 \frac{2}{5}52 | 1 4 \frac{1}{4}41 |
| B | 1 5 \frac{1}{5}51 | 2 4 \frac{2}{4}42 |
| C | 1 5 \frac{1}{5}51 | 0 4 \frac{0}{4}40 |
| D | 1 5 \frac{1}{5}51 | 0 4 \frac{0}{4}40 |
| E | 0 5 \frac{0}{5}50 | 1 4 \frac{1}{4}41 |
注意:由语料库得到的字典长度为5,所以最终文档向量化长度为5。
2.IDF
IDF即逆文档频率(Inverse Document Frequency),文档总数/关键词t出现的文档数目,即I D F ( t ) = l n ( ( 1 + ∣ D ∣ ) / ∣ D t ∣ ) IDF(t)=ln((1+|D|)/|D_t|)IDF(t)=ln((1+∣D∣)/∣Dt∣)(还有log等形式,自然对数被证明是最有效的一个公式),计算语料库中每个关键词的IDF值如下:
| A | l n ( 1 + 2 2 ) ln(\frac{1+2}{2})ln(21+2) |
|---|---|
| B | l n ( 1 + 2 2 ) ln(\frac{1+2}{2})ln(21+2) |
| C | l n ( 1 + 2 1 ) ln(\frac{1+2}{1})ln(11+2) |
| D | l n ( 1 + 2 1 ) ln(\frac{1+2}{1})ln(11+2) |
| E | l n ( 1 + 2 1 ) ln(\frac{1+2}{1})ln(11+2) |
3.结合IF-IDF,文档的向量化表示
举例d 1 d_1d1:
d 1 = ( x 1 , x 2 , x 3 , x 4 , x 5 ) = ( 2 5 × l n ( 1 + 2 2 ) , 1 5 × l n ( 1 + 2 2 ) , 1 5 × l n ( 1 + 2 1 ) , 1 5 × l n ( 1 + 2 1 ) , 0 5 × l n ( 1 + 2 1 ) ) d_1=(x_1,x_2,x_3,x_4,x_5)=(\frac{2}{5}\times ln(\frac{1+2}{2}),\frac{1}{5}\times ln(\frac{1+2}{2}),\frac{1}{5}\times ln(\frac{1+2}{1}), \frac{1}{5}\times ln(\frac{1+2}{1}), \frac{0}{5}\times ln(\frac{1+2}{1}))d1=(x1,x2,x3,x4,x5)=(52×ln(21+2),51×ln(21+2),51×ln(11+2),51×ln(11+2),50×ln(11+2))
4.TfidfVectorizer参数解析
(1)max_df:
当构建词汇表时,严格忽略高于给出阈值的文档频率的词条,语料指定的停用词。如果是浮点值,该参数代表文档的比例,整型绝对计数值,如果词汇表不为None,此参数被忽略。
(2)max_features:
如果不为None,构建一个词汇表,仅考虑max_features–按语料词频排序,如果词汇表不为None,这个参数被忽略