信息检索(十一)-- 文本分类

本章内容和其他课程重合过多,只写了不重合的部分。

1 Document classification

经常是hierarchical的分类:
在这里插入图片描述
Steps of Text classification:

  • Indexing
  • Dimension reduction
  • Weighting
  • Classifier Evaluation/Optimization
1.1 indexing
  • Word (中文分词)
  • Character
  • N-gram
  • Phrase: (Syntactically, Statistically)
  • Concept: Car, Bus …(wordnet, hownet)
1.2 dimension reduction
  • Reduces training time: Training time for some methods is quadratic or worse in the number of features(KNN, SVM)
  • Can improve generalization (performance): Eliminates noise features, Avoids overfitting

types of reduction:

  • Feature Selection: Selection a subset from original feature set according to a score function and threshold. Solving non-informative features and problem scale
  • Feature Extraction (Reparameterization): Create a new feature space, Solving dimension orthogonal

Feature selection: how?

  • Hypothesis testing statistics: Are we confident that the value of one categorical variable(a certain term) is associated with the value of another(class sports/finance)?Chi-square test
  • Information theory: How much information does the value of one categorical variable give you about the value of another.Mutual information
1.3 weighting

The well-known tf*idf:

  • The more tf, the more important the term is;
  • The more df, the less important the term is.

Others …

Classifier
  • Bayesian classifier
  • Decision tree classifier
  • Rocchio
  • Neural network classifier
  • KNN
  • SVM Classifier
  • committee (Boosting)
    这些分类器在机器学习中都讲过,在这里就不讲了。

版权声明:本文为weixin_41332009原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接和本声明。