本章内容和其他课程重合过多,只写了不重合的部分。
1 Document classification
经常是hierarchical的分类:
Steps of Text classification:
- Indexing
- Dimension reduction
- Weighting
- Classifier Evaluation/Optimization
1.1 indexing
- Word (中文分词)
- Character
- N-gram
- Phrase: (Syntactically, Statistically)
- Concept: Car, Bus …(wordnet, hownet)
1.2 dimension reduction
- Reduces training time: Training time for some methods is quadratic or worse in the number of features(KNN, SVM)
- Can improve generalization (performance): Eliminates noise features, Avoids overfitting
types of reduction:
- Feature Selection: Selection a subset from original feature set according to a score function and threshold. Solving non-informative features and problem scale
- Feature Extraction (Reparameterization): Create a new feature space, Solving dimension orthogonal
Feature selection: how?
- Hypothesis testing statistics: Are we confident that the value of one categorical variable(a certain term) is associated with the value of another(class sports/finance)?Chi-square test
- Information theory: How much information does the value of one categorical variable give you about the value of another.Mutual information
1.3 weighting
The well-known tf*idf:
- The more tf, the more important the term is;
- The more df, the less important the term is.
Others …
Classifier
- Bayesian classifier
- Decision tree classifier
- Rocchio
- Neural network classifier
- KNN
- SVM Classifier
- committee (Boosting)
这些分类器在机器学习中都讲过,在这里就不讲了。
版权声明:本文为weixin_41332009原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接和本声明。