Clustered Files
Some problems may arise:
- (1)The size of certain clusters may be too small (a single document each cluster, loose document);
- (2) The size of certain clusters may become too large (if more homogeneous);
- (3) The number of clusters produced may become excessively large if (1) and (2) are not properly treated;
- (4) The overlap among clusters
一种解决方法:如果一个节点下孩子太多,可以split(类似B-tree)
Treatment of overlap among clusters:
- If precision-first(non-relevant items must be avoided): the cluster structure should consist of a large number of small, disjoint clusters('cause you don’t want to introduce any irrelevant items)
- If recall-first: partly overlapping clusters is more effective
relevance feedback
第一次让用户标记相关or不相关,之后系统会根据用户的反馈来改进。可以迭代多次。
如:第一次搜索"bike"
用户来标记
下一次的结果:
原理:
(1)方法一:Standard Rocchio Method
- Starting with an initial query Q(term1,term2,term3…termk)
- Use the known relevant (D r D_rDr) and irrelevant (D n D_nDn) sets of documents and include the initial query q
- the new query would be:

举个例子:
原来的查询Q对于5个term的权重为(5,0,3,0,1). 用户反馈的“相关”结果含有5个term的频次为(2,1,2,0,0),“不相关”结果含有的频次为(1,0,0,0,2). 根据上面的公式,改进后的Q’计算如下:
从几何直观上来看,relevance feedback是这样的:
(2)方法二:Ide Regular Method
Since more feedback should perhaps increase the degree of reformulation, do not normalize for amount of feedback.
Comparison of Methods
Overall, experimental results indicate no clear preference for any one of the specific methods.
但是,每次都让用户来标记的话,对用户来说实在是太痛苦了…所以,现在普遍采用implicit feedback,比如用户点进去的搜索结果或者浏览时间较长的结果是positive feedback。
SMART System Procedure - 之前内容的总结
typical SMART automatic indexing process:

版权声明:本文为weixin_41332009原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接和本声明。