信息检索（九）--信息检索实验系统SMART及其特点

Some problems may arise:

(1)The size of certain clusters may be too small (a single document each cluster, loose document);
(2) The size of certain clusters may become too large (if more homogeneous);
(3) The number of clusters produced may become excessively large if (1) and (2) are not properly treated;
(4) The overlap among clusters

一种解决方法：如果一个节点下孩子太多，可以split（类似B-tree）
在这里插入图片描述
Treatment of overlap among clusters:

If precision-first(non-relevant items must be avoided): the cluster structure should consist of a large number of small, disjoint clusters('cause you don’t want to introduce any irrelevant items)
If recall-first: partly overlapping clusters is more effective

第一次让用户标记相关or不相关，之后系统会根据用户的反馈来改进。可以迭代多次。
如：第一次搜索"bike"
在这里插入图片描述
用户来标记

下一次的结果：

原理：
（1）方法一：Standard Rocchio Method

Starting with an initial query Q（term1,term2,term3…termk）
Use the known relevant ( $D_r$ ) and irrelevant ( $D_n$ ) sets of documents and include the initial query q
the new query would be:

举个例子：
原来的查询Q对于5个term的权重为(5,0,3,0,1). 用户反馈的“相关”结果含有5个term的频次为(2,1,2,0,0),“不相关”结果含有的频次为(1,0,0,0,2). 根据上面的公式，改进后的Q’计算如下：

从几何直观上来看，relevance feedback是这样的：

（2）方法二：Ide Regular Method
Since more feedback should perhaps increase the degree of reformulation, do not normalize for amount of feedback.

Comparison of Methods
Overall, experimental results indicate no clear preference for any one of the specific methods.

但是，每次都让用户来标记的话，对用户来说实在是太痛苦了…所以，现在普遍采用implicit feedback，比如用户点进去的搜索结果或者浏览时间较长的结果是positive feedback。

typical SMART automatic indexing process:
在这里插入图片描述