During this week's lessons, you will learn how to evaluate an information retrieval system (a search engine), including the basic measures for evaluating a set of retrieved results and the major measures for evaluating a ranked list, including the average precision (AP) and the normalized discounted cumulative gain (nDCG), and practical issues in evaluation, including statistical significance testing and pooling.
Keep your eyes open for the following key terms or phrases as you complete the readings and interact with the lectures. These topics will help you better understand the content in this module.
After you actively engage in the learning experiences in this module, you should be able to:
Develop your answers to the following guiding questions while completing the readings and working on assignments throughout the week.
Let:
Suppose:
Then:
Consider this matrix:
Doc \ Action | Retrieved | Not Retrieved | |
---|---|---|---|
Relevant | a | b | |
Not Relevant | c | d |
Let:
Suppose:
Doc, judge | Precision | Recall | |
---|---|---|---|
$D_1+$ | 1/1 | 1/10 | |
$D_2+$ | 2/2 | 2/10 | |
$D_3-$ | 2/3 | 2/10 | |
$D_4-$ | 2/4 | 2/10 | |
$D_5+$ | 3/5 | 3/10 | |
$D_6-$ | 3/6 | 3/10 | |
$D_7-$ | 3/7 | 3/10 | |
$D_8+$ | 4/8 | 4/10 | |
$D_9-$ | 4/9 | 4/10 | |
$D_{10}-$ | 4/10 | 4/10 |
Then we got:
For single search engine system and specific query, Average precision of ranked list $L$:
$$avg(L) = \frac{1}{|Rel|}\sum_{i=1}^n p(i)$$where:
0,& \text{if } D_i \text{ is judged as not relevant}\ \frac{\sum_{rel}}{rank},& \text{if } D_i \text{ is judged as relevant} \end{cases}$$
For multiple search engine system and multiple queries, Mean Average Precision (MAP) is arithmetic mean of all the average precisions over several queries or topics, Let $\mathcal{L} = L_1, L_2, ..., L_m$ be the ranked lists returned from running $m$ different queries. Then we have:
$$MAP(\mathcal{L}) = \frac{1}{m} \sum\limits_{i=1}^m avp(\mathcal{L}_i)$$geometric Mean Average Precision (gMAP) enchance MAP capability to capture low ranked queries that far away from average value. We defined gMAP as:
$$gMap(\mathcal{L}) = \big( \prod\limits_{i=1}^m avp(\mathcal{L}_i) \big)^{\frac{1}{m}}$$or in log space as
$$gMAP(\mathcal{L}) = exp \big( \frac{1}{m} \sum\limits_{i=1}^m ln \ avp(\mathcal{L}_i) \big)$$Reciprocal rank is special case of MAP where there are always $r$ relevant document on the entire collection, such that average precision will always has value equal to $\frac{1}{r}$ where $r$ is the position (rank) of the single relevant document.
Using precision at k documents in comparing two retrieval methods produce unfair measurement since each methods retrieved different $k$ documents. MAP is more usefull in comparing two retrieval methods because MAP provide a way to measure total precision of each method relative to average precision. Thus, we may see that average precision is expected precision which can be achieved by single retrieval method.
Since order of retrieved document represent probability of relevance, than user tend to consider most $k$ document. Thus, user's perspective is subjective preferences. Also, in the case of question and answer search engine, top answer is always prefered to be right answer.
Use Cumulative Gain (CG) and Discounted Cumulative Gain (DCG):
Let:
Then:
Let:
Then:
Because we need absolute measurement for all systems, while nDCG introduce ideal DCG as absolute measurement. By using absolute measurement, we can ensure comparability across queries.
MAP can not use normalization, since ideal MAP is always 1.
Statistical significance test provide a way to assess the variance in average precision scores across these different queries. If there's a big variance, that means the results could fluctuate according to different queries, which makes the result unreliable.
One popular statistical signficance test is Wilcoxon signed-rank test.