Notebook

Week 3 Overview¶

During this week's lessons, you will learn how to evaluate an information retrieval system (a search engine), including the basic measures for evaluating a set of retrieved results and the major measures for evaluating a ranked list, including the average precision (AP) and the normalized discounted cumulative gain (nDCG), and practical issues in evaluation, including statistical significance testing and pooling.

Key Phrases and Concepts¶

Keep your eyes open for the following key terms or phrases as you complete the readings and interact with the lectures. These topics will help you better understand the content in this module.

Cranfield evaluation methodology
Precision and recall
Average precision, mean average precision (MAP), and geometric mean average precision (gMAP)
Reciprocal rank and mean reciprocal rank
F-measure
Normalized discounted cumulative Gain (nDCG)
Statistical significance test

Goals and Objectives¶

After you actively engage in the learning experiences in this module, you should be able to:

Explain the Cranfield evaluation methodology and how it works for evaluating a text retrieval system.
Explain how to evaluate a set of retrieved documents and how to compute precision, recall, and F1.
Explain how to evaluate a ranked list of documents.
Explain how to compute and plot a precision-recall curve.
Explain how to compute average precision and mean average precision (MAP).
Explain how to evaluate a ranked list with multi-level relevance judgments.
Explain how to compute normalized discounted cumulative gain.
Explain why it is important to perform statistical significance tests.

Guiding Questions¶

Develop your answers to the following guiding questions while completing the readings and working on assignments throughout the week.

Why is evaluation so critical for research and application development in text retrieval?¶

Text retrieval is empirical task, so we need to measure the text retrieval result quality based on user, not subjectively measured.
We need to understand what actual utility of text retrieval system from user perspective. To do that, we need to evaluate each possible utilty and measure them through user study.
Measure actual utility on only one system and method is not enough. We need to do evaluation on different systems and methods to reveal exact utility to user.

How does the Cranfield evaluation methodology work?¶

Let:

$D$ is set of documents $\{d_1, d_2, ..., d_n\}$
$Q$ is set of queries $\{q_1, q_2, ..., q_n\}$
$S$ is set of systems $\{s_1, s_2, ..., s_n\}$
$R$ is set of relevance judgement by users for each system in $S$ for each document in $D$ should be have relevance judgement in $J$, such that $R_{Si} = \{d_i \rightarrow j_i, ..., d_n \rightarrow j_i \ | \ d_i \in D \ , \ j_i \in J\}$

Suppose:

We have two systems $S = \{A, B\}$
We want to match query $Q_i$ to each document in $D$ using each system
We have boolean judgement, such that $J = \{+, -\}$
$R_A$ return $\{d_2 \rightarrow +, d_1 \rightarrow +, d_4 \rightarrow -\}$
$R_B$ return $\{d_1 \rightarrow +, d_4 \rightarrow -, d_3 \rightarrow -, d_5 \rightarrow +, d_2 \rightarrow +\}$

Then:

By using precission, we decide that $R_A$ better that $R_B$, since $2/3 > 3/5$

cranfield

How do we evaluate a set of retrieved documents?¶

Using Precision to evaluate degree of relevant from set of retrieved documents.
Using Recall to evaluate relevant ratio of retrieved against not retreived.
Using F1 to combine them.

How do you compute precision, recall, and F1?¶

Consider this matrix:

Doc \ Action	Retrieved	Not Retrieved
Relevant	a	b
Not Relevant	c	d

$$Precision = \frac{a}{a+c}$$$$Recall = \frac{a}{a+b}$$$$\eqalign{ F_{\beta} &= \frac{1}{\frac{\beta^2}{\beta^{2+1}}\frac{1}{R} + \frac{1}{\beta^2+1}\frac{1}{P}}\\ &= \frac{(\beta^2+1)P*R}{\beta^2P+R}\\ \text{if } \beta = 1\\ F_1 &= \frac{2PR}{P+R} }$$

How do we evaluate a ranked list of search results?¶

Let:

Users walking trough retrieved documents and judge each of document.
Compute precision-recall each level on set of retrieved document, such that for $N$ retrieved documents, we have $N$ precission-recall.

Suppose:

Judge $J$ is binary judgement, such that $J = \{+, -\}$
Precision-recall computation result:

Doc, judge	Precision	Recall
$D_1+$	1/1	1/10
$D_2+$	2/2	2/10
$D_3-$	2/3	2/10
$D_4-$	2/4	2/10
$D_5+$	3/5	3/10
$D_6-$	3/6	3/10
$D_7-$	3/7	3/10
$D_8+$	4/8	4/10
$D_9-$	4/9	4/10
$D_{10}-$	4/10	4/10

Then we got:

A precision-recall curve:

Precision Recall Curve

How do you compute average precision? How do you compute mean average precision (MAP) and geometric mean average precision (gMAP)?¶

For single search engine system and specific query, Average precision of ranked list $L$:

$$avg(L) = \frac{1}{|Rel|}\sum_{i=1}^n p(i)$$

where:

Length of $L$ is $n$
$Rel$ is total relevant documents in the collections
$$p(i) = \begin{cases}

0,& \text{if } D_i \text{ is judged as not relevant}\ \frac{\sum_{rel}}{rank},& \text{if } D_i \text{ is judged as relevant} \end{cases}$$

$\sum_{rel}$ is current total of judged relevance document in $i \ rank$

For multiple search engine system and multiple queries, Mean Average Precision (MAP) is arithmetic mean of all the average precisions over several queries or topics, Let $\mathcal{L} = L_1, L_2, ..., L_m$ be the ranked lists returned from running $m$ different queries. Then we have:

$$MAP(\mathcal{L}) = \frac{1}{m} \sum\limits_{i=1}^m avp(\mathcal{L}_i)$$

geometric Mean Average Precision (gMAP) enchance MAP capability to capture low ranked queries that far away from average value. We defined gMAP as:

$$gMap(\mathcal{L}) = \big( \prod\limits_{i=1}^m avp(\mathcal{L}_i) \big)^{\frac{1}{m}}$$

or in log space as

$$gMAP(\mathcal{L}) = exp \big( \frac{1}{m} \sum\limits_{i=1}^m ln \ avp(\mathcal{L}_i) \big)$$

What is mean reciprocal rank?¶

Reciprocal rank is special case of MAP where there are always $r$ relevant document on the entire collection, such that average precision will always has value equal to $\frac{1}{r}$ where $r$ is the position (rank) of the single relevant document.

Why is MAP more appropriate than precision at k documents when comparing two retrieval methods?¶

Using precision at k documents in comparing two retrieval methods produce unfair measurement since each methods retrieved different $k$ documents. MAP is more usefull in comparing two retrieval methods because MAP provide a way to measure total precision of each method relative to average precision. Thus, we may see that average precision is expected precision which can be achieved by single retrieval method.

Why is precision at k documents more meaningful than average precision from a user’s perspective?¶

Since order of retrieved document represent probability of relevance, than user tend to consider most $k$ document. Thus, user's perspective is subjective preferences. Also, in the case of question and answer search engine, top answer is always prefered to be right answer.

How can we evaluate a ranked list of search results using multi-level relevance judgments?¶

Use Cumulative Gain (CG) and Discounted Cumulative Gain (DCG):

Let:

$r_i$ is gain of result $i$
$i$ is index of set of retrieved document, such that $\{d_1, d_2, ..., d_n\}$

Then:

$$CG(L) = \sum\limits_{i=1}^n r_i$$
$$DCG(L) = r_1 + \sum\limits_{i=2}^n \frac{r_i}{log_2 i}$$

How do you compute normalized discounted cumulative gain (nDCG)?¶

Let:

$iDCG$ is ideal Discounted Cumulative Gain

Then:

$$nDCG(L) = \frac{DCG(L)}{iDCG}$$

Why is normalization necessary in nDCG? Does MAP need a similar normalization?¶

Because we need absolute measurement for all systems, while nDCG introduce ideal DCG as absolute measurement. By using absolute measurement, we can ensure comparability across queries.

MAP can not use normalization, since ideal MAP is always 1.

Why is it important to perform statistical significance tests when we compare the retrieval accuracies of two search engine systems?¶

Statistical significance test provide a way to assess the variance in average precision scores across these different queries. If there's a big variance, that means the results could fluctuate according to different queries, which makes the result unreliable.

One popular statistical signficance test is Wilcoxon signed-rank test.

Additional Readings and Resources¶

Mark Sanderson. Test collection based evaluation of information retrieval systems. Foundations and Trends in Information Retrieval 4, 4 (2010), 247-375.
Diane Kelly, Methods for Evaluating Interactive Information Retrieval Systems with Users. Foundations and Trends in Information Retrieval 3(1-2): 1-224 (2009)
C. Zhai and S. Massung. Text Data Management and Analysis: A Practical Introduction to Information Retrieval and Text Mining, ACM Book Series, Morgan & Claypool Publishers, 2016. Chapter 9