Add IFIR in retrieval dataset. #2682

SighingSnow · 2025-05-09T12:26:20Z

Code Quality

Code Formatted: Format the code using make lint to maintain consistent style.

Documentation

Updated Documentation: Add or update documentation to reflect the changes introduced in this PR.

Testing

New Tests Added: Write tests to cover new functionality. Validate with make test-with-coverage.
Tests Passed: Run tests locally using make test or make test-with-coverage to ensure no existing functionality is broken.

Adding datasets checklist

Reason for dataset addition: ...

I have run the following models on the task (adding the results to the pr). These can be run using the mteb -m {model_name} -t {task_name} command.
- sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
- intfloat/multilingual-e5-small
I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
If the dataset is too big (e.g. >2048 examples), considering using self.stratified_subsampling() under dataset_transform()
I have filled out the metadata object in the dataset file (find documentation on it here).
Run tests locally to make sure nothing is broken using make test.
Run the formatter to format the code using make lint.

Samoed · 2025-05-09T12:32:55Z

mteb/tasks/Retrieval/eng/IFRetrieval.py

+DOMAINS = [
+    "fiqa",
+    "nfcorpus",
+    "scifact_open",
+    "aila",
+    "fire",
+    "pm",
+    "cds"
+]
+
+DOMAINS_langs = {split: ["eng"] for split in DOMAINS}
+
+
+def load_ifir_data(
+    self,
+    path: str,
+    domains: list,
+    eval_splits: list,
+    cache_dir: str | None = None,
+    revision: str | None = None,
+):
+    corpus = {domain: {split: None for split in eval_splits} for domain in DOMAINS}
+    queries = {domain: {split: None for split in eval_splits} for domain in DOMAINS}
+    relevant_docs = {
+        domain: {split: None for split in eval_splits} for domain in DOMAINS
+    }
+
+    for domain in domains:
+        domain_corpus = datasets.load_dataset(
+            path, "corpus", split=domain, cache_dir=cache_dir, revision=revision
+        )
+        domain_queries = datasets.load_dataset(
+            path, "queries", split=domain, cache_dir=cache_dir, revision=revision
+        )
+        qrels = datasets.load_dataset(
+            path, "qrels", split=domain, cache_dir=cache_dir, revision=revision
+        )
+        corpus[domain]["test"] = {
+            e["_id"]: {"text": e["text"]} for e in domain_corpus
+        }
+        queries[domain]["test"] = {
+            e["_id"]: e["text"] for e in domain_queries
+        }
+        relevant_docs[domain]["test"] = {}
+
+        for e in qrels:
+            qid = e["query-id"]
+            doc_id = e["doc-id"]
+            if qid not in relevant_docs[domain]["test"]:
+                relevant_docs[domain]["test"][qid] = defaultdict(dict)
+            relevant_docs[domain]["test"][qid].update({doc_id: 1})
+
+    corpus = datasets.DatasetDict(corpus)
+    queries = datasets.DatasetDict(queries)
+    relevant_docs = datasets.DatasetDict(relevant_docs)
+    return corpus, queries, relevant_docs
+
+
+def load_data(self, **kwargs):
+    if self.data_loaded:
+        return
+
+    self.corpus, self.queries, self.relevant_docs = self.load_ifir_data(
+        path=self.metadata_dict["dataset"]["path"],
+        domains=DOMAINS,
+        eval_splits=self.metadata_dict["eval_splits"],
+        cache_dir=kwargs.get("cache_dir", None),
+        revision=self.metadata_dict["dataset"]["revision"],
+    )
+    self.data_loaded = True


Can you move functions to the class? I think you've copied them from BrightRetrieval, that have this structure because it has 2 different tasks from same repo

Yes, no problem.

Samoed · 2025-05-09T12:35:33Z

mteb/tasks/Retrieval/eng/IFRetrieval.py

+DOMAINS = [
+    "fiqa",
+    "nfcorpus",
+    "scifact_open",
+    "aila",
+    "fire",
+    "pm",
+    "cds"
+]


Any difference from our implementation of

https://huggingface.co/datasets/mteb/fiqa

https://huggingface.co/datasets/mteb/nfcorpus

https://huggingface.co/datasets/mteb/scifact

https://huggingface.co/datasets/mteb/AILA_statutes https://huggingface.co/datasets/mteb/AILA_casedocs

Technically, the implementation is the same.

However, we construct a new dataset based on these datasets.

Samoed · 2025-05-09T12:37:17Z

Overall I think it would be better to integrate your tasks seperatly and after that create benchmark in benchmarks.py

Signed-off-by: SighingSnow <[email protected]>

SighingSnow · 2025-05-09T13:01:49Z

Overall I think it would be better to integrate your tasks seperatly and after that create benchmark in benchmarks.py

Maybe like RAR-b ? They seperate their tasks. But I think the implementation maybe not that elegant.

Our paper measures the instruction-following retrieval ability covering these domains. I think it's similar to BRIGHT to some extent.

Samoed · 2025-05-09T16:51:31Z

Yes, you can do like this

KennethEnevoldsen · 2025-05-15T19:00:58Z

@SighingSnow it is possible to create an AggregateTask, that simply aggregates scores across the individual tasks (instead of implementing it as one task). It seems like this is kind what you are doing here.

SighingSnow · 2025-05-16T00:20:58Z

Thank u. I will revise my code. And it will be done maybe next week.

SighingSnow force-pushed the main branch from 91db299 to eeb1666 Compare May 9, 2025 12:30

Samoed requested changes May 9, 2025

View reviewed changes

SighingSnow force-pushed the main branch from eeb1666 to ce8f8b4 Compare May 9, 2025 12:36

Add IFIR in retrieval dataset.

181b60f

Signed-off-by: SighingSnow <[email protected]>

SighingSnow force-pushed the main branch from 1be2405 to 181b60f Compare May 9, 2025 12:55

SighingSnow closed this by deleting the head repository May 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add IFIR in retrieval dataset. #2682

Add IFIR in retrieval dataset. #2682

Uh oh!

SighingSnow commented May 9, 2025

Uh oh!

Samoed May 9, 2025

Uh oh!

SighingSnow May 9, 2025

Uh oh!

Samoed May 9, 2025

Uh oh!

SighingSnow May 9, 2025

Uh oh!

Samoed commented May 9, 2025

Uh oh!

SighingSnow commented May 9, 2025 •

edited

Loading

Uh oh!

Samoed commented May 9, 2025

Uh oh!

KennethEnevoldsen commented May 15, 2025

Uh oh!

SighingSnow commented May 16, 2025

Uh oh!

Uh oh!

Add IFIR in retrieval dataset. #2682

Add IFIR in retrieval dataset. #2682

Uh oh!

Conversation

SighingSnow commented May 9, 2025

Code Quality

Documentation

Testing

Adding datasets checklist

Uh oh!

Samoed May 9, 2025

Choose a reason for hiding this comment

Uh oh!

SighingSnow May 9, 2025

Choose a reason for hiding this comment

Uh oh!

Samoed May 9, 2025

Choose a reason for hiding this comment

Uh oh!

SighingSnow May 9, 2025

Choose a reason for hiding this comment

Uh oh!

Samoed commented May 9, 2025

Uh oh!

SighingSnow commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Samoed commented May 9, 2025

Uh oh!

KennethEnevoldsen commented May 15, 2025

Uh oh!

SighingSnow commented May 16, 2025

Uh oh!

Uh oh!

SighingSnow commented May 9, 2025 •

edited

Loading