-
Notifications
You must be signed in to change notification settings - Fork 430
Add IFIR in retrieval dataset. #2682
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
DOMAINS = [ | ||
"fiqa", | ||
"nfcorpus", | ||
"scifact_open", | ||
"aila", | ||
"fire", | ||
"pm", | ||
"cds" | ||
] | ||
|
||
DOMAINS_langs = {split: ["eng"] for split in DOMAINS} | ||
|
||
|
||
def load_ifir_data( | ||
self, | ||
path: str, | ||
domains: list, | ||
eval_splits: list, | ||
cache_dir: str | None = None, | ||
revision: str | None = None, | ||
): | ||
corpus = {domain: {split: None for split in eval_splits} for domain in DOMAINS} | ||
queries = {domain: {split: None for split in eval_splits} for domain in DOMAINS} | ||
relevant_docs = { | ||
domain: {split: None for split in eval_splits} for domain in DOMAINS | ||
} | ||
|
||
for domain in domains: | ||
domain_corpus = datasets.load_dataset( | ||
path, "corpus", split=domain, cache_dir=cache_dir, revision=revision | ||
) | ||
domain_queries = datasets.load_dataset( | ||
path, "queries", split=domain, cache_dir=cache_dir, revision=revision | ||
) | ||
qrels = datasets.load_dataset( | ||
path, "qrels", split=domain, cache_dir=cache_dir, revision=revision | ||
) | ||
corpus[domain]["test"] = { | ||
e["_id"]: {"text": e["text"]} for e in domain_corpus | ||
} | ||
queries[domain]["test"] = { | ||
e["_id"]: e["text"] for e in domain_queries | ||
} | ||
relevant_docs[domain]["test"] = {} | ||
|
||
for e in qrels: | ||
qid = e["query-id"] | ||
doc_id = e["doc-id"] | ||
if qid not in relevant_docs[domain]["test"]: | ||
relevant_docs[domain]["test"][qid] = defaultdict(dict) | ||
relevant_docs[domain]["test"][qid].update({doc_id: 1}) | ||
|
||
corpus = datasets.DatasetDict(corpus) | ||
queries = datasets.DatasetDict(queries) | ||
relevant_docs = datasets.DatasetDict(relevant_docs) | ||
return corpus, queries, relevant_docs | ||
|
||
|
||
def load_data(self, **kwargs): | ||
if self.data_loaded: | ||
return | ||
|
||
self.corpus, self.queries, self.relevant_docs = self.load_ifir_data( | ||
path=self.metadata_dict["dataset"]["path"], | ||
domains=DOMAINS, | ||
eval_splits=self.metadata_dict["eval_splits"], | ||
cache_dir=kwargs.get("cache_dir", None), | ||
revision=self.metadata_dict["dataset"]["revision"], | ||
) | ||
self.data_loaded = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move functions to the class? I think you've copied them from BrightRetrieval
, that have this structure because it has 2 different tasks from same repo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, no problem.
DOMAINS = [ | ||
"fiqa", | ||
"nfcorpus", | ||
"scifact_open", | ||
"aila", | ||
"fire", | ||
"pm", | ||
"cds" | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically, the implementation is the same.
However, we construct a new dataset based on these datasets.
Overall I think it would be better to integrate your tasks seperatly and after that create benchmark in |
Signed-off-by: SighingSnow <[email protected]>
Maybe like RAR-b ? They seperate their tasks. But I think the implementation maybe not that elegant. Our paper measures the instruction-following retrieval ability covering these domains. I think it's similar to BRIGHT to some extent. |
Yes, you can do like this |
@SighingSnow it is possible to create an AggregateTask, that simply aggregates scores across the individual tasks (instead of implementing it as one task). It seems like this is kind what you are doing here. |
Thank u. I will revise my code. And it will be done maybe next week. |
Code Quality
make lint
to maintain consistent style.Documentation
Testing
make test-with-coverage
.make test
ormake test-with-coverage
to ensure no existing functionality is broken.Adding datasets checklist
Reason for dataset addition: ...
mteb -m {model_name} -t {task_name}
command.sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
intfloat/multilingual-e5-small
self.stratified_subsampling() under dataset_transform()
make test
.make lint
.