This repository contains the code for a pipeline to train and evaluate a biomedical retrieval model using the GPL framework. The pipeline consists of three main stages: building a corpus with hard negatives, training the model, and evaluating its performance on various benchmark datasets. All fine-tuned models and created datasets are available in this HuggingFace Collection.
Create a new conda environment. Ensure that the python version is below 3.11 (otherwise faiss-gpu will fail to install):
conda create -n <env_name> python=3.10.14Activate it using:
conda activate <env_name>
Before running any of the scripts, ensure you have the necessary libraries installed. You can install them using the provided requirements.txt file:
pip install -r requirements.txtThis stage involves generating a dataset of queries, positive passages, and hard negatives from the PubMed abstract dataset. This is accomplished in two steps.
The build-corpus/pubmed-parser.py script downloads PubMed abstracts and constructs 2-hop citation graphs. For each starting abstract, it fetches the abstracts of its cited papers (1-hop) and the papers they cite (2-hop).
To run this script:
python build-corpus/pubmed-parser.pyThis will create the following file:
2hop-citation-graphs.jsonl: Contains the 2-hop citation graphs, with each line representing a starting PMID and its corresponding 1-hop and 2-hop abstracts.
The build-corpus/pubmed-query-scoring.py script takes the citation graphs from the previous step and generates queries and hard negatives. It uses the T5 Doc2Query model to create a query for each positive abstract and then traverses the citation graph to find diverse hard negatives.
To run this script:
python build-corpus/pubmed-query-scoring.pyThis will produce the following file, which will be used for training:
hard-negatives-traversal.jsonl: A JSONL file where each line contains a query, a positive passage, and a list of hard negative passages.
The train.py script fine-tunes the gte-models using the data generated in the previous step. It uses a multiple negatives ranking loss to train the model to distinguish between positive and negative passages for a given query.
To start the training process:
python train.pyThe script will save the fine-tuned model to the following directory:
output/: This folder will contain the trained model artifacts. The specific sub-folder will depend on theMODEL_NAMEset in thetrain.pyscript. For example, ifMODEL_NAMEis'thenlper/gte-base', the model will be saved inoutput/gte/.
The final stage is to evaluate the performance of the trained model on various benchmark datasets. The evaluation scripts use the beir library, and the datasets are available from the BEIR GitHub repository. Make sure to download the necessary datasets and place them in the eval_datasets/ directory. For LoTTE the datasets are downloaded from IR Datasets.
To run the evaluation:
python eval/beir-evaluation.pyThe eval/cqadupstack.py script evaluates the model on the CQADupStack benchmark, which consists of sub-datasets from different domains.
To run this evaluation:
python eval/cqadupstack.pyThe eval/lotte-evaluation.py script evaluates the model on the LoTTE benchmark.
To run the LoTTE evaluation, you need to provide the path to the model and the data directories:
python eval/lotte-evaluation.py --model_path output/gte --data_dir eval_datasets/lotte --rankings_dir rankings --split testTo test the query encoding and retrieval latency, we run evaluations using the MSMARCO dataset. Run the script using the following command.
python eval/latency.py@misc{sinha2025bicaeffectivebiomedicaldense,
title={BiCA: Effective Biomedical Dense Retrieval with Citation-Aware Hard Negatives},
author={Aarush Sinha and Pavan Kumar S and Roshan Balaji and Nirav Pravinbhai Bhatt},
year={2025},
eprint={2511.08029},
archivePrefix={arXiv},
primaryClass={cs.IR},
url={https://arxiv.org/abs/2511.08029},
}