LiveDRBench: Deep Research as Claim Discovery

Arxiv Paper | Hugging Face Dataset | Evaluation Code

Deep Research as Claim Discovery: We propose a formal characterization of the deep research (DR) problem and introduce a new benchmark, LiveDRBench, to evaluate the performance of DR systems. To enable objective evaluation, we define DR using an intermediate output representation that encodes key claims uncovered during search—separating the reasoning challenge from surface-level report generation.

Dataset Details

The benchmark consists of 100 challenging DR tasks over scientific topics (e.g., dataset discovery, materials discovery, novelty search, prior art discovery) and public interest events (e.g, the Oscars). The data was collected between May-June 2025. We plan to keep the benchmark live, and release periodic updates with new tasks.

Each task consists of (a) a prompt with a short description of the task and the expected output format; and (b) ground-truth JSON containing the claims and references that should be uncovered. We also include an evaluation script for computing the performance of DR systems using information-retrieval metrics namely precision, recall, and F1 scores.

The benchmark contains eight categories: SciFacts-Geo, SciFacts-Materials, NovelDatasets identification, NovelDatasets identification and extraction, NovelDatasets peer retrieval, PriorArt search, Entities, and Flight incidents. A preview of the dataset can be obtained at HuggingFace.

A detailed discussion of LiveDRBench, including how it was developed and tested, can be found in our Arxiv paper.

Usage

First, install the required dependencies:

pip install datasets openai tqdm

To use LiveDRBench's questions, you can load the benchmark using the Hugging Face datasets library:

from datasets import load_dataset

livedrbench = load_dataset("microsoft/LiveDRBench", "v1-full")['test']

To evaluate predictions on LiveDRBench, provide a predictions file with the following JSON schema:

[
  {
    "key": str,                             // Unique identifier from livedrbench.csv
    "preds": List[List[dict | str] | dict]  // Predictions in the format specified by each question in livedrbench.csv
  },
  ...
]

Then, run the evaluation script with an OpenAI API key. This script will compute precision, recall, and F1 scores for each benchmark category.

python src/evaluate.py \
  --openai_api_key YOUR_API_KEY \
  --preds_file path/to/your/predictions.json \
  [--openai_model_name gpt-4o] \
  [--num_threads 8] \
  [--debug]

--openai_api_key (required): Your OpenAI API key.
--preds_file (required): Path to the predictions JSON file.
--openai_model_name (optional): Model to use as judge (default: gpt-4o).
--num_threads (optional): Number of parallel threads (default: 8).
--debug (optional): Enable debug mode, without multithreading.

Intended Uses

LiveDRBench repository is best suited for loading the companion benchmark and evaluating existing models. The code and the benchmark are being shared with the research community to facilitate reproduction of our results and foster further research in this area. LiveDRBench is intended to be used by domain experts who are independently capable of evaluating the quality of outputs before acting on them.

Out-of-scope Uses

LiveDRBench is not well suited for training new Deep Research models. It only provides a test set. To avoid accidental test set leakage, we encrypt the answers in the benchmark, following the procedure of BrowseComp benchmark's release.
LiveDRBench dataset is not as representative of all kinds of Deep Research queries, especially those that require assessing the writing quality of long reports.
We do not recommend using LiveDRBench repo or the dataset in commercial or real-world applications without further testing and development. They are being released for research purposes.
LiveDRBench should not be used in highly regulated domains where inaccurate outputs could suggest actions that lead to injury or negatively impact an individual's legal, financial, or life opportunities.

Data Creation: Problem Inversion

Creating LiveDRBench involves a problem inversion process that allows easy updation with new instances, given a set of existing reasoning problems. The first step is to find a long-context or document reasoning problem that includes a question based on the document and its ground-truth answer. In the second step, this problem is inverted to create a new question asking for an event or entity consistent with the properties mentioned in an answer. In the third step, the question is refined (e.g., more properties are added) such that it admits a unique answer. Finally, the ground-truth set of reference documents is updated in case there are additional documents that provide the same answer.

For example, existing data from the Curie benchmark consists of scientific papers and questions that could be answered based on each paper. The data was transformed to create questions that need to be answered without access to the paper, and thus involving non-trivial search and reasoning. The final ground-truth answers for each question were verified by MSR researchers.

While we aim to cover a broad set of scientific fields and world events in the dataset, the dataset primarily covers the fields of materials science, geospatial analysis, and computer science; and world events including flight incidents, the Oscars and Olympiads. We acknowledge that many scientific fields and geographic areas may not be well covered.

Note: LiveDRBench does not contain links to external data sources. LiveDRBench includes data from an existing scientific dataset, Curie. All queries are answerable using publicly available information.

Best Practices

Best performance can be achieved by connecting an API key directly to the codebase. LiveDRBench should not be the only measure of understanding the performance of a DR model. Additional methods specific to the model use case should also be used to determine the overall performance of the model.

We strongly encourage users to use LLMs that support robust Responsible AI mitigations, such as Azure Open AI (AOAI) services. Such services continually update their safety and RAI mitigations with the latest industry standards for responsible use. For more on AOAI’s best practices when employing foundations models for scripts and applications:

Users are reminded to be mindful of data privacy concerns and are encouraged to review the privacy policies associated with any models and data storage solutions interfacing with LiveDRBench.

It is the user’s responsibility to ensure that the use of LiveDRBench repo and dataset complies with relevant data protection regulations and organizational guidelines.

License

Code in this Github repository is licensed under the MIT License. The LiveDRBench dataset is released under the CDLA v2 license.

Contact

If you have suggestions or questions, please raise an issue on Github or contact us at [email protected].

Citing LiveDRBench

@inproceedings{livedrbench2025,
  title={Characterizing Deep Research: A Benchmark and Formal Definition},
  author={Java, Abhinav and Khandelwal, Ashmit and Midigeshi, Sukruta and Halfaker, Aaron and Deshpande, Amit and Goyal, Navin and Gupta, Ankur and Natarajan, Nagarajan and Sharma, Amit},
  booktitle={arXiv preprint arXiv:2508.04183},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.github/workflows		.github/workflows
docs/images		docs/images
src		src
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
SUPPORT.md		SUPPORT.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LiveDRBench: Deep Research as Claim Discovery

Dataset Details

Usage

Intended Uses

Out-of-scope Uses

Data Creation: Problem Inversion

Best Practices

License

Contact

Citing LiveDRBench

About

Uh oh!

Uh oh!

Contributors 5

Uh oh!

Languages

License

microsoft/livedrbench

Folders and files

Latest commit

History

Repository files navigation

LiveDRBench: Deep Research as Claim Discovery

Dataset Details

Usage

Intended Uses

Out-of-scope Uses

Data Creation: Problem Inversion

Best Practices

License

Contact

Citing LiveDRBench

About

Topics

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors 5

Uh oh!

Languages