DrQA is an open-domain question answering system that reads large text corpora—famously Wikipedia—to answer natural language questions with extractive spans. It follows a two-stage pipeline: a fast document retriever first narrows down candidate articles, and a neural machine reader then predicts the exact answer span from those passages. The retriever relies on classic IR features (like TF-IDF and n-gram statistics) to remain lightweight and scalable to millions of documents. The reader is a neural model trained on supervised QA data to estimate start and end positions within a paragraph, and it can be adapted to new domains through fine-tuning or distant supervision. The repository includes scripts to build the Wikipedia index, train the reader, and evaluate end-to-end performance. DrQA popularized a practical recipe for combining IR and neural reading, and it remains a strong baseline for open-domain QA research and production prototypes.
Features
- Scalable TF-IDF–based retriever over large corpora
- Neural span extractor trained for precise start/end predictions
- End-to-end pipeline from indexing to answering questions
- Tools for distant supervision and domain adaptation
- Reproducible training and evaluation scripts for standard datasets
- Modular components enabling IR or reader swaps and custom corpora