This repository offers a carefully curated selection of research papers centered on deep text hashing. It is based on our survey paper, A Survey on Deep Text Hashing: Efficient Semantic Text Retrieval with Binary Representation. The list will be updated regularly. Should you come across any inaccuracies or overlooked works, you are warmly encouraged to open an issue or submit a pull request.
We have implemented several deep text hashing models using the PyTorch framework. Our foundational code structure is inspired by the VDSH repository.
| Model | Paper | Venue | Status |
|---|---|---|---|
| VDSH | Variational deep semantic hashing for text documents | SIGIR'2017 | ✅ |
| NbrReg | Deep semantic text hashing with weak supervision | SIGIR'2018 | ✅ |
| NASH | Toward end-to-end neural architecture for generative semantic hashing | ACL'2018 | ✅ |
| B-VAE | A binary variational autoencoder for hashing | CIARP'2019 | ✅ |
| Doc2Hash | Learning discrete latent variables for documents retrieval | NAACL'2019 | ✅ |
| RBSH | Unsupervised neural generative semantic hashing | SIGIR'2019 | ✅ |
| AMMI | Learning discrete structured representations by adversarially maximizing mutual information | ICML'2020 | ✅ |
| PairRec | Unsupervised semantic hashing with pairwise reconstruction | SIGIR'2020 | ✅ |
| WISH | Unsupervised few-bits semantic hashing with implicit topics modeling | EMNLP'2020 | ✅ |
| MISH | Unsupervised multi-index semantic hashing | WWW'2021 | ✅ |
| SNUH | Integrating semantics and neighborhood information | ACL'2021 | ✅ |
| SSB-VAE | Self-supervised bernoulli autoencoders for semi-supervised hashing | CIARP'2021 | ✅ |
| SMASH | An efficient and robust semantic hashing framework | TOIS'2023 | ✅ |
| HierHash | Multi-grained prototype-induced hierarchical generative model | EMNLP'2024 | ✅ |
| DHSH | De-confusing hard samples for text semantic hashing | ICASSP'2025 | ✅ |
Note: Due to variations in data preprocessing, the results of different models may deviate from those reported in the original papers. We are actively working to standardize both the data processing pipeline and evaluation metrics.
DeepTextHashing/
├── models/ # Model implementations
│ ├── VDSH/
│ ├── NbrReg/
│ ├── NASH/
│ ├── B-VAE/
│ ├── Doc2Hash/
│ ├── RBSH/
│ ├── AMMI/
│ ├── PairRec/
│ ├── WISH/
│ ├── MISH/
│ ├── SNUH/
│ ├── SSB-VAE/
│ ├── SMASH/
│ ├── HierHash/
│ └── DHSH/
├── textdata/ # Dataset loading utilities
├── utils/ # Preprocessing and evaluation utilities
└── requirements.txt
pip install -r requirements.txtRefer to the code in the utils/ folder to preprocess the dataset:
python utils/preprocess.py --dataset ng20Once data preparation is complete, train any model with:
sh models/{model_name}/train.shFor example:
sh models/VDSH/train.shWe have compiled a selection of widely utilized benchmark datasets for text hashing research. These datasets span diverse domains and exhibit a range of characteristics in terms of scale, label types, and download link. For a detailed introduction to the dataset, please refer to our survey.
| Datasets | Instances | Categories | Label Type | Link |
|---|---|---|---|---|
| 20Newsgroups | 18,846 | 20 | Single-label | link |
| Agnews | 127,600 | 4 | Single-label | link |
| Reuters | 10,788 | 90/20 | Multi-label | link |
| DBpedia | 60,000 | 14 | Single-label | link |
| RCV1 | 804,414 | 103/4 | Multi-label | link |
| TMC | 28,596 | 22 | Multi-label | link |
| NYT | 11,527 | 26 | Single-label | link |
| Yahooanswer | 1,460,000 | 10 | Single-label | link |
-
De-confusing Hard Samples for Text Semantic Hashing. In ICASSP'2025 Paper.
-
Document Hashing with Multi-Grained Prototype-Induced Hierarchical Generative Model. In EMNLP'2024 Paper.
-
Efficient similar exercise retrieval model based on unsupervised semantic hashing. In JCA'2024
-
Towards Efficient Coarse-grained Dialogue Response Selection. In TOIS'2023 Paper.
-
An efficient and robust semantic hashing framework for similar text search. In TOIS'2023 Paper Code.
-
Exploiting Multiple Features for Hash Codes Learning with Semantic-Alignment-Promoting Variational Auto-encoder. In NLPCC'2023 Paper.
-
Intra-category aware hierarchical supervised document hashing. In TKDE'2022 Paper Code.
-
Accelerating code search with deep hashing and code classification. In ACL'2022 Paper.
-
LASH: Large-scale academic deep semantic hashing. In TKDE'2021 Paper.
-
Efficient passage retrieval with hashing for open-domain question answering. In ACL'2021 Paper Code.
-
Refining BERT embeddings for document hashing via mutual information maximization. In EMNLP'2021 Paper Code.
-
Integrating Semantics and Neighborhood Information with Graph-Driven Generative Models for Document Retrieval. In ACL/IJCNLP'2021 Paper Code.
-
Unsupervised multi-index semantic hashing. In WWW'2021 Paper Code.
-
Self-supervised bernoulli autoencoders for semi-supervised hashing. In CIARP'2021 Paper Code.
-
Conditional text hashing utilizing pair-wise multi class labels. In ICICEL'2020 Paper.
-
Discrete wasserstein autoencoders for document retrieval. In ICASSP'2020 Paper.
-
Pairwise supervised hashing with Bernoulli variational auto-encoder and self-control gradient estimator. In UAI'2020 Paper.
-
Efficient implicit unsupervised text hashing using adversarial autoencoder. In WWW'2020 Paper Code.
-
Generative semantic hashing enhanced via Boltzmann machines. In WWW'2020 Paper.
-
Unsupervised few-bits semantic hashing with implicit topics modeling. In EMNLP'2020 Paper Code.
-
Learning discrete structured representations by adversarially maximizing mutual information. In ICML'2020 Paper Code.
-
node2hash: Graph aware deep semantic text hashing. In Inf.Process.Manag.'2020 Paper Code.
-
Unsupervised semantic hashing with pairwise reconstruction. In SIGIR'2020 Paper Code.
-
Hashing based answer selection. In AAAI'2020 Paper.
-
Document Hashing with Mixture-Prior Generative Models. In EMNLP'2019 Paper.
-
Doc2hash: Learning discrete latent variables for documents retrieval. In NAACL'2019 Paper Code.
-
A binary variational autoencoder for hashing. In CIARP'2019 Paper.
-
Unsupervised neural generative semantic hashing. In SIGIR'2019 Paper.
-
Short text analysis based on dual semantic extension and deep hashing in microblog. In TIST'2019 Paper.
-
Variational deep semantic text hashing with pairwise labels. In IMCOM'2019 Paper.
-
Nash: Toward end-to-end neural architecture for generative semantic hashing. In ACL'2018 Paper Code.
-
Deep semantic text hashing with weak supervision. In SIGIR'2018 Paper.
-
Variational deep semantic hashing for text documents. In SIGIR'2017 Paper Code.
-
A Document Modeling Method Based on Deep Generative Model and Spectral Hashing. In KSEM'2016 Paper.
-
Understanding short texts through semantic enrichment and hashing. In TKDE'2015 Paper.
-
Convolutional neural networks for text hashing. In IJCAI'2015 Paper.
If you find this repository helpful, please cite our survey:
@article{he2025survey,
title={A Survey on Deep Text Hashing: Efficient Semantic Text Retrieval with Binary Representation},
author={He, Liyang and Huang, Zhenya and Yang, Cheng and Li, Rui and Zhang, Zheng and Zhang, Kai and Li, Zhi and Liu, Qi and Chen, Enhong},
journal={arXiv preprint arXiv:2510.27232},
year={2025}
}