Skip to content

hly1998/DeepTextHashing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A Survey on Deep Text Hashing: Efficient Semantic Text Retrieval with Binary Representation

arXiv License Python PyTorch English 中文

This repository offers a carefully curated selection of research papers centered on deep text hashing. It is based on our survey paper, A Survey on Deep Text Hashing: Efficient Semantic Text Retrieval with Binary Representation. The list will be updated regularly. Should you come across any inaccuracies or overlooked works, you are warmly encouraged to open an issue or submit a pull request.

Table of Contents

Models

We have implemented several deep text hashing models using the PyTorch framework. Our foundational code structure is inspired by the VDSH repository.

Implemented Models

Model Paper Venue Status
VDSH Variational deep semantic hashing for text documents SIGIR'2017
NbrReg Deep semantic text hashing with weak supervision SIGIR'2018
NASH Toward end-to-end neural architecture for generative semantic hashing ACL'2018
B-VAE A binary variational autoencoder for hashing CIARP'2019
Doc2Hash Learning discrete latent variables for documents retrieval NAACL'2019
RBSH Unsupervised neural generative semantic hashing SIGIR'2019
AMMI Learning discrete structured representations by adversarially maximizing mutual information ICML'2020
PairRec Unsupervised semantic hashing with pairwise reconstruction SIGIR'2020
WISH Unsupervised few-bits semantic hashing with implicit topics modeling EMNLP'2020
MISH Unsupervised multi-index semantic hashing WWW'2021
SNUH Integrating semantics and neighborhood information ACL'2021
SSB-VAE Self-supervised bernoulli autoencoders for semi-supervised hashing CIARP'2021
SMASH An efficient and robust semantic hashing framework TOIS'2023
HierHash Multi-grained prototype-induced hierarchical generative model EMNLP'2024
DHSH De-confusing hard samples for text semantic hashing ICASSP'2025

Note: Due to variations in data preprocessing, the results of different models may deviate from those reported in the original papers. We are actively working to standardize both the data processing pipeline and evaluation metrics.

Project Structure

DeepTextHashing/
├── models/              # Model implementations
│   ├── VDSH/
│   ├── NbrReg/
│   ├── NASH/
│   ├── B-VAE/
│   ├── Doc2Hash/
│   ├── RBSH/
│   ├── AMMI/
│   ├── PairRec/
│   ├── WISH/
│   ├── MISH/
│   ├── SNUH/
│   ├── SSB-VAE/
│   ├── SMASH/
│   ├── HierHash/
│   └── DHSH/
├── textdata/            # Dataset loading utilities
├── utils/               # Preprocessing and evaluation utilities
└── requirements.txt

Quick Start

1. Installation

pip install -r requirements.txt

2. Data Preprocessing

Refer to the code in the utils/ folder to preprocess the dataset:

python utils/preprocess.py --dataset ng20

3. Training

Once data preparation is complete, train any model with:

sh models/{model_name}/train.sh

For example:

sh models/VDSH/train.sh

Datasets

We have compiled a selection of widely utilized benchmark datasets for text hashing research. These datasets span diverse domains and exhibit a range of characteristics in terms of scale, label types, and download link. For a detailed introduction to the dataset, please refer to our survey.

Datasets Instances Categories Label Type Link
20Newsgroups 18,846 20 Single-label link
Agnews 127,600 4 Single-label link
Reuters 10,788 90/20 Multi-label link
DBpedia 60,000 14 Single-label link
RCV1 804,414 103/4 Multi-label link
TMC 28,596 22 Multi-label link
NYT 11,527 26 Single-label link
Yahooanswer 1,460,000 10 Single-label link

Paper List

Meaning of the Marker

Marker Meaning
Reconstruction-based method
Applying a prior on the latent representation (X: G=Gaussian, B=Bernoulli, M=Mixture, C=Categorical, BM=Boltzmann, GA=Graph)
Pseudo-similarity-based method
Maximal mutual information method
Learning semantic from categories
Learning semantic from relevance
Promoting code balance
Promoting few-bit code
Using quantization method (X: Loss=quantization loss, Sgn=Signum, Sigmoid, Tanh, STanh=scaled tanh)
Promoting the robustness of hash codes
Optimization of gradients during backpropagation in discrete layers
Adaptation to hashing index

Papers

  • De-confusing Hard Samples for Text Semantic Hashing. In ICASSP'2025 Paper.

  • Document Hashing with Multi-Grained Prototype-Induced Hierarchical Generative Model. In EMNLP'2024 Paper.

  • Efficient similar exercise retrieval model based on unsupervised semantic hashing. In JCA'2024

  • Towards Efficient Coarse-grained Dialogue Response Selection. In TOIS'2023 Paper.

  • An efficient and robust semantic hashing framework for similar text search. In TOIS'2023 Paper Code.

  • Exploiting Multiple Features for Hash Codes Learning with Semantic-Alignment-Promoting Variational Auto-encoder. In NLPCC'2023 Paper.

  • Intra-category aware hierarchical supervised document hashing. In TKDE'2022 Paper Code.

  • Accelerating code search with deep hashing and code classification. In ACL'2022 Paper.

  • LASH: Large-scale academic deep semantic hashing. In TKDE'2021 Paper.

  • Efficient passage retrieval with hashing for open-domain question answering. In ACL'2021 Paper Code.

  • Refining BERT embeddings for document hashing via mutual information maximization. In EMNLP'2021 Paper Code.

  • Integrating Semantics and Neighborhood Information with Graph-Driven Generative Models for Document Retrieval. In ACL/IJCNLP'2021 Paper Code.

  • Unsupervised multi-index semantic hashing. In WWW'2021 Paper Code.

  • Self-supervised bernoulli autoencoders for semi-supervised hashing. In CIARP'2021 Paper Code.

  • Conditional text hashing utilizing pair-wise multi class labels. In ICICEL'2020 Paper.

  • Discrete wasserstein autoencoders for document retrieval. In ICASSP'2020 Paper.

  • Pairwise supervised hashing with Bernoulli variational auto-encoder and self-control gradient estimator. In UAI'2020 Paper.

  • Efficient implicit unsupervised text hashing using adversarial autoencoder. In WWW'2020 Paper Code.

  • Generative semantic hashing enhanced via Boltzmann machines. In WWW'2020 Paper.

  • Unsupervised few-bits semantic hashing with implicit topics modeling. In EMNLP'2020 Paper Code.

  • Learning discrete structured representations by adversarially maximizing mutual information. In ICML'2020 Paper Code.

  • node2hash: Graph aware deep semantic text hashing. In Inf.Process.Manag.'2020 Paper Code.

  • Unsupervised semantic hashing with pairwise reconstruction. In SIGIR'2020 Paper Code.

  • Hashing based answer selection. In AAAI'2020 Paper.

  • Document Hashing with Mixture-Prior Generative Models. In EMNLP'2019 Paper.

  • Doc2hash: Learning discrete latent variables for documents retrieval. In NAACL'2019 Paper Code.

  • A binary variational autoencoder for hashing. In CIARP'2019 Paper.

  • Unsupervised neural generative semantic hashing. In SIGIR'2019 Paper.

  • Short text analysis based on dual semantic extension and deep hashing in microblog. In TIST'2019 Paper.

  • Variational deep semantic text hashing with pairwise labels. In IMCOM'2019 Paper.

  • Nash: Toward end-to-end neural architecture for generative semantic hashing. In ACL'2018 Paper Code.

  • Deep semantic text hashing with weak supervision. In SIGIR'2018 Paper.

  • Variational deep semantic hashing for text documents. In SIGIR'2017 Paper Code.

  • A Document Modeling Method Based on Deep Generative Model and Spectral Hashing. In KSEM'2016 Paper.

  • Understanding short texts through semantic enrichment and hashing. In TKDE'2015 Paper.

  • Convolutional neural networks for text hashing. In IJCAI'2015 Paper.

Citation

If you find this repository helpful, please cite our survey:

@article{he2025survey,
  title={A Survey on Deep Text Hashing: Efficient Semantic Text Retrieval with Binary Representation},
  author={He, Liyang and Huang, Zhenya and Yang, Cheng and Li, Rui and Zhang, Zheng and Zhang, Kai and Li, Zhi and Liu, Qi and Chen, Enhong},
  journal={arXiv preprint arXiv:2510.27232},
  year={2025}
}

About

The Python implementation of some deep text hashing (also called deep semantic hashing) Models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published