AI-powered emoji search engine. This application understands the meaning (semantics) behind your query to find the most relevant emojis. It supports over 50 languages and offers compound word analysis.
Semantic Emoji Search is a web application built with Streamlit that leverages vector embeddings to perform semantic searches. Unlike traditional search engines that rely on exact text matches, this tool uses a embedding model to understand the context and sentiment of your input.
Key features include:
- Semantic Understanding: Finds emojis conceptually related to your query (e.g., "sad" might return 😢, 🌧️, or 💔).
- Multilingual Support: Supports 50+ languages.
- Compound Word Analysis: Automatically breaks down compound words (e.g., "Sunflower" → "Sun" + "Flower") to provide granular emoji suggestions for each part (Supports 25+ languages).
- Interactive UI: Clean and responsive interface with grid visualizations.
The application utilizes the badrex/LLM-generated-emoji-descriptions dataset.
The system supports state-of-the-art embedding models to capture semantic meaning:
intfloat/multilingual-e5-basesentence-transformers/paraphrase-multilingual-MiniLM-L12-v2(Default & Recommended)
-
Normal Search: Supports 50+ languages.
Details are in our publication Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation. We used the following 50+ languages: ar, bg, ca, cs, da, de, el, en, es, et, fa, fi, fr, fr-ca, gl, gu, he, hi, hr, hu, hy, id, it, ja, ka, ko, ku, lt, lv, mk, mn, mr, ms, my, nb, nl, pl, pt, pt-br, ro, ru, sk, sl, sq, sr, sv, th, tr, uk, ur, vi, zh-cn, zh-tw. link
-
Compound Search: Supports 25+ languages (based on WordNet OMW-1.4). link
Before you begin, ensure you have the following installed on your system:
- Python 3.13 or higher
- pip (Python package installer) or uv (Project manager)
-
Clone the repository
git clone <repository_url> cd emoji-search
-
Install Dependencies It is recommended to use a virtual environment.
Using
pip:pip install streamlit chromadb sentence-transformers nltk langdetect
Using
uv(if applicable):uv sync
-
NLTK Data The application will attempt to download necessary NLTK data (WordNet, OMW-1.4) on the first run. Ensure you have an internet connection.
The application expects a pre-populated vector database for the embeddings.
- ChromaDB: Ensure the
chroma_dbdirectory is present in the root of the project. This directory should contain the vector collectionLLM-generated-emoji-multilingual-MiniLM-L12-v2.
Note: If the database is missing, the search functionality will not work.
To start the application, run the following command in your terminal:
streamlit run app.pyOnce the server starts, your default web browser will open to http://localhost:8501.
- Enter a Phrase: Type any word or phrase into the search box (e.g., "Spicy food", "แมวน่ารัก", "Bonjour").
- View Results: The most semantically relevant emojis will appear instantly.
- Compound Analysis: If you enter a compound word (like "Firefly"), the app will also show results for the individual components ("Fire" + "Fly").
This project makes use of several open-source libraries and datasets:
- Streamlit - For the web application framework.
- ChromaDB - For the vector database storage and retrieval.
- Sentence Transformers - For generating semantic embeddings.
- NLTK - For natural language processing and WordNet integration.
- LangDetect - For language detection.
- Open Multilingual Wordnet (OMW) - For multilingual support.