A clean, from-scratch PyTorch implementation of DINO (Self-Distillation with No Labels), the groundbreaking self-supervised learning method that discovers meaningful visual representations without any human annotations.
Description: This animation demonstrates the training progression of our self-implemented DINO model, showing improved feature extraction and segmentation capabilities over time.
Self-Supervised Learning (SSL) is a revolutionary paradigm where models learn representations from unlabeled data by creating their own supervision signals. Instead of relying on human-annotated labels, SSL systems generate "pretext tasks" that enable models to learn rich, transferable representations.
- Contrastive Learning: Learning by comparing similar and dissimilar examples
- Predictive Tasks: Predicting hidden or transformed parts of the input
- Raw Unlabeled Data: Start with massive collections of images (e.g., from the internet)
- Pretext Task: Create an artificial task where labels are automatically generated
- Model Training: Train a model to solve this pretext task
- Representation Learning: Discard the task-specific head and use the learned features for downstream tasks with minimal labeled data
| Task | Description |
|---|---|
| Rotation | Predict the rotation angle (0°, 90°, 180°, 270°) applied to an image |
| Jigsaw Puzzles | Reassemble shuffled patches of an image |
| Image Inpainting | Predict missing parts of an image |
| Instance Discrimination | Contrast different views of an image against other images |
DINO (DIstillation with NO labels) is a groundbreaking SSL algorithm that uses a simple yet powerful self-distillation framework to learn semantically meaningful image representations, even discovering object segmentation capabilities without any labels.
# Conceptual implementation of DINO's core mechanism
teacher_network = VisionTransformer() # Processes global views
student_network = VisionTransformer() # Processes local views
# Teacher weights are EMA of student weights
teacher_network.weights = EMA(student_network.weights)
# Knowledge distillation loss
loss = distillation_loss(
student_network(local_view),
teacher_network(global_view)
)The EMA update rule for a parameter vector is:
Where:
-
$\theta_{\text{teacher}}$ : Teacher model parameters -
$\theta_{\text{student}}$ : Student model parameters -
$m$ : Momentum coefficient (typically close to 1, e.g., 0.99, 0.996) -
Why EMA is Used in DINO
The teacher network provides consistent, slowly evolving targets for the student to learn from.
EMA helps avoid the trivial solution where both networks output constant representations.
The teacher acts as an ensemble of previous student models, capturing robust features.
- Global Views (e.g., 224×224 pixels) → Teacher Network
- Local Views (e.g., 96×96 pixels) → Student Network
- Sharpening: Temperature parameter in softmax produces "peaky" distributions
- Centering: Bias term prevents dimension domination
- Momentum Encoder: Stable targets via exponential moving average
| Feature | Benefit |
|---|---|
| No Labels Needed | Learns entirely from image structure |
| Emergent Segmentation | Discovers objects without segmentation labels |
| Excellent Features | State-of-the-art performance with linear probes |
| Conceptual Simplicity | Avoids complex contrastive learning mechanisms |
# Setup Instructions
## 1. Clone the Repository
```bash
git clone https://github.com/basaanithanaveenkumar/object-detection-BBD.git
cd object-detection-BBDmkdir -p datapython scripts/download_dataset.pymv data/100k/val data/100k/validpython scripts/convert_to_coco.pyThis setup process:
- Clones the object detection project repository
- Creates the necessary directory structure
- Downloads the BBD (Berkeley DeepDrive) dataset
- Renames the validation directory to match expected conventions
- Converts the BBD dataset format to standard COCO format for compatibility with object detection frameworks
Based on the original paper:
Emerging Properties in Self-Supervised Vision Transformers
Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, Armand Joulin
arXiv | Official Implementation
MIT License - see LICENSE for details.
Contributions welcome! Please feel free to submit issues and pull requests.
⭐ If this project helps your research, please give it a star!
