Stars
The official repository for ERNIE 4.5 and ERNIEKit – its industrial-grade development toolkit based on PaddlePaddle.
Seed1.5-VL, a vision-language foundation model designed to advance general-purpose multimodal understanding and reasoning, achieving state-of-the-art performance on 38 out of 60 public benchmarks.
The repository provides code for running inference with the Meta Segment Anything Model 2 (SAM 2), links for downloading the trained model checkpoints, and example notebooks that show how to use th…
The official implementation of "Patch-as-Decodable-Token: Towards Unified Multi-Modal Vision Tasks in MLLMs"
Detect Anything via Next Point Prediction (Based on Qwen2.5-VL-3B)
This is an official code for UniConvNet on ICCV 2025
[CVPR 2025 Highlight] Official code and models for Encoder-only Mask Transformer (EoMT).
Reference PyTorch implementation and models for DINOv3
Implementation of "YOLOv13: Real-Time Object Detection with Hypergraph-Enhanced Adaptive Visual Perception".
[ICCV 2025] EA-ViT: Efficient Adaptation for Elastic Vision Transformer
💫 Models for the spaCy Natural Language Processing (NLP) library
PyTorch Re-Implementation of "The Sparsely-Gated Mixture-of-Experts Layer" by Noam Shazeer et al. https://arxiv.org/abs/1701.06538
P^2HCT: Plug-and-Play Hierarchical C2F Transformer for Multi-Scale Feature Fusion
New generation of CLIP with fine grained discrimination capability, ICML2025
The official implementation of [CVPR 2025] "5%>100%: Breaking Performance Shackles of Full Fine-Tuning on Visual Recognition Tasks".
[CVPR 2024] Real-Time Open-Vocabulary Object Detection
Official PyTorch implementation of "Multi-modal Queried Object Detection in the Wild" (accepted by NeurIPS 2023)
RF-DETR is a real-time object detection and segmentation model architecture developed by Roboflow, SOTA on COCO and designed for fine-tuning.
Official repository of 'Visual-RFT: Visual Reinforcement Fine-Tuning' & 'Visual-ARFT: Visual Agentic Reinforcement Fine-Tuning'’
Qwen3-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
A fork to add multimodal model training to open-r1
Solve Visual Understanding with Reinforced VLMs