Stars
A fully open-source humanoid arm for physical AI research and deployment in contact-rich environments.
This repository provides training and evaluation code for `MCTR` using MMPTracking and MTMC_NVIDIA datasets.
Kyutai's Speech-To-Text and Text-To-Speech models based on the Delayed Streams Modeling framework.
A ComfyUI extention for BAGEL(Unified Model for Multimodal Understanding and Generation)
The simplest, fastest repository for training/finetuning small-sized VLMs.
Kimi-Audio, an open-source audio foundation model excelling in audio understanding, generation, and conversation
The Open-Source Multimodal AI Agent Stack: Connecting Cutting-Edge AI Models and Agent Infra
[NeurIPS 2025] SpatialLM: Training Large Language Models for Structured Indoor Modeling
GRAB: A Dataset of Whole-Body Human Grasping of Objects
A mini, open-weights, version of our Proxy assistant.
The official PyTorch implementation of the paper "MotionGPT: Finetuned LLMs are General-Purpose Motion Generators"
[Technical Report 2023] PhysHOI: Physics-Based Imitation of Dynamic Human-Object Interaction
Easily fine-tune, evaluate and deploy gpt-oss, Qwen3, DeepSeek-R1, or any open source LLM / VLM!
🌐 Make websites accessible for AI agents. Automate tasks online with ease.
WiLoR: End-to-end 3D hand localization and reconstruction in-the-wild
A Library for Differentiable Logic Gate Networks
openpilot is an operating system for robotics. Currently, it upgrades the driver assistance system on 300+ supported cars.
Official code for "F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching"
Official code implementation of General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model
Official Pytorch Implementation for "VidToMe: Video Token Merging for Zero-Shot Video Editing" (CVPR 2024)
[ICLR 2024] LLM-grounded Video Diffusion Models (LVD): official implementation for the LVD paper
[NeurIPS 2023] Self-supervised Object-Centric Learning for Videos
evelinehong / 3D-CLR-Official
Forked from zsh2000/3D-CLR[CVPR 2023] Code for "3D Concept Learning and Reasoning from Multi-View Images"
Track-Anything is a flexible and interactive tool for video object tracking and segmentation, based on Segment Anything, XMem, and E2FGVI.
Code for NeurIPS 2022 paper "Learning Viewpoint-Agnostic Visual Representations by Recovering Tokens in 3D Space"
[NeurIPS 2024] Depth Anything V2. A More Capable Foundation Model for Monocular Depth Estimation


