Stars
An open-source AI agent that brings the power of Gemini directly into your terminal.
Open-source deep-learning framework for building, training, and fine-tuning deep learning models using state-of-the-art Physics-ML methods
EasyR1: An Efficient, Scalable, Multi-Modality RL Training Framework based on veRL
Awesome-LLM: a curated list of Large Language Model
ConceptAttention: A method for interpreting multi-modal diffusion transformers.
Code for MetaMorph Multimodal Understanding and Generation via Instruction Tuning
[NeurIPS 2024 Best Paper Award][GPT beats diffusion🔥] [scaling laws in visual generation📈] Official impl. of "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction". A…
Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
[WACV'25 Oral] Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think
State-of-the-art deep learning based audio codec supporting both mono 24 kHz audio and stereo 48 kHz audio.
Implementation of MusicLM, Google's new SOTA model for music generation using attention networks, in Pytorch
SALMONN family: A suite of advanced multi-modal LLMs
a text-conditional diffusion probabilistic model capable of generating high fidelity audio.
Official repo for Images that sound: a special spectrogram that can be seen as images and played as sound generated by diffusions
An AI-Powered Speech Processing Toolkit and Open Source SOTA Pretrained Models, Supporting Speech Enhancement, Separation, and Target Speaker Extraction, etc.
Cache-Augmented Generation: A Simple, Efficient Alternative to RAG
Medical o1, Towards medical complex reasoning with LLMs
SONAR, a new multilingual and multimodal fixed-size sentence embedding space, with a full suite of speech and text encoders and decoders.
Large Concept Models: Language modeling in a sentence representation space
Qwen2.5-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
GPT4-4V Histopathology In-Context-Learning
Get up and running with Llama 3.3, DeepSeek-R1, Phi-4, Gemma 3, Mistral Small 3.1 and other large language models.
Large Language Model Text Generation Inference
Versatile audio super resolution (any -> 48kHz) with AudioSR.
State-of-the-art audio codec with 90x compression factor. Supports 44.1kHz, 24kHz, and 16kHz mono/stereo audio.
(ICLR'25) PaPaGei: Open Foundation Models for Optical Physiological Signals