Highlights
- Pro
Starred repositories
Precision Alignment, Infinite Possibilities
Official implementation of "Continuous Autoregressive Language Models"
The official Implementation of PeriodWave and PeriodWave-Turbo
Elucidated Text-To-Audio (ETTA) is a SOTA text-to-audio model with a holistic understanding of the design space and trained with synthetic captions.
VibeVoice: Expressive, longform conversational speech synthesis. (Community fork)
SoulX-Podcast is an inference codebase by the Soul AI team for generating high-fidelity podcasts from text.
Trainging, inference, and testing of the SAC speech codec model.
OmniVinci is an omni-modal LLM for joint understanding of vision, audio, and language.
kyutai-labs / nanoGPTaudio
Forked from karpathy/nanoGPTCode for the blog "Neural audio codecs: how to get audio into LLMs"
[NAACL 2025] WaveFM: A High-Fidelity and Efficient Vocoder Based on Flow Matching
PyTorch implementation of Audio Flamingo: Series of Advanced Audio Understanding Language Models
LongCat Audio Tokenizer and Detokenizer
Data Pipeline, Models, and Benchmark for Omni-Captioner.
PESQ (Perceptual Evaluation of Speech Quality) Wrapper for Python Users (narrow band and wide band)
FLM-Audio is a audio-language subversion of RoboEgo/FLM-Ego -- an omnimodal model with native full duplexity.
Speech To Speech: an effort for an open-sourced and modular GPT4-o
A CLI text-to-speech tool using the Kokoro model, supporting multiple languages, voices (with blending), and various input formats including EPUB books and PDF documents.
Official implementation of DNSMOS Pro (accepted at INTERSPEECH 2024).
Language modelling on RVQ tokens with minimal codes
Intelligent automation and multi-agent orchestration for Claude Code
Claude Code is an agentic coding tool that lives in your terminal, understands your codebase, and helps you code faster by executing routine tasks, explaining complex code, and handling git workflo…
A comprehensive toolkit for podcast evaluation. https://arxiv.org/abs/2510.00485
Official code for "Semantic-VAE: Semantic-Alignment Latent Representation for Better Speech Synthesis"
Evaluating Bias in Spoken Dialogue LLMs for Real-World Decisions and Recommendations
ACE-Step: A Step Towards Music Generation Foundation Model
Ming-UniAudio: Speech LLM for Joint Understanding, Generation and Editing with Unified Representation
AIR-Bench: Benchmarking Large Audio-Language Models via Generative Comprehension
