Skip to content
/ onellm Public
forked from muxi-ai/onellm

A unified interface for interacting with LLMs from various providers – with support for hundreds of models, built-in fallback mechanisms, and enhanced reliability features.

License

Notifications You must be signed in to change notification settings

ANT0071/onellm

 
 

Repository files navigation

OneLLM

Python Version Status License Test Coverage   Ask DeepWiki

A "drop-in" replacement for OpenAI's client that offers a unified interface for interacting with large language models from various providers, with support for hundreds of models, built-in fallback mechanisms, and enhanced reliability features.


Important

Support this project by starring this repo on GitHub!

More stars → more visibility → more contributors → better features → more robust tool for everyone 🎉

🌟 Star this repo on GitHub →

Thank you for your support! 🙏


📚 Table of Contents

👉 Overview

OneLLM is a lightweight, provider-agnostic Python library that offers a unified interface for interacting with large language models (LLMs) from various providers. It simplifies the integration of LLMs into applications by providing a consistent API while abstracting away provider-specific implementation details.

The library follows the OpenAI client API design pattern, making it familiar to developers already using OpenAI and enabling easy migration for existing applications. Simply change your import statements and instantly gain access to hundreds of models across dozens of providers while maintaining your existing code structure.

With support for 19 implemented providers (and more planned), OneLLM gives you access to approximately 300+ unique language models through a single, consistent interface - from the latest proprietary models to open-source alternatives, all accessible through familiar OpenAI-compatible patterns.

Note

Ready for Use: OneLLM now supports 19 providers with 300+ models! From cloud APIs to local models, you can access them all through a single, unified interface. Contributions are welcome to help add even more providers!


🚀 Getting Started

Installation

# Basic installation (includes OpenAI compatibility and download utility)
pip install OneLLM

# For all providers (includes dependencies for future provider support)
pip install "OneLLM[all]"

Download Models for Local Use

OneLLM includes a built-in utility for downloading GGUF models:

# Download a model from HuggingFace (saves to ~/llama_models by default)
onellm download --repo-id "TheBloke/Llama-2-7B-GGUF" --filename "llama-2-7b.Q4_K_M.gguf"

# Download to a custom directory
onellm download -r "microsoft/Phi-3-mini-4k-instruct-gguf" -f "Phi-3-mini-4k-instruct-q4.gguf" -o /path/to/models

Quick Win: Your First LLM Call

# Basic usage with OpenAI-compatible syntax
from onellm import ChatCompletion

response = ChatCompletion.create(
    model="openai/gpt-4o-mini",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hello, how are you?"}
    ]
)

print(response.choices[0].message["content"])
# Output: I'm doing well, thank you for asking! I'm here and ready to help you...

For more detailed examples, check out the examples directory.


✨ Key Features

Feature Description
📦 Drop-in replacement Use your existing OpenAI code with minimal changes
🔄 Provider-agnostic Support for 300+ models across 19 implemented providers
🔁 Automatic fallback Seamlessly switch to alternative models when needed
🔄 Auto-retry mechanism Retry the same model multiple times before failing
🧩 OpenAI-compatible Familiar interface for developers used to OpenAI
📺 Streaming support Real-time streaming responses from supported providers
🖼️ Multi-modal capabilities Support for text, images, audio across compatible models
🏠 Local LLM support Run models locally via Ollama and llama.cpp
⬇️ Model downloads Built-in CLI to download GGUF models from HuggingFace
🧹 Unicode artifact cleaning Automatic removal of invisible characters to prevent AI detection
🏷️ Consistent naming Clear provider/model-name format for attribution
🧪 Comprehensive tests 96% test coverage ensuring reliability
📄 Apache-2.0 license Open-source license that protects contributions

🌐 Supported Providers

OneLLM currently supports 19 providers with more on the way:

Cloud API Providers (17)

  • Anthropic - Claude family of models
  • Anyscale - Configurable AI platform
  • AWS Bedrock - Access to multiple model families
  • Azure OpenAI - Microsoft-hosted OpenAI models
  • Cohere - Command models with RAG
  • DeepSeek - Chinese LLM provider
  • Fireworks - Fast inference platform
  • Moonshot - Kimi models with long-context capabilities
  • Google AI Studio - Gemini models via API key
  • Groq - Ultra-fast inference for Llama, Mixtral
  • Mistral - Mistral Large, Medium, Small
  • OpenAI - GPT-4o, 3o-mini, DALL-E, Whisper, etc.
  • OpenRouter - Gateway to 100+ models
  • Perplexity - Search-augmented models
  • Together AI - Open-source model hosting
  • Vertex AI - Google Cloud's enterprise Gemini
  • X.AI - Grok models

Local Providers (2)

  • Ollama - Run models locally with easy management
  • llama.cpp - Direct GGUF model execution

Notable Models Available

Through these providers, you gain access to hundreds of models, including:

Model Family Notable Models
OpenAI Family GPT-4o, GPT-4 Turbo, o3
Claude Family Claude 3 Opus, Claude 3 Sonnet, Claude 3 Haiku
Llama Family Llama 3 70B, Llama 3 8B, Code Llama
Mistral Family Mistral Large, Mistral 7B, Mixtral
Gemini Family Gemini Pro, Gemini Ultra, Gemini Flash
Embeddings Ada-002, text-embedding-3-small/large, Cohere embeddings
Multimodal GPT-4 Vision, Claude 3 Vision, Gemini Pro Vision

🏗️ Architecture

OneLLM follows a modular architecture with clear separation of concerns:

---
config:
  look: handDrawn
  theme: mc
  themeVariables:
    background: 'transparent'
    primaryColor: '#fff0'
    secondaryColor: 'transparent'
    tertiaryColor: 'transparent'
    mainBkg: 'transparent'

  flowchart:
    layout: fixed
---
flowchart TD
    %% User API Layer
    User(User Application) --> ChatCompletion
    User --> Completion
    User --> Embedding
    User --> OpenAIClient["OpenAI Client Interface"]

    subgraph API["Public API Layer"]
        ChatCompletion["ChatCompletion\n.create() / .acreate()"]
        Completion["Completion\n.create() / .acreate()"]
        Embedding["Embedding\n.create() / .acreate()"]
        OpenAIClient
    end

    %% Core logic
    subgraph Core["Core Logic"]
        Router["Provider Router"]
        Config["Configuration\nEnvironment Variables\nAPI Keys"]
        FallbackManager["Fallback Manager"]
        RetryManager["Retry Manager"]
    end

    %% Provider Layer
    BaseProvider["Provider Interface<br>(Base Class)"]

    subgraph Implementations["Provider Implementations"]
        OpenAI["OpenAI"]
        Anthropic["Anthropic"]
        GoogleProvider["Google"]
        Groq["Groq"]
        Ollama["Local LLMs"]
        OtherProviders["20+ Others"]
    end

    %% Utilities
    subgraph Utilities["Utilities"]
        Streaming["Streaming<br>Handlers"]
        TokenCounting["Token<br>Counter"]
        ErrorHandling["Error<br>Handling"]
        Types["Type<br>Definitions"]
        Models["Response<br>Models"]
    end

    %% External services
    OpenAIAPI["OpenAI API"]
    AnthropicAPI["Anthropic API"]
    GoogleAPI["Google API"]
    GroqAPI["Groq API"]
    LocalModels["Ollama/llama.cpp"]
    OtherAPIs["..."]

    %% Connections
    ChatCompletion --> Router
    Completion --> Router
    Embedding --> Router
    OpenAIClient --> Router

    Router --> Config
    Router --> FallbackManager
    FallbackManager --> RetryManager

    RetryManager --> BaseProvider

    BaseProvider --> OpenAI
    BaseProvider --> Anthropic
    BaseProvider --> GoogleProvider
    BaseProvider --> Groq
    BaseProvider --> Ollama
    BaseProvider --> OtherProviders

    BaseProvider --> Streaming
    BaseProvider --> TokenCounting
    BaseProvider --> ErrorHandling

    OpenAI --> OpenAIAPI
    Anthropic --> AnthropicAPI
    GoogleProvider --> GoogleAPI
    Groq --> GroqAPI
    Ollama --> LocalModels
    OtherProviders --> OtherAPIs
Loading

Core Components

  1. Public Interface Classes

    • ChatCompletion - For chat-based interactions
    • Completion - For text completion
    • Embedding - For generating embeddings
  2. Provider Interface

    • Abstract base class defining required methods
    • Provider-specific implementations
  3. Configuration System

    • Environment variable support
    • Runtime configuration options
  4. Error Handling

    • Standardized error types
    • Provider-specific error mapping
  5. Fallback System

    • Automatic retries with alternative models
    • Configurable fallback chains
    • Graceful degradation options

🔌 API Design

OneLLM mirrors the OpenAI Python client library API for familiarity:

Chat Completions

from onellm import ChatCompletion

# Basic usage (identical to OpenAI's client)
response = ChatCompletion.create(
    model="openai/gpt-4",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hello, how are you?"}
    ]
)

# Streaming
for chunk in ChatCompletion.create(
    model="anthropic/claude-3-opus",
    messages=[{"role": "user", "content": "Write a poem about AI"}],
    stream=True
):
    print(chunk.choices[0].delta.content, end="", flush=True)

# With fallback options
response = ChatCompletion.create(
    model=["openai/gpt-4", "anthropic/claude-3-opus", "mistral/mistral-large"],
    messages=[{"role": "user", "content": "Explain quantum computing"}]
)

# With retries and fallbacks for enhanced reliability
response = ChatCompletion.create(
    model="openai/gpt-4",
    messages=[{"role": "user", "content": "Explain quantum computing"}],
    retries=3,  # Try gpt-4 up to 3 additional times before failing or using fallbacks
    fallback_models=["anthropic/claude-3-opus", "mistral/mistral-large"]
)

# JSON mode for structured outputs
response = ChatCompletion.create(
    model="openai/gpt-4o",
    messages=[
        {"role": "user", "content": "List the top 3 planets by size"}
    ],
    response_format={"type": "json_object"}  # Request structured JSON output
)
print(response.choices[0].message["content"])  # Outputs valid, parseable JSON

# Multi-modal with images
response = ChatCompletion.create(
    model="openai/gpt-4-vision",
    messages=[
        {"role": "user", "content": [
            {"type": "text", "text": "What's in this image?"},
            {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
        ]}
    ]
)

# Multi-modal with audio
response = ChatCompletion.create(
    model="anthropic/claude-3-sonnet",
    messages=[
        {"role": "user", "content": [
            {"type": "text", "text": "Transcribe and analyze this audio clip"},
            {"type": "audio_url", "audio_url": {"url": "https://example.com/audio.mp3"}}
        ]}
    ]
)

Completions

from onellm import Completion

response = Completion.create(
    model="groq/llama3-70b",
    prompt="Once upon a time",
    max_tokens=100
)

Embeddings

from onellm import Embedding

response = Embedding.create(
    model="openai/text-embedding-3-small",
    input="The quick brown fox jumps over the lazy dog"
)

🛠️ Advanced Features

Fallback Chains for Enhanced Reliability

OneLLM includes built-in fallback support to handle API errors gracefully:

response = ChatCompletion.create(
    model="openai/gpt-4",
    messages=[{"role": "user", "content": "Hello, how are you?"}],
    fallback_models=[
        "anthropic/claude-3-haiku",  # Try Claude if GPT-4 fails
        "openai/gpt-3.5-turbo"       # Try GPT-3.5 if Claude fails
    ]
)

If the primary model fails due to service unavailability, rate limiting, or other retriable errors, OneLLM automatically tries the fallback models in sequence.

Automatic Retries

For transient errors, you can configure OneLLM to retry the same model multiple times before falling back to alternatives:

response = ChatCompletion.create(
    model="openai/gpt-4",
    messages=[{"role": "user", "content": "Hello, how are you?"}],
    retries=3,  # Will try the same model up to 3 additional times if it fails
    fallback_models=["anthropic/claude-3-haiku", "openai/gpt-3.5-turbo"]
)

This is implemented using the fallback mechanism under the hood, making it both powerful and efficient.

Fallback Chains + Automatic Retries Architecture

---
config:
  look: handDrawn
  theme: mc
  themeVariables:
    background: 'transparent'
    primaryColor: '#fff0'
    secondaryColor: 'transparent'
    tertiaryColor: 'transparent'
    mainBkg: 'transparent'

  flowchart:
    layout: fixed
---
flowchart TD
    START(["Client Request"]) --> REQUEST["Chat/Completion Request"]
    REQUEST --> PRIMARY["Primary Model<br>e.g., openai/gpt-4"]

    PRIMARY --> API_CHECK{"API<br>Available?"}
    API_CHECK -->|Yes| MODEL_CHECK{"Model<br>Available?"}
    MODEL_CHECK -->|Yes| QUOTA_CHECK{"Quota/Rate<br>Limits OK?"}
    QUOTA_CHECK -->|Yes| SUCCESS["Successful Response"]
    SUCCESS --> RESPONSE(["Return to Client"])

    API_CHECK -->|No| RETRY_DECISION{"Retry<br>Count < Max?"}
    MODEL_CHECK -->|No| RETRY_DECISION
    QUOTA_CHECK -->|No| RETRY_DECISION

    RETRY_DECISION -->|Yes| RETRY["Retry with Delay<br>(Same Model)"]
    RETRY --> PRIMARY

    RETRY_DECISION -->|No| FALLBACK_CHECK{"Fallbacks<br>Available?"}

    FALLBACK_CHECK -->|Yes| FALLBACK_MODEL["Next Fallback Model<br>e.g., anthropic/claude-3-haiku"]
    FALLBACK_MODEL --> FALLBACK_TRY["Try Fallback"]
    FALLBACK_TRY --> FALLBACK_API_CHECK{"API<br>Available?"}

    FALLBACK_API_CHECK -->|Yes| FALLBACK_SUCCESS["Successful Response"]
    FALLBACK_SUCCESS --> RESPONSE

    FALLBACK_API_CHECK -->|No| NEXT_FALLBACK{"More<br>Fallbacks?"}
    NEXT_FALLBACK -->|Yes| FALLBACK_MODEL
    NEXT_FALLBACK -->|No| ERROR["Error Response"]

    FALLBACK_CHECK -->|No| ERROR
    ERROR --> RESPONSE
Loading

JSON Mode for Structured Outputs

For applications that require structured data, OneLLM supports JSON mode with compatible providers:

response = ChatCompletion.create(
    model="openai/gpt-4o",
    messages=[
        {"role": "user", "content": "List the top 3 programming languages with their key features"}
    ],
    response_format={"type": "json_object"}  # Request JSON output
)

# The response contains valid, parseable JSON
json_response = response.choices[0].message["content"]
print(json_response)  # Structured JSON data

# Parse it with standard libraries
import json
structured_data = json.loads(json_response)

When using providers that don't natively support JSON mode, OneLLM automatically adds system instructions requesting JSON-formatted responses.

Asynchronous Support

Both synchronous and asynchronous APIs are available:

# Synchronous
response = ChatCompletion.create(
    model="openai/gpt-4",
    messages=[{"role": "user", "content": "Hello"}]
)

# Asynchronous with fallbacks and retries
response = await ChatCompletion.acreate(
    model="openai/gpt-4",
    messages=[{"role": "user", "content": "Hello"}],
    retries=2,
    fallback_models=["anthropic/claude-3-opus"]
)

🔄 Migration from OpenAI

OneLLM provides multiple ways to migrate from the OpenAI client, including a fully compatible client interface:

Option 1: Complete Drop-in Replacement (Identical Interface)

OneLLM is a complete drop-in replacement for the OpenAI client with the OpenAI library included as a default dependency:

# Before
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Hello world"}]
)

# After - 100% identical client interface
from onellm import OpenAI  # or Client
client = OpenAI()            # completely compatible with OpenAI's client
response = client.chat.completions.create(
    model="gpt-4",  # automatically adds "openai/" prefix when needed
    messages=[{"role": "user", "content": "Hello world"}]
)

Option 2: Streamlined Direct API (Fewer Lines)

# Before
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Hello world"}]
)

# After - more concise
from onellm import ChatCompletion
response = ChatCompletion.create(
    model="openai/gpt-4",  # explicitly using provider prefix
    messages=[{"role": "user", "content": "Hello world"}]
)

Option 3: Model Fallback (Enhanced Reliability)

# Adding fallback options with ChatCompletion
from onellm import ChatCompletion
response = ChatCompletion.create(
    model="openai/gpt-4",
    messages=[{"role": "user", "content": "Hello world"}],
    fallback_models=[
        "anthropic/claude-3-haiku",
        "openai/gpt-3.5-turbo"
    ],
    # optional config
    fallback_config={
        "log_fallbacks": True
    }
)

# Using retries with fallbacks for enhanced reliability
from onellm import ChatCompletion
response = ChatCompletion.create(
    model="openai/gpt-4",
    messages=[{"role": "user", "content": "Hello world"}],
    retries=3,  # Will retry gpt-4 up to 3 additional times before using fallbacks
    fallback_models=[
        "anthropic/claude-3-haiku",
        "openai/gpt-3.5-turbo"
    ]
)

# Using fallback with the client interface
from onellm import OpenAI
client = OpenAI()
response = client.chat.completions.create(
    model="openai/gpt-4",
    messages=[{"role": "user", "content": "Hello world"}],
    fallback_models=[
        "anthropic/claude-3-opus",
        "groq/llama3-70b"
    ]
)

🏷️ Model Naming Convention

Models are specified using a provider prefix to clearly identify the source:

Provider Format Example
OpenAI openai/{model} openai/gpt-4
Google google/{model} google/gemini-pro
Anthropic anthropic/{model} anthropic/claude-3-opus
Groq groq/{model} groq/llama3-70b
Mistral mistral/{model} mistral/mistral-large
Ollama ollama/{model}@host:port ollama/llama3:8b@localhost:11434
llama.cpp llama_cpp/{model.gguf} llama_cpp/llama-3-8b-q4_K_M.gguf
XAI (Grok) xai/{model} xai/grok-beta
Cohere cohere/{model} cohere/command-r-plus
AWS Bedrock bedrock/{model} bedrock/claude-3-5-sonnet

⚙️ Configuration

OneLLM can be configured through environment variables or at runtime:

# Environment variables
# OPENAI_API_KEY=sk-...
# ANTHROPIC_API_KEY=sk-...

# Runtime configuration
import onellm

onellm.openai_api_key = "sk-..."  # OpenAI API key
onellm.anthropic_api_key = "sk-..."  # Anthropic API key

# Configure fallback behavior
onellm.config.fallback = {
    "enabled": True,
    "default_chains": {
        "chat": ["openai/gpt-4", "anthropic/claude-3-opus", "groq/llama3-70b"],
        "embedding": ["openai/text-embedding-3-small", "cohere/embed-english"]
    },
    "retry_delay": 1.0,
    "max_retries": 3
}

🧪 Test Coverage

OneLLM maintains comprehensive test coverage to ensure reliability and compatibility:

Coverage Metric Value
Overall Package Coverage 96% (357 passing tests)
Modules with 100% Coverage 16 modules
Minimum Module Coverage 90%
Provider Implementations 93-95% coverage

This extensive test coverage ensures reliable operation across all supported providers and models.


📖 Documentation

OneLLM uses a code-first documentation approach:

  1. Examples Directory: The examples/ directory contains well-documented example scripts that demonstrate all key features of the library:

    • Each example includes detailed frontmatter explaining its purpose and relationship to the codebase
    • Examples range from basic usage to advanced features like fallbacks, retries, and multi-modal capabilities
    • Running the examples is the fastest way to understand the library's capabilities
  2. Code Docstrings: All public APIs, classes, and methods have comprehensive docstrings:

    • Detailed parameter descriptions
    • Return value documentation
    • Exception information
    • Usage examples
  3. Type Annotations: The codebase uses Python type annotations throughout:

    • Provides IDE autocompletion support
    • Makes argument requirements clear
    • Enables static type checking

This approach keeps documentation tightly coupled with code, ensuring it stays up-to-date as the library evolves. To get started, we recommend examining the examples that match your use case.


🤝 Call for Contributions

We're building something amazing with OneLLM, and we'd love your help to make it even better! There are many ways to contribute:

  • Code contributions: Add new providers, enhance existing ones, or improve core functionality
  • Bug reports: Help us identify and fix issues
  • Documentation: Improve examples, clarify API usage, or fix typos
  • Feature requests: Share your ideas for making OneLLM more powerful
  • Testing: Help ensure reliability across different environments and use cases

Getting started is easy:

  1. Check out our open issues for good first contributions
  2. Fork the repository and create a feature branch
  3. Make your improvements and run tests
  4. Submit a pull request with a clear description of your changes

For complete details on contribution guidelines, code style, provider development, testing standards, and our contributor license agreement, please read our CONTRIBUTING.md.


📄 License

OneLLM is licensed under the Apache-2.0 license.

Why Apache 2.0?

I chose the Apache 2.0 license to make OneLLM easy to adopt, integrate, and build on. This license:

  • Allows you to freely use, modify, and distribute the library in both open-source and proprietary software
  • Encourages wide adoption by individuals, startups, and enterprises alike
  • Includes a clear patent grant for legal peace of mind
  • Enables flexible usage without the complexity of copyleft restrictions

Whether you’re building internal tools or commercial applications, Apache 2.0 gives you the freedom to use OneLLM however you need – no strings attached.


🌟 Acknowledgements

This project stands on the shoulders of many excellent open-source projects and would not be possible without the collaborative spirit of the developer community.

Special thanks to all the LLM providers whose APIs this library integrates with, and to the early adopters who tested the library and provided crucial feedback.

🙏 Thank You

Thank you for trying out OneLLM! Your interest and support mean a lot to this project. Whether you're using it in your applications, experimenting with different LLM providers, or just exploring the capabilities, your participation helps drive this project forward.

If you find OneLLM useful in your work:

  • Consider starring the repository on GitHub
  • Share your experiences or use cases with the community
  • Let us know how we can make it better for your needs

Thank you for your support!

~ Ran Aroussi
𝕏 / @aroussi

About

A unified interface for interacting with LLMs from various providers – with support for hundreds of models, built-in fallback mechanisms, and enhanced reliability features.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.9%
  • Shell 0.1%