OpenVINO™ GenAI is a library of the most popular Generative AI model pipelines, optimized execution methods, and samples that run on top of highly performant OpenVINO Runtime.
This library is friendly to PC and laptop execution, and optimized for resource consumption. It requires no external dependencies to run generative models as it already includes all the core functionality (e.g. tokenization via openvino-tokenizers).
- Introduction to OpenVINO™ GenAI
- Install OpenVINO™ GenAI
- Build OpenVINO™ GenAI
- Supported Models
- Model Preparation Guide
Explore blogs to setup your first hands-on experience with OpenVINO GenAI:
- Install OpenVINO GenAI from PyPI:
pip install openvino-genai
- Obtain model, e.g. export model to OpenVINO IR format from Hugging Face (see Model Preparation Guide for more details):
optimum-cli export openvino --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 --weight-format int4 --trust-remote-code TinyLlama_1_1b_v1_ov - Run inference:
import openvino_genai as ov_genai pipe = ov_genai.LLMPipeline("TinyLlama_1_1b_v1_ov", "CPU") # Use CPU or GPU as devices without any other code change print(pipe.generate("What is OpenVINO?", max_new_tokens=100))
OpenVINO™ GenAI library provides very lightweight C++ and Python APIs to run the following Generative AI Scenarios:
- Text generation using Large Language Models (LLMs) - Chat with local Llama, Phi, Qwen and other models
- Image processing using Visual Language Model (VLMs) - Analyze images/videos with LLaVa, MiniCPM-V and other models
- Image generation using Diffusers - Generate images with Stable Diffusion & Flux models
- Speech recognition using Whisper - Convert speech to text using Whisper models
- Speech generation using SpeechT5 - Convert text to speech using SpeechT5 TTS models
- Semantic search using Text Embedding - Compute embeddings for documents and queries to enable efficient retrieval in RAG workflows
- Text Rerank for Retrieval-Augmented Generation (RAG) - Analyze the relevance and accuracy of documents and queries for your RAG workflows
Library efficiently supports LoRA adapters for Text and Image generation scenarios:
- Load multiple adapters per model
- Select active adapters for every generation
- Mix multiple adapters with coefficients via alpha blending
All scenarios are run on top of OpenVINO Runtime that supports inference on CPU, GPU and NPU. See here for platform support matrix.
OpenVINO™ GenAI library provides a transparent way to use state-of-the-art generation optimizations:
- Speculative decoding that employs two models of different sizes and uses the large model to periodically correct the results of the small model. See here for more detailed overview
- KVCache token eviction algorithm that reduces the size of the KVCache by pruning less impacting tokens.
Additionally, OpenVINO™ GenAI library implements a continuous batching approach to use OpenVINO within LLM serving. The continuous batching library could be used in LLM serving frameworks and supports the following features:
- Prefix caching that caches fragments of previous generation requests and corresponding KVCache entries internally and uses them in case of repeated query.
Continuous batching functionality is used within OpenVINO Model Server (OVMS) to serve LLMs, see here for more details.
The OpenVINO™ GenAI repository is licensed under Apache License Version 2.0. By contributing to the project, you agree to the license and copyright terms therein and release your contribution under these terms.
