Optimum-NVIDIA

Optimized inference with NVIDIA and Hugging Face

Optimum-NVIDIA delivers the best inference performance on the NVIDIA platform through Hugging Face. Run LLaMA 2 at 1,200 tokens/second (up to 28x faster than the framework) by changing just a single line in your existing transformers code.

Installation

You can use a Docker container to try Optimum-NVIDIA today. Images are available on the Hugging Face Docker Hub.

docker pull huggingface/optimum-nvidia

An Optimum-NVIDIA package that can be installed with pip will be made available soon.

Quickstart Guide

Pipelines

Hugging Face pipelines provide a simple yet powerful abstraction to quickly set up inference. If you already have a pipeline from transformers, you can unlock the performance benefits of Optimum-NVIDIA by just changing one line.

- from transformers.pipelines import pipeline
+ from optimum.nvidia.pipelines import pipeline

pipe = pipeline('text-generation', 'meta-llama/Llama-2-7b-chat-hf', use_fp8=True)
pipe("Describe a real-world application of AI in sustainable energy.")

Generate

If you want control over advanced features like quantization and token seleciton strategies, we recommend using the generate() API. Just like with pipelines, switching from existing transformers code is super simple.

- from transformers import LlamaForCausalLM
+ from optimum.nvidia import LlamaForCausalLM
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf", padding_side="left")

model = LlamaForCausalLM.from_pretrained(
  "meta-llama/Llama-2-7b-chat-hf",
+ use_fp8=True,  
)

model_inputs = tokenizer(["How is autonomous vehicle technology transforming the future of transportation and urban planning?"], return_tensors="pt").to("cuda")

generated_ids = model.generate(
                    **model_inputs, 
                    top_k=40, 
                    top_p=0.7, 
                    repetition_penalty=10,
)

tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

To learn more about text generation with LLMs, check out this guide!

Support Matrix

We test Optimum-NVIDIA on 4090, L40S, and H100 Tensor Core GPUs, though it is expected to work on any GPU based on the following architectures:

Volta
Turing
Ampere
Hopper
Ada-Lovelace

Note that FP8 support is only available on GPUs based on Hopper and Ada-Lovelace architectures.

Optimum-NVIDIA works on Linux will support Windows soon.

Optimum-NVIDIA currently accelerates text-generation with LLaMAForCausalLM, and we are actively working to expand support to include more model architectures and tasks.

Contributing

Check out our Contributing Guide

Name		Name	Last commit message	Last commit date
Latest commit History 121 Commits
.github/workflows		.github/workflows
docker		docker
examples/text-generation		examples/text-generation
scripts		scripts
src/optimum		src/optimum
templates/inference-endpoints		templates/inference-endpoints
tests		tests
third-party		third-party
.dockerignore		.dockerignore
.gitignore		.gitignore
.gitmodules		.gitmodules
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Optimum-NVIDIA

Optimized inference with NVIDIA and Hugging Face

Installation

Quickstart Guide

Pipelines

Generate

Support Matrix

Contributing

About

Uh oh!

Releases

Packages

Languages

License

CrayonUpdatesf/optimum-nvidia

Folders and files

Latest commit

History

Repository files navigation

Optimum-NVIDIA

Optimized inference with NVIDIA and Hugging Face

Installation

Quickstart Guide

Pipelines

Generate

Support Matrix

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages