Skip to content

o2lab/xllm

Repository files navigation

XLLM

Toolkit work in progress for Large Language Model (LLM) profiling and inference acceleration.

RoadMap

  • llama.cpp profiling
  • GPU profiling metrics

LLAMA.CPP Profiling

Clone the llama.cpp. We tested with llama.cpp version b1752, you may checkout the same version via git checkout b1752 Copy files in xllm profiler to your local clone of llama.cpp. Change the CUDA to point to your local install in Makefile. Compile and install the libkineto dependency. Please feel free to use our customized kineto if you are running on TAMU HPRC. Update the Makefile.

To start profiling, simply run kineto_profiler under llama.cpp/profiler, with the same argument as running ./main in llama.cpp.

Trace Analysis

We leverage the Holistic Trace Analysis (HTA) to provide insights on llama.cpp on LLM inference. We provide jupyter notebooks for initial tracing result analyses to play with.

GPU profiling Metrics

Based on the CUDA API and CUPTI API, we can collect available GPU profiling metrics with tensor_usage_collector.

After running make under xllm/profiling/profiler (Please update the Makefile to your desired CUDA version matching the CUDA displayed in nvidia-smi, especially when you have multiple CUDA installations), run tensor_usage_collector . All available metrics will be collected into tensor_usage_results.csv in the same directory.

Results can be analyzed with jupyter notebooks under this folder.

Models

Mixtral-8x7B-Instruct: https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1

About

AI acceleration by profiling models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •