Toolkit work in progress for Large Language Model (LLM) profiling and inference acceleration.
- llama.cpp profiling
- GPU profiling metrics
Clone the llama.cpp. We tested with llama.cpp version b1752, you may checkout the same version via
git checkout b1752
Copy files in xllm profiler to your local clone of llama.cpp. Change the CUDA to point to your local install in Makefile. Compile and install the libkineto dependency. Please feel free to use our customized kineto if you are running on TAMU HPRC. Update the Makefile.
To start profiling, simply run kineto_profiler
under llama.cpp/profiler
, with the same argument as running ./main
in llama.cpp.
We leverage the Holistic Trace Analysis (HTA) to provide insights on llama.cpp on LLM inference. We provide jupyter notebooks for initial tracing result analyses to play with.
Based on the CUDA API and CUPTI API, we can collect available GPU profiling metrics with tensor_usage_collector.
After running make
under xllm/profiling/profiler
(Please update the Makefile to your desired CUDA version matching the CUDA displayed in nvidia-smi
, especially when you have multiple CUDA installations), run tensor_usage_collector
. All available metrics will be collected into tensor_usage_results.csv
in the same directory.
Results can be analyzed with jupyter notebooks under this folder.
Mixtral-8x7B-Instruct: https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1