Scripts for vllm-model-bash efforts
bash vllm_bench.sh config.yamlThis harness automates:
- Launching and monitoring
vllm servefor multiple models - Running
vllm bench servebenchmarks with per-model overrides - Collecting Nsight Systems (
nsys) profiling traces - Generating structured results and per-model summaries
Each benchmark run:
- Launches a vLLM server based on the YAML config
- Runs concurrency sweeps and collects latency/throughput metrics
- Optionally profiles GPU activity via Nsight Systems (nsys), and/or PyTorch Profiler
- Produces organized outputs under a specified directory
Ideal for performance characterization, MLPerf inference testing, and multi-level GPU profiling at scale.
Install these packages:
sudo apt-get install jq curl -y
pip install yqFor GPU profiling capabilities:
- Nsight Systems: System-wide performance analysis, CUDA graph tracing
- Nsight Compute: Detailed kernel-level analysis
- PyTorch Profiler: Python/PyTorch-level CPU and GPU profiling with memory tracking
Captures system-wide GPU activity, CUDA graphs, and NVTX ranges.
profiling:
nsys_launch_args: "--trace=cuda,nvtx,osrt --cuda-graph-trace=node"
nsys_start_args: "--force-overwrite=true --gpu-metrics-devices=cuda-visible"Outputs: .qdrep files viewable in Nsight Systems GUI
Captures Python-level CPU/GPU activity, memory allocations, and operator traces.
profiling:
torch_profiler:
enabled: true
record_shapes: true # Record tensor shapes
profile_memory: true # Track memory allocations
with_stack: false # Include Python stack traces
with_flops: false # Include FLOP estimatesOutputs:
- Chrome trace files (
.json) - viewable inchrome://tracing - PyTorch
.pttrace files - loadable withtorch.profiler.load()
You can enable both nsys and torch profiler simultaneously:
profile: true # Enables nsys
profiling:
nsys_launch_args: "--trace=cuda,nvtx,osrt --cuda-graph-trace=node"
nsys_start_args: "--force-overwrite=true --gpu-metrics-devices=cuda-visible"
torch_profiler:
enabled: true
record_shapes: true
profile_memory: true