Description
LocalAI version:
latest
Environment, CPU architecture, OS, and Version:
Linux srv3 5.19.0-1010-nvidia-lowlatency #10-Ubuntu SMP PREEMPT_DYNAMIC Wed Apr 26 00:40:27 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Describe the bug
Cant set dtype='half' for VLLM through .yaml or docker run args.
To Reproduce
Create vllm.yaml inside models folder
name: vllm
backend: vllm
parameters:
model: "TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ"
# Uncomment to specify a quantization method (optional)
quantization: "gptq"
# Uncomment to limit the GPU memory utilization (vLLM default is 0.9 for 90%)
gpu_memory_utilization: 0.7
# Uncomment to trust remote code from huggingface
trust_remote_code: true
# Uncomment to enable eager execution
# enforce_eager: true
# Uncomment to specify the size of the CPU swap space per GPU (in GiB)
# swap_space: 2
# Uncomment to specify the maximum length of a sequence (including prompt and output)
max_model_len: 32000
tensor-parallel-size: 8
cuda: true
Start LocalAI
sudo docker run --rm -ti --gpus all -p 8080:8080 -e DEBUG=true -e MODELS_PATH=/models -e THREADS=1 -v /opt/localai/models:/models --name localai quay.io/go-skynet/local-ai:master-cublas-cuda12-ffmpeg
Run inference
curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "vllm",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who won the world series in 2020?"}
]
}'
Result:
{"error":{"code":500,"message":"could not load model (no success): Unexpected err=ValueError('Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your NVIDIA GeForce RTX 2080 Ti GPU has compute capability 7.5. You can use float16 instead by explicitly setting the
dtype flag in CLI, for example: --dtype=half.'), type(err)=\u003cclass 'ValueError'\u003e","type":""}}
Expected behavior
I should be able to set dtype='half' for vllm.
Logs
Additional context