Cant change DTYPE inside VLLM settings

**LocalAI version:**
latest

**Environment, CPU architecture, OS, and Version:**
Linux srv3 5.19.0-1010-nvidia-lowlatency #10-Ubuntu SMP PREEMPT_DYNAMIC Wed Apr 26 00:40:27 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

**Describe the bug**
Cant set dtype='half' for VLLM through .yaml or docker run args.

**To Reproduce**
Create vllm.yaml inside models folder
```

name: vllm
backend: vllm
parameters:
  model: "TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ"

# Uncomment to specify a quantization method (optional)
quantization: "gptq"
# Uncomment to limit the GPU memory utilization (vLLM default is 0.9 for 90%)
gpu_memory_utilization: 0.7
# Uncomment to trust remote code from huggingface
trust_remote_code: true
# Uncomment to enable eager execution
# enforce_eager: true
# Uncomment to specify the size of the CPU swap space per GPU (in GiB)
# swap_space: 2
# Uncomment to specify the maximum length of a sequence (including prompt and output)
max_model_len: 32000
tensor-parallel-size: 8
cuda: true
```

Start LocalAI
`sudo docker run --rm -ti --gpus all -p 8080:8080 -e DEBUG=true -e MODELS_PATH=/models -e THREADS=1 -v /opt/localai/models:/models --name localai quay.io/go-skynet/local-ai:master-cublas-cuda12-ffmpeg`

Run inference
```
curl http://localhost:8080/v1/chat/completions     -H "Content-Type: application/json"     -d '{
        "model": "vllm",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Who won the world series in 2020?"}
        ]
    }'
```
Result:
`{"error":{"code":500,"message":"could not load model (no success): Unexpected err=ValueError('Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your NVIDIA GeForce RTX 2080 Ti GPU has compute capability 7.5. You can use float16 instead by explicitly setting the`dtype` flag in CLI, for example: --dtype=half.'), type(err)=\u003cclass 'ValueError'\u003e","type":""}}`

**Expected behavior**
I should be able to set dtype='half' for vllm.

**Logs**


**Additional context**

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Cant change DTYPE inside VLLM settings #1863

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

Cant change DTYPE inside VLLM settings #1863

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions