Skip to content

Cant change DTYPE inside VLLM settings #1863

Open
@telekoteko

Description

@telekoteko

LocalAI version:
latest

Environment, CPU architecture, OS, and Version:
Linux srv3 5.19.0-1010-nvidia-lowlatency #10-Ubuntu SMP PREEMPT_DYNAMIC Wed Apr 26 00:40:27 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Describe the bug
Cant set dtype='half' for VLLM through .yaml or docker run args.

To Reproduce
Create vllm.yaml inside models folder


name: vllm
backend: vllm
parameters:
  model: "TheBloke/Mixtral-8x7B-Instruct-v0.1-GPTQ"

# Uncomment to specify a quantization method (optional)
quantization: "gptq"
# Uncomment to limit the GPU memory utilization (vLLM default is 0.9 for 90%)
gpu_memory_utilization: 0.7
# Uncomment to trust remote code from huggingface
trust_remote_code: true
# Uncomment to enable eager execution
# enforce_eager: true
# Uncomment to specify the size of the CPU swap space per GPU (in GiB)
# swap_space: 2
# Uncomment to specify the maximum length of a sequence (including prompt and output)
max_model_len: 32000
tensor-parallel-size: 8
cuda: true

Start LocalAI
sudo docker run --rm -ti --gpus all -p 8080:8080 -e DEBUG=true -e MODELS_PATH=/models -e THREADS=1 -v /opt/localai/models:/models --name localai quay.io/go-skynet/local-ai:master-cublas-cuda12-ffmpeg

Run inference

curl http://localhost:8080/v1/chat/completions     -H "Content-Type: application/json"     -d '{
        "model": "vllm",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Who won the world series in 2020?"}
        ]
    }'

Result:
{"error":{"code":500,"message":"could not load model (no success): Unexpected err=ValueError('Bfloat16 is only supported on GPUs with compute capability of at least 8.0. Your NVIDIA GeForce RTX 2080 Ti GPU has compute capability 7.5. You can use float16 instead by explicitly setting thedtype flag in CLI, for example: --dtype=half.'), type(err)=\u003cclass 'ValueError'\u003e","type":""}}

Expected behavior
I should be able to set dtype='half' for vllm.

Logs

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions