Misc. bug: Illegal CUDA memory access in ggml_backend_cuda_cpy_tensor_async

### Name and Version

ghcr.io/ggerganov/llama.cpp:server-cuda

docker.compose.image=sha256:f608f747701dc4df42f89cab23a6d6b556889f4454737395e988fd3a94e41b45

llama-cpp-chat-1  | build: 5332 (7c28a74e) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu


### Operating systems

Linux

### Which llama.cpp modules do you know to be affected?

llama-server

### Command line

```shell
image: ghcr.io/ggerganov/llama.cpp:server-cuda

    command:
      - "-a"
      - "llama-cpp-chat"

      - "-a"
      - "llama-cpp-chat"

      - "-ctv"
      - "q4_0"

      - "-ctv"
      - "q4_0"

      - "--slot-save-path"
      - "/prompt-cache/slot-saves"

      - "-fa"
      - "-m"
      - "/models/mradermacher--Dolphin-Mistral-24B-Venice-Edition-i1-GGUF/Dolphin-Mistral-24B-Venice-Edition.i1-IQ4_XS.gguf"
      - "-np"
      - "2"
      - "-c"
      - "131072"
      - "-ngl"
      - "500"
      - "--cache-reuse"
      - "25"


    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: [ "0", "2", "3" ]
              capabilities: [gpu]
```

### Problem description & steps to reproduce

I'm seeing quite frequent CUDA illegal memory access errors.

I've seen this with Scout and/or Maverick before too, but this particular instance is when running mradermacher's Dolphin-Mistral-24B-Venice-Edition-i1-GGUF/Dolphin-Mistral-24B-Venice-Edition.i1-IQ4_XS.gguf.

I believe I have ruled out a card/VRAM/PCIe issue, since I consistently get this error despite various combinations of CUDA_VISIBLE_DEVICES (and confirming in nvtop that the cards reported as problematic have since been excluded). The cards are on risers, but short, high-quality ones, and different risers on different cards, so I don't think that's relevant. The different cards have different PCI versions and lane counts, too.

### First Bad Commit

_No response_

### Relevant log output

```shell
llama-cpp-chat-1  | /app/ggml/src/ggml-cuda/ggml-cuda.cu:75: CUDA error
llama-cpp-chat-1  | CUDA error: an illegal memory access was encountered
llama-cpp-chat-1  |   current device: 0, in function ggml_backend_cuda_cpy_tensor_async at /app/ggml/src/ggml-cuda/ggml-cuda.cu:2428
llama-cpp-chat-1  |   cudaMemcpyPeerAsync(dst->data, cuda_ctx_dst->device, src->data, cuda_ctx_src->device, ggml_nbytes(dst), cuda_ctx_src->stream())
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Misc. bug: Illegal CUDA memory access in ggml_backend_cuda_cpy_tensor_async #13449

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Misc. bug: Illegal CUDA memory access in ggml_backend_cuda_cpy_tensor_async #13449

Description

Name and Version

Operating systems

Which llama.cpp modules do you know to be affected?

Command line

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions