Skip to content

Eval bug: Phi-4 mini in iOS with xcframework #12232

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Animaxx opened this issue Mar 6, 2025 · 12 comments
Closed

Eval bug: Phi-4 mini in iOS with xcframework #12232

Animaxx opened this issue Mar 6, 2025 · 12 comments

Comments

@Animaxx
Copy link

Animaxx commented Mar 6, 2025

Name and Version

Run iOS in real device with Phi4 model got "llama_init_from_model: failed to initialize Metal backend"

Operating systems

Mac

GGML backends

Metal

Hardware

iPhone 16 pro

Models

Phi-4 mini Q4L
https://huggingface.co/bartowski/microsoft_Phi-4-mini-instruct-GGUF/blob/main/microsoft_Phi-4-mini-instruct-Q4_K_L.gguf

Problem description & steps to reproduce

Not able to load the Phi4 mini but able to load other (older) models

First Bad Commit

No response

Relevant log output

totalRAM 7.4770813, freeRAM 0.15818787, modelSize 2.64
RAMPressue normal

llama_model_load_from_file_impl: using device Metal (Apple A18 Pro GPU) - 5461 MiB free
llama_model_loader: loaded meta data with 40 key-value pairs and 196 tensors from /var/mobile/Containers/Data/Application/46F3617C-4A90-4449-8323-6961420AB5FC/Documents/Phi-4-mini-instruct-Q4_K_L.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = phi3
llama_model_loader: - kv   1:              phi3.rope.scaling.attn_factor f32              = 1.190238
llama_model_loader: - kv   2:                               general.type str              = model
llama_model_loader: - kv   3:                               general.name str              = Phi 4 Mini Instruct
llama_model_loader: - kv   4:                           general.finetune str              = instruct
llama_model_loader: - kv   5:                           general.basename str              = Phi-4
llama_model_loader: - kv   6:                         general.size_label str              = mini
llama_model_loader: - kv   7:                            general.license str              = mit
llama_model_loader: - kv   8:                       general.license.link str              = https://huggingface.co/microsoft/Phi-...
llama_model_loader: - kv   9:                               general.tags arr[str,3]       = ["nlp", "code", "text-generation"]
llama_model_loader: - kv  10:                          general.languages arr[str,1]       = ["multilingual"]
llama_model_loader: - kv  11:                        phi3.context_length u32              = 131072
llama_model_loader: - kv  12:  phi3.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  13:                      phi3.embedding_length u32              = 3072
llama_model_loader: - kv  14:                   phi3.feed_forward_length u32              = 8192
llama_model_loader: - kv  15:                           phi3.block_count u32              = 32
llama_model_loader: - kv  16:                  phi3.attention.head_count u32              = 24
llama_model_loader: - kv  17:               phi3.attention.head_count_kv u32              = 8
llama_model_loader: - kv  18:      phi3.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  19:                  phi3.rope.dimension_count u32              = 96
llama_model_loader: - kv  20:                        phi3.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  21:              phi3.attention.sliding_window u32              = 262144
llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = gpt-4o
llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,200064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,200064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,199742]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "e r", ...
llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 199999
llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 199999
llama_model_loader: - kv  29:            tokenizer.ggml.unknown_token_id u32              = 199999
llama_model_loader: - kv  30:            tokenizer.ggml.padding_token_id u32              = 199999
llama_model_loader: - kv  31:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  32:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  33:                    tokenizer.chat_template str              = {% for message in messages %}{% if me...
llama_model_loader: - kv  34:               general.quantization_version u32              = 2
llama_model_loader: - kv  35:                          general.file_type u32              = 15
llama_model_loader: - kv  36:                      quantize.imatrix.file str              = /models_out/Phi-4-mini-instruct-GGUF/...
llama_model_loader: - kv  37:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  38:             quantize.imatrix.entries_count i32              = 128
llama_model_loader: - kv  39:              quantize.imatrix.chunks_count i32              = 123
llama_model_loader: - type  f32:   67 tensors
llama_model_loader: - type q8_0:    1 tensors
llama_model_loader: - type q4_K:   80 tensors
llama_model_loader: - type q5_K:   32 tensors
llama_model_loader: - type q6_K:   16 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 2.45 GiB (5.49 BPW) 
init_tokenizer: initializing tokenizer for type 2
load: control token: 200028 '<|tag|>' is not marked as EOG
load: control token: 200027 '<|tool_response|>' is not marked as EOG
load: control token: 200026 '<|/tool_call|>' is not marked as EOG
load: control token: 200025 '<|tool_call|>' is not marked as EOG
load: control token: 200022 '<|system|>' is not marked as EOG
load: control token: 200018 '<|endofprompt|>' is not marked as EOG
load: control token: 200021 '<|user|>' is not marked as EOG
load: control token: 200024 '<|/tool|>' is not marked as EOG
load: control token: 200023 '<|tool|>' is not marked as EOG
load: control token: 200019 '<|assistant|>' is not marked as EOG
load: special tokens cache size = 12
load: token to piece cache size = 1.3333 MB
print_info: arch             = phi3
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 3072
print_info: n_layer          = 32
print_info: n_head           = 24
print_info: n_head_kv        = 8
print_info: n_rot            = 96
print_info: n_swa            = 262144
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 3
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: n_ff             = 8192
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 10000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 4096
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 3B
print_info: model params     = 3.84 B
print_info: general.name     = Phi 4 Mini Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 200064
print_info: n_merges         = 199742
print_info: BOS token        = 199999 '<|endoftext|>'
print_info: EOS token        = 199999 '<|endoftext|>'
print_info: EOT token        = 200020 '<|end|>'
print_info: UNK token        = 199999 '<|endoftext|>'
print_info: PAD token        = 199999 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 199999 '<|endoftext|>'
print_info: EOG token        = 200020 '<|end|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: layer   0 assigned to device Metal
load_tensors: layer   1 assigned to device Metal
load_tensors: layer   2 assigned to device Metal
load_tensors: layer   3 assigned to device Metal
load_tensors: layer   4 assigned to device Metal
load_tensors: layer   5 assigned to device Metal
load_tensors: layer   6 assigned to device Metal
load_tensors: layer   7 assigned to device Metal
load_tensors: layer   8 assigned to device Metal
load_tensors: layer   9 assigned to device Metal
load_tensors: layer  10 assigned to device Metal
load_tensors: layer  11 assigned to device Metal
load_tensors: layer  12 assigned to device Metal
load_tensors: layer  13 assigned to device Metal
load_tensors: layer  14 assigned to device Metal
load_tensors: layer  15 assigned to device Metal
load_tensors: layer  16 assigned to device Metal
load_tensors: layer  17 assigned to device Metal
load_tensors: layer  18 assigned to device Metal
load_tensors: layer  19 assigned to device Metal
load_tensors: layer  20 assigned to device Metal
load_tensors: layer  21 assigned to device Metal
load_tensors: layer  22 assigned to device Metal
load_tensors: layer  23 assigned to device Metal
load_tensors: layer  24 assigned to device Metal
load_tensors: layer  25 assigned to device Metal
load_tensors: layer  26 assigned to device Metal
load_tensors: layer  27 assigned to device Metal
load_tensors: layer  28 assigned to device Metal
load_tensors: layer  29 assigned to device Metal
load_tensors: layer  30 assigned to device Metal
load_tensors: layer  31 assigned to device Metal
load_tensors: layer  32 assigned to device Metal
load_tensors: tensor 'token_embd.weight' (q8_0) (and 0 others) cannot be used with preferred buffer type CPU_AARCH64, using CPU instead
ggml_backend_metal_log_allocated_size: allocated buffer, size =  2510.53 MiB, ( 2510.61 /  5461.34)
load_tensors: offloading 32 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 33/33 layers to GPU
load_tensors:   CPU_Mapped model buffer size =   622.76 MiB
load_tensors: Metal_Mapped model buffer size =  2510.53 MiB
Using 5 threads
llama_init_from_model: n_seq_max     = 1
llama_init_from_model: n_ctx         = 8192
llama_init_from_model: n_ctx_per_seq = 8192
llama_init_from_model: n_batch       = 8192
llama_init_from_model: n_ubatch      = 512
llama_init_from_model: flash_attn    = 1
llama_init_from_model: freq_base     = 10000.0
llama_init_from_model: freq_scale    = 1
llama_init_from_model: n_ctx_per_seq (8192) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
ggml_metal_init: allocating
ggml_metal_init: picking default device: Apple A18 Pro GPU
ggml_metal_init: using embedded metal library
Compiler failed with XPC_ERROR_CONNECTION_INTERRUPTED
Compiler failed with XPC_ERROR_CONNECTION_INTERRUPTED
Compiler failed with XPC_ERROR_CONNECTION_INTERRUPTED
MTLCompiler: Compilation failed with XPC_ERROR_CONNECTION_INTERRUPTED on 3 try
ggml_metal_init: error: Error Domain=MTLLibraryErrorDomain Code=3 "Compiler encountered an internal error" UserInfo={NSLocalizedDescription=Compiler encountered an internal error}
ggml_backend_metal_device_init: error: failed to allocate context
llama_init_from_model: failed to initialize Metal backend
Could not load context!
Memory pressure warning received
totalRAM 7.4770813, freeRAM 0.16456604, modelSize 2.64
currentPressue warning
llama_model_load_from_file_impl: using device Metal (Apple A18 Pro GPU) - 2950 MiB free
llama_model_loader: loaded meta data with 40 key-value pairs and 196 tensors from /var/mobile/Containers/Data/Application/46F3617C-4A90-4449-8323-6961420AB5FC/Documents/Phi-4-mini-instruct-Q4_K_L.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = phi3
llama_model_loader: - kv   1:              phi3.rope.scaling.attn_factor f32              = 1.190238
llama_model_loader: - kv   2:                               general.type str              = model
llama_model_loader: - kv   3:                               general.name str              = Phi 4 Mini Instruct
llama_model_loader: - kv   4:                           general.finetune str              = instruct
llama_model_loader: - kv   5:                           general.basename str              = Phi-4
llama_model_loader: - kv   6:                         general.size_label str              = mini
llama_model_loader: - kv   7:                            general.license str              = mit
llama_model_loader: - kv   8:                       general.license.link str              = https://huggingface.co/microsoft/Phi-...
llama_model_loader: - kv   9:                               general.tags arr[str,3]       = ["nlp", "code", "text-generation"]
llama_model_loader: - kv  10:                          general.languages arr[str,1]       = ["multilingual"]
llama_model_loader: - kv  11:                        phi3.context_length u32              = 131072
llama_model_loader: - kv  12:  phi3.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  13:                      phi3.embedding_length u32              = 3072
llama_model_loader: - kv  14:                   phi3.feed_forward_length u32              = 8192
llama_model_loader: - kv  15:                           phi3.block_count u32              = 32
llama_model_loader: - kv  16:                  phi3.attention.head_count u32              = 24
llama_model_loader: - kv  17:               phi3.attention.head_count_kv u32              = 8
llama_model_loader: - kv  18:      phi3.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  19:                  phi3.rope.dimension_count u32              = 96
llama_model_loader: - kv  20:                        phi3.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  21:              phi3.attention.sliding_window u32              = 262144
llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = gpt-4o
llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,200064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,200064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,199742]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "e r", ...
llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 199999
llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 199999
llama_model_loader: - kv  29:            tokenizer.ggml.unknown_token_id u32              = 199999
llama_model_loader: - kv  30:            tokenizer.ggml.padding_token_id u32              = 199999
llama_model_loader: - kv  31:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  32:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  33:                    tokenizer.chat_template str              = {% for message in messages %}{% if me...
llama_model_loader: - kv  34:               general.quantization_version u32              = 2
llama_model_loader: - kv  35:                          general.file_type u32              = 15
llama_model_loader: - kv  36:                      quantize.imatrix.file str              = /models_out/Phi-4-mini-instruct-GGUF/...
llama_model_loader: - kv  37:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  38:             quantize.imatrix.entries_count i32              = 128
llama_model_loader: - kv  39:              quantize.imatrix.chunks_count i32              = 123
llama_model_loader: - type  f32:   67 tensors
llama_model_loader: - type q8_0:    1 tensors
llama_model_loader: - type q4_K:   80 tensors
llama_model_loader: - type q5_K:   32 tensors
llama_model_loader: - type q6_K:   16 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 2.45 GiB (5.49 BPW) 
init_tokenizer: initializing tokenizer for type 2
load: control token: 200028 '<|tag|>' is not marked as EOG
load: control token: 200027 '<|tool_response|>' is not marked as EOG
load: control token: 200026 '<|/tool_call|>' is not marked as EOG
load: control token: 200025 '<|tool_call|>' is not marked as EOG
load: control token: 200022 '<|system|>' is not marked as EOG
load: control token: 200018 '<|endofprompt|>' is not marked as EOG
load: control token: 200021 '<|user|>' is not marked as EOG
load: control token: 200024 '<|/tool|>' is not marked as EOG
load: control token: 200023 '<|tool|>' is not marked as EOG
load: control token: 200019 '<|assistant|>' is not marked as EOG
load: special tokens cache size = 12
load: token to piece cache size = 1.3333 MB
print_info: arch             = phi3
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 3072
print_info: n_layer          = 32
print_info: n_head           = 24
print_info: n_head_kv        = 8
print_info: n_rot            = 96
print_info: n_swa            = 262144
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 3
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: n_ff             = 8192
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 10000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 4096
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 3B
print_info: model params     = 3.84 B
print_info: general.name     = Phi 4 Mini Instruct
print_info: vocab type       = BPE
print_info: n_vocab          = 200064
print_info: n_merges         = 199742
print_info: BOS token        = 199999 '<|endoftext|>'
print_info: EOS token        = 199999 '<|endoftext|>'
print_info: EOT token        = 200020 '<|end|>'
print_info: UNK token        = 199999 '<|endoftext|>'
print_info: PAD token        = 199999 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 199999 '<|endoftext|>'
print_info: EOG token        = 200020 '<|end|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: layer   0 assigned to device Metal
load_tensors: layer   1 assigned to device Metal
load_tensors: layer   2 assigned to device Metal
load_tensors: layer   3 assigned to device Metal
load_tensors: layer   4 assigned to device Metal
load_tensors: layer   5 assigned to device Metal
load_tensors: layer   6 assigned to device Metal
load_tensors: layer   7 assigned to device Metal
load_tensors: layer   8 assigned to device Metal
load_tensors: layer   9 assigned to device Metal
load_tensors: layer  10 assigned to device Metal
load_tensors: layer  11 assigned to device Metal
load_tensors: layer  12 assigned to device Metal
load_tensors: layer  13 assigned to device Metal
load_tensors: layer  14 assigned to device Metal
load_tensors: layer  15 assigned to device Metal
load_tensors: layer  16 assigned to device Metal
load_tensors: layer  17 assigned to device Metal
load_tensors: layer  18 assigned to device Metal
load_tensors: layer  19 assigned to device Metal
load_tensors: layer  20 assigned to device Metal
load_tensors: layer  21 assigned to device Metal
load_tensors: layer  22 assigned to device Metal
load_tensors: layer  23 assigned to device Metal
load_tensors: layer  24 assigned to device Metal
load_tensors: layer  25 assigned to device Metal
load_tensors: layer  26 assigned to device Metal
load_tensors: layer  27 assigned to device Metal
load_tensors: layer  28 assigned to device Metal
load_tensors: layer  29 assigned to device Metal
load_tensors: layer  30 assigned to device Metal
load_tensors: layer  31 assigned to device Metal
load_tensors: layer  32 assigned to device Metal
load_tensors: tensor 'token_embd.weight' (q8_0) (and 0 others) cannot be used with preferred buffer type CPU_AARCH64, using CPU instead
ggml_backend_metal_log_allocated_size: allocated buffer, size =  2510.53 MiB, ( 5021.52 /  5461.34)
load_tensors: offloading 32 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 33/33 layers to GPU
load_tensors:   CPU_Mapped model buffer size =   622.76 MiB
load_tensors: Metal_Mapped model buffer size =  2510.53 MiB
Using 5 threads
llama_init_from_model: n_seq_max     = 1
llama_init_from_model: n_ctx         = 8192
llama_init_from_model: n_ctx_per_seq = 8192
llama_init_from_model: n_batch       = 8192
llama_init_from_model: n_ubatch      = 512
llama_init_from_model: flash_attn    = 1
llama_init_from_model: freq_base     = 10000.0
llama_init_from_model: freq_scale    = 1
llama_init_from_model: n_ctx_per_seq (8192) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
ggml_metal_init: allocating
ggml_metal_init: picking default device: Apple A18 Pro GPU
ggml_metal_init: using embedded metal library
Compiler failed with XPC_ERROR_CONNECTION_INTERRUPTED
Compiler failed with XPC_ERROR_CONNECTION_INTERRUPTED
Compiler failed with XPC_ERROR_CONNECTION_INTERRUPTED
MTLCompiler: Compilation failed with XPC_ERROR_CONNECTION_INTERRUPTED on 3 try
ggml_metal_init: error: Error Domain=MTLLibraryErrorDomain Code=3 "Compiler encountered an internal error" UserInfo={NSLocalizedDescription=Compiler encountered an internal error}
ggml_backend_metal_device_init: error: failed to allocate context
llama_init_from_model: failed to initialize Metal backend
Could not load context!
@danbev
Copy link
Collaborator

danbev commented Mar 10, 2025

I don't have a real Iphone 16 Pro device to try this out on, but I've tried loading this model in a simulator (using the llama.swiftui example) and also in a standalone swift example that uses the framework.

It looks like this error is happening when the model is being compiled by by Metal. This is done in
ggml/src/ggml-metal/ggml-metal.m:

static struct ggml_backend_metal_context * ggml_metal_init(ggml_backend_dev_t dev) {
    ...
                metal_library = [device newLibraryWithSource:src options:options error:&error];
                if (error) {
                    GGML_LOG_ERROR("%s: error: %s\n", __func__, [[error description] UTF8String]);
                    return NULL;
                }

When newLibraryWithSource is called it will take the kernel sources, the content of the file ggml/src/ggml-metal/ggml-metal.metal, and send them to the Metal compiler XPC service for compilation.

(lldb) p src
(__NSCFString *) 0x0000000148058000 @"#define GGML_COMMON_DECL_METAL\n#define GGML_COMMON_IMPL_METAL\n#if defined(GGML_METAL_EMBED_LIBRARY)\n#ifndef GGML_COMMON_DECL\n\n#if defined(GGML_COMMON_DECL_C)\n#include <stdint.h>\n\ntypedef uint16_t ggml_half;\ntypedef uint32_t ggml_half2;\n\n#define GGML_COMMON_AGGR_U\n#define GGML_COMMON_AGGR_S\n\n#define GGML_COMMON_DECL\n#elif defined(GGML_COMMON_DECL_CPP)\n#include <cstdint>\n\ntypedef uint16_t ggml_half;\ntypedef uint32_t ggml_half2;\n\n// std-c++ allow anonymous unions but some compiler warn on it\n#define GGML_COMMON_AGGR_U data\n// std-c++ do not allow it.\n#define GGML_COMMON_AGGR_S data\n\n#define GGML_COMMON_DECL\n#elif defined(GGML_COMMON_DECL_METAL)\n#include <metal_stdlib>\n\ntypedef half  ggml_half;\ntypedef half2 ggml_half2;\n\n#define GGML_COMMON_AGGR_U\n#define GGML_COMMON_AGGR_S\n\n#define GGML_COMMON_DECL\n#elif defined(GGML_COMMON_DECL_CUDA)\n#if defined(GGML_COMMON_DECL_MUSA)\n#include <musa_fp16.h>\n#else\n#include <cuda_fp16.h>\n#endif\n#include <cstdint>\n\ntypedef half  ggml_half;\ntypedef half2 ggml_half2;\n\n#define GGML_"

So the XPC_ERROR_CONNECTION_INTERRUPTED error is perhaps happening because the XPC service
is experiencing some kind of interruptions while trying to compile the src (the Metal compute shader code).

In the output above we can see the following message:

Memory pressure warning received

Perhaps the system is terminating the XPC service because of memory pressure.

Not able to load the Phi4 mini but able to load other (older) models

Are the older models you were able to load smaller or around the same size as this model?

@ggerganov
Copy link
Member

I also experience this error sporadically on my iPhone 13 mini. Try to disable residency sets and see if it helps.

@Animaxx
Copy link
Author

Animaxx commented Mar 14, 2025

sure, it recently the problem didn't happen again on the Phi-4 mini model. but for the order (old models) is including variety size like qwen 2.5 1.5b Q8 and qwen 2.5 7b Q4 and etc. hope it helps

@Animaxx
Copy link
Author

Animaxx commented Mar 14, 2025

I think the Memory pressure warning received is happen during the retry, in the first load the Pressure normal

@ggerganov
Copy link
Member

Try to disable residency set usage by applying this patch:

diff --git a/ggml/src/ggml-metal/ggml-metal.m b/ggml/src/ggml-metal/ggml-metal.m
index e51a4169a..e0595351b 100644
--- a/ggml/src/ggml-metal/ggml-metal.m
+++ b/ggml/src/ggml-metal/ggml-metal.m
@@ -24,12 +24,12 @@
 #endif
 
 // create residency sets only on macOS >= 15.0
-#if !TARGET_CPU_X86_64 && TARGET_OS_OSX && __MAC_OS_X_VERSION_MAX_ALLOWED >= 150000 || \
-    TARGET_OS_IOS && __IPHONE_OS_VERSION_MAX_ALLOWED >= 180000 || \
-    TARGET_OS_TV && __TV_OS_VERSION_MAX_ALLOWED >= 180000 || \
-    TARGET_OS_VISION && __VISION_OS_VERSION_MAX_ALLOWED >= 200000
-#define GGML_METAL_HAS_RESIDENCY_SETS 1
-#endif
+//#if !TARGET_CPU_X86_64 && TARGET_OS_OSX && __MAC_OS_X_VERSION_MAX_ALLOWED >= 150000 || \
+//    TARGET_OS_IOS && __IPHONE_OS_VERSION_MAX_ALLOWED >= 180000 || \
+//    TARGET_OS_TV && __TV_OS_VERSION_MAX_ALLOWED >= 180000 || \
+//    TARGET_OS_VISION && __VISION_OS_VERSION_MAX_ALLOWED >= 200000
+//#define GGML_METAL_HAS_RESIDENCY_SETS 1
+//#endif
 
 // globals
 

And then you have to rebuild the XCFramework. Let us know if the issue persists in this case.

@Animaxx
Copy link
Author

Animaxx commented Mar 15, 2025

Thanks, let me try with this one.

@Animaxx
Copy link
Author

Animaxx commented Mar 17, 2025

disable residency seems work better on my case

@ggerganov
Copy link
Member

Yes, I'm thinking that it should be disabled on iOS, tvOS and visionOS because these usually have very limited memory.

@jhen0409 What do you think? Did you observe any benefits from enabling it in #11648?

@jhen0409
Copy link
Collaborator

Yes, I'm thinking that it should be disabled on iOS, tvOS and visionOS because these usually have very limited memory.

@jhen0409 What do you think? Did you observe any benefits from enabling it in #11648?

I'd found a slight delay in heating up on the bench when residency sets are disabled, but in most cases on iOS, I don’t think it affects the user experience. I believe it would be better to disable this by default for devices with limited memory.

Also, the primary reason I attempted to enable it is because I also have an iOS app running on macOS (Designed for iPad target).

@Animaxx
Copy link
Author

Animaxx commented Mar 28, 2025

just follow up on this one, should we just change from

#if !TARGET_CPU_X86_64 && TARGET_OS_OSX && __MAC_OS_X_VERSION_MAX_ALLOWED >= 150000 || \
    TARGET_OS_IOS && __IPHONE_OS_VERSION_MAX_ALLOWED >= 180000 || \
    TARGET_OS_TV && __TV_OS_VERSION_MAX_ALLOWED >= 180000 || \
    TARGET_OS_VISION && __VISION_OS_VERSION_MAX_ALLOWED >= 200000
#define GGML_METAL_HAS_RESIDENCY_SETS 1
#endif

to

#if !TARGET_CPU_X86_64 && TARGET_OS_OSX && __MAC_OS_X_VERSION_MAX_ALLOWED >= 150000 || \
    TARGET_OS_VISION && __VISION_OS_VERSION_MAX_ALLOWED >= 200000
#define GGML_METAL_HAS_RESIDENCY_SETS 1
#endif

@ggerganov
Copy link
Member

@Animaxx I would also remove the Vision check.

Also, the primary reason I attempted to enable it is because I also have an iOS app running on macOS (Designed for iPad target).

Hm, I guess that's a valid use case, that would require to have the TARGET_OS_IOS check. Not sure if there is anything else that we can check to distinguish this scenario.

Anyway, I think it is better to remove the check for now so that iPhone and iPad don't panic.

@github-actions github-actions bot added the stale label Apr 28, 2025
Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants