-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Added support for overriding tensor buffer types #2007
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
zpin
wants to merge
2
commits into
abetlen:main
Choose a base branch
from
zpin:override_tensor
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Could you offer the usage of this parameter. python3 -m llama_cpp.server --model /home/LLM/DeepSeek-Coder-V2-Lite-Instruct-Q4_K_M.gguf --port 8002 --verbose True --n_gpu_layers 99 ---tensor_buft_overrides exp=CPU
# and
python3 -m llama_cpp.server --model /home/arda/LLM/DeepSeek-Coder-V2-Lite-Instruct-Q4_K_M.gguf --port 8002 --verbose True --n_gpu_layers 99 ---override_tensor exp=CPU both of them lead to error, here is the log: usage: __main__.py [-h] [--model MODEL] [--model_alias MODEL_ALIAS] [--n_gpu_layers N_GPU_LAYERS]
[--split_mode SPLIT_MODE] [--main_gpu MAIN_GPU] [--tensor_split [TENSOR_SPLIT ...]]
[--vocab_only VOCAB_ONLY] [--use_mmap USE_MMAP] [--use_mlock USE_MLOCK]
[--kv_overrides [KV_OVERRIDES ...]] [--rpc_servers RPC_SERVERS] [--seed SEED] [--n_ctx N_CTX]
[--n_batch N_BATCH] [--n_ubatch N_UBATCH] [--n_threads N_THREADS]
[--n_threads_batch N_THREADS_BATCH] [--rope_scaling_type ROPE_SCALING_TYPE]
[--rope_freq_base ROPE_FREQ_BASE] [--rope_freq_scale ROPE_FREQ_SCALE]
[--yarn_ext_factor YARN_EXT_FACTOR] [--yarn_attn_factor YARN_ATTN_FACTOR]
[--yarn_beta_fast YARN_BETA_FAST] [--yarn_beta_slow YARN_BETA_SLOW]
[--yarn_orig_ctx YARN_ORIG_CTX] [--mul_mat_q MUL_MAT_Q] [--logits_all LOGITS_ALL]
[--embedding EMBEDDING] [--offload_kqv OFFLOAD_KQV] [--flash_attn FLASH_ATTN]
[--last_n_tokens_size LAST_N_TOKENS_SIZE] [--lora_base LORA_BASE] [--lora_path LORA_PATH]
[--numa NUMA] [--chat_format CHAT_FORMAT] [--clip_model_path CLIP_MODEL_PATH] [--cache CACHE]
[--cache_type CACHE_TYPE] [--cache_size CACHE_SIZE]
[--hf_tokenizer_config_path HF_TOKENIZER_CONFIG_PATH]
[--hf_pretrained_model_name_or_path HF_PRETRAINED_MODEL_NAME_OR_PATH]
[--hf_model_repo_id HF_MODEL_REPO_ID] [--draft_model DRAFT_MODEL]
[--draft_model_num_pred_tokens DRAFT_MODEL_NUM_PRED_TOKENS] [--type_k TYPE_K] [--type_v TYPE_V]
[--verbose VERBOSE] [--host HOST] [--port PORT] [--ssl_keyfile SSL_KEYFILE]
[--ssl_certfile SSL_CERTFILE] [--api_key API_KEY] [--interrupt_requests INTERRUPT_REQUESTS]
[--disable_ping_events DISABLE_PING_EVENTS] [--root_path ROOT_PATH] [--config_file CONFIG_FILE]
__main__.py: error: unrecognized arguments: ---tensor_buft_overrides exp=CPU
|
merged but unuse in main branch,that‘s why |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Equivalent to the
-ot
llama.cpp argument:Can be passed as an optionlal string to the
Llama
class using the newoverride_tensor
parameter. Same format as the argument above.Provides more control over how memory is used, letting you selectively place specific tensors on different devices, especially helpful when running large MOE models.