Description
Describe the bug
What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程,最好有截图)
I got error [WARNING:swift] max_model_len(40960) - num_tokens(118444) < max_tokens(1024). Setting max_tokens: -77484
even if I've set max_model_len
131072 with rope_scaling
I set both rope_scaling
and max_model_len
for inference, and also added the rope_scaling
field to the original model’s config.json. At first, the model recognizes max_model_len
what I defined, but then it gets overridden, probably because it's still reading the positional embedding value from config.json
instead of using the updated rope_scaling.
Script I used:
NPROC_PER_NODE=1
CUDA_VISIBLE_DEVICES=0
swift infer
--model Qwen/Qwen3-8B
--temperature 0
--infer_backend vllm
--val_dataset /home/yekyung/git/CLIPPER/data/sampled/dev.jsonl
--use_hf 1
--rope_scaling 'yarn'
--max_length 128000
--max_model_len 131072
--max_new_tokens 1024
Error I got:
[INFO:swift] Loading the model using model_dir: /home/yekyung/.cache/huggingface/hub/models--Qwen--Qwen3-8B/snapshots/9c925d64d72725edaf899c6cb9c377fd0709d9c5
[INFO:swift] default_system: None
[INFO:swift] max_length: 128000
[INFO:swift] response_prefix: ''
[INFO:swift] agent_template: hermes
INFO 07-09 19:56:24 [config.py:841] This model supports multiple tasks: {'classify', 'reward', 'generate', 'embed'}. Defaulting to 'generate'.
INFO 07-09 19:56:24 [config.py:1472] Using max model len 131072
INFO 07-09 19:56:24 [config.py:1988] Disabling V1 multiprocessing for external launcher.
INFO 07-09 19:56:24 [config.py:2285] Chunked prefill is enabled with max_num_batched_tokens=2048.
...... loading the model ......
[INFO:swift] swift.version: 3.6.0.dev0
[INFO:swift] request_config: RequestConfig(max_tokens=1024, temperature=0.0, top_k=None, top_p=None, repetition_penalty=None, num_beams=1, stop=[], seed=None, stream=False, logprobs=False, top_logprobs=None, n=1, best_of=None, presence_penalty=0.0, frequency_penalty=0.0, length_penalty=1.0)
[rank0]:[W709 19:57:05.788547688 ProcessGroupNCCL.cpp:4718] [PG ID 0 PG GUID 0 Rank 0] using GPU 0 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can pecify device_id in init_process_group() to force use of a particular device.
[INFO:swift] val_dataset: Dataset({
features: ['messages'],
num_rows: 100
})
[WARNING:swift] max_model_len(40960) - num_tokens(118444) < max_tokens(1024). Setting max_tokens: -77484
[rank0]: Traceback (most recent call last):
[rank0]: File "/home/yekyung/git/ms-swift/swift/cli/infer.py", line 5, in
[rank0]: infer_main()
[rank0]: File "/home/yekyung/git/ms-swift/swift/llm/infer/infer.py", line 279, in infer_main
[rank0]: return SwiftInfer(args).main()
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/yekyung/git/ms-swift/swift/llm/base.py", line 49, in main
[rank0]: result = self.run()
[rank0]: ^^^^^^^^^^
[rank0]: File "/home/yekyung/git/ms-swift/swift/llm/infer/infer.py", line 91, in run
[rank0]: result = self.infer_dataset()
[rank0]: ^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/yekyung/git/ms-swift/swift/llm/infer/infer.py", line 237, in infer_dataset
[rank0]: result_list += self._batch_infer(shard_dataset, request_config)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: result_list += self._batch_infer(shard_dataset, request_config)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/yekyung/git/ms-swift/swift/llm/infer/infer.py", line 266, in _batch_infer
[rank0]: resp_list = self.infer(val_dataset, request_config, template=self.template, use_tqdm=True, **self.infer_kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/yekyung/git/ms-swift/swift/llm/infer/infer_engine/vllm_engine.py", line 446, in infer
[rank0]: generation_config = self._prepare_generation_config(request_config)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/yekyung/git/ms-swift/swift/llm/infer/infer_engine/vllm_engine.py", line 307, in _prepare_generation_config
[rank0]: res = SamplingParams(**kwargs)
[rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]: File "/home/yekyung/vllm/vllm/sampling_params.py", line 375, in post_init
[rank0]: self._verify_args()
[rank0]: File "/home/yekyung/vllm/vllm/sampling_params.py", line 430, in _verify_args
[rank0]: raise ValueError(
[rank0]: ValueError: max_tokens must be at least 1, got -77484.
Your hardware and system info
Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息,如CUDA版本,系统,GPU型号和torch版本等)
H100
torch 2.7.1
Additional context
Add any other context about the problem here(在这里补充其他信息)
config.json of model
"rope_scaling": {
"rope_type": "yarn",
"factor": 4.0,
"original_max_position_embeddings": 32768
},