max_model_len is not working

**Describe the bug**
What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程，最好有截图)

I got error `[WARNING:swift] max_model_len(40960) - num_tokens(118444) < max_tokens(1024). Setting max_tokens: -77484 `even if I've set `max_model_len` 131072 with rope_scaling

I set both `rope_scaling` and `max_model_len` for inference, and also added the `rope_scaling` field to the original model’s config.json. At first, the model recognizes `max_model_len` what I defined, but then it gets overridden, probably because it's still reading the positional embedding value from `config.json `instead of using the updated rope_scaling. 


Script I used: 

NPROC_PER_NODE=1 \
CUDA_VISIBLE_DEVICES=0 \
swift infer \
    --model Qwen/Qwen3-8B \
    --temperature 0 \
    --infer_backend vllm \
    --val_dataset /home/yekyung/git/CLIPPER/data/sampled/dev.jsonl \
    --use_hf 1 \
    --rope_scaling 'yarn' \
    --max_length 128000 \
    --max_model_len 131072 \
    --max_new_tokens 1024

---

Error I got:


[INFO:swift] Loading the model using model_dir: /home/yekyung/.cache/huggingface/hub/models--Qwen--Qwen3-8B/snapshots/9c925d64d72725edaf899c6cb9c377fd0709d9c5
[INFO:swift] default_system: None
[INFO:swift] max_length: 128000
[INFO:swift] response_prefix: ''
[INFO:swift] agent_template: hermes
INFO 07-09 19:56:24 [config.py:841] This model supports multiple tasks: {'classify', 'reward', 'generate', 'embed'}. Defaulting to 'generate'.
INFO 07-09 19:56:24 [config.py:1472] Using max model len 131072
INFO 07-09 19:56:24 [config.py:1988] Disabling V1 multiprocessing for external launcher.
INFO 07-09 19:56:24 [config.py:2285] Chunked prefill is enabled with max_num_batched_tokens=2048.

  ...... loading the model ......

[INFO:swift] swift.__version__: 3.6.0.dev0
[INFO:swift] request_config: RequestConfig(max_tokens=1024, temperature=0.0, top_k=None, top_p=None, repetition_penalty=None, num_beams=1, stop=[], seed=None, stream=False, logprobs=False, top_logprobs=None, n=1, best_of=None, presence_penalty=0.0, frequency_penalty=0.0, length_penalty=1.0)
[rank0]:[W709 19:57:05.788547688 ProcessGroupNCCL.cpp:4718] [PG ID 0 PG GUID 0 Rank 0]  using GPU 0 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can pecify device_id in init_process_group() to force use of a particular device.
[INFO:swift] val_dataset: Dataset({
    features: ['messages'],
    num_rows: 100
})
[WARNING:swift] max_model_len(40960) - num_tokens(118444) < max_tokens(1024). Setting max_tokens: -77484
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/yekyung/git/ms-swift/swift/cli/infer.py", line 5, in <module>
[rank0]:     infer_main()
[rank0]:   File "/home/yekyung/git/ms-swift/swift/llm/infer/infer.py", line 279, in infer_main
[rank0]:     return SwiftInfer(args).main()
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/yekyung/git/ms-swift/swift/llm/base.py", line 49, in main
[rank0]:     result = self.run()
[rank0]:              ^^^^^^^^^^
[rank0]:   File "/home/yekyung/git/ms-swift/swift/llm/infer/infer.py", line 91, in run
[rank0]:     result = self.infer_dataset()
[rank0]:              ^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/yekyung/git/ms-swift/swift/llm/infer/infer.py", line 237, in infer_dataset
[rank0]:     result_list += self._batch_infer(shard_dataset, request_config)
[rank0]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:     result_list += self._batch_infer(shard_dataset, request_config)
[rank0]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/yekyung/git/ms-swift/swift/llm/infer/infer.py", line 266, in _batch_infer
[rank0]:     resp_list = self.infer(val_dataset, request_config, template=self.template, use_tqdm=True, **self.infer_kwargs)
[rank0]:                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/yekyung/git/ms-swift/swift/llm/infer/infer_engine/vllm_engine.py", line 446, in infer
[rank0]:     generation_config = self._prepare_generation_config(request_config)
[rank0]:                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/yekyung/git/ms-swift/swift/llm/infer/infer_engine/vllm_engine.py", line 307, in _prepare_generation_config
[rank0]:     res = SamplingParams(**kwargs)
[rank0]:           ^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/yekyung/vllm/vllm/sampling_params.py", line 375, in __post_init__
[rank0]:     self._verify_args()
[rank0]:   File "/home/yekyung/vllm/vllm/sampling_params.py", line 430, in _verify_args
[rank0]:     raise ValueError(
[rank0]: ValueError: max_tokens must be at least 1, got -77484.

**Your hardware and system info**
Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息，如CUDA版本，系统，GPU型号和torch版本等)

H100 
torch 2.7.1

**Additional context**
Add any other context about the problem here(在这里补充其他信息)



config.json of model

  "rope_scaling": {
    "rope_type": "yarn",
    "factor": 4.0,
    "original_max_position_embeddings": 32768
  },



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

max_model_len is not working #4893

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

max_model_len is not working #4893

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions