[model] support minimax #4610

Jintao-Huang · 2025-06-16T06:39:14Z

使用transformers作为推理后端：
显存占用: 8 * 80GiB

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0,1,2,3,4,5,6,7'

from transformers import QuantoConfig
from swift.llm import PtEngine, RequestConfig, InferRequest

quantization_config = QuantoConfig(weights='int8')
messages = [{
    'role': 'system',
    'content': 'You are a helpful assistant.'
}, {
    'role': 'user',
    'content': 'who are you?'
}]
engine = PtEngine('MiniMax/MiniMax-M1-40k', quantization_config=quantization_config)
infer_request = InferRequest(messages=messages)
request_config = RequestConfig(max_tokens=128, temperature=0)
resp = engine.infer([infer_request], request_config=request_config)
response = resp[0].choices[0].message.content
print(f'response: {response}')

"""
<think>
Okay, the user asked "who are you?" I need to respond in a way that's helpful and clear. Let me start by introducing myself as an AI assistant. I should mention that I'm here to help with information, answer questions, and assist with tasks. Maybe keep it friendly and open-ended so they know they can ask for more details if needed. Let me make sure the response is concise but informative.
</think>

I'm an AI assistant designed to help with information, answer questions, and assist with various tasks. Feel free to ask me anything, and I'll do my best to help! 😊
"""

使用vllm作为推理后端：
显存占用: 8 * 80GiB
注意：你需要手动修改config.json文件，config['architectures'] = ["MiniMaxM1ForCausalLM"]修改为config['architectures'] = ["MiniMaxText01ForCausalLM"]

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0,1,2,3,4,5,6,7'
os.environ['VLLM_USE_V1'] = '0'

if __name__ == '__main__':
    from swift.llm import VllmEngine, RequestConfig, InferRequest

    messages = [{
        'role': 'system',
        'content': 'You are a helpful assistant.'
    }, {
        'role': 'user',
        'content': 'who are you?'
    }]
    engine = VllmEngine('MiniMax/MiniMax-M1-40k', tensor_parallel_size=8, 
                           quantization='experts_int8', max_model_len=4096, enforce_eager=True)
    infer_request = InferRequest(messages=messages)
    request_config = RequestConfig(max_tokens=128, temperature=0)
    resp = engine.infer([infer_request], request_config=request_config)
    response = resp[0].choices[0].message.content
    print(f'response: {response}')

使用命令行：

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
swift infer \
    --model MiniMax/MiniMax-M1-40k \
    --tensor_parallel_size 8 \
    --vllm_quantization experts_int8 \
    --max_model_len 4096 \
    --enforce_eager true \
    --infer_backend vllm \
    --stream true \
    --max_new_tokens 2048 \
    --temperature 0.7

显存占用：

Jintao-Huang added 2 commits June 16, 2025 14:38

suport minimax

4945156

update

642979a

tastelikefeet approved these changes Jun 16, 2025

View reviewed changes

Jintao-Huang merged commit 7c5be95 into modelscope:main Jun 16, 2025
1 of 2 checks passed

Jintao-Huang added a commit that referenced this pull request Jun 18, 2025

[model] support minimax (#4610)

f14d61a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[model] support minimax #4610

[model] support minimax #4610

Uh oh!

Jintao-Huang commented Jun 16, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

[model] support minimax #4610

[model] support minimax #4610

Uh oh!

Conversation

Jintao-Huang commented Jun 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Jintao-Huang commented Jun 16, 2025 •

edited

Loading