Skip to content

[model] support minimax #4610

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jun 16, 2025
Merged

Conversation

Jintao-Huang
Copy link
Collaborator

@Jintao-Huang Jintao-Huang commented Jun 16, 2025

使用transformers作为推理后端:
显存占用: 8 * 80GiB

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0,1,2,3,4,5,6,7'

from transformers import QuantoConfig
from swift.llm import PtEngine, RequestConfig, InferRequest

quantization_config = QuantoConfig(weights='int8')
messages = [{
    'role': 'system',
    'content': 'You are a helpful assistant.'
}, {
    'role': 'user',
    'content': 'who are you?'
}]
engine = PtEngine('MiniMax/MiniMax-M1-40k', quantization_config=quantization_config)
infer_request = InferRequest(messages=messages)
request_config = RequestConfig(max_tokens=128, temperature=0)
resp = engine.infer([infer_request], request_config=request_config)
response = resp[0].choices[0].message.content
print(f'response: {response}')

"""
<think>
Okay, the user asked "who are you?" I need to respond in a way that's helpful and clear. Let me start by introducing myself as an AI assistant. I should mention that I'm here to help with information, answer questions, and assist with tasks. Maybe keep it friendly and open-ended so they know they can ask for more details if needed. Let me make sure the response is concise but informative.
</think>

I'm an AI assistant designed to help with information, answer questions, and assist with various tasks. Feel free to ask me anything, and I'll do my best to help! 😊
"""

使用vllm作为推理后端:
显存占用: 8 * 80GiB
注意:你需要手动修改config.json文件,config['architectures'] = ["MiniMaxM1ForCausalLM"]修改为config['architectures'] = ["MiniMaxText01ForCausalLM"]

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0,1,2,3,4,5,6,7'
os.environ['VLLM_USE_V1'] = '0'

if __name__ == '__main__':
    from swift.llm import VllmEngine, RequestConfig, InferRequest

    messages = [{
        'role': 'system',
        'content': 'You are a helpful assistant.'
    }, {
        'role': 'user',
        'content': 'who are you?'
    }]
    engine = VllmEngine('MiniMax/MiniMax-M1-40k', tensor_parallel_size=8, 
                           quantization='experts_int8', max_model_len=4096, enforce_eager=True)
    infer_request = InferRequest(messages=messages)
    request_config = RequestConfig(max_tokens=128, temperature=0)
    resp = engine.infer([infer_request], request_config=request_config)
    response = resp[0].choices[0].message.content
    print(f'response: {response}')

使用命令行:

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
swift infer \
    --model MiniMax/MiniMax-M1-40k \
    --tensor_parallel_size 8 \
    --vllm_quantization experts_int8 \
    --max_model_len 4096 \
    --enforce_eager true \
    --infer_backend vllm \
    --stream true \
    --max_new_tokens 2048 \
    --temperature 0.7

image

显存占用:

image

@Jintao-Huang Jintao-Huang merged commit 7c5be95 into modelscope:main Jun 16, 2025
1 of 2 checks passed
Jintao-Huang added a commit that referenced this pull request Jun 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants