Skip to content

Feat: add Kwai-Keye transformers #39292

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: main
Choose a base branch
from
Open

Conversation

Kwai-Keye
Copy link

@Kwai-Keye Kwai-Keye commented Jul 9, 2025

Add support for Kwai-Keye/Keye-VL-8B-Preview model

Description

This pull request adds support for the Keye-VL-8B-Preview model developed by Kwai-Keye. Keye-VL-8B-Preview is an advanced vision-language model that demonstrates strong performance in video understanding, visual perception, and reasoning tasks,.

The model repository can be found at:

Key Changes

  1. Added model configuration files for Keye-VL-8B-Preview

  2. Implemented model architecture code based on the official specifications

  3. Added tokenizer support for the model's specific tokenization requirements

  4. Included example usage scripts in the documentation

Model Architecture

The model consists of:

  • A Siglip vision encoder for processing image/video inputs
  • A Qwen3 decoder for language understanding and generation

Usage Example

import torch
from transformers import KeyeForConditionalGeneration, AutoProcessor
from PIL import Image
import requests

model = KeyeForConditionalGeneration.from_pretrained(
    "Kwai-Keye/Keye-VL-8B-Preview",
    torch_dtype=torch.float16,
    device_map="auto",
    attn_implementation="sdpa",
    trust_remote_code=True
)
processor = AutoProcessor.from_pretrained("Kwai-Keye/Keye-VL-8B-Preview", trust_remote_code=True)
url = "https://s1-11508.kwimgs.com/kos/nlav11508/mllm_all/ziran_jiafeimao_11.jpg"
messages = [
    {
        "role":"user",
        "content":[
            {
                "type":"image",
                "image": url,
            },
            {
                "type":"text",
                "text":"Describe this image."
            }
        ]
    }

]

image_inputs = [Image.open(requests.get(url, stream=True).raw)]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)

inputs = processor(
    text=[text],
    images=image_inputs,
    videos=None,
    padding=True,
    return_tensors="pt",
).to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
            out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Checklist

  • Model code is properly formatted and follows transformers coding guidelines
  • Documentation is updated with usage examples
  • All new and existing tests pass locally with the changes

We believe that integrating Keye-VL-8B-Preview into the transformers library will provide users with another powerful option for vision-language tasks. We welcome any feedback or suggestions for improving this integration.

Copy link
Contributor

github-actions bot commented Jul 9, 2025

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant