CogVLM
A state-of-the-art open visual language model
..., web demo, and OpenAI-Vision–style APIs), along with quantization options that reduce VRAM needs (e.g., 4-bit). It includes checkpoints for chat, base, and grounding variants, plus recipes for model-parallel inference and LoRA fine-tuning. The documentation covers task prompts for general dialogue, visual grounding (box→caption, caption→box, caption+boxes), and GUI agent workflows that produce structured actions with bounding boxes.