SlideChat: A Large Vision-Language Assistant for Whole-Slide Pathology Image Understanding (CVPR2025)

🍎 Homepage | 🤗 Dataset | 🤗 Model | 📖Paper

Abstract: Despite the progress made by multimodal large language models (MLLMs) in computational pathology, they remain limited by a predominant focus on patchlevel analysis, missing essential contextual information at the whole-slide level. The lack of large-scale instruction datasets and the gigapixel scale of whole slide images (WSIs) pose significant developmental challenges. In this paper, we present SlideChat, the first vision-language assistant capable of understanding gigapixel whole-slide images, exhibiting excellent multimodal conversational capability and response complex instruction across diverse pathology scenarios. To support its development, we created SlideInstruction, the largest instructionfollowing dataset for WSIs consisting of 4.2K WSI captions and 176K VQA pairs with multiple categories. Furthermore, we propose SlideBench, a multimodal benchmark that incorporates captioning and VQA tasks to assess SlideChat’s capabilities in varied clinical settings such as microscopy, diagnosis. Compared to both general and specialized MLLMs, SlideChat exhibits exceptional capabilities, achieving state-of-the-art performance on 18 of 22 tasks.

Update

🚀[2025-03-18]: We have released SlideInstruction, SlideBench and SlideChat!
🚀[2025-02-27]: Accepted by CVPR2025!🌟

Release

We release SlideChat, SlideInstruction, and SlideBench as open-source resources, hoping to facilitate research and development in computational pathology.

SlideChat: The first large vision-language assistant for whole-slide pathology image analysis, capable of generating comprehensive descriptions and contextually relevant responses.
SlideInstruction: The largest comprehensive WSI instruction-following dataset, derived from pathology reports..
SlideBench: A WSI multimodal benchmark including SlideBench-Caption, SlideBench-VQA-TCGA, and SlideBench-VQA-BCNB. Before open‑sourcing, SlideBench‑VQA‑TCGA underwent a second round of expert review with pathologists, further enhancing data quality. The initial version covers 10 cancer types with 1,494 samples (SlideBench‑VQA‑TCGA.csv). We later expanded it to 31 additional cancer types, rigorously validated by experts, yielding 3,176 samples (SlideBench‑VQA‑TCGA‑plus.csv). SlideChat achieved an accuracy of 75.23% on the initial version and 74.12% on the expanded version.

Usage

This project is built upon Xtuner. To get started:

conda create --name xtuner-env python=3.10 -y
conda activate xtuner-env
git clone https://github.com/uni-medical/SlideChat.git
cd SlideChat
pip install -e .

Pre-requisites

Download the JSON file containing WSI IDs (TCGA) and conversation data from the Dataset. The input image file is in CSV format and contains 512-dimensional feature representations for all patches within the WSI. Example files are provided in the ./dataset/ folder. For slide downloading and processing, please refer to CLAM and DSMIL.

Training

SlideChat serializes each input WSI into a sequence of patches, converting each into visual embeddings with a patch-level encoder CONCH. A slide-level encoder then interacts with these features to generate contextual embeddings. Then, a multimodal projector maps the visual features from the slide-level encoder into a unified space, aligned seamlessly with the LLM. SlideChat was trained for two stages: (1) Cross-Domain Alignment: SlideChat is trained to generate descriptive captions using 4.2K WSI-caption pairs from SlideInstruction. Specifically, only the slide-level encoder and projection are updated, while the patch-level encoder and LLM weights remain fixed; (2) Visual Instruction Learning: we utilize 176K WSI VQAs from SlideInstruction, allowing the slide encoder, projection layer, and large language model components to be fully trainable to ensure comprehensive adaptability.

Config files are in configs/.

NPROC_PER_NODE=${GPU_NUM} xtuner train \
 <your config file path>  \
  --deepspeed <deepspeed config file path> \
  --work-dir <workdir path>

# stage1 example
NPROC_PER_NODE=${GPU_NUM} xtuner train \
 configs/slidechat/stage_1.py \
  --deepspeed configs/deepspeed/deepspeed_zero2.json \
  --work-dir work_dirs/stage1

# stage2 example
NPROC_PER_NODE=${GPU_NUM} xtuner train \
 configs/slidechat/stage_2.py \
  --deepspeed configs/deepspeed/deepspeed_zero2.json \
  --work-dir work_dirs/stage2

Where ${GPU_NUM} is the number of GPUs

For a detailed explanation of the configuration file, please refer here.

llm_name_or_path: The parameter llm_name_or_path corresponds to the Hugging Face LLM path, such as internlm/internlm2-chat-7b or Qwen/Qwen2.5-7B-Instruct and so on.
data_path: Training data (.json) path.
evaluation_images: Evaluation data path.

LLAVAModel Hyperparameters:

freeze_llm: Freeze the parameters of the LLM.
pretrained_pth: If it is the stage 2 training , it refers to the checkpoint file from stage 1 training; otherwise, it is set to None.
train_stage: train_stage indicates the training phase, either Stage '1' or Stage '2'.

Inference

xtuner test <your config file path> \
--checkpoint <your checkpoint path> \
--test_slide_csv  <your test file> \
--test_output_csv <result file> \
--local_rank 0

# example
xtuner test configs/slidechat/stage_2.py \
--checkpoint stage2_pth \
--test_slide_csv  SlideBench-VQA(TCGA).csv \
--test_output_csv output_my_test.csv \
--local_rank 0

Contact

Ying Chen: [email protected]
Yuanfeng Ji: [email protected]
Junjun He: [email protected]

Citation

BibTeX:

@article{chen2024slidechat,
  title={SlideChat: A Large Vision-Language Assistant for Whole-Slide Pathology Image Understanding},
  author={Chen, Ying and Wang, Guoan and Ji, Yuanfeng and Li, Yanjun and Ye, Jin and Li, Tianbin and and Ming, Hu and Yu, Rongshan and Qiao, Yu and He, Junjun},
  journal={arXiv preprint arXiv:2410.11761},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
dataset		dataset
img		img
requirements		requirements
xtuner		xtuner
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
environment.yaml		environment.yaml
requirements.txt		requirements.txt
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SlideChat: A Large Vision-Language Assistant for Whole-Slide Pathology Image Understanding (CVPR2025)

Update

Release

Usage

Pre-requisites

Training

Inference

Contact

Citation

About

Uh oh!

Releases

Packages

Contributors 3

Languages

License

uni-medical/SlideChat

Folders and files

Latest commit

History

Repository files navigation

SlideChat: A Large Vision-Language Assistant for Whole-Slide Pathology Image Understanding (CVPR2025)

Update

Release

Usage

Pre-requisites

Training

Inference

Contact

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages