Tahoe-100M CVAE for Drug Response Prediction

Overview

This project contains code and Jupyter notebooks to train and evaluate a Conditional Variational Autoencoder (CVAE) on a subset of the Tahoe-100M dataset. The goal is to model single-cell gene expression changes induced by drug perturbations, enabling better understanding of transcriptional response to treatment.

We explore the Tahoe-100M dataset (focusing on a 1M cell subset), preprocess the data, define and train a CVAE model, and analyze the results both interactively and via production-ready scripts.

Project Structure

.
├── data/                # Raw and processed data
├── models/              # Saved trained models
├── notebooks/           # Jupyter notebooks for exploration
│   ├── loading_data.ipynb
│   └── model_dev.ipynb
├── src/                 # Source code for production usage
│   ├── data_utils.py    # Data loading and preprocessing
│   ├── model.py         # CVAE model architecture
│   ├── train.py         # Training script
│   └── predict.py       # Prediction script
├── requirements.txt     # Python dependencies
├── environment.yaml     # Conda environment
└── README.md

Notebooks

The notebooks/ folder contains two primary notebooks:

1. `loading_data.ipynb`

Purpose: Data preprocessing and preparation
Main tasks:

Import required libraries
Load raw data from the Tahoe-100M dataset
Clean and preprocess expression matrices and metadata
Feature engineering and data validation
Save processed data to disk for training

2. `model_dev.ipynb`

Purpose: Model development, training, and evaluation
Main tasks:

Load preprocessed data
Define CVAE model architecture
Implement training loop and evaluation metrics
Visualize training loss and prediction accuracy
Analyze predictions on a hold-out set
Save best-performing models to models/ directory

Production Code

The src/ directory contains modular Python scripts for reproducible and scalable training and inference.

Training

To train the CVAE model from preprocessed data:

python src/train.py

Prediction

To use a trained model for predictions:

python src/predict.py

Setup & Installation

Requirements

Python 3.8 or higher
Jupyter Notebook or JupyterLab (optional)
Git

Installation

Using pip:

pip install -r requirements.txt

Using conda:

conda env create -f environment.yaml
conda activate tahoe-cvae

Dataset

We use a subset (~1M cells) from the Tahoe-100M dataset, which includes:

Single-cell gene expression matrices
Drug perturbation metadata
Cell line and sample metadata

Download and extract the data into the data/ directory before running preprocessing scripts or training.

Usage

Interactive Workflow

Run notebooks/loading_data.ipynb to preprocess and store data
Run notebooks/model_dev.ipynb to train and evaluate the model

Script Workflow

Train model:
```
python src/train.py
```
Predict using trained model:
```
python src/predict.py
```

Contributing

We welcome contributions! Here's how to get started:

Fork the repository
Create your feature branch:
```
git checkout -b feature/AmazingFeature
```
Commit your changes:
```
git commit -m 'Add some AmazingFeature'
```
Push to your branch:
```
git push origin feature/AmazingFeature
```
Open a Pull Request

License

This project is licensed under the GNU General Public License. See the LICENSE file for more details.

Acknowledgments

Tahoe-100M Dataset
Inspired by the CVAE architecture from scVI and scGen
Thanks to all contributors and open-source maintainers

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
models		models
notebooks		notebooks
old_notebooks		old_notebooks
src		src
.gitignore		.gitignore
README.md		README.md
environment.yml		environment.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Tahoe-100M CVAE for Drug Response Prediction

Overview

Project Structure

Notebooks

1. `loading_data.ipynb`

2. `model_dev.ipynb`

Production Code

Training

Prediction

Setup & Installation

Requirements

Installation

Dataset

Usage

Interactive Workflow

Script Workflow

Contributing

License

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

abuchin/tahoe-100m

Folders and files

Latest commit

History

Repository files navigation

Tahoe-100M CVAE for Drug Response Prediction

Overview

Project Structure

Notebooks

1. loading_data.ipynb

2. model_dev.ipynb

Production Code

Training

Prediction

Setup & Installation

Requirements

Installation

Dataset

Usage

Interactive Workflow

Script Workflow

Contributing

License

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

1. `loading_data.ipynb`

2. `model_dev.ipynb`

Packages