This project contains code and Jupyter notebooks to train and evaluate a Conditional Variational Autoencoder (CVAE) on a subset of the Tahoe-100M dataset. The goal is to model single-cell gene expression changes induced by drug perturbations, enabling better understanding of transcriptional response to treatment.
We explore the Tahoe-100M dataset (focusing on a 1M cell subset), preprocess the data, define and train a CVAE model, and analyze the results both interactively and via production-ready scripts.
.
├── data/ # Raw and processed data
├── models/ # Saved trained models
├── notebooks/ # Jupyter notebooks for exploration
│ ├── loading_data.ipynb
│ └── model_dev.ipynb
├── src/ # Source code for production usage
│ ├── data_utils.py # Data loading and preprocessing
│ ├── model.py # CVAE model architecture
│ ├── train.py # Training script
│ └── predict.py # Prediction script
├── requirements.txt # Python dependencies
├── environment.yaml # Conda environment
└── README.md
The notebooks/
folder contains two primary notebooks:
Purpose: Data preprocessing and preparation
Main tasks:
- Import required libraries
- Load raw data from the Tahoe-100M dataset
- Clean and preprocess expression matrices and metadata
- Feature engineering and data validation
- Save processed data to disk for training
Purpose: Model development, training, and evaluation
Main tasks:
- Load preprocessed data
- Define CVAE model architecture
- Implement training loop and evaluation metrics
- Visualize training loss and prediction accuracy
- Analyze predictions on a hold-out set
- Save best-performing models to
models/
directory
The src/
directory contains modular Python scripts for reproducible and scalable training and inference.
To train the CVAE model from preprocessed data:
python src/train.py
To use a trained model for predictions:
python src/predict.py
- Python 3.8 or higher
- Jupyter Notebook or JupyterLab (optional)
- Git
Using pip
:
pip install -r requirements.txt
Using conda
:
conda env create -f environment.yaml
conda activate tahoe-cvae
We use a subset (~1M cells) from the Tahoe-100M dataset, which includes:
- Single-cell gene expression matrices
- Drug perturbation metadata
- Cell line and sample metadata
Download and extract the data into the data/
directory before running preprocessing scripts or training.
- Run
notebooks/loading_data.ipynb
to preprocess and store data - Run
notebooks/model_dev.ipynb
to train and evaluate the model
-
Train model:
python src/train.py
-
Predict using trained model:
python src/predict.py
We welcome contributions! Here's how to get started:
- Fork the repository
- Create your feature branch:
git checkout -b feature/AmazingFeature
- Commit your changes:
git commit -m 'Add some AmazingFeature'
- Push to your branch:
git push origin feature/AmazingFeature
- Open a Pull Request
This project is licensed under the GNU General Public License. See the LICENSE
file for more details.
- Tahoe-100M Dataset
- Inspired by the CVAE architecture from scVI and scGen
- Thanks to all contributors and open-source maintainers