MIMIC-III Multimodal and Multitask

Quick start

Process data(instructions below)
Define your own models under src/model/multi_modality_model_hy.py
1. You should define Encoders for each modality of Time Series, Text, and Tabular, each inheriting from ModalityEncoder
2. Define your own MuliModalEncoder using those models.
3. Define the task specific compoenets that map from the multimodal encoding to outclasses. Should inherit from TaskSpecificComponenet.
4. The above will all be instantied and used in a MultiModalMultiTaskWrapper in the trainig script.
Instantiate your models and train! An example is given under src/experiments/example_trainer.py

Structure

.
├── data
├── mimic3benchmark
│   ├── evaluation
│   ├── resources
│   └── scripts
├── har_code 
|   ├─ readers.py
|   └───mimic3models
│      ├── decompensation
│      │   └── logistic
│      ├── in_hospital_mortality
│      │   └── logistic    
│      ├── keras_models
│      ├── length_of_stay
│      │   └── logistic
│      ├── multitask
│      ├── phenotyping
│      │   └── logistic
│      └── resources
└── src
    ├── models
    │   
    └── experiments

data: this is the folder stores all data, this is not included but is generated by mimic3benchmark
mimic3benchmark: the folder contains the data preprocess pipeline and some other utilitis to read the data(derived from Harutyunyan)
mimic3benchmark: the folder contains the Harytyunyan benchmark models(derived from Harutyunyan)
src: Where experiments are run and scripts to process some of the other modalities
1. Scripts to process the text(derived from Khandanga)
2. The MultiModal and Multi-Task model defenitions
3. The experiment training scripts

Previous work

Time series

Our work builds is based off the inital processing by Harutyunyan et al., which is done in mimic3benchmark and mimic3models(moved to be under harcode here). Some changes were made and we include necessary files. Changes are listed at bottom Original code is available at this link: Link to Harutyunyan code repository

Text

To process the text data, we follow the approach laid out by Khandanga et al.. Some changes were made, these are also listed at the bottom of this document. Their code can be found here

Data

All mimic-3 raw data (text and time-series) must be obtained by the user.

Preprocessing the time-series data:

Instructions and code from Harutyunyan et al.

Download all data
The following command takes MIMIC-III CSVs, generates one directory per SUBJECT_ID and writes ICU stay information to data/{SUBJECT_ID}/stays.csv, diagnoses to data/{SUBJECT_ID}/diagnoses.csv, and events to data/{SUBJECT_ID}/events.csv. This step might take around an hour.
```
python -m mimic3benchmark.scripts.extract_subjects {PATH TO MIMIC-III CSVs} data/root/
```
The following command attempts to fix some issues (ICU stay ID is missing) and removes the events that have missing information. About 80% of events remain after removing all suspicious rows (more information can be found in mimic3benchmark/scripts/more_on_validating_events.md).
```
python -m mimic3benchmark.scripts.validate_events data/root/
```
The next command breaks up per-subject data into separate episodes (pertaining to ICU stays). Time series of events are stored in {SUBJECT_ID}/episode{#}_timeseries.csv (where # counts distinct episodes) while episode-level information (patient age, gender, ethnicity, height, weight) and outcomes (mortality, length of stay, diagnoses) are stores in {SUBJECT_ID}/episode{#}.csv. This script requires two files, one that maps event ITEMIDs to clinical variables and another that defines valid ranges for clinical variables (for detecting outliers, etc.). Outlier detection is disabled in the current version.
```
python -m mimic3benchmark.scripts.extract_episodes_from_subjects data/root/
```
The next command splits the whole dataset into training and testing sets. Note that the train/test split is the same of all tasks.
```
python -m mimic3benchmark.scripts.split_train_and_test data/root/
```
The following commands will generate task-specific datasets, which can later be used in models. These commands are independent, if you are going to work only on one benchmark task, you can run only the corresponding command.
```
python -m mimic3benchmark.scripts.create_multitask data/root/ data/multitask/
```

After the above command is done, there will be a directory data/{task} for each created benchmark task. These directories have two sub-directories: train and test. Each of them contains bunch of ICU stays and one file with name listfile.csv, which lists all samples in that particular set. Each row of listfile.csv has the following form: icu_stay, period_length, label(s). A row specifies a sample for which the input is the collection of ICU event of icu_stay that occurred in the first period_length hours of the stay and the target is/are label(s). In in-hospital mortality prediction task period_length is always 48 hours, so it is not listed in corresponding listfiles.

Preprocess the text data

cd into src folder
run extract_notes.py file under scripts folder. This will extract text data in each icu stay and save it in json files, {patient_id}_{# of their stay}.the key is the time point when notes are written and the value are the text, if a stay has no notes taken, this step would be skipped.

python3 scripts/extract_notes.py

run extract_T0.py file under scripts folder. This would extract the time point when time-series data start recording, this would be useful in later steps.

python3 scripts/extract_T0.py

Models

Models are defined under multimodal/models/multi_modality_model_hy.py. This files defines

base class for ModalityEncoder - which defines the architecture of an encoder for a modality
class MultiModalEncoder - which takes 3 encoders that inherit from ModalityEncoder and generates the global embedding
base class TaskSpecificComponent - which defines the structure of the task specific compoenent for a given task.
base class MultiModalMultiTaskWrapper - which combines the above classes into the full MM-MT model. This requires
1. A MultiModalEncoder that encodes all the modalities. This encoder requires:
  1. A Time Series ModalityEncodeer
  2. A Text ModalityEncoder
  3. A Tabular ModalityEncoder
2. A TaskSpecificComponent per task of interest. In this work we had six of tasks, leading to six TaskSpecificComponenets.

Included in multi_modality_model_hy.py are ModailtyEncoders for time series(LSTMModel), text(Text_CNN), and tabular(TabularEmbedding) data. Also included is our default TaskSpecificCompoenent, FCTaskComponenet which is just a FC linear layer, followed by Dropout, ReLU, and an output layer. The specific parameters per task are defined in the training script.

Training

Given a model defention, any scheme for training can be used. Under the experiments folder, we provide dataloaders for the data to allow for ease of use when training models. In addition, under multitasking/multitasking.py we provide the base trainig script we used as an example. Changes to data locations on the users local machine would be necessary.

Changelong

Time series changes

mimic3bencmark

mimic3csv.py
preprocessing.py
readers.py
subjects.py

mimic3models

metrics.py
preprocessing.py

Text processing changes

added extract_discharge_summary.py
extract_notes.py
extract_TO.py
added generate_pretrained_embeddings.py

License

Work will be published under MIT License upon final release.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
har_code		har_code
mimic3benchmark		mimic3benchmark
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MIMIC-III Multimodal and Multitask

Quick start

Structure

Previous work

Time series

Text

Data

Preprocessing the time-series data:

Preprocess the text data

Models

Training

Changelong

Time series changes

Text processing changes

License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

DoubleBlindGithub/M3

Folders and files

Latest commit

History

Repository files navigation

MIMIC-III Multimodal and Multitask

Quick start

Structure

Previous work

Time series

Text

Data

Preprocessing the time-series data:

Preprocess the text data

Models

Training

Changelong

Time series changes

Text processing changes

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages