- Process data(instructions below)
- Define your own models under
src/model/multi_modality_model_hy.py- You should define Encoders for each modality of Time Series, Text, and Tabular, each inheriting from
ModalityEncoder - Define your own
MuliModalEncoderusing those models. - Define the task specific compoenets that map from the multimodal encoding to outclasses. Should inherit from
TaskSpecificComponenet. - The above will all be instantied and used in a
MultiModalMultiTaskWrapperin the trainig script.
- You should define Encoders for each modality of Time Series, Text, and Tabular, each inheriting from
- Instantiate your models and train! An example is given under
src/experiments/example_trainer.py
.
├── data
├── mimic3benchmark
│ ├── evaluation
│ ├── resources
│ └── scripts
├── har_code
| ├─ readers.py
| └───mimic3models
│ ├── decompensation
│ │ └── logistic
│ ├── in_hospital_mortality
│ │ └── logistic
│ ├── keras_models
│ ├── length_of_stay
│ │ └── logistic
│ ├── multitask
│ ├── phenotyping
│ │ └── logistic
│ └── resources
└── src
├── models
│
└── experiments
- data: this is the folder stores all data, this is not included but is generated by mimic3benchmark
- mimic3benchmark: the folder contains the data preprocess pipeline and some other utilitis to read the data(derived from Harutyunyan)
- mimic3benchmark: the folder contains the Harytyunyan benchmark models(derived from Harutyunyan)
- src: Where experiments are run and scripts to process some of the other modalities
- Scripts to process the text(derived from Khandanga)
- The MultiModal and Multi-Task model defenitions
- The experiment training scripts
Our work builds is based off the inital processing by Harutyunyan et al., which is done in mimic3benchmark and mimic3models(moved to be under harcode here). Some changes were made and we include necessary files. Changes are listed at bottom Original code is available at this link: Link to Harutyunyan code repository
To process the text data, we follow the approach laid out by Khandanga et al.. Some changes were made, these are also listed at the bottom of this document. Their code can be found here
All mimic-3 raw data (text and time-series) must be obtained by the user.
Instructions and code from Harutyunyan et al.
-
Download all data
-
The following command takes MIMIC-III CSVs, generates one directory per
SUBJECT_IDand writes ICU stay information todata/{SUBJECT_ID}/stays.csv, diagnoses todata/{SUBJECT_ID}/diagnoses.csv, and events todata/{SUBJECT_ID}/events.csv. This step might take around an hour.python -m mimic3benchmark.scripts.extract_subjects {PATH TO MIMIC-III CSVs} data/root/ -
The following command attempts to fix some issues (ICU stay ID is missing) and removes the events that have missing information. About 80% of events remain after removing all suspicious rows (more information can be found in
mimic3benchmark/scripts/more_on_validating_events.md).python -m mimic3benchmark.scripts.validate_events data/root/ -
The next command breaks up per-subject data into separate episodes (pertaining to ICU stays). Time series of events are stored in
{SUBJECT_ID}/episode{#}_timeseries.csv(where # counts distinct episodes) while episode-level information (patient age, gender, ethnicity, height, weight) and outcomes (mortality, length of stay, diagnoses) are stores in{SUBJECT_ID}/episode{#}.csv. This script requires two files, one that maps event ITEMIDs to clinical variables and another that defines valid ranges for clinical variables (for detecting outliers, etc.). Outlier detection is disabled in the current version.python -m mimic3benchmark.scripts.extract_episodes_from_subjects data/root/ -
The next command splits the whole dataset into training and testing sets. Note that the train/test split is the same of all tasks.
python -m mimic3benchmark.scripts.split_train_and_test data/root/ -
The following commands will generate task-specific datasets, which can later be used in models. These commands are independent, if you are going to work only on one benchmark task, you can run only the corresponding command.
python -m mimic3benchmark.scripts.create_multitask data/root/ data/multitask/
After the above command is done, there will be a directory data/{task} for each created benchmark task.
These directories have two sub-directories: train and test.
Each of them contains bunch of ICU stays and one file with name listfile.csv, which lists all samples in that particular set.
Each row of listfile.csv has the following form: icu_stay, period_length, label(s).
A row specifies a sample for which the input is the collection of ICU event of icu_stay that occurred in the first period_length hours of the stay and the target is/are label(s).
In in-hospital mortality prediction task period_length is always 48 hours, so it is not listed in corresponding listfiles.
- cd into src folder
- run extract_notes.py file under scripts folder. This will extract text data in each icu stay and save it in json files, {patient_id}_{# of their stay}.the key is the time point when notes are written and the value are the text, if a stay has no notes taken, this step would be skipped.
python3 scripts/extract_notes.py - run extract_T0.py file under scripts folder. This would extract the time point when time-series data start recording, this would be useful in later steps.
python3 scripts/extract_T0.py Models are defined under multimodal/models/multi_modality_model_hy.py. This files defines
- base class for
ModalityEncoder- which defines the architecture of an encoder for a modality - class
MultiModalEncoder- which takes 3 encoders that inherit fromModalityEncoderand generates the global embedding - base class
TaskSpecificComponent- which defines the structure of the task specific compoenent for a given task. - base class
MultiModalMultiTaskWrapper- which combines the above classes into the full MM-MT model. This requires- A
MultiModalEncoderthat encodes all the modalities. This encoder requires:- A Time Series
ModalityEncodeer - A Text
ModalityEncoder - A Tabular
ModalityEncoder
- A Time Series
- A
TaskSpecificComponentper task of interest. In this work we had six of tasks, leading to sixTaskSpecificComponenets.
- A
Included in multi_modality_model_hy.py are ModailtyEncoders for time series(LSTMModel), text(Text_CNN), and tabular(TabularEmbedding) data.
Also included is our default TaskSpecificCompoenent, FCTaskComponenet which is just a FC linear layer, followed by Dropout, ReLU, and an output layer. The specific parameters per task are defined in the training script.
Given a model defention, any scheme for training can be used. Under the experiments folder, we provide dataloaders for the data to allow for
ease of use when training models. In addition, under multitasking/multitasking.py we provide the base trainig script we used as an example.
Changes to data locations on the users local machine would be necessary.
mimic3bencmark
- mimic3csv.py
- preprocessing.py
- readers.py
- subjects.py
mimic3models
- metrics.py
- preprocessing.py
- added extract_discharge_summary.py
- extract_notes.py
- extract_TO.py
- added generate_pretrained_embeddings.py
Work will be published under MIT License upon final release.