Skip to content

DoubleBlindGithub/M3

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MIMIC-III Multimodal and Multitask

Quick start

  1. Process data(instructions below)
  2. Define your own models under src/model/multi_modality_model_hy.py
    1. You should define Encoders for each modality of Time Series, Text, and Tabular, each inheriting from ModalityEncoder
    2. Define your own MuliModalEncoder using those models.
    3. Define the task specific compoenets that map from the multimodal encoding to outclasses. Should inherit from TaskSpecificComponenet.
    4. The above will all be instantied and used in a MultiModalMultiTaskWrapper in the trainig script.
  3. Instantiate your models and train! An example is given under src/experiments/example_trainer.py

Structure

.
├── data
├── mimic3benchmark
│   ├── evaluation
│   ├── resources
│   └── scripts
├── har_code 
|   ├─ readers.py
|   └───mimic3models
│      ├── decompensation
│      │   └── logistic
│      ├── in_hospital_mortality
│      │   └── logistic    
│      ├── keras_models
│      ├── length_of_stay
│      │   └── logistic
│      ├── multitask
│      ├── phenotyping
│      │   └── logistic
│      └── resources
└── src
    ├── models
    │   
    └── experiments
    
  1. data: this is the folder stores all data, this is not included but is generated by mimic3benchmark
  2. mimic3benchmark: the folder contains the data preprocess pipeline and some other utilitis to read the data(derived from Harutyunyan)
  3. mimic3benchmark: the folder contains the Harytyunyan benchmark models(derived from Harutyunyan)
  4. src: Where experiments are run and scripts to process some of the other modalities
    1. Scripts to process the text(derived from Khandanga)
    2. The MultiModal and Multi-Task model defenitions
    3. The experiment training scripts

Previous work

Time series

Our work builds is based off the inital processing by Harutyunyan et al., which is done in mimic3benchmark and mimic3models(moved to be under harcode here). Some changes were made and we include necessary files. Changes are listed at bottom Original code is available at this link: Link to Harutyunyan code repository

Text

To process the text data, we follow the approach laid out by Khandanga et al.. Some changes were made, these are also listed at the bottom of this document. Their code can be found here

Data

All mimic-3 raw data (text and time-series) must be obtained by the user.

Preprocessing the time-series data:

Instructions and code from Harutyunyan et al.

  1. Download all data

  2. The following command takes MIMIC-III CSVs, generates one directory per SUBJECT_ID and writes ICU stay information to data/{SUBJECT_ID}/stays.csv, diagnoses to data/{SUBJECT_ID}/diagnoses.csv, and events to data/{SUBJECT_ID}/events.csv. This step might take around an hour.

    python -m mimic3benchmark.scripts.extract_subjects {PATH TO MIMIC-III CSVs} data/root/
    
  3. The following command attempts to fix some issues (ICU stay ID is missing) and removes the events that have missing information. About 80% of events remain after removing all suspicious rows (more information can be found in mimic3benchmark/scripts/more_on_validating_events.md).

    python -m mimic3benchmark.scripts.validate_events data/root/
    
  4. The next command breaks up per-subject data into separate episodes (pertaining to ICU stays). Time series of events are stored in {SUBJECT_ID}/episode{#}_timeseries.csv (where # counts distinct episodes) while episode-level information (patient age, gender, ethnicity, height, weight) and outcomes (mortality, length of stay, diagnoses) are stores in {SUBJECT_ID}/episode{#}.csv. This script requires two files, one that maps event ITEMIDs to clinical variables and another that defines valid ranges for clinical variables (for detecting outliers, etc.). Outlier detection is disabled in the current version.

    python -m mimic3benchmark.scripts.extract_episodes_from_subjects data/root/
    
  5. The next command splits the whole dataset into training and testing sets. Note that the train/test split is the same of all tasks.

    python -m mimic3benchmark.scripts.split_train_and_test data/root/
    
  6. The following commands will generate task-specific datasets, which can later be used in models. These commands are independent, if you are going to work only on one benchmark task, you can run only the corresponding command.

    python -m mimic3benchmark.scripts.create_multitask data/root/ data/multitask/
    

After the above command is done, there will be a directory data/{task} for each created benchmark task. These directories have two sub-directories: train and test. Each of them contains bunch of ICU stays and one file with name listfile.csv, which lists all samples in that particular set. Each row of listfile.csv has the following form: icu_stay, period_length, label(s). A row specifies a sample for which the input is the collection of ICU event of icu_stay that occurred in the first period_length hours of the stay and the target is/are label(s). In in-hospital mortality prediction task period_length is always 48 hours, so it is not listed in corresponding listfiles.

Preprocess the text data

  1. cd into src folder
  2. run extract_notes.py file under scripts folder. This will extract text data in each icu stay and save it in json files, {patient_id}_{# of their stay}.the key is the time point when notes are written and the value are the text, if a stay has no notes taken, this step would be skipped.
python3 scripts/extract_notes.py 
  1. run extract_T0.py file under scripts folder. This would extract the time point when time-series data start recording, this would be useful in later steps.
python3 scripts/extract_T0.py 

Models

Models are defined under multimodal/models/multi_modality_model_hy.py. This files defines

  1. base class for ModalityEncoder - which defines the architecture of an encoder for a modality
  2. class MultiModalEncoder - which takes 3 encoders that inherit from ModalityEncoder and generates the global embedding
  3. base class TaskSpecificComponent - which defines the structure of the task specific compoenent for a given task.
  4. base class MultiModalMultiTaskWrapper - which combines the above classes into the full MM-MT model. This requires
    1. A MultiModalEncoder that encodes all the modalities. This encoder requires:
      1. A Time Series ModalityEncodeer
      2. A Text ModalityEncoder
      3. A Tabular ModalityEncoder
    2. A TaskSpecificComponent per task of interest. In this work we had six of tasks, leading to six TaskSpecificComponenets.

Included in multi_modality_model_hy.py are ModailtyEncoders for time series(LSTMModel), text(Text_CNN), and tabular(TabularEmbedding) data. Also included is our default TaskSpecificCompoenent, FCTaskComponenet which is just a FC linear layer, followed by Dropout, ReLU, and an output layer. The specific parameters per task are defined in the training script.

Training

Given a model defention, any scheme for training can be used. Under the experiments folder, we provide dataloaders for the data to allow for ease of use when training models. In addition, under multitasking/multitasking.py we provide the base trainig script we used as an example. Changes to data locations on the users local machine would be necessary.

Changelong

Time series changes

mimic3bencmark

  • mimic3csv.py
  • preprocessing.py
  • readers.py
  • subjects.py

mimic3models

  • metrics.py
  • preprocessing.py

Text processing changes

  • added extract_discharge_summary.py
  • extract_notes.py
  • extract_TO.py
  • added generate_pretrained_embeddings.py

License

Work will be published under MIT License upon final release.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •