Overview

The aim of this project is to develop a model that can accurately classify the kind of hospitalization required for patients suffering from COVID Sars 2, based on their clinical data and test results.

The final model can determine whether a patient needs to be admitted to the regular ward, the intensive care ward or if they do not require any admission whatsoever.

Dataset

The dataset used in this project comes from the Hospital Israelita Albert Einstein in Sao Paulo Brazil, and can be found here.

The data set contains around 5000 records and consists of over 100 features. The biggest challenge with this dataset was the enormous amount of missing data. The chart below is a histogram of row counts and number of empty features. We see that a majority of rows are missing data for almost 90% of the features.

This absence of data points is partly due to the nature of the problem itself. Most of the records are for patients that were never admitted at all, and labs were only performed on patients who were. Since most of the features relate to lab test results, we have many features empty for many records on account of the fact that the labs were never performed.

The following chart ranks the features by most data contained to most empty. The fill levels are shown seperatelyfor admitted patients and not admitted patients.

Additionaly, there is strong class imbalance between not hospitalized, regular ward,semi_intensive and intensive care patients.

After fitting a preliminary Logistic Regression model on the dataset, I saw that the "semi_intensive" class was perfoming very poorly.

I again used regression to reclassify those data points as either regular ward or intensive. his was in the hopes that while by itself the semi intensive data was hard to predict since it was on the margin of two other classes, it might be more usefull to use those data sets to strengthen the other two classes.

Methodology

To address the challenge of missing data, I have employed an iterative regression technique that takes data that IS present as a training set to predict the values missing for a given feature, and then adding the now complete feature to the training set for the next feature.

As for developing the actual model, I fit a variety of models to the data nd selected the best one. The following dataframe shows the performance metrics for each class with each model variant.

To verify the coefficients computed by the best model, I used hypothesis testing to compare the means of features between admited and not admitted patients to see if there was a statistically significant difference those two populations for that feature. The idea is, that is a significant difference exists, then those features do in fact bear heavily on determing which category a patient will fall into.

Conclusions

A Logistic Regression model using SMOTE to address the imbalance issue appears to be the best peforming model.

The following chart shows the most important features identified by the model.

The following dataframe shows the results of conducting relevant hypothesis tests on the means of the top features for the not admitted and admitted populations. We see that in most cases, the null hypothesis that the means are not statistically significant between the two populations has been rejected.

Recommendations

Additional data would be very helpful. There are some easy to obtain data points that are missing, things like BMI data, pre-existing conditions, smoking history etc. Ofcourse, patient privacy might be a concern.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.ipynb_checkpoints		.ipynb_checkpoints
Images		Images
__pycache__		__pycache__
older		older
Mod3Project _Predicting_Hospital_Admission_in_COVID19_Patients.ipynb		Mod3Project _Predicting_Hospital_Admission_in_COVID19_Patients.ipynb
PresentationSlides.pdf		PresentationSlides.pdf
README.md		README.md
dataset.xlsx		dataset.xlsx
khantools.py		khantools.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Overview

Dataset

Methodology

Conclusions

Recommendations

About

Uh oh!

Releases

Packages

Languages

UmarsPortfolio/CovidBrazilMod3

Folders and files

Latest commit

History

Repository files navigation

Overview

Dataset

Methodology

Conclusions

Recommendations

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages