Applying Machine Learning to Understand Water Security and Water Access Inequality in Underserved Colonia Communities
This is the source code for reproducing results shown in the Colonias paper
Original dataset is obtained from Rural Community Assistance Partnership (RCAP) on the GIS web platform named Phase II Colonia Web Map
Colonias.gdb: Oringal dataset from RCAPcolonias_Y_norm.csv: preprocessed dataset for colonias with public water services from the original dataset (Colonias.gdb)colonias_N_norm.csv: preprocessed dataset for colonias without public water services from the original dataset (Colonias.gdb)parameters_Y.csv: clustering results (Silhouette Score) under different damping factors for colonias with public water servicesparameters_N.csv: clustering results (Silhouette Score) under different damping factors for colonias without public water servicescolonias_Y_norm_labeled.csv:colonias_Y_norm.csvattached with the optimal clustering labels for colonias with public water servicescolonias_N_norm_labeled.csv:colonias_N_norm.csvattached with the optimal clustering labels for colonias without public water services
colonias_N_norm.csv and colonias_Y_norm.csv are inputs of Affinity Propagation algorithm.
Selected attributes and corresponding descriptions are as follows:
- Python 2.7+
- scikit-learn == 0.21
- gower 0.1.2
- pandas
- numpy
- graphviz 0.20.1
- matplotlib
You can reproduce our workflow by following steps below.
ap_optimal_param.ipynb: this code is to compare clustering results (Silhouette Score) under differentdampingfactors anditerations. You can generateparameters_Y.csvandparameters_N.csvaccordingly (can be found under the folderdataset/).params_SS.ipynb: plotSilhouette Scorevalues under different damping factors and iterations for colonis with/without public water services.ap_get_labels.ipynb: according to step 1, optimal parameters with the highestSilhouette Scorewill be choosen as theinput. It outputs clustered labels under the best parameters, which will be saved incsvfiles.- Attach labels with original
csvfiles (colonias_N_norm.csv,colonias_Y_norm.csv) into newcsvfiles (colonias_N_norm_labeled,colonias_Y_norm_labeled). (You also can update label information on the same csv files) decision_tree.ipynb: Generate decision tree for clustering results of colonias with/without public water services
Under the folder maps/, you can visualize clustering results on the map to understand the geographical distribution of water security in colonias.
maps/jsons: this folder contains water security information with clustering labels and geographical locations for colonias with/without public water services.maps/shapefiles: this folder contains shapefiles of country boundaries and county outlines for 4 colonial states (Arizona, California, New Mexico, and Texas)

