full report submitted1

juanerolon · juanerolon · commit 390e8bc51ca8 · 2017-08-16T11:33:15.000-05:00
diff --git a/finding_donors.ipynb b/finding_donors.ipynb
@@ -745,6 +745,17 @@
     "print \"Naive Predictor: [Accuracy score: {:.4f}, F-score: {:.4f}]\".format(accuracy, fscore)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "**Answer: ** \n",
+    "The results of running the cell code above are:\n",
+    "\n",
+    "Naive Predictor accuracy score: 0.1986\n",
+    "Naive Predictor F-score: 0.2365"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -915,7 +926,9 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "**Answer: ** To aid in the selection of three models I generated a collection of bar graphs indicating the resulting f-score on the test set as well as the training times for each classifier. See figure above. I chose three classifiers yielding the highest f-score and reasonably short training times. Amongst the slow algorithms (righ-panel) I chose Gradient Boosting as it yielded the highest f-score of all classifiers tested; among the intermediate algorithms (middle panel) I chose logistic regression which runs fast and yields the best score in that group. In the left panel, I chose linear SVC as it yielded the hightes f-score in that group; Random Forest could have been a good choice as well.\n",
+    "**Answer: ** To aid in the selection of three models I generated a collection of bar graphs indicating the resulting f-score on the test set as well as the training times for each classifier. See figure above that displays the results as bar chart plots. \n",
+    "\n",
+    "I chose three classifiers yielding the highest f-score and reasonably short training times. Amongst the slow algorithms (righ-panel) I chose Gradient Boosting as it yielded the highest f-score of all classifiers tested; among the intermediate algorithms (middle panel) I chose logistic regression which runs fast and yields the best score in that group. In the left panel, I chose linear SVC as it yielded the hightes f-score in that group; Random Forest could have been a good choice as well.\n",
     "\n",
     "\n",
     "** 1. Gradient Boosting:**\n",
@@ -1520,25 +1533,24 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "**Answer:** To answer this question accurately we need to apply feature selection engineering methods. However, we may attempt first to gain insight about the features that could be informative and discriminating and/or those that could provide the highest information gain, say using a \"rule of thumb\" or heuristic criterion. \n",
+    "**Answer:** \n",
+    "To answer this question accurately we need to apply feature selection engineering methods. However, we may attempt first to gain insight about the features that could be informative and discriminating and/or those that could provide the highest information gain, say using a \"rule of thumb\" or heuristic criterion. \n",
     "\n",
-    "First, we can remove subsets of features from the training dataset and observe whether their removal maintains or improves our predictive performance metric. Features that don't affect performance metrics according to an established criterion or tolerance will be deemed irrelevant. Features that imply some form or redundancy would be removed accordingly as well. \n",
+    "First, we can remove subsets of features from the training dataset and observe whether their removal maintains or improves our predictive performance metric. Features that don't affect performance metrics according to an established criterion or tolerance will be deemed irrelevant. Features that imply some form of redundancy or linear dependance on others would be removed accordingly as well. \n",
     "\n",
     "Second, we can try to estimate the purity of the feature or sufeatures that split into the classes of interest. Let us take a look at the subplots given above, each representing a selected feature, e.g education, occupation, etc. For categorical features we identify subfeatures representing a unique value of the parent feature and which splits the data classes, '<=50K' (black color) and '>50K' (green color); for continuous data each subfeature represents an interval and each interval splits the data in the same corresponding classes.\n",
     "\n",
-    "Intution from visual inspection (\"rule of thumb\"):\n",
+    "Intuition from visual inspection (\"rule of thumb\"):\n",
     "\n",
-    "When targeting a specific feature, we aim to identify the subfeatures with the highest purity, i.e. a clear separation or discrimination between two classes of interest; preferentially in our case, observing a higher proportion of individuals earning '>50K' in some of the subfeatures, i.e taller green bars somewhere in the attribute distribution.\n",
+    "When targeting a specific feature, we aim to identify the subfeatures with the highest purity, i.e. a clear separation or discrimination between two classes of interest; preferentially in our case, observing a higher proportion of individuals earning '>50K' in some of the subfeatures, i.e taller green bars somewhere in the attribute distribution. A crude measure of this is the following ratio: \n",
     "\n",
-    "A crude measure of this is the following ratio: R =(No. '>50K')/(No. '<=50K') > threshold >= 0.5 (50%)\n",
-    "or equivalently with largest group percentage: GP = (No. '>50K')/((No. '<=50K') + (No. '<=50K')). \n",
+    "R =(No. '>50K')/(No. '<=50K') > threshold >= 0.5 (50%), or equivalently with largest group percentage: \n",
+    "GP = (No. '>50K')/((No. '<=50K') + (No. '<=50K')). \n",
     "\n",
-    "In addition, these numbers should be representative of the total number of individuals earning '>50K'\n",
-    "GT = (No. '>50K')/(Total No. '50K').  Here (Total No. '50K') = 11,208 \n",
-    "So we are also looking for MAXIMAL values of GT when possible\n",
+    "In addition, the proportion of individuals belonging to a given class within a group should be representative of the total number of individuals earning '>50K'.\n",
     "\n",
     "** 1. Age:**\n",
-    "The class distribution appears to be the envelope of a normal distribution, which is good for expectation values in relation to '>50K' class. Near the center we have subfeatures corresponding to age groups 40-50, 50-60 with a significant proportion of '>50K', R >0.5.\n",
+    "The class distribution appears to be the envelope of a normal distribution, which is good for expectation values in relation to the '>50K' class. Near the center we have subfeatures corresponding to age groups 40-50, 50-60 with a significant proportion of '>50K', R >0.5.\n",
     "\n",
     "** 2. Capital loss:**\n",
     "This feature contains 3 subfeatures (capital gain amounts of 10-20K, 20-30K and 90-100K) with majority '>50K' class and clearly discriminated from the rest of the subfeatures. Among all features, capital gain fulfills all of our simplified feature importance criteria.\n",
@@ -1550,9 +1562,9 @@
     "Contains the subfeatures \"50-60h\" and \"60-70h\" which contain a significant proportion of '>50K' earners, R >0.5. In addition, the \"40-50h\" groups seems discriminating, i.e. most people working 40 to 50 hours per week would earn less than 50K.\n",
     "\n",
     "** 5. Education Level:**\n",
-    "It contains subfeatures \"Masters\", \"Doctorate\" and \"Prof. School\" with majority '>50K' class, and \"Bachelors\", with a high percentage of \">50K\" earners in that group R > 0.5.  Althought these groups discriminate between the two income classes, the '>50K' (green) distribution spreads a bit more among the rest of the groups in comparison to capital gain and loss. \n",
+    "It contains subfeatures \"Masters\", \"Doctorate\" and \"Prof. School\" with majority '>50K' class, and \"Bachelors\", with a high percentage of \">50K\" earners in that group R > 0.5.  Although these groups discriminate between the two income classes, the '>50K' (green) distribution spreads a bit more among the rest of the groups in comparison to capital gain and loss. \n",
     "\n",
-    "The rest of the features do not fulfill our \"rule of thumb\" criteria except for \"Occupation\" and \"Civil Status\", which contain several \">50K\" groups with R>0.5, but no '>50K' majority groups. In addition, 'occupation' seems to be correlated or dependent on the education feature (possibly redundant)."
+    "The rest of the features do not fulfill our \"rule of thumb\" criteria except for \"Occupation\" and \"Civil Status\", which contain several \">50K\" groups with R>0.5, but no '>50K' majority groups. In addition, 'occupation' seems to be correlated or dependent on the education feature (possibly redundant).\n"
    ]
   },
   {
@@ -1615,9 +1627,12 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "**Answer: ** There is qualitative agreement between these results and those discussed in question 6. In particular, the features \"Capital-gain and Capital-loss confirm our expectations regarding them as being the features that have the largest weight and impact on the model's performance metrics. The same comparison applies broadly to the \"education\" feature. (Notice that sklearn refers to the \"education number\" assigned to the categorical attributes in our \"education\" feature in question 6; it is understood that they are equivalent).\n",
+    "**Answer: ** \n",
+    "There is qualitative agreement between these results and those discussed in question 6. In particular, the features \"Age\", \"Capital-gain\" and \"Capital-loss\" confirm our expectations regarding them as the features that have the largest weight and impact on the model's performance metrics. The same comparison applies broadly to the \"education\" and \"civil status\" features. (Notice that sklearn refers to the \"education number\" assigned to the categorical attributes of the \"education\" feature in question 6; it is understood that they are equivalent).\n",
+    "\n",
+    "Notice that sklearn feature selection is highly dependent on the hyperparameters passed to the classifier and/or the resulting optimized hyperparameters. In addition, sklearn feature weighing takes place over the full set of 103 scaled and encoded features, not over the original features being ranked in question 6. \n",
     "\n",
-    "Regarding the \"marital status\" it seems that sklearn's implementation picked the subfeature \"married-civ-spouse\" as being the relevant feature while discarding the rest of subfeatures within \"marital status\". This also seems reasonable and confirms our expectations; We need to notice however that sklearn feature selection takes place over the full set of 103 scaled and encoded features, not the original features being ranked in question 6. Feature selection over the final set of features makes more sense as several subfeatures whithin the main attributes discussed in the bar plots in question carried little information about the '>50K' class, and in some cases the class was absent, e.g. such as in the subfeature \"married-AF-spouse\" or \"Married spouse-absent\", etc."
+    "In my opinion, feature selection over the final set of features makes more sense as several attributes whithin the main features (see bar plots) carry little information about the '>50K' class."
    ]
   },
   {
@@ -1766,9 +1781,9 @@
    "metadata": {},
    "source": [
     "**Answer:** \n",
-    "As shown in the bar plots above, feature selection produces a reduction on the performance metrics accuracy and f-score. In particular there is a 1.2% reduction in accuracy and a 3.2% reduction in the f-score.\n",
+    "As shown in the bar plots above, feature selection produces a reduction on the performance metrics accuracy and f-score. In particular, there is a 1.2% reduction in accuracy and a 3.2% reduction in the f-score.\n",
     "\n",
-    "If training time was a factor I would definitively use the reduced data model as it is an order of magnitude faster. I belive this improvement in training time would be worthwhile  for training a much lager data set at the expense of just 3.2% reduction on f-score."
+    "If training time was a factor I would definitively use the reduced data model as it is an order of magnitude faster. I belive this improvement in training time would be worthwhile  for training a much lager data set at the expense of just 3.2% reduction in the f-score."
    ]
   },
   {