Merge pull request udacity#55 from udacity/hotfix/project-files

jared-weed · web-flow · commit 1dc4e0cfb305 · 2016-07-22T13:20:00.000-04:00
Hotfix project questions
diff --git a/projects/boston_housing/boston_housing.ipynb b/projects/boston_housing/boston_housing.ipynb
@@ -118,7 +118,7 @@
     "### Question 1 - Feature Observation\n",
     "As a reminder, we are using three features from the Boston housing dataset: `'RM'`, `'LSTAT'`, and `'PTRATIO'`. For each data point (neighborhood):\n",
     "- `'RM'` is the average number of rooms among homes in the neighborhood.\n",
-    "- `'LSTAT'` is the percentage of all Boston homeowners who have a greater net worth than homeowners in the neighborhood.\n",
+    "- `'LSTAT'` is the percentage of homeowners in the neighborhood considered \"lower class\" (working poor).\n",
     "- `'PTRATIO'` is the ratio of students to teachers in primary and secondary schools in the neighborhood.\n",
     "\n",
     "_Using your intuition, for each of the three features above, do you think that an increase in the value of that feature would lead to an **increase** in the value of `'MDEV'` or a **decrease** in the value of `'MDEV'`? Justify your answer for each._  \n",
@@ -511,7 +511,7 @@
     "| Feature | Client 1 | Client 2 | Client 3 |\n",
     "| :---: | :---: | :---: | :---: |\n",
     "| Total number of rooms in home | 5 rooms | 4 rooms | 8 rooms |\n",
-    "| Household net worth (income) | Top 34th percent | Bottom 45th percent | Top 7th percent |\n",
+    "| Neighborhood poverty level (as %) | 17% | 32% | 3% |\n",
     "| Student-teacher ratio of nearby schools | 15-to-1 | 22-to-1 | 12-to-1 |\n",
     "*What price would you recommend each client sell his/her home at? Do these prices seem reasonable given the values for the respective features?*  \n",
     "**Hint:** Use the statistics you calculated in the **Data Exploration** section to help justify your response.  \n",
@@ -528,9 +528,9 @@
    "outputs": [],
    "source": [
     "# Produce a matrix for client data\n",
-    "client_data = [[5, 34, 15], # Client 1\n",
-    "               [4, 55, 22], # Client 2\n",
-    "               [8, 7, 12]]  # Client 3\n",
+    "client_data = [[5, 17, 15], # Client 1\n",
+    "               [4, 32, 22], # Client 2\n",
+    "               [8, 3, 12]]  # Client 3\n",
     "\n",
     "# Show predictions\n",
     "for i, price in enumerate(reg.predict(client_data)):\n",
diff --git a/projects/creating_customer_segments/customer_segments.ipynb b/projects/creating_customer_segments/customer_segments.ipynb
@@ -660,14 +660,21 @@
     "## Conclusion"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In this final section, you will investigate ways that you can make use of the clustered data. First, you will consider how the different groups of customers, the ***customer segments***, may be affected differently by a specific delivery scheme. Next, you will consider how giving a label to each customer (which *segment* that customer belongs to) can provide for additional features about the customer data. Finally, you will compare the ***customer segments*** to a hidden variable present in the data, to see whether the clustering identified certain relationships."
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {
     "collapsed": true
    },
    "source": [
     "### Question 10\n",
-    "Companies often run [A/B tests](https://en.wikipedia.org/wiki/A/B_testing) when making small changes to their products or services to determine whether that change affects its customers positively or negatively. The wholesale distributor wants to consider changing its delivery service from 5 days a week to 3 days a week, but will only do so if it affects their customers positively. *How would you use the customer segments you found above to perform an A/B test for this change?*  \n",
+    "Companies will often run [A/B tests](https://en.wikipedia.org/wiki/A/B_testing) when making small changes to their products or services to determine whether making that change will affect its customers positively or negatively. The wholesale distributor is considering changing its delivery service from currently 5 days a week to 3 days a week. However, the distributor will only make this change in delivery service for customers that react positively. *How can the wholesale distributor use the customer segments to determine which customers, if any, would reach positively to the change in delivery service?*  \n",
     "**Hint:** Can we assume the change affects all customers equally? How can we determine which group of customers it affects the most?"
    ]
   },
@@ -683,9 +690,9 @@
    "metadata": {},
    "source": [
     "### Question 11\n",
-    "Additional structure is derived from originally unlabeled data when using clustering techniques. Since each customer has a segment it best identifies with (depending on the clustering algorithm applied), we can consider *'customer segment'* as an **engineered feature** for the data. Assume the wholesale distributor recently acquired ten new customers and has made estimates for each customer's annual spending of the six product categories. Knowing these estimates, the wholesale distributor wants to classify each new customer to one of the customer segments to determine the most appropriate delivery service.  \n",
-    "*Describe a supervised learning strategy you could use to make classification predictions for the ten new customers.*  \n",
-    "**Hint:** What other input feature could the supervised learner use besides the six product features to help make a prediction?"
+    "Additional structure is derived from originally unlabeled data when using clustering techniques. Since each customer has a ***customer segment*** it best identifies with (depending on the clustering algorithm applied), we can consider *'customer segment'* as an **engineered feature** for the data. Assume the wholesale distributor recently acquired ten new customers and each provided estimates for anticipated annual spending of each product category. Knowing these estimates, the wholesale distributor wants to classify each new customer to a ***customer segment*** to determine the most appropriate delivery service.  \n",
+    "*How can the wholesale distributor label the new customers using only their estimated product spending and the* ***customer segment*** *data?*  \n",
+    "**Hint:** A supervised learner could be used to train on the original customers. What would be the target variable?"
    ]
   },
   {
diff --git a/projects/student_intervention/student_intervention.ipynb b/projects/student_intervention/student_intervention.ipynb
@@ -239,15 +239,28 @@
    "metadata": {},
    "source": [
     "## Training and Evaluating Models\n",
-    "In this section, you will choose 3 supervised learning models that are appropriate for this problem and available in `scikit-learn`. You will first discuss the reasoning behind choosing these three models by considering what you know about the data and each model's strengths and weaknesses. You will then fit the model to varying sizes of training data (100 data points, 200 data points, and 300 data points) and measure the F<sub>1</sub> score. You will need to produce three tables (one for each model) that shows the training set size, training time, prediction time, F<sub>1</sub> score on the training set, and F<sub>1</sub> score on the testing set."
+    "In this section, you will choose 3 supervised learning models that are appropriate for this problem and available in `scikit-learn`. You will first discuss the reasoning behind choosing these three models by considering what you know about the data and each model's strengths and weaknesses. You will then fit the model to varying sizes of training data (100 data points, 200 data points, and 300 data points) and measure the F<sub>1</sub> score. You will need to produce three tables (one for each model) that shows the training set size, training time, prediction time, F<sub>1</sub> score on the training set, and F<sub>1</sub> score on the testing set.\n",
+    "\n",
+    "**The following supervised learning models are currently available in** [`scikit-learn`](http://scikit-learn.org/stable/supervised_learning.html) **that you may choose from:**\n",
+    "- Gaussian Naive Bayes (GaussianNB)\n",
+    "- Decision Trees\n",
+    "- Ensemble Methods (Bagging, AdaBoost, Random Forest, Gradient Boosting)\n",
+    "- K-Nearest Neighbors (KNeighbors)\n",
+    "- Stochastic Gradient Descent (SGDC)\n",
+    "- Support Vector Machines (SVM)\n",
+    "- Logistic Regression"
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
     "### Question 2 - Model Application\n",
-    "*List three supervised learning models that are appropriate for this problem. What are the general applications of each model? What are their strengths and weaknesses? Given what you know about the data, why did you choose these models to be applied?*"
+    "*List three supervised learning models that are appropriate for this problem. For each model chosen*\n",
+    "- Describe one real-world application in industry where the model can be applied. *(You may need to do a small bit of research for this — give references!)* \n",
+    "- What are the strengths of the model; when does it perform well? \n",
+    "- What are the weaknesses of the model; when does it perform poorly?\n",
+    "- What makes this model a good candidate for the problem, given what you know about the data?"
    ]
   },
   {
@@ -413,7 +426,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "### Question 3 - Chosing the Best Model\n",
+    "### Question 3 - Choosing the Best Model\n",
     "*Based on the experiments you performed earlier, in one to two paragraphs, explain to the board of supervisors what single model you chose as the best model. Which model is generally the most appropriate based on the available data, limited resources, cost, and performance?*"
    ]
   },
@@ -429,7 +442,7 @@
    "metadata": {},
    "source": [
     "### Question 4 - Model in Layman's Terms\n",
-    "*In one to two paragraphs, explain to the board of directors in layman's terms how the final model chosen is supposed to work. For example if you've chosen to use a decision tree or a support vector machine, how does the model go about making a prediction?*"
+    "*In one to two paragraphs, explain to the board of directors in layman's terms how the final model chosen is supposed to work. Be sure that you are describing the major qualities of the model, such as how the model is trained and how the model makes a prediction. Avoid using advanced mathematical or technical jargon, such as describing equations or discussing the algorithm implementation.*"
    ]
   },
   {
diff --git a/projects/titanic_survival_exploration/README.md b/projects/titanic_survival_exploration/README.md
@@ -36,14 +36,14 @@ This will open the iPython Notebook software and project file in your web browse
 
 The dataset used in this project is included as `titanic_data.csv`. This dataset is provided by Udacity and contains the following attributes:
 
-- `survival` ? Survival (0 = No; 1 = Yes)
-- `pclass` ? Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
-- `name` ? Name
-- `sex` ? Sex
-- `age` ? Age
-- `sibsp` ? Number of Siblings/Spouses Aboard
-- `parch` ? Number of Parents/Children Aboard
-- `ticket` ? Ticket Number
-- `fare` ? Passenger Fare
-- `cabin` ? Cabin
-- `embarked` ? Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
+- `survival` : Survival (0 = No; 1 = Yes)
+- `pclass` : Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
+- `name` : Name
+- `sex` : Sex
+- `age` : Age
+- `sibsp` : Number of Siblings/Spouses Aboard
+- `parch` : Number of Parents/Children Aboard
+- `ticket` : Ticket Number
+- `fare` : Passenger Fare
+- `cabin` : Cabin
+- `embarked` : Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)