Skip to content

Commit 87a7569

Browse files
committed
pylab bugfix
1 parent 254e778 commit 87a7569

File tree

38 files changed

+16198
-3465
lines changed

38 files changed

+16198
-3465
lines changed
Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
{
2+
"metadata": {
3+
"name": "A0.%20Before%20You%20Begin"
4+
},
5+
"nbformat": 3,
6+
"nbformat_minor": 0,
7+
"worksheets": [
8+
{
9+
"cells": [
10+
{
11+
"cell_type": "markdown",
12+
"metadata": {},
13+
"source": [
14+
"Before you begin\n",
15+
"================"
16+
]
17+
},
18+
{
19+
"cell_type": "markdown",
20+
"metadata": {},
21+
"source": [
22+
"### About the content\n",
23+
"\n",
24+
"The content for OpenDST beta is written in IPython Notebook. This is a technology that allows executable code to be run via a browser with the results visible right in the browser. For this edition of the content we are using Python code for executable content. With IPython Notebook the executable code could be R or even MATLAB. "
25+
]
26+
},
27+
{
28+
"cell_type": "markdown",
29+
"metadata": {},
30+
"source": [
31+
"### IPython Notebook and Python\n",
32+
"\n",
33+
"* If you are new to Python see http://python.org - it will be best to spend some time learning basic Python syntax before attempting the interactive parts of these lessons.\n",
34+
"\n",
35+
"* If you are new to IPython Notebook see http://ipython.org/notebook - prepare to be impressed.\n",
36+
"\n",
37+
"* If you are new to numpy/scipy/matplotlib/pandas see http://www.scipy.org/ - it is not necessary to have a mastery of all these to get value out of these lessons. However the more you know these technologies the faster you will be able to apply these techniques immediately.\n",
38+
"\n",
39+
"* In any case you can always read the Overview lesson for each technique and learn the material in parallel."
40+
]
41+
},
42+
{
43+
"cell_type": "markdown",
44+
"metadata": {},
45+
"source": [
46+
"### Important \n",
47+
" \n",
48+
"Please note this very important point before continuing. \n",
49+
"\n",
50+
"__When you execute code somewhere in the middle of a notebook, the code fragment may be dependent on imports, intermediate results in local variables and other such initializations in preceding cells. So it is better to execute code cells sequentially, starting at the top, when new to IPython Notebook, to avoid being confused by such extraneous errors.__ \n",
51+
"\n",
52+
"Even so if you want to dive in and not have to execute each preceding code cell then you'll need to execute the menu item 'Cell->Run All'. This executes each code cell for you sequentially once, so from that point on all dependencies on prior code cells are satisfied.\n",
53+
"While the code is running you'll see a flashing message towards the top right corner of the notebook, in the grey menubar, which says 'Kernel Busy'.\n",
54+
"This means the code is being executed once through sequentially. Expand the browser to full screen to make sure you can see the message. When the message goes away and stays that way for 10 seconds or so, the execution is complete. You may now dive right in. \n",
55+
"\n",
56+
"If, during 'run all' you see a pink colored warning message, it can be ignored - it appears to be a Python or IPython issue that does not affect our content. To get rid of it for aesthetic reasons, click on the cell above it and hit shift-enter. It should disappear. Basically, executing the code that generated the message a second time makes the message disappear. "
57+
]
58+
},
59+
{
60+
"cell_type": "markdown",
61+
"metadata": {},
62+
"source": [
63+
"### Executing code etc.\n",
64+
"\n",
65+
"* To execute the code in a particular cell, click on the cell and hit shift-enter.\n",
66+
"* Before you execute the code in an arbitrary cell it is good to run all the code once so that all imports and variables are initialised.\n",
67+
"* To execute all code in a notebook click Menu->Cell->RunAll. You should do this at the start of reading each new notebook.\n",
68+
"* If you see a pink colored warning message, it can be ignored. To get rid of it for aesthetic reasons, click on the cell above it and hit shift enter. It should disappear. \n",
69+
"* On the other hand if you get a full blown exception, it means something went wrong. Typically it means you didn't run Menu->Cell->RunAll and something did not get initialised. The error message itself should tell you more.\n",
70+
"\n",
71+
"* You will occasionally see a code cell that says \"TRY THIS\". Do actually try what it's asking you to. The notebook is meant to be interactive.\n",
72+
"* The content gets progressively more challenging and the intent is that a community support system will develop around the content. In the meanwhile there's StackOverflow.\n",
73+
"\n",
74+
"* Make sure you have all the supporting images and datasets in the right locations or you'll see lots of exceptions. The dataset and image directories are included and should work without error unless they have been moved around.\n",
75+
"* See installation instructions at http://opendst.com for a fresh install from github repo or from a zip file.\n",
76+
"* If you're having problems that are hard to fix, a fresh installation, from instructions at opendst.com, is recommended."
77+
]
78+
}
79+
],
80+
"metadata": {}
81+
}
82+
]
83+
}

notebooks/.ipynb_checkpoints/A1. Linear Regression - Overview-checkpoint.ipynb

Lines changed: 466 additions & 0 deletions
Large diffs are not rendered by default.

notebooks/.ipynb_checkpoints/A2. Linear Regression - Data Exploration - Lending Club-checkpoint.ipynb

Lines changed: 566 additions & 0 deletions
Large diffs are not rendered by default.
Lines changed: 227 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,227 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"Linear Regression - Analysis\n",
8+
"============\n",
9+
"***\n",
10+
"\n",
11+
"We're going to pick up where we left off at the end of the exploration and define a linear model with two independent variables determining the dependent variable, Interest Rate.\n",
12+
"\n",
13+
"Our investigation is now defined as:\n",
14+
"\n",
15+
"_Investigate FICO Score and Loan Amount as predictors of Interest Rate for the Lending Club sample of 2,500 loans._\n",
16+
"\n",
17+
"We use Multivariate Linear Regression to model Interest Rate variance with FICO Score and Loan Amount using:\n",
18+
"\n",
19+
"$$InterestRate = a_0 + a_1 * FICOScore + a_2 * LoanAmount$$\n",
20+
"\n",
21+
"We're going to use modeling software to generate the model coefficients $a_0$, $a_1$ and $a_2$ and then some error estimates that we'll only touch upon lightly at this point. \n"
22+
]
23+
},
24+
{
25+
"cell_type": "code",
26+
"execution_count": 1,
27+
"metadata": {
28+
"collapsed": false
29+
},
30+
"outputs": [
31+
{
32+
"name": "stdout",
33+
"output_type": "stream",
34+
"text": [
35+
"Populating the interactive namespace from numpy and matplotlib\n",
36+
"Coefficients: [ 72.88279832 -0.08844242]\n",
37+
"Intercept: 0.000210747768548\n",
38+
"P-Values: [ 0.00000000e+000 0.00000000e+000 5.96972978e-203]\n",
39+
"R-Squared: 0.656632624649\n"
40+
]
41+
}
42+
],
43+
"source": [
44+
"%pylab inline\n",
45+
"import pylab as pl\n",
46+
"import numpy as np\n",
47+
"#from sklearn import datasets, linear_model\n",
48+
"import pandas as pd\n",
49+
"import statsmodels.api as sm\n",
50+
"\n",
51+
"# import the cleaned up dataset\n",
52+
"df = pd.read_csv('../datasets/loanf.csv')\n",
53+
"\n",
54+
"intrate = df['Interest.Rate']\n",
55+
"loanamt = df['Loan.Amount']\n",
56+
"fico = df['FICO.Score']\n",
57+
"\n",
58+
"# reshape the data from a pandas Series to columns \n",
59+
"# the dependent variable\n",
60+
"y = np.matrix(intrate).transpose()\n",
61+
"# the independent variables shaped as columns\n",
62+
"x1 = np.matrix(fico).transpose()\n",
63+
"x2 = np.matrix(loanamt).transpose()\n",
64+
"\n",
65+
"# put the two columns together to create an input matrix \n",
66+
"# if we had n independent variables we would have n columns here\n",
67+
"x = np.column_stack([x1,x2])\n",
68+
"\n",
69+
"# create a linear model and fit it to the data\n",
70+
"X = sm.add_constant(x)\n",
71+
"model = sm.OLS(y,X)\n",
72+
"f = model.fit()\n",
73+
"\n",
74+
"print 'Coefficients: ', f.params[0:2]\n",
75+
"print 'Intercept: ', f.params[2]\n",
76+
"print 'P-Values: ', f.pvalues\n",
77+
"print 'R-Squared: ', f.rsquared\n"
78+
]
79+
},
80+
{
81+
"cell_type": "markdown",
82+
"metadata": {},
83+
"source": [
84+
"So we have a lot of numbers here and we're going to understand some of them.\n",
85+
"\n",
86+
"Coefficients: contains $a_1$ and $a_2$ respectively. \n",
87+
"Intercept: is the $a_0$.\n",
88+
"\n",
89+
"How good are these numbers, how reliable? We need to have some idea. After all we are estimating. We're going to learn a very simple pragmatic way to use a couple of these. \n",
90+
"\n",
91+
"Let's look at the second two numbers. \n",
92+
"We are going to talk loosely here so as to give some flavor of why these are important.\n",
93+
"But this is by no means a formal explanation.\n",
94+
"\n",
95+
"P-Values are probabilities. Informally, each number represents a probability that the respective coefficient we have is a really bad one. To be fairly confident we want this probability to be close to zero. The convention is it needs to be 0.05 or less.\n",
96+
"For now suffice it to say that if we have this true for each of our coefficients then we have good confidence in the model. If one or other of the coefficients is equal to or greater than 0.05 then we have less confidence in that particular dimension being useful in modeling and predicting.\n",
97+
"\n",
98+
"$R$-$squared$ or $R^2$ is a measure of how much of the variance in the data is captured by the model. What does this mean? For now let's understand this as a measure of how well the model captures the **spread** of the observed values not just the average trend. \n",
99+
"\n",
100+
"R is a coefficient of correlation between the independent variables and the dependent variable - i.e. how much the Y depends on the separate X's. R lies between -1 and 1, so $R^2$ lies between 0 and 1. \n",
101+
"\n",
102+
"A high $R^2$ would be close to 1.0 a low one close to 0. The value we have, 0.65, is a reasonably good one. It suggests an R with absolute value in the neighborhood of 0.8.\n",
103+
"The details of these error estimates deserve a separate discussion which we defer until another time.\n",
104+
"\n",
105+
"In summary we have a linear multivariate regression model for Interest Rate based on FICO score and Loan Amount which is well described by the parameters above.\n"
106+
]
107+
},
108+
{
109+
"cell_type": "code",
110+
"execution_count": 2,
111+
"metadata": {
112+
"collapsed": false
113+
},
114+
"outputs": [
115+
{
116+
"data": {
117+
"text/html": [
118+
"<style>\n",
119+
" @font-face {\n",
120+
" font-family: \"Computer Modern\";\n",
121+
" src: url('http://mirrors.ctan.org/fonts/cm-unicode/fonts/otf/cmunss.otf');\n",
122+
" }\n",
123+
" div.cell{\n",
124+
" width:800px;\n",
125+
" margin-left:auto;\n",
126+
" margin-right:auto;\n",
127+
" }\n",
128+
" h1 {\n",
129+
" font-family: \"Charis SIL\", Palatino, serif;\n",
130+
" }\n",
131+
" h4{\n",
132+
" margin-top:12px;\n",
133+
" margin-bottom: 3px;\n",
134+
" }\n",
135+
" div.text_cell_render{\n",
136+
" font-family: Computer Modern, \"Helvetica Neue\", Arial, Helvetica, Geneva, sans-serif;\n",
137+
" line-height: 145%;\n",
138+
" font-size: 120%;\n",
139+
" width:800px;\n",
140+
" margin-left:auto;\n",
141+
" margin-right:auto;\n",
142+
" }\n",
143+
" .CodeMirror{\n",
144+
" font-family: \"Source Code Pro\", source-code-pro,Consolas, monospace;\n",
145+
" }\n",
146+
" .prompt{\n",
147+
" display: None;\n",
148+
" }\n",
149+
" .text_cell_render h5 {\n",
150+
" font-weight: 300;\n",
151+
" font-size: 16pt;\n",
152+
" color: #4057A1;\n",
153+
" font-style: italic;\n",
154+
" margin-bottom: .5em;\n",
155+
" margin-top: 0.5em;\n",
156+
" display: block;\n",
157+
" }\n",
158+
" \n",
159+
" .warning{\n",
160+
" color: rgb( 240, 20, 20 )\n",
161+
" }\n",
162+
"</style>\n",
163+
"<script>\n",
164+
" MathJax.Hub.Config({\n",
165+
" TeX: {\n",
166+
" extensions: [\"AMSmath.js\"]\n",
167+
" },\n",
168+
" tex2jax: {\n",
169+
" inlineMath: [ ['$','$'], [\"\\\\(\",\"\\\\)\"] ],\n",
170+
" displayMath: [ ['$$','$$'], [\"\\\\[\",\"\\\\]\"] ]\n",
171+
" },\n",
172+
" displayAlign: 'center', // Change this to 'center' to center equations.\n",
173+
" \"HTML-CSS\": {\n",
174+
" styles: {'.MathJax_Display': {\"margin\": 4}}\n",
175+
" }\n",
176+
" });\n",
177+
"</script>"
178+
],
179+
"text/plain": [
180+
"<IPython.core.display.HTML at 0x1085cd4d0>"
181+
]
182+
},
183+
"execution_count": 2,
184+
"metadata": {},
185+
"output_type": "execute_result"
186+
}
187+
],
188+
"source": [
189+
"from IPython.core.display import HTML\n",
190+
"def css_styling():\n",
191+
" styles = open(\"../styles/custom.css\", \"r\").read()\n",
192+
" return HTML(styles)\n",
193+
"css_styling()"
194+
]
195+
},
196+
{
197+
"cell_type": "code",
198+
"execution_count": 2,
199+
"metadata": {
200+
"collapsed": false
201+
},
202+
"outputs": [],
203+
"source": []
204+
}
205+
],
206+
"metadata": {
207+
"kernelspec": {
208+
"display_name": "Python 2",
209+
"language": "python",
210+
"name": "python2"
211+
},
212+
"language_info": {
213+
"codemirror_mode": {
214+
"name": "ipython",
215+
"version": 2
216+
},
217+
"file_extension": ".py",
218+
"mimetype": "text/x-python",
219+
"name": "python",
220+
"nbconvert_exporter": "python",
221+
"pygments_lexer": "ipython2",
222+
"version": "2.7.8"
223+
}
224+
},
225+
"nbformat": 4,
226+
"nbformat_minor": 0
227+
}
Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
{
2+
"cells": [
3+
{
4+
"cell_type": "markdown",
5+
"metadata": {},
6+
"source": [
7+
"### This is a placeholder. \n",
8+
"The data set used is the same as in Linear Regression where the data exploration was done in depth. \n",
9+
"So this section is just a place holder and the content is identical to the data exploration lesson in Linear Regression. \n",
10+
"Should we use a different data set for Logistic Regression in future, this is where the data exploration will go.\n"
11+
]
12+
},
13+
{
14+
"cell_type": "code",
15+
"execution_count": 1,
16+
"metadata": {
17+
"collapsed": false
18+
},
19+
"outputs": [
20+
{
21+
"name": "stdout",
22+
"output_type": "stream",
23+
"text": [
24+
"Populating the interactive namespace from numpy and matplotlib\n"
25+
]
26+
}
27+
],
28+
"source": [
29+
"%pylab inline"
30+
]
31+
},
32+
{
33+
"cell_type": "code",
34+
"execution_count": null,
35+
"metadata": {
36+
"collapsed": true
37+
},
38+
"outputs": [],
39+
"source": []
40+
}
41+
],
42+
"metadata": {
43+
"kernelspec": {
44+
"display_name": "Python 2",
45+
"language": "python",
46+
"name": "python2"
47+
},
48+
"language_info": {
49+
"codemirror_mode": {
50+
"name": "ipython",
51+
"version": 2
52+
},
53+
"file_extension": ".py",
54+
"mimetype": "text/x-python",
55+
"name": "python",
56+
"nbconvert_exporter": "python",
57+
"pygments_lexer": "ipython2",
58+
"version": "2.7.8"
59+
}
60+
},
61+
"nbformat": 4,
62+
"nbformat_minor": 0
63+
}

0 commit comments

Comments
 (0)