Unit III - I
Unit III - I
• SUPERVISED –
o Learn through examples of which we know the desired output (what we want to predict).
o Supervised learning is analogous to training a child to walk. You will hold the child’s hand, show
him how to take his foot forward, walk yourself for a demonstration and so on, until the child
learns to walk on his own.
o Is this a cat or a dog?
o Are these emails spam or not?
o Predict the market value of houses, given the square meters, number of rooms, neighborhood,
etc.
o Types
▪ Classification - Output is a discrete variable (e.g., cat/dog)
▪ Regression - Output is continuous (e.g., price, temperature)
Classification Regression
o Algorithms for Supervised Learning
• k-Nearest Neighbours
• Decision Trees
• Naive Bayes
• Logistic Regression
• Support Vector Machines
• UNSUPERVISED –
o There is no desired output. Learn something about the data. Latent relationships.
o In unsupervised learning, we do not specify a target variable to the machine, rather we ask
machine “What can you tell me about X?”. More specifically, we may ask questions such as given
a huge data set X, “What are the five best groups we can make out of X?” or “What features occur
together most frequently in X?”.
o I have photos and want to put them in 20 groups.
o I want to find anomalies in the credit card usage patterns of my customers.
o Useful for learning structure in the data (clustering), hidden correlations, reduce dimensionality,
etc.
1
o k-means clustering
o Cluster Identification
o DBSCAN Clustering (Density-based Spatial Clustering of Applications with Noise)
o Principal component analysis (PCA)
• REINFORCEMENT –
o An agent interacts with an environment and watches the result of the interaction. Environment
gives feedback via a positive or negative reward signal.
o Consider training a pet dog, we train our pet to bring a ball to us. We throw the ball at a certain
distance and ask the dog to fetch it back to us. Every time the dog does this right, we reward the
dog. Slowly, the dog learns that doing the job rightly gives him a reward and then the dog starts
doing the job right way every time in future.
o Agent: It is an assumed entity which performs actions in an environment to gain some reward.
o Environment (e): A scenario that an agent has to face.
o Reward (R): An immediate return given to an agent when he or she performs specific action or
task.
o State (s): State refers to the current situation returned by the environment.
o Data Gathering
• Collect data from various sources
• The more the better: Some algorithms need large amounts of data to be useful (e.g., neural
networks).
• The quantity and quality of data dictate the model accuracy
o Data Preprocessing
• Clean data to have homogeneity
• Is there anything wrong with the data?
▪ Missing values
▪ Outliers
▪ Bad encoding (for text)
▪ Wrongly-labelled examples
▪ Biased data
▪ Do I have many more samples of one class than the rest?
2
o Feature Engineering
• A feature is an individual measurable property of a phenomenon being observed
• Our inputs are represented by a set of features.
• To classify spam email, features could be:
▪ Number of words that have been ch4ng3d like this.
▪ Language of the email (0=English, 1=Spanish)
▪ Number of emojis
• Extracts more information from existing data, makes data more useful
• With good features, most algorithms can learn faster
o Algorithm Selection
• Selecting the right machine learning model
• Supervised learning models - outcome is known, refine the model until output reaches the
desired accuracy level. Unsupervised learning model - If the outcome is unknown, need
classification to be done then. Reinforcement learning - trial and error. Business environments.
o Training
• “Learning” is done at this stage. We use the part of data set allocated for training to teach our
model to differentiate between the 2 fruits. If we view our model in mathematical terms, the
inputs i.e. our 2 features would have coefficients. These coefficients are called the weights of
features. There would also be a constant or y-intercept involved. This is referred to as the bias of
the model. The process of determining their values is of trial and error. Initially, we pick random
values for them and provide inputs. The achieved output is compared with actual output and the
difference is minimized by trying different values of weights and biases. The iterations are
repeated using different entries from our training data set until the model reaches the desired
level of accuracy.
• Training requires patience and experimentation. It is also useful to have knowledge of the field
where the model would be implemented. For instance, if a machine learning model is to be used
for identifying high risk clients for an insurance company, the knowledge of how the insurance
industry operates would expedite the process of training as more educated guesses can be made
during the iterations. Training can prove to be highly rewarding if the model starts to succeed in
its role. It is comparable to when a child learns to ride a bicycle. Initially, they may have multiple
falls but, after a while, they develop a better grasp of the process and are able to react better to
different situations while riding the bicycle
o Evaluation
• After the model is trained, it needs to be tested to see if it would operate well in real world
situations. That is why the part of the data set created for evaluation is used to check the model’s
proficiency. This puts the model in a scenario where it encounters situations that were not a part
of its training. In our case, it could mean trying to identify a type of an apple or an orange that is
completely new to the model. However, through its training, the model should be capable
enough to extrapolate the information and deem whether the fruit is an apple or an orange.
• Evaluation becomes highly important when it comes to commercial applications. Evaluation
allows data scientists to check whether the goals they set out to achieve were met or not. If the
results are not satisfactory then the prior steps need to be revisited so that root cause behind the
model’s underperformance can be identified and, subsequently, rectified. If the evaluation is not
done properly then the model may not excel at fulfilling its desired commercial purpose. This
could mean that company that designed and sold the model may lose their good will with the
client. It could also mean damage to the company’s reputation as future clients may become
3
hesitant when it comes to trusting the company’s acumen regarding machine learning models.
Therefore, evaluation of the model is essential for avoiding the aforementioned ill-effects.
o Hyperparameter Tuning
• If the evaluation is successful, we proceed to the step of hyperparameter tuning. This step tries to
improve upon the positive results achieved during the evaluation step. For our example, we
would see if we can make our model even better at recognizing apples and oranges. There are
different ways we can go about improving the model. One of them is revisiting the training step
and use multiple sweeps of the training data set for training the model. This could lead to greater
accuracy as the longer duration of training provides more exposure and improves quality of the
model. Another way to go about it is refining the initial values given to the model. Random initial
values often produce poor results as they are gradually refined by trial and error. However, if we
can come up with better initial values or perhaps initiate the model using a distribution instead of
a value then our results could get better. There are other parameters that we could play around
with in order to refine the model.
REGRESSION
• Regression - to find the relationship between variables.
• In Machine Learning, and in statistical modeling, that relationship is used to predict the outcome of future
events.
• Linear regression uses the relationship between the data-points to draw a straight line through all them.
The line can be used to predict future values.
x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]
plt.scatter(x, y)
plt.show()
4
import matplotlib.pyplot as plt
from scipy import stats
x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]
def myfunc(x):
return slope * x + intercept
plt.scatter(x, y)
plt.plot(x, mymodel)
plt.show()
It is important to know how the relationship between the values of the x-axis and the values of the y-axis is, if
there are no relationship the linear regression can not be used to predict anything.
This relationship - the coefficient of correlation - is called r. The r value ranges from -1 to 1, where 0 means no
relationship, and 1 (and -1) means 100% related.
For above eg. r comes to be -0.75859. shows that there is a relationship, not perfect, but it indicates that we
could use linear regression in future predictions.
Another eg.,
def myfunc(x):
return slope * x + intercept
speed = myfunc(10)
print(speed)
Speed comes out to be 85.6, which can be read from the diagram:
5
Another eg.,
plt.scatter(x, y)
plt.plot(x, mymodel)
plt.show()
The result: 0.013 indicates a very bad relationship, and tells us that this data set is not suitable for linear
regression. If data points clearly will not fit a linear regression (a straight line through all data points), it might be
ideal for polynomial regression. Polynomial regression, like linear regression, uses the relationship between the
variables x and y to find the best way to draw a line through the data points.
6
Features and Labels
Regression is a form of supervised machine learning, which is where the scientist teaches the machine by showing
it features and then showing it what the correct answer is, over and over, to teach the machine. Once the
machine is taught, the scientist will usually "test" the machine on some unseen data, where the scientist still
knows what the correct answer is, but the machine doesn't. The machine's answers are compared to the known
answers, and the machine's accuracy can be measured. If the accuracy is high enough, the scientist may consider
actually employing the algorithm in the real world. A popular use with regression is to predict stock prices.
import quandl
import pandas as pd
df = quandl.get("WIKI/GOOGL")
print(df.head())
Date Open High Low ... Adj. Low Adj. Close Adj. Volume ...
[5 rows x 12 columns]
Adjusted columns are the most ideal ones. Regular columns here are prices on the day, but stocks have things
called stock splits, where suddenly 1 share becomes something like 2 shares, thus the value of a share is halved,
but the value of the company has not halved. Adjusted columns are adjusted for stock splits over time, which
makes them more reliable for doing analysis. (Here, Adj. Volume tells us number of trades done in a day)
7
Adj. Close HL_PCT PCT-change Adj. Volume
Date
2004-08-19 50.322842 3.712563 0.324968 44659000.0
2004-08-20 54.322689 0.710922 7.227007 22834300.0
2004-08-23 54.869377 3.729433 -1.227880 18256100.0
2004-08-24 52.597363 6.417469 -5.726357 15247300.0
2004-08-25 53.164113 1.886792 1.183658 9188600.0
import math
import numpy as np
from sklearn import preprocessing
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
preprocessing is the module used to do some cleaning/scaling of data prior to machine learning, and
train_test_split is used in the testing stages. Finally, we're also importing the LinearRegression algorithm from
Scikit-learn, which will be used as machine learning algorithms for our results.
Here, our features are actually: current price, high minus low percent, and the percent change volatility. The price
that is the label shall be the price at some determined point the future.
We're saying we want to forecast out 1% of the entire length of the dataset. Thus, if our data is 100 days of stock
prices, we want to be able to predict the price 1 day out into the future.
It is a typical standard with machine learning in code to define X (capital x), as the features, and y (lowercase y) as
the label that corresponds to the features.
X (features) is defined, as our entire dataframe EXCEPT for the label column, converted to a numpy array. We do
this using the .drop method that can be applied to dataframes, which returns a new dataframe. Next y variable is
defined, which is our label, as simply the label column of the dataframe, converted to a numpy array.
X = np.array(df.drop(['label'], 1))
y = np.array(df['label'])
print( len(X), len(y))
3389 3389
8
Before moving on to training and testing, we're going to do some pre-processing. Generally, we want our features
in machine learning to be in a range of -1 to 1. This may do nothing, but it usually speeds up processing and can
also help with accuracy. Because this range is so popularly used, it is included in the preprocessing module of
Scikit-Learn. To utilize this, we can apply preprocessing.scale to our X variable:
X = preprocessing.scale(X)
train_test_split is a function in Sklearn model selection for splitting data arrays into two subsets: for training data
and for testing data. With this function, we don't need to divide the dataset manually. By default,
Sklearn train_test_split will make random partitions for the two subsets.
The return here is the training set of features, testing set of features, training set of labels, and testing set of
labels. Now, we're ready to define our classifier. There are many classifiers in general available through Scikit-
Learn. We will use linear regression.
clf=LinearRegression()
Once we have defined the classifer, we're ready to train it. With Scikit-Learn (sklearn), we train with .fit:
clf.fit(X_train,y_train)
When we call fit method it estimates the best representative function for the the data points (could be a line,
polynomial or discrete borders around). With that representation, we can calculate new data points. Take
linear_regression for example: when we call fit on a dataset of points, it'll give us a function that represents a line
that is best fits all the points. With that line function we can estimate other results. It finds the coefficients for
the equation specified via the algorithm being used.
To get the coefficient of determination of the prediction we can use Score() method as follows −
accuracy = clf.score(X_test,y_test)
print(accuracy)
0.9779914257461039
We can see the accuracy is about 0.977908235944789 ie. 97%
sklearn.linear_model.LinearRegression is one of the best statistical models that studies the relationship
between a dependent variable (Y) with a given set of independent variables (X). The relationship can be
established with the help of fitting a best line.
sklearn.linear_model.LinearRegression is the module used to implement linear regression. Following table
consists the parameters used by Linear Regression module −
9
Sr.No Parameter & Description
clf = LinearRegression(n_jobs=-1)
Same data can be used in other machine algorithms like SVM. Here we need kernel which can be linear, poly, rbf,
sigmoid. Here rbf is the default.
for k in ['linear','poly','rbf','sigmoid']:
clf = svm.SVR(kernel=k)
clf.fit(X_train, y_train)
accuracy = clf.score(X_test, y_test)
print(k,accuracy)
linear 0.960075071072
poly 0.63712232551
rbf 0.802831714511
sigmoid -0.125347960903
In prediction we do not know the exact date, we just predict for say next 10 days. So, y is correct that is the price
but it does not correspond to a specific date. It will be predicted for next 10 days.
X = np.array(df.drop(['label'], 1))
X = preprocessing.scale(X)
X_lately = X[-forecast_out:]
df.dropna(inplace=True)
y = np.array(df['label'])
10
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
clf=LinearRegression()
clf.fit(X_train,y_train)
accuracy = clf.score(X_test,y_test)
print(accuracy)
0.9779914257461039
forecast_set = clf.predict(X_lately)
print(forecast_set, accuracy, forecast_out)
import datetime
import matplotlib.pyplot as plt
from matplotlib import style
style.use('ggplot')
df['Forecast'] = np.nan //entire column is full of Nan
print(df.head())
Adj. Close HL_PCT PCT-change Adj. Volume label Forecast
Date
2004-08-19 50.322842 3.712563 0.324968 44659000.0 69.078238 NaN
2004-08-20 54.322689 0.710922 7.227007 22834300.0 67.839414 NaN
2004-08-23 54.869377 3.729433 -1.227880 18256100.0 68.912727 NaN
2004-08-24 52.597363 6.417469 -5.726357 15247300.0 70.668146 NaN
2004-08-25 53.164113 1.886792 1.183658 9188600.0 71.219849 NaN
print(df.tail())
11
2018-03-10 05:30:00 NaN NaN ... NaN 1201.948128
2018-03-11 05:30:00 NaN NaN ... NaN 1138.138879
2018-03-12 05:30:00 NaN NaN ... NaN 1084.069635
[5 rows x 6 columns]
df['Adj. Close'].plot()
df['Forecast'].plot()
plt.legend(loc=4) //bottom
plt.xlabel('Date')
plt.ylabel('Price')
plt.show()
The definition of a simple straight line: y = mx + b, where m is the slope and b is the y-intercept.
The slope, m, of the best-fit line is defined as:
12
import statistics as st
import numpy as np
xs = np.array([1,2,3,4,5], dtype=np.float64)
ys = np.array([5,4,6,5,6], dtype=np.float64)
def best_fit_slope(xs,ys):
m = (((st.mean(xs)*st.mean(ys)) – st.mean(xs*ys)) /
((st.mean(xs)**2) – st.mean(xs**2)))
return m
m = best_fit_slope(xs,ys)
print(m)
xs = np.array([1,2,3,4,5], dtype=np.float64)
ys = np.array([5,4,6,5,6], dtype=np.float64)
def best_fit_slope_and_intercept(xs,ys):
m = (((mean(xs)*mean(ys)) - mean(xs*ys)) /
((mean(xs)*mean(xs)) - mean(xs*xs)))
b = mean(ys) - m*mean(xs)
return m, b
m, b = best_fit_slope_and_intercept(xs,ys)
print(m,b)
13
Now, we have to crate the best fit line for the data points shown in above graph.
regression_line = []
for x in xs:
regression_line.append((m*x)+b)
Considering above example of Linear regression, we can easily find out how accurate the regression line is to
some extent. But, when we have a dataset of 5 million data points or if there are 20 hierarchical layers in the
linear regression model, then how much accurate our result be? In such cases, even the best fit line is not of much
use. The standard way to check for errors is by using squared errors called r squared or the coefficient of
determination.
The distance between the regression line's y values, and the data's y values is the error, then we square that. The
line's squared error is either a mean or a sum of this.
Our best-fit line equation is the result of a proof that is used to discover the calculation for the best-fit
regression line, where the regression line is the line that has the least squared error. In order to normalize
the error as errors could be either positive or negative, we square the errors. Another way is to use the
absolute value of the error. Squared one is used if outliers are considered while taking decisions. Moreover,
squared errors is mostly followed by everyone.
Squared error is totally relative to the dataset, so we need something more. coefficient of determination.
14
The equation is essentially 1 minus the division of the squared error of the regression line and the squared error
of the mean y line. The mean y line is quite literally the mean of all of the y values from the dataset. If we were to
graph it, then it would be a flat, horizontal, line. Thus, we do the squared error of the average y, and of the
regression line. The objective here is to discern how much of the error is actually just simply a result in variation in
the data features, as opposed to being a result of the regression line being a poor fit. It is subtracted from 1 to get
a percentage, a value between 0 and 1. The goal is to have R squared error close to 1.
def squared_error(ys_orig,ys_line):
return sum((ys_line - ys_orig) * (ys_line - ys_orig))
def coefficient_of_determination(ys_orig,ys_line):
y_mean_line = [mean(ys_orig) for y in ys_orig]
squared_error_regr = squared_error(ys_orig, ys_line)
squared_error_y_mean = squared_error(ys_orig, y_mean_line)
return 1 - (squared_error_regr/squared_error_y_mean)
r_squared = coefficient_of_determination(ys,regression_line)
print(r_squared)
If we care about predicting exact future values, r squared is indeed very useful. If we're interested in predicting
motion/direction, then our best fit line is actually pretty good so far, and r squared shouldn't carry as much
weight.
15