0% found this document useful (0 votes)
45 views

Unit III - I

This document summarizes different machine learning algorithms and the typical steps involved in solving a machine learning problem. It describes supervised learning algorithms like classification and regression that learn from labeled examples. Unsupervised learning algorithms like clustering are used to find hidden patterns in unlabeled data. Reinforcement learning involves an agent interacting with an environment to maximize rewards. The key steps are data gathering and preprocessing, feature engineering, selecting an algorithm, training and evaluating models, and tuning hyperparameters. Regression is used to predict continuous output variables based on the relationship between input features.

Uploaded by

Shiv Kumar Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views

Unit III - I

This document summarizes different machine learning algorithms and the typical steps involved in solving a machine learning problem. It describes supervised learning algorithms like classification and regression that learn from labeled examples. Unsupervised learning algorithms like clustering are used to find hidden patterns in unlabeled data. Reinforcement learning involves an agent interacting with an environment to maximize rewards. The key steps are data gathering and preprocessing, feature engineering, selecting an algorithm, training and evaluating models, and tuning hyperparameters. Regression is used to predict continuous output variables based on the relationship between input features.

Uploaded by

Shiv Kumar Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

UNIT III

TYPES OF MACHINE LEARNING ALGORITHMS

• SUPERVISED –
o Learn through examples of which we know the desired output (what we want to predict).
o Supervised learning is analogous to training a child to walk. You will hold the child’s hand, show
him how to take his foot forward, walk yourself for a demonstration and so on, until the child
learns to walk on his own.
o Is this a cat or a dog?
o Are these emails spam or not?
o Predict the market value of houses, given the square meters, number of rooms, neighborhood,
etc.
o Types
▪ Classification - Output is a discrete variable (e.g., cat/dog)
▪ Regression - Output is continuous (e.g., price, temperature)

Classification Regression
o Algorithms for Supervised Learning

• k-Nearest Neighbours
• Decision Trees
• Naive Bayes
• Logistic Regression
• Support Vector Machines
• UNSUPERVISED –
o There is no desired output. Learn something about the data. Latent relationships.
o In unsupervised learning, we do not specify a target variable to the machine, rather we ask
machine “What can you tell me about X?”. More specifically, we may ask questions such as given
a huge data set X, “What are the five best groups we can make out of X?” or “What features occur
together most frequently in X?”.
o I have photos and want to put them in 20 groups.
o I want to find anomalies in the credit card usage patterns of my customers.
o Useful for learning structure in the data (clustering), hidden correlations, reduce dimensionality,
etc.

o Algorithms for Unsupervised Learning

1
o k-means clustering
o Cluster Identification
o DBSCAN Clustering (Density-based Spatial Clustering of Applications with Noise)
o Principal component analysis (PCA)
• REINFORCEMENT –
o An agent interacts with an environment and watches the result of the interaction. Environment
gives feedback via a positive or negative reward signal.
o Consider training a pet dog, we train our pet to bring a ball to us. We throw the ball at a certain
distance and ask the dog to fetch it back to us. Every time the dog does this right, we reward the
dog. Slowly, the dog learns that doing the job rightly gives him a reward and then the dog starts
doing the job right way every time in future.
o Agent: It is an assumed entity which performs actions in an environment to gain some reward.
o Environment (e): A scenario that an agent has to face.
o Reward (R): An immediate return given to an agent when he or she performs specific action or
task.
o State (s): State refers to the current situation returned by the environment.

STEPS TO SOLVE A MACHINE LEARNING PROBLEM

o Data Gathering
• Collect data from various sources
• The more the better: Some algorithms need large amounts of data to be useful (e.g., neural
networks).
• The quantity and quality of data dictate the model accuracy
o Data Preprocessing
• Clean data to have homogeneity
• Is there anything wrong with the data?
▪ Missing values
▪ Outliers
▪ Bad encoding (for text)
▪ Wrongly-labelled examples
▪ Biased data
▪ Do I have many more samples of one class than the rest?

2
o Feature Engineering
• A feature is an individual measurable property of a phenomenon being observed
• Our inputs are represented by a set of features.
• To classify spam email, features could be:
▪ Number of words that have been ch4ng3d like this.
▪ Language of the email (0=English, 1=Spanish)
▪ Number of emojis
• Extracts more information from existing data, makes data more useful
• With good features, most algorithms can learn faster
o Algorithm Selection
• Selecting the right machine learning model
• Supervised learning models - outcome is known, refine the model until output reaches the
desired accuracy level. Unsupervised learning model - If the outcome is unknown, need
classification to be done then. Reinforcement learning - trial and error. Business environments.
o Training
• “Learning” is done at this stage. We use the part of data set allocated for training to teach our
model to differentiate between the 2 fruits. If we view our model in mathematical terms, the
inputs i.e. our 2 features would have coefficients. These coefficients are called the weights of
features. There would also be a constant or y-intercept involved. This is referred to as the bias of
the model. The process of determining their values is of trial and error. Initially, we pick random
values for them and provide inputs. The achieved output is compared with actual output and the
difference is minimized by trying different values of weights and biases. The iterations are
repeated using different entries from our training data set until the model reaches the desired
level of accuracy.
• Training requires patience and experimentation. It is also useful to have knowledge of the field
where the model would be implemented. For instance, if a machine learning model is to be used
for identifying high risk clients for an insurance company, the knowledge of how the insurance
industry operates would expedite the process of training as more educated guesses can be made
during the iterations. Training can prove to be highly rewarding if the model starts to succeed in
its role. It is comparable to when a child learns to ride a bicycle. Initially, they may have multiple
falls but, after a while, they develop a better grasp of the process and are able to react better to
different situations while riding the bicycle
o Evaluation
• After the model is trained, it needs to be tested to see if it would operate well in real world
situations. That is why the part of the data set created for evaluation is used to check the model’s
proficiency. This puts the model in a scenario where it encounters situations that were not a part
of its training. In our case, it could mean trying to identify a type of an apple or an orange that is
completely new to the model. However, through its training, the model should be capable
enough to extrapolate the information and deem whether the fruit is an apple or an orange.
• Evaluation becomes highly important when it comes to commercial applications. Evaluation
allows data scientists to check whether the goals they set out to achieve were met or not. If the
results are not satisfactory then the prior steps need to be revisited so that root cause behind the
model’s underperformance can be identified and, subsequently, rectified. If the evaluation is not
done properly then the model may not excel at fulfilling its desired commercial purpose. This
could mean that company that designed and sold the model may lose their good will with the
client. It could also mean damage to the company’s reputation as future clients may become

3
hesitant when it comes to trusting the company’s acumen regarding machine learning models.
Therefore, evaluation of the model is essential for avoiding the aforementioned ill-effects.
o Hyperparameter Tuning
• If the evaluation is successful, we proceed to the step of hyperparameter tuning. This step tries to
improve upon the positive results achieved during the evaluation step. For our example, we
would see if we can make our model even better at recognizing apples and oranges. There are
different ways we can go about improving the model. One of them is revisiting the training step
and use multiple sweeps of the training data set for training the model. This could lead to greater
accuracy as the longer duration of training provides more exposure and improves quality of the
model. Another way to go about it is refining the initial values given to the model. Random initial
values often produce poor results as they are gradually refined by trial and error. However, if we
can come up with better initial values or perhaps initiate the model using a distribution instead of
a value then our results could get better. There are other parameters that we could play around
with in order to refine the model.

REGRESSION
• Regression - to find the relationship between variables.
• In Machine Learning, and in statistical modeling, that relationship is used to predict the outcome of future
events.
• Linear regression uses the relationship between the data-points to draw a straight line through all them.
The line can be used to predict future values.

import matplotlib.pyplot as plt

x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]
plt.scatter(x, y)
plt.show()

4
import matplotlib.pyplot as plt
from scipy import stats
x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]

slope, intercept, r, p, std_err = stats.linregress(x, y)

def myfunc(x):
return slope * x + intercept

mymodel = list(map(myfunc, x))

plt.scatter(x, y)
plt.plot(x, mymodel)
plt.show()

It is important to know how the relationship between the values of the x-axis and the values of the y-axis is, if
there are no relationship the linear regression can not be used to predict anything.

This relationship - the coefficient of correlation - is called r. The r value ranges from -1 to 1, where 0 means no
relationship, and 1 (and -1) means 100% related.

For above eg. r comes to be -0.75859. shows that there is a relationship, not perfect, but it indicates that we
could use linear regression in future predictions.

Another eg.,

from scipy import stats


x = [5,7,8,7,2,17,2,9,4,11,12,9,6]
y = [99,86,87,88,111,86,103,87,94,78,77,85,86]

slope, intercept, r, p, std_err = stats.linregress(x, y)

def myfunc(x):
return slope * x + intercept

speed = myfunc(10)
print(speed)

Speed comes out to be 85.6, which can be read from the diagram:

5
Another eg.,

import matplotlib.pyplot as plt


from scipy import stats
x = [89,43,36,36,95,10,66,34,38,20,26,29,48,64,6,5,36,66,72,40]
y = [21,46,3,35,67,95,53,72,58,10,26,34,90,33,38,20,56,2,47,15]

slope, intercept, r, p, std_err = stats.linregress(x, y)


print(r)
def myfunc(x):
return slope * x + intercept

mymodel = list(map(myfunc, x))

plt.scatter(x, y)
plt.plot(x, mymodel)
plt.show()

The result: 0.013 indicates a very bad relationship, and tells us that this data set is not suitable for linear
regression. If data points clearly will not fit a linear regression (a straight line through all data points), it might be
ideal for polynomial regression. Polynomial regression, like linear regression, uses the relationship between the
variables x and y to find the best way to draw a line through the data points.

6
Features and Labels

• Feature is an individual measurable property or characteristic of a phenomenon.


• Briefly, feature is input; label is output. This applies to both classification and regression problems. A
feature is one column of the data in your input set. For instance, if we're trying to predict the type of pet
someone will choose, our input features might include age, home region, family income, etc.
• The output we get from our model after training it is called a label.
• Feature engineering is the process of creating new input features for machine learning. Features are
extracted from raw data.
• Feature selection is the process of identifying and selecting a subset of input variables that are most
relevant to the target variable. Perhaps the simplest case of feature selection is the case where there are
numerical input variables and a numerical target for regression predictive modeling.
• Simple linear regression just takes a single feature, while multiple linear regression takes multiple x
values.

Regression is a form of supervised machine learning, which is where the scientist teaches the machine by showing
it features and then showing it what the correct answer is, over and over, to teach the machine. Once the
machine is taught, the scientist will usually "test" the machine on some unseen data, where the scientist still
knows what the correct answer is, but the machine doesn't. The machine's answers are compared to the known
answers, and the machine's accuracy can be measured. If the accuracy is high enough, the scientist may consider
actually employing the algorithm in the real world. A popular use with regression is to predict stock prices.

import quandl
import pandas as pd
df = quandl.get("WIKI/GOOGL")
print(df.head())

Date Open High Low ... Adj. Low Adj. Close Adj. Volume ...

2004-08-19 100.01 104.06 95.96 ... 48.128568 50.322842 44659000.0

2004-08-20 101.01 109.08 100.50 ... 50.405597 54.322689 22834300.0

2004-08-23 110.76 113.48 109.05 ... 54.693835 54.869377 18256100.0

2004-08-24 111.24 111.60 103.57 ... 51.945350 52.597363 15247300.0

2004-08-25 104.76 108.00 103.88 ... 52.100830 53.164113 9188600.0

[5 rows x 12 columns]

Adjusted columns are the most ideal ones. Regular columns here are prices on the day, but stocks have things
called stock splits, where suddenly 1 share becomes something like 2 shares, thus the value of a share is halved,
but the value of the company has not halved. Adjusted columns are adjusted for stock splits over time, which
makes them more reliable for doing analysis. (Here, Adj. Volume tells us number of trades done in a day)

df =df[['Adj. Open','Adj. High','Adj. Low','Adj. Close','Adj. Volume']]


df['HL_PCT'] = (df['Adj. High'] - df['Adj. Close']) / df['Adj. Close'] * 100.0
df['PCT-change'] = (df['Adj. Close'] - df['Adj. Open']) / df['Adj. Open'] * 100.0
df =df[['Adj. Close', 'HL_PCT', 'PCT-change', 'Adj. Volume']]
print (df.head())

7
Adj. Close HL_PCT PCT-change Adj. Volume
Date
2004-08-19 50.322842 3.712563 0.324968 44659000.0
2004-08-20 54.322689 0.710922 7.227007 22834300.0
2004-08-23 54.869377 3.729433 -1.227880 18256100.0
2004-08-24 52.597363 6.417469 -5.726357 15247300.0
2004-08-25 53.164113 1.886792 1.183658 9188600.0

import math
import numpy as np
from sklearn import preprocessing
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

preprocessing is the module used to do some cleaning/scaling of data prior to machine learning, and
train_test_split is used in the testing stages. Finally, we're also importing the LinearRegression algorithm from
Scikit-learn, which will be used as machine learning algorithms for our results.

Here, our features are actually: current price, high minus low percent, and the percent change volatility. The price
that is the label shall be the price at some determined point the future.

forecast_col = 'Adj. Close'


df.fillna(value=-99999, inplace=True)
forecast_out = int(math.ceil(0.01 * len(df)))
df['label'] = df[forecast_col].shift(-forecast_out)

df.dropna(inplace=True) // drops NaN from dataframe


print(df.head())

Adj. Close HL_PCT PCT-change Adj. Volume label


Date
2004-08-19 50.322842 3.712563 0.324968 44659000.0 69.078238
2004-08-20 54.322689 0.710922 7.227007 22834300.0 67.839414
2004-08-23 54.869377 3.729433 -1.227880 18256100.0 68.912727
2004-08-24 52.597363 6.417469 -5.726357 15247300.0 70.668146
2004-08-25 53.164113 1.886792 1.183658 9188600.0 71.219849

We're saying we want to forecast out 1% of the entire length of the dataset. Thus, if our data is 100 days of stock
prices, we want to be able to predict the price 1 day out into the future.

Training & Testing

It is a typical standard with machine learning in code to define X (capital x), as the features, and y (lowercase y) as
the label that corresponds to the features.

X (features) is defined, as our entire dataframe EXCEPT for the label column, converted to a numpy array. We do
this using the .drop method that can be applied to dataframes, which returns a new dataframe. Next y variable is
defined, which is our label, as simply the label column of the dataframe, converted to a numpy array.

X = np.array(df.drop(['label'], 1))
y = np.array(df['label'])
print( len(X), len(y))
3389 3389
8
Before moving on to training and testing, we're going to do some pre-processing. Generally, we want our features
in machine learning to be in a range of -1 to 1. This may do nothing, but it usually speeds up processing and can
also help with accuracy. Because this range is so popularly used, it is included in the preprocessing module of
Scikit-Learn. To utilize this, we can apply preprocessing.scale to our X variable:

X = preprocessing.scale(X)

train_test_split is a function in Sklearn model selection for splitting data arrays into two subsets: for training data
and for testing data. With this function, we don't need to divide the dataset manually. By default,
Sklearn train_test_split will make random partitions for the two subsets.

train_test_split(X, y, train_size=0.*,test_size=0.*, random_state=*)

• X, y. The first parameter is the dataset we're selecting to use.


• train_size. This parameter sets the size of the training dataset. There are three options: None, which is
the default, Int, which requires the exact number of samples, and float, which ranges from 0.1 to 1.0.
• test_size. This parameter specifies the size of the testing dataset. The default state suits the training size.
It will be set to 0.25 if the training size is set to default.
• random_state The default mode performs a random split using np.random. Alternatively, we can add an
integer using an exact number.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

The return here is the training set of features, testing set of features, training set of labels, and testing set of
labels. Now, we're ready to define our classifier. There are many classifiers in general available through Scikit-
Learn. We will use linear regression.

clf=LinearRegression()

Once we have defined the classifer, we're ready to train it. With Scikit-Learn (sklearn), we train with .fit:

clf.fit(X_train,y_train)

When we call fit method it estimates the best representative function for the the data points (could be a line,
polynomial or discrete borders around). With that representation, we can calculate new data points. Take
linear_regression for example: when we call fit on a dataset of points, it'll give us a function that represents a line
that is best fits all the points. With that line function we can estimate other results. It finds the coefficients for
the equation specified via the algorithm being used.

Our classifier is now trained. Now we can test it!

To get the coefficient of determination of the prediction we can use Score() method as follows −

accuracy = clf.score(X_test,y_test)
print(accuracy)
0.9779914257461039
We can see the accuracy is about 0.977908235944789 ie. 97%

sklearn.linear_model.LinearRegression is one of the best statistical models that studies the relationship
between a dependent variable (Y) with a given set of independent variables (X). The relationship can be
established with the help of fitting a best line.
sklearn.linear_model.LinearRegression is the module used to implement linear regression. Following table
consists the parameters used by Linear Regression module −

9
Sr.No Parameter & Description

1 fit_intercept − Boolean, optional, default True


Used to calculate the intercept for the model. No intercept will be used in the
calculation if this set to false.

2 normalize − Boolean, optional, default False


If this parameter is set to True, the regressor X will be normalized before regression. The
normalization will be done by subtracting the mean and dividing it by L2 norm. If
fit_intercept = False, this parameter will be ignored.

3 copy_X − Boolean, optional, default True


By default, it is true which means X will be copied. But if it is set to false, X may be
overwritten.

4 n_jobs − int or None, optional(default = None)


It represents the number of jobs to use for the computation. It has n_jobs, we have an
algorithm that can be threaded for high performance. If we put in -1 for the value, then
the algorithm will use all available threads.

clf = LinearRegression(n_jobs=-1)

Same data can be used in other machine algorithms like SVM. Here we need kernel which can be linear, poly, rbf,
sigmoid. Here rbf is the default.

for k in ['linear','poly','rbf','sigmoid']:
clf = svm.SVR(kernel=k)
clf.fit(X_train, y_train)
accuracy = clf.score(X_test, y_test)
print(k,accuracy)

linear 0.960075071072
poly 0.63712232551
rbf 0.802831714511
sigmoid -0.125347960903

It is quite evident that the linear kernel performed the best.

Forecasting and Predicting

In prediction we do not know the exact date, we just predict for say next 10 days. So, y is correct that is the price
but it does not correspond to a specific date. It will be predicted for next 10 days.

X = np.array(df.drop(['label'], 1))
X = preprocessing.scale(X)

X_lately = X[-forecast_out:]
df.dropna(inplace=True)
y = np.array(df['label'])

10
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
clf=LinearRegression()
clf.fit(X_train,y_train)
accuracy = clf.score(X_test,y_test)
print(accuracy)
0.9779914257461039
forecast_set = clf.predict(X_lately)
print(forecast_set, accuracy, forecast_out)

[1076.46498923 1091.24242607 1105.18083269 1100.23008741 1093.83987146


1091.09429591 1089.51684425 1086.57132285 1080.30542794 1075.88015771
1073.59999077 1091.96338526 1110.4256518 1115.73520437 1130.50350186
1134.5933292 1133.69274224 1130.87599329 1132.84437211 1150.28553626
1150.06031949 1160.45075311 1157.098267 1164.65845891 1184.02555688
1197.24805468 1191.77456851 1203.63310946 1209.46647331 1206.94226105
1197.71032614 1203.68897129 1201.94812829 1138.13887908 1084.06963491] 0.9779914257461039 35

import datetime
import matplotlib.pyplot as plt
from matplotlib import style

style.use('ggplot')
df['Forecast'] = np.nan //entire column is full of Nan

last_date = df.iloc[-1].name //last date


last_unix = last_date.timestamp()
one_day = 86400 //no of seconds in a day
next_unix = last_unix + one_day

//to have dates on x Label


for i in forecast_set:
next_date = datetime.datetime.fromtimestamp(next_unix) //creating if it doesnot exists
next_unix += 86400
df.loc[next_date] = [np.nan for _ in range(len(df.columns)-1)]+[i]
//list of values of Nan
// i is forecast
// we r taking each forecast and setting it to values in dataframes making future features not a Nan.Takes first col
and sets it to Nan and final columns which is i which is forecast in this case.

print(df.head())
Adj. Close HL_PCT PCT-change Adj. Volume label Forecast
Date
2004-08-19 50.322842 3.712563 0.324968 44659000.0 69.078238 NaN
2004-08-20 54.322689 0.710922 7.227007 22834300.0 67.839414 NaN
2004-08-23 54.869377 3.729433 -1.227880 18256100.0 68.912727 NaN
2004-08-24 52.597363 6.417469 -5.726357 15247300.0 70.668146 NaN
2004-08-25 53.164113 1.886792 1.183658 9188600.0 71.219849 NaN

print(df.tail())

Adj. Close HL_PCT ... label Forecast


Date ...
2018-03-08 05:30:00 NaN NaN ... NaN 1197.710326
2018-03-09 05:30:00 NaN NaN ... NaN 1203.688971

11
2018-03-10 05:30:00 NaN NaN ... NaN 1201.948128
2018-03-11 05:30:00 NaN NaN ... NaN 1138.138879
2018-03-12 05:30:00 NaN NaN ... NaN 1084.069635

[5 rows x 6 columns]

df['Adj. Close'].plot()
df['Forecast'].plot()
plt.legend(loc=4) //bottom
plt.xlabel('Date')
plt.ylabel('Price')
plt.show()

REGRESSION – HOW TO PROGRAM THE BEST FIT SLOPE

The definition of a simple straight line: y = mx + b, where m is the slope and b is the y-intercept.
The slope, m, of the best-fit line is defined as:

12
import statistics as st
import numpy as np

xs = np.array([1,2,3,4,5], dtype=np.float64)
ys = np.array([5,4,6,5,6], dtype=np.float64)

def best_fit_slope(xs,ys):
m = (((st.mean(xs)*st.mean(ys)) – st.mean(xs*ys)) /
((st.mean(xs)**2) – st.mean(xs**2)))
return m

m = best_fit_slope(xs,ys)
print(m)

REGRESSION – HOW TO PROGRAM THE BEST FIT LINE


The calculation for the best-fit line's y-intercept is:

from statistics import mean


import numpy as np

xs = np.array([1,2,3,4,5], dtype=np.float64)
ys = np.array([5,4,6,5,6], dtype=np.float64)

def best_fit_slope_and_intercept(xs,ys):
m = (((mean(xs)*mean(ys)) - mean(xs*ys)) /
((mean(xs)*mean(xs)) - mean(xs*xs)))

b = mean(ys) - m*mean(xs)

return m, b

m, b = best_fit_slope_and_intercept(xs,ys)

print(m,b)

13
Now, we have to crate the best fit line for the data points shown in above graph.

regression_line = []
for x in xs:
regression_line.append((m*x)+b)

import matplotlib.pyplot as plt


from matplotlib import style
style.use('ggplot')
plt.scatter(xs,ys,color='#003F72',label='data')
plt.plot(xs, regression_line, label='regression line')
plt.legend(loc=4)
plt.show()

Regression - R Squared and Coefficient of Determination Theory

Considering above example of Linear regression, we can easily find out how accurate the regression line is to
some extent. But, when we have a dataset of 5 million data points or if there are 20 hierarchical layers in the
linear regression model, then how much accurate our result be? In such cases, even the best fit line is not of much
use. The standard way to check for errors is by using squared errors called r squared or the coefficient of
determination.

The distance between the regression line's y values, and the data's y values is the error, then we square that. The
line's squared error is either a mean or a sum of this.
Our best-fit line equation is the result of a proof that is used to discover the calculation for the best-fit
regression line, where the regression line is the line that has the least squared error. In order to normalize
the error as errors could be either positive or negative, we square the errors. Another way is to use the
absolute value of the error. Squared one is used if outliers are considered while taking decisions. Moreover,
squared errors is mostly followed by everyone.

Squared error is totally relative to the dataset, so we need something more. coefficient of determination.

14
The equation is essentially 1 minus the division of the squared error of the regression line and the squared error
of the mean y line. The mean y line is quite literally the mean of all of the y values from the dataset. If we were to
graph it, then it would be a flat, horizontal, line. Thus, we do the squared error of the average y, and of the
regression line. The objective here is to discern how much of the error is actually just simply a result in variation in
the data features, as opposed to being a result of the regression line being a poor fit. It is subtracted from 1 to get
a percentage, a value between 0 and 1. The goal is to have R squared error close to 1.

def squared_error(ys_orig,ys_line):
return sum((ys_line - ys_orig) * (ys_line - ys_orig))

def coefficient_of_determination(ys_orig,ys_line):
y_mean_line = [mean(ys_orig) for y in ys_orig]
squared_error_regr = squared_error(ys_orig, ys_line)
squared_error_y_mean = squared_error(ys_orig, y_mean_line)
return 1 - (squared_error_regr/squared_error_y_mean)

r_squared = coefficient_of_determination(ys,regression_line)
print(r_squared)

If we care about predicting exact future values, r squared is indeed very useful. If we're interested in predicting
motion/direction, then our best fit line is actually pretty good so far, and r squared shouldn't carry as much
weight.

15

You might also like