CS-605 Data - Analytics - Lab Complete Manual (2) - 1672730238
CS-605 Data - Analytics - Lab Complete Manual (2) - 1672730238
LAB MANUAL
(CS 605)
(Data Analytics Lab)
VI SEM (CS)
CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 2
DEPARTMENT OF
COMPUTER SCIENCE & ENGINEERING
CERTIFICATE
Gandhi Proudhyogiki Vishwavidhyalaya, Bhopal for ……… Semester of the Computer Science & Engineering
Signature of
Faculty In-charge
CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 3
INDEX
Date of
Sl. Signature of
Conductio
No Name of the Experiment Faculty-in-
n
. Charge
Introduction to R programming language with its installation and
1 packages.
EXPT. No. – 1: Introduction to R programming language with its installation and packages.
CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 4
To Install RStudio
1. Go to ‘’www.rstudio.com’’ and click on the "Download RStudio" button.
2. Click on "Download RStudio Desktop."
3. Click on the version recommended for your system, or the latest Mac version, save the .dmg file on your
computer, double-click it to open, and then drag and drop it to your applications folder.
To Install R Packages
The directory where packages are stored is called the library. R comes with a standard set of packages. Others
are available for download and installation. Once installed, they have to be loaded into the session to be used.
.libPaths() # get library location
library() # see all packages installed
search() # see packages currently loaded
Installing and Loading Packages
It turns out the ability to estimate ordered logistic or probit regression is included in the MASS (Modern
Applied Statistics with S) package.
CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 5
To install this package you run the following command:
> install.packages (" MASS ")
To actually tell R to use the new package you have to tell R to load the package’s library each time you start an
R session, just like so:
> library (" MASS ")
R now knows all the functions that are canned in the MASS package. To see what functions are implemented in
the MASS package, type:
> library ( help = " MASS ")
The Workspace
To keep different projects in different physical directories, here are some standard commands for managing
your workspace.
getwd( ) # print the current working directory .
ls ( ) # list the objects in the current workspace.
Setwd (mydirectory) # change to my directory
setwd ("c:/docs/mydir") # note / instead of \ in windows
# View and set options for the session
help(options) # learn about available options
options() # view current option settings
Viva Questions:
1. What is the importance of R Programming?
2. Explain the features of R.
3. How to add R Packages in a program?
4. How to get currently loaded packages information in R?
5. Explain the process of listing objects in current workspace.
CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 6
EXPT. No. – 2: Write a program to implement various data structures of R.
Vectors
To create vector with more than one element, user can use c () function which means to combine the elements
into a vector.
# Create a vector.
apple <- c('red','green',"yellow")
print(apple)
Lists
A list is an R-object which can contain many different types of elements inside it like vectors, functions and
even another list inside it.
# Create a list.
list1 <- list(c(2,5,3),21.3,sin)
[[2]]
[1] 21.3
[[3]]
function (x) .Primitive("sin")
Matrices
A matrix is a two-dimensional rectangular data set. It can be created using a vector input to the matrix
function.
# Create a matrix.
M = matrix( c('a','a','b','c','b','a'), nrow = 2, ncol = 3, byrow = TRUE)
print(M)
The above code will produce the following result −
CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 7
[,1] [,2] [,3]
[1,] "a" "a" "b"
[2,] "c" "b" "a"
Arrays
While matrices are confined to two dimensions, arrays can be of any number of dimensions. The array function
takes a dim attribute which creates the required number of dimension. In the below example user create an
array with two elements which are 3x3 matrices each.
# Create an array.
a <- array(c('green','yellow'),dim = c(3,3,2))
print(a)
The above code will produce the following result −
,,1
,,2
Factors
Factors are the r-objects which are created using a vector. It stores the vector along with the distinct values of
the elements in the vector as labels. The labels are always character irrespective of whether it is numeric or
character or Boolean etc. in the input vector. They are useful in statistical modeling.
Factors are created using the factor() function. The nlevels functions give the count of levels.
# Create a vector.
apple_colors <- c('green','green','yellow','red','red','red','green')
CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 8
Data Frames
Data frames are tabular data objects. Unlike a matrix in data frame each column can contain different modes of
data. The first column can be numeric while the second column can be character and third column can be
logical. It is a list of vectors of equal length.
Data Frames are created using the data.frame() function.
# Create the data frame.
BMI <- data.frame(
gender = c("Male", "Male","Female"),
height = c(152, 171.5, 165),
weight = c(81,93, 78),
Age = c(42,38,26)
)
print(BMI)
The above code will produce the following result −
gender height weight Age
1 Male 152.0 81 42
2 Male 171.5 93 38
3 Female 165.0 78 26
Variable
A variable provides named storage that the programs can manipulate.
# Assignment using equal operator.
var.1 = c(0,1,2,3)
print(var.1)
cat ("var.1 is ", var.1 ,"\n")
cat ("var.2 is ", var.2 ,"\n")
cat ("var.3 is ", var.3 ,"\n")
The above code will produce the following result −
[1] 0 1 2 3
var.1 is 0 1 2 3
var.2 is learn R
var.3 is 1 1
CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 9
Viva Questions:
1. Describe data types in R.
2. How to declare variables in R?
3. What is the difference between array and matrices?
4. Explain the types of modes of R-object.
5. Differentiate between List and Vector.
CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 10
Theory:
An operator is a symbol that tells the compiler to perform specific mathematical or logical manipulation
Following table shows the arithmetic operators supported by R language. The operators act on each
element of the vector.
Operator Description Example
v <- c( 2,5.5,6)
+ Adds two vectors t <- c(8, 3, 4)
print(v+t)
it produces the following result −
[1] 10.0 8.5 10.0
v <- c( 2,5.5,6)
t <- c(8, 3, 4)
print(v-t)
− Subtracts second vector from the first
it produces the following result −
[1] -6.0 2.5 2.0
v <- c( 2,5.5,6)
t <- c(8, 3, 4)
print(v*t)
* Multiplies both vectors
it produces the following result −
[1] 16.0 16.5 24.0
v <- c( 2,5.5,6)
t <- c(8, 3, 4)
print(v/t)
/ Divide the first vector with the second
When we execute the above code, it produces
the following result −
[1] 0.250000 1.833333 1.500000
v <- c( 2,5.5,6)
t <- c(8, 3, 4)
Give the remainder of the first vector print(v%%t)
%%
with the second
it produces the following result −
[1] 2.0 2.5 2.0
%/% The result of division of first vector v <- c( 2,5.5,6)
with second (quotient) t <- c(8, 3, 4)
CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 11
print(v%/%t)
it produces the following result −
[1] 0 1 1
v <- c( 2,5.5,6)
t <- c(8, 3, 4)
The first vector raised to the exponent print(v^t)
^
of second vector
it produces the following result −
[1] 256.000 166.375 1296.000
Relational Operators
Following table shows the relational operators supported by R language. Each element of the first vector is
compared with the corresponding element of the second vector. The result of comparison is a Boolean value.
Operator Description Example
v <- c(2,5.5,6,9)
Checks if each element of the first vector is t <- c(8,2.5,14,9)
> greater than the corresponding element of the print(v>t)
second vector.
it produces the following result −
[1] FALSE TRUE FALSE FALSE
v <- c(2,5.5,6,9)
t <- c(8,2.5,14,9)
Checks if each element of the first vector is less
print(v < t)
< than the corresponding element of the second
vector. it produces the following result −
[1] TRUE FALSE TRUE FALSE
v <- c(2,5.5,6,9)
t <- c(8,2.5,14,9)
Checks if each element of the first vector is equal print(v == t)
==
to the corresponding element of the second vector.
it produces the following result −
[1] FALSE FALSE FALSE TRUE
v <- c(2,5.5,6,9)
t <- c(8,2.5,14,9)
Checks if each element of the first vector is less
print(v<=t)
<= than or equal to the corresponding element of the
second vector. it produces the following result −
[1] TRUE FALSE TRUE TRUE
>= Checks if each element of the first vector is v <- c(2,5.5,6,9)
greater than or equal to the corresponding element t <- c(8,2.5,14,9)
of the second vector. print(v>=t)
it produces the following result −
CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 12
[1] FALSE TRUE FALSE TRUE
v <- c(2,5.5,6,9)
t <- c(8,2.5,14,9)
Checks if each element of the first vector is
print(v!=t)
!= unequal to the corresponding element of the
second vector. it produces the following result −
[1] TRUE TRUE TRUE FALSE
Logical Operators
Following table shows the logical operators supported by R language.
Operator Description Example
v <- c(3,0,TRUE,2+2i)
Called Logical AND operator. Takes first t <- c(1,3,TRUE,2+3i)
&& element of both the vectors and gives the print(v&&t)
TRUE only if both are TRUE.
it produces the following result −
[1] TRUE
v <- c(0,0,TRUE,2+2i)
Called Logical OR operator. Takes first t <- c(0,3,TRUE,2+3i)
|| element of both the vectors and gives the print(v||t)
TRUE if one of them is TRUE. it produces the following result −
[1] FALSE
CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 13
Assignment Operators
These operators are used to assign values to vectors.
Operator Description Example
v1 <- c(3,1,TRUE,2+3i)
v2 <<- c(3,1,TRUE,2+3i)
<− v3 = c(3,1,TRUE,2+3i)
or print(v1)
print(v2)
= Called Left Assignment print(v3)
or it produces the following result −
<<− [1] 3+0i 1+0i 1+0i 2+3i
[1] 3+0i 1+0i 1+0i 2+3i
[1] 3+0i 1+0i 1+0i 2+3i
c(3,1,TRUE,2+3i) -> v1
c(3,1,TRUE,2+3i) ->> v2
-> print(v1)
print(v2)
or Called Right Assignment
->> it produces the following result −
[1] 3+0i 1+0i 1+0i 2+3i
[1] 3+0i 1+0i 1+0i 2+3i
Miscellaneous Operators
These operators are used to for specific purpose and not general mathematical or logical computation.
Colon operator. It
v <- 2:8
creates the series of
print(v)
: numbers in
sequence for a it produces the following result −
vector.
[1] 2 3 4 5 6 7 8
v1 <- 8
v2 <- 12
t <- 1:10
This operator is
print(v1 %in% t)
used to identify if
%in% print(v2 %in% t)
an element belongs
to a vector. it produces the following result −
[1] TRUE
[1] FALSE
%*% This operator is M = matrix( c(2,6,5,1,10,4), nrow = 2,ncol = 3,byrow = TRUE)
CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 14
t = M %*% t(M)
print(t)
used to multiply a
matrix with its it produces the following result −
transpose. [,1] [,2]
[1,] 65 82
[2,] 82 117
Decision Making:
This provides the following types of decision-making statements.
Sr.No. Statement & Description
if statement
An if statement consists of a Boolean expression followed by one or more statements.
Syntax:
if(boolean_expression) {
// statement(s) will execute if the boolean expression is true.
}
Example:
1 x <- 30L
if(is.integer(x)) {
print("X is an Integer")
}
When the above code is compiled and executed, it produces the following result −
[1] "X is an Integer"
2 if...else statement
An if statement can be followed by an optional else statement, which executes when the
Boolean expression is false.
Syntax:
if(boolean_expression) {
// statement(s) will execute if the boolean expression is true.
} else {
// statement(s) will execute if the boolean expression is false.
}
Example:
x <- c("what","is","truth")
if("Truth" %in% x) {
print("Truth is found")
} else {
print("Truth is not found")
}
When the above code is compiled and executed, it produces the following result −
[1] "Truth is not found"
CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 15
if("Truth" %in% x) {
print("Truth is found the first time")
} else if ("truth" %in% x) {
print("truth is found the second time")
} else {
print("No truth found")
}
When the above code is compiled and executed, it produces the following result −
[1] "truth is found the second time"
3 switch statement
A switch statement allows a variable to be tested for equality against a list of values. Each
value is called a case, and the variable being switched on is checked for each case.
Syntax:
The basic syntax for creating a switch statement in R is −
switch(expression, case1, case2, case3....)
> switch(2,"red","green","blue")
[1] "green"
> switch(1,"red","green","blue")
[1] "red"
> x <- switch(4,"red","green","blue")
>x
NULL
> x <- switch(0,"red","green","blue")
>x
NULL
CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 16
> switch("color", "color" = "red", "shape" = "square", "length" = 5)
[1] "red"
> switch("length", "color" = "red", "shape" = "square", "length" = 5)
[1] 5
Viva Questions:
1. Explain types of operators in R.
2. What is the difference between Element wise Logical AND & Logical AND operator?
3. Describe the types of decision-making statements in R.
4. Differentiate between %in% and %*% operator.
5. Differentiate between Break and Return keyword.
CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 17
EXPT. No. – 4: Write a program to implement functions and loops in R.
2 while loop
Repeats a statement or group of statements while a given condition is true. It tests the
condition before executing the loop body.
Syntax
The basic syntax for creating a while loop in R is −
while (test_expression) {
statement
}
Example
v <- c("Hello","while loop")
cnt <- 2
CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 18
while (cnt < 7) {
print(v)
cnt = cnt + 1
}
When the above code is compiled and executed, it produces the following result −
[1] "Hello" "while loop"
[1] "Hello" "while loop"
[1] "Hello" "while loop"
[1] "Hello" "while loop"
[1] "Hello" "while loop"
3 for loop
Like a while statement, except that it tests the condition at the end of the loop body.
Syntax
The basic syntax for creating a for loop statement in R is −
for (value in vector) {
statements
}
R’s for loops are particularly flexible in that they are not limited to integers, or even
numbers in the input. We can pass character vectors, logical vectors, lists or expressions.
Example
v <- LETTERS[1:4]
for ( i in v) {
print(i)
}
When the above code is compiled and executed, it produces the following result −
[1] "A"
[1] "B"
[1] "C"
[1] "D"
CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 19
repeat {
print(v)
cnt <- cnt + 1
if(cnt > 5) {
break
}
}
When the above code is compiled and executed, it produces the following result −
[1] "Hello" "loop"
[1] "Hello" "loop"
[1] "Hello" "loop"
[1] "Hello" "loop"
next statement
The next statement simulates the behavior of R switch. The next statement in R
programming language is useful when we want to skip the current iteration of a loop
without terminating it. On encountering next, the R parser skips further evaluation and
starts next iteration of the loop.
Example
v <- LETTERS[1:6]
for ( i in v) {
if (i == "D") {
2 next
}
print(i)
}
When the above code is compiled and executed, it produces the following result −
[1] "A"
[1] "B"
[1] "C"
[1] "E"
[1] "F"
Functions:
A function is a set of statements organized together to perform a specific task. R has a large number of in-built
functions and the user can create their own functions.
Simple examples of in-built functions are seq(), mean(), max(), sum(x) and paste(...) etc. They are directly
called by user written programs.
# Create a sequence of numbers from 32 to 44.
print(seq(32,44))
CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 20
CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 21
new.function()
The above code will produce the following result −
[1] 1
[1] 4
[1] 9
[1] 16
[1] 25
CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 22
[1] 18
[1] 45
Math functions
R has an array of mathematical functions.
Operator Description
abs(x) Takes the absolute value of x
log(x,base=y) Takes the logarithm of x with base y; if
base is not specified, returns the natural
logarithm
exp(x) Returns the exponential of x
sqrt(x) Returns the square root of x
factorial(x) Returns the factorial of x (x!)
Statistical functions
R standard installation contains wide range of statistical functions. Some of the important functions are:
Basic statistic functions:
Operator Description
mean(x) Mean of x
median(x) Median of x
var(x) Variance of x
sd(x) Standard deviation of x
Examples:
Mean: Calculate sum of all the values and divide it with the total number of values in the data set.
> x <- c(1,2,3,4,5,1,2,3,1,2,4,5,2,3,1,1,2,3,5,6) # our data set
> mean.result = mean(x) # calculate mean
> print (mean.result)
[1] 2.8
Median: The middle value of the data set.
> x <- c(1,2,3,4,5,1,2,3,1,2,4,5,2,3,1,1,2,3,5,6) # our data set
> median.result = median(x) # calculate median
> print (median.result)
[1] 2.5
Mode: The most occurring number in the data set. For calculating mode, there is no default function in
R. So, user can create custom function.
> mode <- function(x) {
+ ux <- unique(x)
+ ux[which.max(tabulate(match(x, ux)))]
+}
> x <- c(1,2,3,4,5,1,2,3,1,2,4,5,2,3,1,1,2,3,5,6) # our data set
> mode.result = mode(x) # calculate mode (with our custom function named ‘mode’)
> print (mode.result)
[1] 1
Variance: How far a set of data values are spread out from their mean.
> variance.result = var(x) # calculate variance
> print (variance.result)
[1] 2.484211
Standard Deviation: A measure that is used to quantify the amount of variation or dispersion of a set of
data values.
> sd.result = sqrt(var(x)) # calculate standard deviation
> print (sd.result)
[1] 1.576138
CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 24
General functions
General functions like cbind(), rbind(),range() etc., Each of these functions has a specific task, takes arguments
to return an output.
1. cbind() function combines vector, matrix or data frame by columns.
cbind(x1,x2,...)
x1,x2:vector, matrix, data frames
data1.csv:
Subtype,Gender,Expression
A,m,-0.54
A,f,-0.8
B,f,-1.03
C,m,-0.41
data2.csv:
Age,City
32,New York
21,Houston
34,Seattle
67,Houston
rbind(x1,x2,...)
x1,x2:vector, matrix, data frames
data1.csv:
Subtype GenderExpression
A m -0.54
A f -0.8
B f -1.03
C m -0.41
data2.csv
Subtype GenderExpression
CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 25
D m 3.22
D f 1.02
D f 0.21
D m -0.04
D m 2.11
B m -1.21
A f -0.2
CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 26
>range(y,na.rm=TRUE)
[1] -4 43
> range(y,finite=TRUE)
[1] -4 43
Viva Questions:
1. What is loop statement and loop control statement?
2. Differentiate between repeat and while loop.
3. Differentiate between break and next statement.
4. Give examples of in-built functions in R.
5. How to calculate power of a number in R?
CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 27
EXPT. No. – 5: Write a program to implement Statistical data analysis through R using Excel/CSV files.
Aim: To study and implement Statistical data analysis using functions like mean, mode, and median.
Theory:
Statistics is defined as the study of the collection, analysis, interpretation, presentation, and organization of
data.”
Why Statistics?
Statistical methods are mainly useful to ensure that your data are interpreted correctly. And those apparent
relationships are really “significant” or meaningful and it is not simply happen by chance.
The functions are mean, median and mode.
Mean
It is calculated by taking the sum of the values and dividing with the number of values in a data series.
The function mean() is used to calculate this in R.
Syntax
The basic syntax for calculating mean in R is −
mean(x, trim = 0, na.rm = FALSE, ...)
Following is the description of the parameters used −
x is the input vector.
trim is used to drop some observations from both end of the sorted vector.
na.rm is used to remove the missing values from the input vector.
Example
# Create a vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5)
# Find Mean.
result.mean <- mean(x)
print(result.mean)
The above code will produce the following result −
[1] 8.22
Median
The middle most value in a data series is called the median. The median() function is used in R to calculate
this value.
Syntax
The basic syntax for calculating median in R is −
median(x, na.rm = FALSE)
Following is the description of the parameters used −
CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 28
x is the input vector.
na.rm is used to remove the missing values from the input vector.
Example
# Create the vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5)
Mode
The mode is the value that has highest number of occurrences in a set of data. Unike mean and median, mode
can have both numeric and character data.
R does not have a standard in-built function to calculate mode. So user creates a user function to calculate
mode of a data set in R. This function takes the vector as input and gives the mode value as output.
Example
# Create the function.
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 29
In R, user can read data from files stored outside the R environment. user can also write data into files which
will be stored and accessed by the operating system. R can read and write into various file formats like csv,
excel, xml etc.
User will learn to read data from a csv file and then write data into a csv file. The file should be present in
current working directory so that R can read it. Of course we can also set our own directory and read files from
there.
Getting and Setting the Working Directory
User can check which directory the R workspace is pointing to using the getwd() function. User can also set a
new working directory using setwd()function.
# Get and print current working directory.
print(getwd())
CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 30
The above code will produce the following result −
id, name, salary, start_date, dept
1 1 Rick 623.30 2012-01-01 IT
2 2 Dan 515.20 2013-09-23 Operations
3 3 Michelle 611.00 2014-11-15 IT
4 4 Ryan 729.00 2014-05-11 HR
5 5 Gary 843.25 2015-03-27 Finance
6 6 Nina 578.00 2013-05-21 IT
7 7 Simon 632.80 2013-07-30 Operations
8 8 Guru 722.50 2014-06-17 Finance
print(is.data.frame(data))
print(ncol(data))
print(nrow(data))
The above code will produce the following result −
[1] TRUE
[1] 5
[1] 8
Once user read data in a data frame, user can apply all the functions applicable to data frames.
Get the maximum salary
# Create a data frame.
data <- read.csv("input.csv")
CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 31
retval <- subset(data, salary == max(salary))
print(retval)
The above code will produce the following result −
id name salary start_date dept
5 5 Gary 843.25 2015-03-27 Finance
CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 32
Microsoft Excel is the most widely used spreadsheet program which stores data in the .xls or .xlsx format. R
can read directly from these files using some excel specific packages. Few such packages are - XLConnect,
xlsx, gdata etc. here user will be using xlsx package. R can also write into excel file using this package.
CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 33
Install xlsx Package
User can use the following command in the R console to install the "xlsx" package. It may ask to install some
additional packages on which this package is dependent. Follow the same command with required package
name to install the additional packages.
install.packages("xlsx")
Verify and Load the "xlsx" Package
Use the following command to verify and load the "xlsx" package.
# Verify the package is installed.
any(grepl("xlsx",installed.packages()))
CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 34
# Read the first worksheet in the file input.xlsx.
data <- read.xlsx("input.xlsx", sheetIndex = 1)
print(data)
The above code will produce the following result −
id, name, salary, start_date, dept
1 1 Rick 623.30 2012-01-01 IT
2 2 Dan 515.20 2013-09-23 Operations
3 3 Michelle 611.00 2014-11-15 IT
4 4 Ryan 729.00 2014-05-11 HR
5 5 Gary 843.25 2015-03-27 Finance
6 6 Nina 578.00 2013-05-21 IT
7 7 Simon 632.80 2013-07-30 Operations
8 8 Guru 722.50 2014-06-17 Finance
Viva Questions:
CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 35
EXPT. No. - 6. WORKING WITH R CHARTS AND GRAPH
Aim: Visualize information using Pie chart, Bar chart, Histograms, Line Graphs and Scatterplots.
Theory:
R Programming language has numerous libraries to create charts and graphs. A pie-chart is a representation of
values as slices of a circle with different colors. The slices are labeled and the numbers corresponding to each
slice is also represented in the chart.
In R the pie chart is created using the pie() function which takes positive numbers as a vector input. The
additional parameters are used to control labels, color, title etc.
Syntax
The basic syntax for creating a pie-chart using the R is −
pie(x, labels, radius, main, col, clockwise)
Following is the description of the parameters used −
x is a vector containing the numeric values used in the pie chart.
labels is used to give description to the slices.
radius indicates the radius of the circle of the pie chart.(value between −1 and +1).
main indicates the title of the chart.
col indicates the color palette.
clockwise is a logical value indicating if the slices are drawn clockwise or anti clockwise.
Syntax
CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 36
The features of the bar chart can be expanded by adding more parameters. The main parameter is used to
add title. The col parameter is used to add colors to the bars. The args.name is a vector having same number
of values as the input vector to describe the meaning of each bar.
Histograms Chart:
A histogram represents the frequencies of values of a variable bucketed into ranges. Histogram is similar to bar
chat but the difference is it groups the values into continuous ranges. Each bar in histogram represents the
height of the number of values present in that range.
R creates histogram using hist() function. This function takes a vector as an input and uses some more
parameters to plot histograms.
Syntax
The basic syntax for creating a histogram using R is −
hist(v,main,xlab,xlim,ylim,breaks,col,border)
To specify the range of values allowed in X axis and Y axis, we can use the xlim and ylim parameters.
The width of each of the bar can be decided by using breaks.
Line Graphs:
A line chart is a graph that connects a series of points by drawing line segments between them. These points
are ordered in one of their coordinate (usually the x-coordinate) value. Line charts are usually used in
identifying the trends in data.
The plot() function in R is used to create the line graph.
Syntax
CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 37
plot(v,type,col,xlab,ylab)
Following is the description of the parameters used −
v is a vector containing the numeric values.
type takes the value "p" to draw only the points, "l" to draw only the lines and "o" to draw both points
and lines.
xlab is the label for x axis.
ylab is the label for y axis.
main is the Title of the chart.
col is used to give colors to both the points and lines.
The features of the line chart can be expanded by using additional parameters. We add color to the points and
lines, give a title to the chart and add labels to the axes.
More than one line can be drawn on the same chart by using the lines()function.
After the first line is plotted, the lines() function can use an additional vector as input to draw the second line in
the chart,
Scatterplots
Scatterplots show many points plotted in the Cartesian plane. Each point represents the values of two variables.
One variable is chosen in the horizontal axis and another in the vertical axis.
The simple scatterplot is created using the plot() function.
Syntax
The basic syntax for creating scatterplot in R is −
plot(x, y, main, xlab, ylab, xlim, ylim, axes)
Scatterplot Matrices
When we have more than two variables and we want to find the correlation between one variable versus the
remaining ones we use scatterplot matrix. We use pairs() function to create matrices of scatterplots.
CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 38
Syntax
Excersize:
1. create a dataset of temperatures in a week and plot a barplot that will have labels.
1. temperature <- c(28, 35, 31, 40, 29, 41, 42)
2. days <- c("Sun", "Mon", "Tues", "Wed",
3. "Thurs", "Fri", "Sat")
4. barplot(temperature, main = "Maximum Temperatures
5. in a Week",
6. xlab = "Days",
7. ylab = "Degree in Celcius",
8. names.arg= days,
9. col = "darkred")
Viva Questions:
CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 39
EXPT. No. - 7. LINEAR REGRESSION ANALYSIS USING R
Aim: Study of Regression Analysis and perform Linear Regression Analysis on Salary_Data_Set.
Regression analysis is a very widely used statistical tool to establish a relationship model between two
variables. One of these variable is called predictor variable whose value is gathered through experiments. The
other variable is called response variable whose value is derived from the predictor variable.
In Linear Regression these two variables are related through an equation, where exponent (power) of both
these variables is 1. Mathematically a linear relationship represents a straight line when plotted as a graph. A
non-linear relationship where the exponent of any variable is not equal to 1 creates a curve.
The general mathematical equation for a linear regression is −
y = ax + b
Following is the description of the parameters used −
y is the response variable.
x is the predictor variable.
a and b are constants which are called the coefficients.
Steps to Establish a Regression
A simple example of regression is predicting weight of a person when his height is known. To do this we need
to have the relationship between height and weight of a person.
The steps to create the relationship is −
Carry out the experiment of gathering a sample of observed values of height and corresponding weight.
Create a relationship model using the lm() functions in R.
Find the coefficients from the model created and create the mathematical equation using these
Get a summary of the relationship model to know the average error in prediction. Also called residuals.
To predict the weight of new persons, use the predict() function in R.
Input Data
Below is the sample data representing the observations −
# Values of height
151, 174, 138, 186, 128, 136, 179, 163, 152, 131
# Values of weight.
63, 81, 56, 91, 47, 57, 76, 72, 62, 48
lm() Function
This function creates the relationship model between the predictor and the response variable.
Syntax
The basic syntax for lm() function in linear regression is −
lm(formula,data)
CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 40
predict() Function
Syntax
The basic syntax for predict() in linear regression is −
predict(object, newdata)
Following is the description of the parameters used −
object is the formula which is already created using the lm() function.
newdata is the vector containing the new value for predictor variable.
CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 41
Excersize:
The overall idea of regression is to examine two things –
Does a set of predictor variables do a good job in predicting an outcome (dependent) variable?
Which variables, in particular, are significant predictors of the outcome variable, and in what way
do they–indicated by the magnitude and sign of the beta estimates–impact the outcome variable?
These simple linear regression estimates are used to explain the relationship between one dependent variable
and one independent variable. Now, here we would implement the linear regression approach to one of our
datasets. The dataset that we are using here is the salary dataset of some organization that decides its salary
based on the number of years the employee has worked in the organization. So, we need to find out if there is
any relation between the number of years the employee has worked and the salary he/she gets. Then we are
going to test that the model that we have made on the training dataset is working fine with the test dataset or
not.
Step#1: #1:
The first thing that you need to do is to create dataset (copy and paste in Excel Sheet and save as
Salary_Data.csv in Working Directory)
YearsExperience Salary
1.1 39343
1.3 46205
1.5 37731
2 43525
2.2 39891
2.9 56642
3 60150
3.2 54445
3.2 64445
3.7 57189
3.9 63218
4 55794
4 56957
4.1 57081
4.5 61111
4.9 67938
5.1 66029
5.3 83088
5.9 81363
6 93940
6.8 91738
7.1 98273
7.9 101302
8.2 113812
8.7 109431
9 105582
9.5 116969
9.6 112635
CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 42
10.3 122391
10.5 121872
Step #2:
The next is to open the R studio since we are going to implement the regression in the R environment.
Step #3: Now in this step we are going to deal with the whole operation that we are going to perform in the R
studio. Commands with their brief explanation are as follows –
setwd("C:/Users/hp/Desktop")
Now we are going to load the dataset to the R studio. In this case, we have a CSV (comma separated values)
file, so we are going to use the read.csv() to load the Salary_Data.csv dataset to the R environment. Also, we are
going to assign the dataset to a variable and here suppose let’s take the name of the variable to be as raw_data.
Now, to view the dataset on the R studio, use name of the variable to which we have loaded the dataset in the
previous step.
View(raw_data)
Testing data, on the other hand, includes only input data, not the corresponding expected output. The testing
data is used to assess how well your algorithm was trained, and to estimate model properties.
For doing the splitting, we need to install the caTools package and import the caTools library.
install.packages('caTools')
library(caTools)
Now, we will set the seed. When we will split the whole dataset into the training dataset and the test dataset,
then this seed will enable us to make the same partitions in the datasets.
set.seed(123)
Now, after setting the seed, we will finally split the dataset. In general, it is recommended to split the dataset in
the ratio of 3:1. That is 75% of the original dataset would be our training dataset and 25% of the original dataset
would be our test dataset. But, here in this dataset, we are having only 30 rows. So, it is more appropriate to
allow 20 rows(i.e. 2/3 part) to the training dataset and 10 rows(i.e. 1/3 part) to the test dataset.
CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 43
split = sample.split(raw_data$Salary, SplitRatio = 2/3)
Here sample.split() is the function that splits the original dataset. The first argument of this function denotes on
the basis of which column we want to split our dataset. Here, we have done the splitting on the basis of Salary
column. SplitRatio specifies the part that would be allocated to the training dataset.
Now, the subset with the split = TRUE will be assigned to the training dataset and the subset with the split =
FALSE will be assigned to the test dataset.
Step #5: Fitting the Linear Simple Regression to the Training Dataset.
Now, we will make a linear regression model that will fit our training dataset. lm() function is used to do
so. lm() is used to fit linear models. It can be used to carry out regression, single stratum analysis of variance
and analysis of covariance.
Basically, there are a number of arguments for the lm() function but here we are not going to use all of them.
The first argument specifies the formula that we want to use to set our linear model. Her, we have used Years of
Experience as an independent variable to predict the dependent variable that is the Salary. The second argument
specifies which dataset we want to feed to the regressor to build our model. We are going to use the training
dataset to feed the regressor.
After training our model on the training dataset, it is the time to analyze our model. To do so, write the
following command in the R console:
summary(regressor)
CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 44
library(ggplot2)
ggplot() +
geom_point(aes(x = training_set$YearsExperience,
y = training_set$Salary), colour = 'red') +
geom_line(aes(x = training_set$YearsExperience,
y = predict(regressor, newdata = training_set)),
colour = 'blue') +
ggtitle('Salary vs Experience (Training Set)') +
xlab('Years of Experience') +
ylab('Salary')
Output:
The blue colored straight line in the graph represents the regressor that we have made from the training dataset.
Since we are working with the simple linear regression, therefore, the straight line is obtained. Also, the red
colored dots represent the actual training dataset.
Although, we did not accurately predict the results but the model that we have trained was close enough to
reach the accuracy.
CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 45
ggplot() +
geom_point(aes(x = test_set$YearsExperience,
y = test_set$Salary),
colour = 'red') +
geom_line(aes(x = training_set$YearsExperience,
y = predict(regressor, newdata = training_set)),
colour = 'blue') +
ggtitle('Salary vs Experience (Test Set)') +
xlab('Years of Experience') +
ylab('Salary')
Output:
# Splitting the dataset into the Training set and Test set
install.packages('caTools')
library(caTools)
set.seed(123)
CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 46
geom_point(aes(x = training_set$YearsExperience,
y = training_set$Salary),
colour = 'red') +
geom_line(aes(x = training_set$YearsExperience,
y = predict(regressor, newdata = training_set)),
colour = 'blue') +
ggtitle('Salary vs Experience (Training set)') +
xlab('Years of experience') +
ylab('Salary')
Viva Questions:
1. What is Linear Regression Analysis?
2. What is use of lm() function?
3. What is the use of predict() function?
4. How did you visualize the test set data results?
5. What is examine by regression?
CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 47
CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 48
Syntax
The basic syntax for lm() function in multiple regression is −
lm(y ~ x1+x2+x3...,data)
Following is the description of the parameters used −
formula is a symbol presenting the relation between the response variable and predictor variables.
data is the vector on which the formula will be applied.
CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 49
Following is the description of the parameters used −
formula is the symbol presenting the relationship between the variables.
data is the data set giving the values of these variables.
family is R object to specify the details of the model. It's value is binomial for logistic regression.
print(summary(am.data))
When we execute the above code, it produces the following result −
Call:
glm(formula = am ~ cyl + hp + wt, family = binomial, data = input)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.17272 -0.14907 -0.01464 0.14116 1.27641
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 19.70288 8.11637 2.428 0.0152 *
cyl 0.48760 1.07162 0.455 0.6491
hp 0.03259 0.01886 1.728 0.0840 .
wt -9.14947 4.15332 -2.203 0.0276 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Conclusion
In the summary as the p-value in the last column is more than 0.05 for the variables "cyl" and "hp", we
consider them to be insignificant in contributing to the value of the variable "am". Only weight (wt) impacts
the "am" value in this regression model.
CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 50
Viva Questions:
CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 51
pnorm()
This function gives the probability of a normally distributed random number to be less that the value of a given
number. It is also called "Cumulative Distribution Function".
qnorm()
This function takes the probability value and gives a number whose cumulative value matches the probability
value.
rnorm()
This function is used to generate random numbers whose distribution is normal. It takes the sample size as
input and generates that many random numbers. We draw a histogram to show the distribution of the generated
numbers.
BINOMIAL DISTRIBUTION
The binomial distribution model deals with finding the probability of success of an event which has only two
possible outcomes in a series of experiments. For example, tossing of a coin always gives a head or a tail. The
probability of finding exactly 3 heads in tossing a coin repeatedly for 10 times is estimated during the binomial
distribution.
R has four in-built functions to generate binomial distribution. They are described below.
dbinom(x, size, prob)
pbinom(x, size, prob)
qbinom(p, size, prob)
rbinom(n, size, prob)
Following is the description of the parameters used −
x is a vector of numbers.
p is a vector of probabilities.
n is number of observations.
size is the number of trials.
prob is the probability of success of each trial.
dbinom()
This function gives the probability density distribution at each point.
pbinom()
This function gives the cumulative probability of an event. It is a single value representing the probability.
6
qbinom()
This function takes the probability value and gives a number whose cumulative value matches the probability
value.
rbinom()
This function generates required number of random values of given probability from a given sample.
POISSON DISTRIBUTION:
Poisson Regression involves regression models in which the response variable is in the form of counts and not
fractional numbers. For example, the count of number of births or number of wins in a football match series.
Also the values of the response variables follow a Poisson distribution.
The general mathematical equation for Poisson regression is −
log(y) = a + b1x1 + b2x2 + bnxn.....
Following is the description of the parameters used −
y is the response variable.
CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 52
a and b are the numeric coefficients.
x is the predictor variable.
The function used to create the Poisson regression model is the glm() function.
Syntax
Deviance Residuals:
Min 1Q Median 3Q Max
-3.6871 -1.6503 -0.4269 1.1902 4.2616
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.69196 0.04541 81.302 < 2e-16 ***
woolB -0.20599 0.05157 -3.994 6.49e-05 ***
tensionM -0.32132 0.06027 -5.332 9.73e-08 ***
tensionH -0.51849 0.06396 -8.107 5.21e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 53
Viva Questions:
ANALYSIS OF COVARIANCE:
We use Regression analysis to create models which describe the effect of variation in predictor variables on the
response variable. Sometimes, if we have a categorical variable with values like Yes/No or Male/Female etc.
The simple regression analysis gives multiple results for each value of the categorical variable. In such
scenario, we can study the effect of the categorical variable by using it along with the predictor variable and
comparing the regression lines for each level of the categorical variable. Such an analysis is termed
as Analysis of Covariance also called as ANCOVA.
Example
Consider the R built in data set mtcars. In it we observer that the field "am" represents the type of transmission
(auto or manual). It is a categorical variable with values 0 and 1. The miles per gallon value(mpg) of a car can
also depend on it besides the value of horse power("hp").
We study the effect of the value of "am" on the regression between "mpg" and "hp". It is done by using
the aov() function followed by the anova() function to compare the multiple regressions.
CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 54
Input Data
Create a data frame containing the fields "mpg", "hp" and "am" from the data set mtcars. Here we take "mpg"
as the response variable, "hp" as the predictor variable and "am" as the categorical variable.
input <- mtcars[,c("am","mpg","hp")]
print(head(input))
When we execute the above code, it produces the following result −
am mpg hp
Mazda RX4 1 21.0 110
Mazda RX4 Wag 1 21.0 110
Datsun 710 1 22.8 93
Hornet 4 Drive 0 21.4 110
Hornet Sportabout 0 18.7 175
Valiant 0 18.1 105
ANCOVA Analysis
We create a regression model taking "hp" as the predictor variable and "mpg" as the response variable taking
into account the interaction between "am" and "hp".
Model with interaction between categorical variable and predictor variable
CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 55
is the amount of rainfall in a region at different months of the year. R language uses many functions to create,
manipulate and plot the time series data. The data for the time series is stored in an R object called time-series
object. It is also a R data object like a vector or data frame.
The time series object is created by using the ts() function.
Syntax
The basic syntax for ts() function in time series analysis is −
timeseries.object.name <- ts(data, start, end, frequency)
Following is the description of the parameters used −
data is a vector or matrix containing the values used in the time series.
start specifies the start time for the first observation in time series.
end specifies the end time for the last observation in time series.
frequency specifies the number of observations per unit time.
Except the parameter "data" all other parameters are optional.
CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 56
Viva Questions:
CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG