0% found this document useful (0 votes)
2K views

CS-605 Data - Analytics - Lab Complete Manual (2) - 1672730238

The document describes various data structures in R including vectors, lists, matrices, arrays, and factors. It provides code examples to create each type of data structure and explains the basic syntax. Vectors can contain multiple elements of the same type, lists can contain different data types, matrices are two-dimensional, arrays can have multiple dimensions, and factors store a vector along with unique labels.

Uploaded by

ullukanu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2K views

CS-605 Data - Analytics - Lab Complete Manual (2) - 1672730238

The document describes various data structures in R including vectors, lists, matrices, arrays, and factors. It provides code examples to create each type of data structure and explains the basic syntax. Vectors can contain multiple elements of the same type, lists can contain different data types, matrices are two-dimensional, arrays can have multiple dimensions, and factors store a vector along with unique labels.

Uploaded by

ullukanu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 56

DATA ANALYTICS LABORATORY 1

LAB MANUAL
(CS 605)
(Data Analytics Lab)

VI SEM (CS)

CHAMELI DEVI GROUP OF


INSTITUTIONS, INDORE

CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 2

CHAMELI DEVI GROUP OF INSTITUTIONS


INDORE (M.P.)

DEPARTMENT OF
COMPUTER SCIENCE & ENGINEERING

CERTIFICATE

This is to certify that Mr./Ms……………………………………………………………… with RGTU

Enrollment No. 0832 ..…………………………..has satisfactorily completed the course of experiments in

…………………….……………………………………………...………laboratory, as prescribed by Rajiv

Gandhi Proudhyogiki Vishwavidhyalaya, Bhopal for ……… Semester of the Computer Science & Engineering

Department during year 20….…  ....

Signature of
Faculty In-charge

CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 3

INDEX
Date of
Sl. Signature of
Conductio
No Name of the Experiment Faculty-in-
n
. Charge
Introduction to R programming language with its installation and
1 packages.

2 Write a program to implement various data structures of R.

3 Write a program to implement decision control and operators in R.

4 Write a program to implement functions and loops in R.

Write a program to implement Statistical data analysis through R using


5 Excel/CSV files.
Visualize information using pie chart, bar chart, histograms, line graphs
6
and scatter-plots.
Study of regression analysis and perform linear regression analysis on
7
salary_data_set.

Study of regression analysis and perform linear regression analysis on


8
salary_data_set.

9 Study of normal, poission and binomial distribution in r.

10 Study of covariance and time series analysis in r.

EXPT. No. – 1: Introduction to R programming language with its installation and packages.

CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 4

Aim: To understand R language and its installation.


R is a programming language and free software environment for statistical computing and graphics that is
supported by the R Foundation for Statistical Computing. The R language is widely used among statisticians
and data miners for developing statistical software and data analysis.
R is an interpreted language; users typically access it through a command line interpreter. If a user types 2+2 at
the R command prompt and presses enter, the computer replies with 4, as shown below:
>2+2
[1] 4
To Install R and R Packages
1. Open an internet browser and go to ‘‘www.r-project.org’’.
2. Click the "download R" link in the middle of the page under "Getting Started."
3. Select a CRAN location (a mirror site) and click the corresponding link.
4. Click on the "Download R for WINDOWS" link at the top of the page.
5. Click on the file containing the latest version of R under "Files."
6. Save the .pkg file, double-click it to open, and follow the installation instructions.
7. Now that R is installed, you need to download and install RStudio.

To Install RStudio
1. Go to ‘’www.rstudio.com’’ and click on the "Download RStudio" button.
2. Click on "Download RStudio Desktop."
3. Click on the version recommended for your system, or the latest Mac version, save the .dmg file on your
computer, double-click it to open, and then drag and drop it to your applications folder.

To Install R Packages
The directory where packages are stored is called the library. R comes with a standard set of packages. Others
are available for download and installation. Once installed, they have to be loaded into the session to be used.
.libPaths() # get library location
library() # see all packages installed
search() # see packages currently loaded
Installing and Loading Packages
It turns out the ability to estimate ordered logistic or probit regression is included in the MASS (Modern
Applied Statistics with S) package.

CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 5
To install this package you run the following command:
> install.packages (" MASS ")
To actually tell R to use the new package you have to tell R to load the package’s library each time you start an
R session, just like so:
> library (" MASS ")
R now knows all the functions that are canned in the MASS package. To see what functions are implemented in
the MASS package, type:
> library ( help = " MASS ")

Maintaining your Library


Packages are frequently updated. Depending on the developer this could happen very often. To keep the
packages updated enter this every once in a while:
> update.packages ( )

The Workspace
To keep different projects in different physical directories, here are some standard commands for managing
your workspace.
getwd( ) # print the current working directory .
ls ( ) # list the objects in the current workspace.
Setwd (mydirectory) # change to my directory
setwd ("c:/docs/mydir") # note / instead of \ in windows
# View and set options for the session
help(options) # learn about available options
options() # view current option settings

Viva Questions:
1. What is the importance of R Programming?
2. Explain the features of R.
3. How to add R Packages in a program?
4. How to get currently loaded packages information in R?
5. Explain the process of listing objects in current workspace.

CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 6
EXPT. No. – 2: Write a program to implement various data structures of R.

Aim: Study of data types and variables in R

Vectors
To create vector with more than one element, user can use c () function which means to combine the elements
into a vector.
# Create a vector.
apple <- c('red','green',"yellow")
print(apple)

# Get the class of the vector.


print(class(apple))
The above code will produce the following result −
[1] "red" "green" "yellow"
[1] "character"

Lists
A list is an R-object which can contain many different types of elements inside it like vectors, functions and
even another list inside it.
# Create a list.
list1 <- list(c(2,5,3),21.3,sin)

# Print the list.


print(list1)
The above code will produce the following result −
[[1]]
[1] 2 5 3

[[2]]
[1] 21.3

[[3]]
function (x) .Primitive("sin")

Matrices
A matrix is a two-dimensional rectangular data set. It can be created using a vector input to the matrix
function.
# Create a matrix.
M = matrix( c('a','a','b','c','b','a'), nrow = 2, ncol = 3, byrow = TRUE)
print(M)
The above code will produce the following result −

CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 7
[,1] [,2] [,3]
[1,] "a" "a" "b"
[2,] "c" "b" "a"

Arrays
While matrices are confined to two dimensions, arrays can be of any number of dimensions. The array function
takes a dim attribute which creates the required number of dimension. In the below example user create an
array with two elements which are 3x3 matrices each.
# Create an array.
a <- array(c('green','yellow'),dim = c(3,3,2))
print(a)
The above code will produce the following result −
,,1

[,1] [,2] [,3]


[1,] "green" "yellow" "green"
[2,] "yellow" "green" "yellow"
[3,] "green" "yellow" "green"

,,2

[,1] [,2] [,3]


[1,] "yellow" "green" "yellow"
[2,] "green" "yellow" "green"
[3,] "yellow" "green" "yellow"

Factors
Factors are the r-objects which are created using a vector. It stores the vector along with the distinct values of
the elements in the vector as labels. The labels are always character irrespective of whether it is numeric or
character or Boolean etc. in the input vector. They are useful in statistical modeling.
Factors are created using the factor() function. The nlevels functions give the count of levels.
# Create a vector.
apple_colors <- c('green','green','yellow','red','red','red','green')

# Create a factor object.


factor_apple <- factor(apple_colors)

# Print the factor.


print(factor_apple)
print(nlevels(factor_apple))
The above code will produce the following result −
[1] green green yellow red red red green
Levels: green red yellow
[1] 3

CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 8
Data Frames
Data frames are tabular data objects. Unlike a matrix in data frame each column can contain different modes of
data. The first column can be numeric while the second column can be character and third column can be
logical. It is a list of vectors of equal length.
Data Frames are created using the data.frame() function.
# Create the data frame.
BMI <- data.frame(
gender = c("Male", "Male","Female"),
height = c(152, 171.5, 165),
weight = c(81,93, 78),
Age = c(42,38,26)
)
print(BMI)
The above code will produce the following result −
gender height weight Age
1 Male 152.0 81 42
2 Male 171.5 93 38
3 Female 165.0 78 26

Variable
A variable provides named storage that the programs can manipulate.
# Assignment using equal operator.
var.1 = c(0,1,2,3)

# Assignment using leftward operator.


var.2 <- c("learn","R")

# Assignment using rightward operator.


c(TRUE,1) -> var.3

print(var.1)
cat ("var.1 is ", var.1 ,"\n")
cat ("var.2 is ", var.2 ,"\n")
cat ("var.3 is ", var.3 ,"\n")
The above code will produce the following result −
[1] 0 1 2 3
var.1 is 0 1 2 3
var.2 is learn R
var.3 is 1 1

CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 9

Viva Questions:
1. Describe data types in R.
2. How to declare variables in R?
3. What is the difference between array and matrices?
4. Explain the types of modes of R-object.
5. Differentiate between List and Vector.

EXPT. No. – 3: Write a program to implement decision control and operators in R.

CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 10

Aim: Study of operators and decision-making statement in R.

Theory:
An operator is a symbol that tells the compiler to perform specific mathematical or logical manipulation
Following table shows the arithmetic operators supported by R language. The operators act on each
element of the vector.
Operator Description Example

v <- c( 2,5.5,6)
+ Adds two vectors t <- c(8, 3, 4)
print(v+t)
it produces the following result −
[1] 10.0 8.5 10.0
v <- c( 2,5.5,6)
t <- c(8, 3, 4)
print(v-t)
− Subtracts second vector from the first
it produces the following result −
[1] -6.0 2.5 2.0
v <- c( 2,5.5,6)
t <- c(8, 3, 4)
print(v*t)
* Multiplies both vectors
it produces the following result −
[1] 16.0 16.5 24.0
v <- c( 2,5.5,6)
t <- c(8, 3, 4)
print(v/t)
/ Divide the first vector with the second
When we execute the above code, it produces
the following result −
[1] 0.250000 1.833333 1.500000
v <- c( 2,5.5,6)
t <- c(8, 3, 4)
Give the remainder of the first vector print(v%%t)
%%
with the second
it produces the following result −
[1] 2.0 2.5 2.0
%/% The result of division of first vector v <- c( 2,5.5,6)
with second (quotient) t <- c(8, 3, 4)

CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 11
print(v%/%t)
it produces the following result −
[1] 0 1 1
v <- c( 2,5.5,6)
t <- c(8, 3, 4)
The first vector raised to the exponent print(v^t)
^
of second vector
it produces the following result −
[1] 256.000 166.375 1296.000

Relational Operators
Following table shows the relational operators supported by R language. Each element of the first vector is
compared with the corresponding element of the second vector. The result of comparison is a Boolean value.
Operator Description Example

v <- c(2,5.5,6,9)
Checks if each element of the first vector is t <- c(8,2.5,14,9)
> greater than the corresponding element of the print(v>t)
second vector.
it produces the following result −
[1] FALSE TRUE FALSE FALSE
v <- c(2,5.5,6,9)
t <- c(8,2.5,14,9)
Checks if each element of the first vector is less
print(v < t)
< than the corresponding element of the second
vector. it produces the following result −
[1] TRUE FALSE TRUE FALSE
v <- c(2,5.5,6,9)
t <- c(8,2.5,14,9)
Checks if each element of the first vector is equal print(v == t)
==
to the corresponding element of the second vector.
it produces the following result −
[1] FALSE FALSE FALSE TRUE
v <- c(2,5.5,6,9)
t <- c(8,2.5,14,9)
Checks if each element of the first vector is less
print(v<=t)
<= than or equal to the corresponding element of the
second vector. it produces the following result −
[1] TRUE FALSE TRUE TRUE
>= Checks if each element of the first vector is v <- c(2,5.5,6,9)
greater than or equal to the corresponding element t <- c(8,2.5,14,9)
of the second vector. print(v>=t)
it produces the following result −

CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 12
[1] FALSE TRUE FALSE TRUE
v <- c(2,5.5,6,9)
t <- c(8,2.5,14,9)
Checks if each element of the first vector is
print(v!=t)
!= unequal to the corresponding element of the
second vector. it produces the following result −
[1] TRUE TRUE TRUE FALSE

Logical Operators
Following table shows the logical operators supported by R language.
Operator Description Example

It is called Element-wise Logical AND operator. v <- c(3,1,TRUE,2+3i)


It combines each element of the first vector with t <- c(4,1,FALSE,2+3i)
& the corresponding element of the second vector print(v&t)
and gives a output TRUE if both the elements are
TRUE. it produces the following result −
[1] TRUE TRUE FALSE TRUE
v <- c(3,0,TRUE,2+2i)
It is called Element-wise Logical OR operator. It
t <- c(4,0,FALSE,2+3i)
combines each element of the first vector with the
print(v|t)
| corresponding element of the second vector and
gives a output TRUE if one the elements is it produces the following result −
TRUE.
[1] TRUE FALSE TRUE TRUE
v <- c(3,0,TRUE,2+2i)
It is called Logical NOT operator. Takes each print(!v)
! element of the vector and gives the opposite
logical value. it produces the following result −
[1] FALSE TRUE FALSE FALSE
The logical operator && and || considers only the first element of the vectors and give a vector of single
element as output.
Operator Description Example

v <- c(3,0,TRUE,2+2i)
Called Logical AND operator. Takes first t <- c(1,3,TRUE,2+3i)
&& element of both the vectors and gives the print(v&&t)
TRUE only if both are TRUE.
it produces the following result −
[1] TRUE
v <- c(0,0,TRUE,2+2i)
Called Logical OR operator. Takes first t <- c(0,3,TRUE,2+3i)
|| element of both the vectors and gives the print(v||t)
TRUE if one of them is TRUE. it produces the following result −
[1] FALSE

CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 13
Assignment Operators
These operators are used to assign values to vectors.
Operator Description Example

v1 <- c(3,1,TRUE,2+3i)
v2 <<- c(3,1,TRUE,2+3i)
<− v3 = c(3,1,TRUE,2+3i)
or print(v1)
print(v2)
= Called Left Assignment print(v3)
or it produces the following result −
<<− [1] 3+0i 1+0i 1+0i 2+3i
[1] 3+0i 1+0i 1+0i 2+3i
[1] 3+0i 1+0i 1+0i 2+3i
c(3,1,TRUE,2+3i) -> v1
c(3,1,TRUE,2+3i) ->> v2
-> print(v1)
print(v2)
or Called Right Assignment
->> it produces the following result −
[1] 3+0i 1+0i 1+0i 2+3i
[1] 3+0i 1+0i 1+0i 2+3i

Miscellaneous Operators
These operators are used to for specific purpose and not general mathematical or logical computation.

Operator Description Example

Colon operator. It
v <- 2:8
creates the series of
print(v)
: numbers in
sequence for a it produces the following result −
vector.
[1] 2 3 4 5 6 7 8
v1 <- 8
v2 <- 12
t <- 1:10
This operator is
print(v1 %in% t)
used to identify if
%in% print(v2 %in% t)
an element belongs
to a vector. it produces the following result −
[1] TRUE
[1] FALSE
%*% This operator is M = matrix( c(2,6,5,1,10,4), nrow = 2,ncol = 3,byrow = TRUE)
CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 14
t = M %*% t(M)
print(t)
used to multiply a
matrix with its it produces the following result −
transpose. [,1] [,2]
[1,] 65 82
[2,] 82 117

Decision Making:
This provides the following types of decision-making statements.
Sr.No. Statement & Description
if statement
An if statement consists of a Boolean expression followed by one or more statements.
Syntax:
if(boolean_expression) {
// statement(s) will execute if the boolean expression is true.
}
Example:
1 x <- 30L
if(is.integer(x)) {
print("X is an Integer")
}
When the above code is compiled and executed, it produces the following result −
[1] "X is an Integer"

2 if...else statement
An if statement can be followed by an optional else statement, which executes when the
Boolean expression is false.
Syntax:
if(boolean_expression) {
// statement(s) will execute if the boolean expression is true.
} else {
// statement(s) will execute if the boolean expression is false.
}
Example:
x <- c("what","is","truth")

if("Truth" %in% x) {
print("Truth is found")
} else {
print("Truth is not found")
}
When the above code is compiled and executed, it produces the following result −
[1] "Truth is not found"

CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 15

Here "Truth" and "truth" are two different strings.


The if...else if...else Statement:
An if statement can be followed by an optional else if...else statement, which is very useful
to test various conditions using single if...else if statement.
Syntax:
The basic syntax for creating an if...else if...else statement in R is −
if(boolean_expression 1) {
// Executes when the boolean expression 1 is true.
} else if( boolean_expression 2) {
// Executes when the boolean expression 2 is true.
} else if( boolean_expression 3) {
// Executes when the boolean expression 3 is true.
} else {
// executes when none of the above condition is true.
}
Example
x <- c("what","is","truth")

if("Truth" %in% x) {
print("Truth is found the first time")
} else if ("truth" %in% x) {
print("truth is found the second time")
} else {
print("No truth found")
}
When the above code is compiled and executed, it produces the following result −
[1] "truth is found the second time"
3 switch statement
A switch statement allows a variable to be tested for equality against a list of values. Each
value is called a case, and the variable being switched on is checked for each case.
Syntax:
The basic syntax for creating a switch statement in R is −
switch(expression, case1, case2, case3....)
> switch(2,"red","green","blue")
[1] "green"
> switch(1,"red","green","blue")
[1] "red"
> x <- switch(4,"red","green","blue")
>x
NULL
> x <- switch(0,"red","green","blue")
>x
NULL

CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 16
> switch("color", "color" = "red", "shape" = "square", "length" = 5)
[1] "red"
> switch("length", "color" = "red", "shape" = "square", "length" = 5)
[1] 5

Viva Questions:
1. Explain types of operators in R.
2. What is the difference between Element wise Logical AND & Logical AND operator?
3. Describe the types of decision-making statements in R.
4. Differentiate between %in% and %*% operator.
5. Differentiate between Break and Return keyword.

CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 17
EXPT. No. – 4: Write a program to implement functions and loops in R.

Aim: Study of loops statement, loop control statement and functions in R.


Theory:
R programming language provides the following kinds of loop to handle looping requirements
Sr.No. Loop Type & Description
1 repeat loop
Executes a sequence of statements multiple times and abbreviates the code that manages
the loop variable.
Syntax:
The basic syntax for creating a repeat loop in R is −
repeat {
commands
if(condition) {
break
}
}
Example:
v <- c("Hello","loop")
cnt <- 2
repeat {
print(v)
cnt <- cnt+1
if(cnt > 5) {
break
}
}
When the above code is compiled and executed, it produces the following result −
[1] "Hello" "loop"
[1] "Hello" "loop"
[1] "Hello" "loop"
[1] "Hello" "loop"

2 while loop
Repeats a statement or group of statements while a given condition is true. It tests the
condition before executing the loop body.
Syntax
The basic syntax for creating a while loop in R is −
while (test_expression) {
statement
}
Example
v <- c("Hello","while loop")
cnt <- 2

CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 18
while (cnt < 7) {
print(v)
cnt = cnt + 1
}
When the above code is compiled and executed, it produces the following result −
[1] "Hello" "while loop"
[1] "Hello" "while loop"
[1] "Hello" "while loop"
[1] "Hello" "while loop"
[1] "Hello" "while loop"

3 for loop
Like a while statement, except that it tests the condition at the end of the loop body.
Syntax
The basic syntax for creating a for loop statement in R is −
for (value in vector) {
statements
}
R’s for loops are particularly flexible in that they are not limited to integers, or even
numbers in the input. We can pass character vectors, logical vectors, lists or expressions.

Example
v <- LETTERS[1:4]
for ( i in v) {
print(i)
}
When the above code is compiled and executed, it produces the following result −
[1] "A"
[1] "B"
[1] "C"
[1] "D"

Loop Control Statements


Loop control statements change execution from its normal sequence.
R supports the following control statements.
Sr.No. Control Statement & Description
1 break statement
Terminates the loop statement and transfers execution to the statement immediately
following the loop.
Example
v <- c("Hello","loop")
cnt <- 2

CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 19
repeat {
print(v)
cnt <- cnt + 1

if(cnt > 5) {
break
}
}
When the above code is compiled and executed, it produces the following result −
[1] "Hello" "loop"
[1] "Hello" "loop"
[1] "Hello" "loop"
[1] "Hello" "loop"
next statement
The next statement simulates the behavior of R switch. The next statement in R
programming language is useful when we want to skip the current iteration of a loop
without terminating it. On encountering next, the R parser skips further evaluation and
starts next iteration of the loop.
Example
v <- LETTERS[1:6]
for ( i in v) {
if (i == "D") {
2 next
}
print(i)
}
When the above code is compiled and executed, it produces the following result −
[1] "A"
[1] "B"
[1] "C"
[1] "E"
[1] "F"

Functions:
A function is a set of statements organized together to perform a specific task. R has a large number of in-built
functions and the user can create their own functions.
Simple examples of in-built functions are seq(), mean(), max(), sum(x) and paste(...) etc. They are directly
called by user written programs.
# Create a sequence of numbers from 32 to 44.
print(seq(32,44))

# Find mean of numbers from 25 to 82.


print(mean(25:82))

CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 20

# Find sum of numbers frm 41 to 68.


print(sum(41:68))
The above code will produce the following result −
[1] 32 33 34 35 36 37 38 39 40 41 42 43 44
[1] 53.5
[1] 1526
User-defined Function
Useer can create user-defined functions in R. They are specific to what a user wants and once created they can
be used like the built-in functions. Below is an example of how a function is created and used.
# Create a function to print squares of numbers in sequence.
new.function <- function(a) {
for(i in 1:a) {
b <- i^2
print(b)
}
}
Calling a Function
# Create a function to print squares of numbers in sequence.
new.function <- function(a) {
for(i in 1:a) {
b <- i^2
print(b)
}
}

# Call the function new.function supplying 6 as an argument.


new.function(6)
The above code will produce the following result −
[1] 1
[1] 4
[1] 9
[1] 16
[1] 25
[1] 36

Calling a Function without an Argument


# Create a function without an argument.
new.function <- function() {
for(i in 1:5) {
print(i^2)
}
}

# Call the function without supplying an argument.

CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 21
new.function()
The above code will produce the following result −
[1] 1
[1] 4
[1] 9
[1] 16
[1] 25

Calling a Function with Argument Values (by position and by name)


The arguments to a function call can be supplied in the same sequence as defined in the function or they can be
supplied in a different sequence but assigned to the names of the arguments.
# Create a function with arguments.
new.function <- function(a,b,c) {
result <- a * b + c
print(result)
}

# Call the function by position of arguments.


new.function(5,3,11)

# Call the function by names of the arguments.


new.function(a = 11, b = 5, c = 3)
When we execute the above code, it produces the following result −
[1] 26
[1] 58

Calling a Function with Default Argument


User can define the value of the arguments in the function definition and call the function without supplying
any argument to get the default result. But user can also call such functions by supplying new values of the
argument and get non default result.
# Create a function with arguments.
new.function <- function(a = 3, b = 6) {
result <- a * b
print(result)
}

# Call the function without giving any argument.


new.function()

# Call the function with giving new values of the argument.


new.function(9,5)

The above code will produce the following result −

CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 22
[1] 18
[1] 45

Lazy Evaluation of Function


Arguments to functions are evaluated lazily, which means so they are evaluated only when needed by the
function body.
# Create a function with arguments.
new.function <- function(a, b) {
print(a^2)
print(a)
print(b)
}

# Evaluate the function without supplying one of the arguments.


new.function(6)
The above code will produce the following result −
[1] 36
[1] 6
Error in print(b) : argument "b" is missing, with no default

Math functions
R has an array of mathematical functions.
Operator Description
abs(x) Takes the absolute value of x
log(x,base=y) Takes the logarithm of x with base y; if
base is not specified, returns the natural
logarithm
exp(x) Returns the exponential of x
sqrt(x) Returns the square root of x
factorial(x) Returns the factorial of x (x!)

# sequence of number from 44 to 55 both including incremented by 1


x_vector <- seq(45,55, by = 1)
#logarithm
log(x_vector)
Output:
## [1] 3.806662 3.828641 3.850148 3.871201 3.891820 3.912023 3.931826
## [8] 3.951244 3.970292 3.988984 4.007333
#exponential
exp(x_vector)
#squared root
sqrt(x_vector)
Output:
## [1] 6.708204 6.782330 6.855655 6.928203 7.000000 7.071068 7.141428
## [8] 7.211103 7.280110 7.348469 7.416198
#factorial
CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 23
factorial(x_vector)
## [1] 1.196222e+56 5.502622e+57 2.586232e+59 1.241392e+61 6.082819e+62
## [6] 3.041409e+64 1.551119e+66 8.065818e+67 4.274883e+69 2.308437e+71
## [11] 1.269640e+73

Statistical functions
R standard installation contains wide range of statistical functions. Some of the important functions are:
Basic statistic functions:
Operator Description
mean(x) Mean of x
median(x) Median of x
var(x) Variance of x
sd(x) Standard deviation of x
Examples:
Mean: Calculate sum of all the values and divide it with the total number of values in the data set.
> x <- c(1,2,3,4,5,1,2,3,1,2,4,5,2,3,1,1,2,3,5,6) # our data set
> mean.result = mean(x) # calculate mean
> print (mean.result)
[1] 2.8
Median: The middle value of the data set.
> x <- c(1,2,3,4,5,1,2,3,1,2,4,5,2,3,1,1,2,3,5,6) # our data set
> median.result = median(x) # calculate median
> print (median.result)
[1] 2.5
Mode: The most occurring number in the data set. For calculating mode, there is no default function in
R. So, user can create custom function.
> mode <- function(x) {
+ ux <- unique(x)
+ ux[which.max(tabulate(match(x, ux)))]
+}
> x <- c(1,2,3,4,5,1,2,3,1,2,4,5,2,3,1,1,2,3,5,6) # our data set
> mode.result = mode(x) # calculate mode (with our custom function named ‘mode’)
> print (mode.result)
[1] 1

Variance: How far a set of data values are spread out from their mean.
> variance.result = var(x) # calculate variance
> print (variance.result)
[1] 2.484211

Standard Deviation: A measure that is used to quantify the amount of variation or dispersion of a set of
data values.
> sd.result = sqrt(var(x)) # calculate standard deviation
> print (sd.result)
[1] 1.576138

CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 24
General functions
General functions like cbind(), rbind(),range() etc., Each of these functions has a specific task, takes arguments
to return an output.
1. cbind() function combines vector, matrix or data frame by columns.
cbind(x1,x2,...)
x1,x2:vector, matrix, data frames

data1.csv:
Subtype,Gender,Expression
A,m,-0.54
A,f,-0.8
B,f,-1.03
C,m,-0.41

data2.csv:
Age,City
32,New York
21,Houston
34,Seattle
67,Houston

Read in the data from the file:


>x <- read.csv("data1.csv",header=T,sep=",")
>x2 <- read.csv("data2.csv",header=T,sep=",")
>x3 <- cbind(x,x2)
>x3
Subtype Gender Expression Age City
1 A m -0.54 32 New York
2 A f -0.80 21 Houston
3 B f -1.03 34 Seattle
4 C m -0.41 67 Houston

The row number of the two datasets must be equal.

2. rbind() function combines vector, matrix or data frame by rows.

rbind(x1,x2,...)
x1,x2:vector, matrix, data frames
data1.csv:
Subtype GenderExpression
A m -0.54
A f -0.8
B f -1.03
C m -0.41

data2.csv
Subtype GenderExpression

CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 25
D m 3.22
D f 1.02
D f 0.21
D m -0.04
D m 2.11
B m -1.21
A f -0.2

Read in the data from the file:


>x <- read.csv("data1.csv",header=T,sep=",")
>x2 <- read.csv("data2.csv",header=T,sep=",")

>x3 <- rbind(x,x2)


>x3
Subtype Gender Expression
1 A m -0.54
2 A f -0.80
3 B f -1.03
4 C m -0.41
5 D m 3.22
6 D f 1.02
7 D f 0.21
8 D m -0.04
9 D m 2.11
10 B m -1.21
11 A f -0.20

3. range() function get a vector of the minimum and maximum values.

range(..., na.rm = FALSE, finite = FALSE)


...: numeric vector
na.rm: whether NA should be removed, if not, NA will be returned
finite: whether non-finite elements should be omitted

>x <- c(1,2.3,2,3,4,8,12,43,-4,-1)


>r <- range(x)
>r
[1] -4 43
>diff(r)
[1] 47

Missing value affect the results:


>y<- c(x,NA)
>y
[1] 1.0 2.3 2.0 3.0 4.0 8.0 12.0 43.0 -4.0 -1.0 NA
>range(y)
[1] NA NA

CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 26

After define na.rm=TRUE, result is meaningful:

>range(y,na.rm=TRUE)
[1] -4 43

> range(y,finite=TRUE)
[1] -4 43

Viva Questions:
1. What is loop statement and loop control statement?
2. Differentiate between repeat and while loop.
3. Differentiate between break and next statement.
4. Give examples of in-built functions in R.
5. How to calculate power of a number in R?

CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 27
EXPT. No. – 5: Write a program to implement Statistical data analysis through R using Excel/CSV files.

Aim: To study and implement Statistical data analysis using functions like mean, mode, and median.
Theory:

Statistics is defined as the study of the collection, analysis, interpretation, presentation, and organization of
data.”
Why Statistics?
Statistical methods are mainly useful to ensure that your data are interpreted correctly. And those apparent
relationships are really “significant” or meaningful and it is not simply happen by chance.
The functions are mean, median and mode.
Mean
It is calculated by taking the sum of the values and dividing with the number of values in a data series.
The function mean() is used to calculate this in R.
Syntax
The basic syntax for calculating mean in R is −
mean(x, trim = 0, na.rm = FALSE, ...)
Following is the description of the parameters used −
 x is the input vector.
 trim is used to drop some observations from both end of the sorted vector.
 na.rm is used to remove the missing values from the input vector.
Example
# Create a vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5)

# Find Mean.
result.mean <- mean(x)
print(result.mean)
The above code will produce the following result −
[1] 8.22

Median
The middle most value in a data series is called the median. The median() function is used in R to calculate
this value.
Syntax
The basic syntax for calculating median in R is −
median(x, na.rm = FALSE)
Following is the description of the parameters used −

CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 28
 x is the input vector.
 na.rm is used to remove the missing values from the input vector.
Example
# Create the vector.
x <- c(12,7,3,4.2,18,2,54,-21,8,-5)

# Find the median.


median.result <- median(x)
print(median.result)
The above code will produce the following result −
[1] 5.6

Mode
The mode is the value that has highest number of occurrences in a set of data. Unike mean and median, mode
can have both numeric and character data.
R does not have a standard in-built function to calculate mode. So user creates a user function to calculate
mode of a data set in R. This function takes the vector as input and gives the mode value as output.
Example
# Create the function.
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}

# Create the vector with numbers.


v <- c(2,1,2,3,1,2,3,4,1,5,5,3,2,3)

# Calculate the mode using the user function.


result <- getmode(v)
print(result)

# Create the vector with characters.


charv <- c("o","it","the","it","it")

# Calculate the mode using the user function.


result <- getmode(charv)
print(result)
The above code will produce the following result −
[1] 2
[1] "it"

CSV Files interface:

CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 29
In R, user can read data from files stored outside the R environment. user can also write data into files which
will be stored and accessed by the operating system. R can read and write into various file formats like csv,
excel, xml etc.
User will learn to read data from a csv file and then write data into a csv file. The file should be present in
current working directory so that R can read it. Of course we can also set our own directory and read files from
there.
Getting and Setting the Working Directory
User can check which directory the R workspace is pointing to using the getwd() function. User can also set a
new working directory using setwd()function.
# Get and print current working directory.
print(getwd())

# Set current working directory.


setwd("C:/web/com")

# Get and print current working directory.


print(getwd())
The above code will produce the following result −
[1] "C:/web/com"
This result depends on the OS and current directory where user is working.

Input as CSV File


The csv file is a text file in which the values in the columns are separated by a comma. The data present in the
file named input.csv.
User can create this file using windows notepad by copying and pasting this data. Save the file
as input.csv using the save As All files(*.*) option in notepad.
id,name,salary,start_date,dept
1,Rick,623.3,2012-01-01,IT
2,Dan,515.2,2013-09-23,Operations
3,Michelle,611,2014-11-15,IT
4,Ryan,729,2014-05-11,HR
5,Gary,843.25,2015-03-27,Finance
6,Nina,578,2013-05-21,IT
7,Simon,632.8,2013-07-30,Operations
8,Guru,722.5,2014-06-17,Finance

Reading a CSV File


Following is a simple example of read.csv() function to read a CSV file available in your current working
directory −
data <- read.csv("input.csv")
print(data)

CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 30
The above code will produce the following result −
id, name, salary, start_date, dept
1 1 Rick 623.30 2012-01-01 IT
2 2 Dan 515.20 2013-09-23 Operations
3 3 Michelle 611.00 2014-11-15 IT
4 4 Ryan 729.00 2014-05-11 HR
5 5 Gary 843.25 2015-03-27 Finance
6 6 Nina 578.00 2013-05-21 IT
7 7 Simon 632.80 2013-07-30 Operations
8 8 Guru 722.50 2014-06-17 Finance

Analyzing the CSV File


By default the read.csv() function gives the output as a data frame. This can be easily checked as follows. Also
user can check the number of columns and rows.
data <- read.csv("input.csv")

print(is.data.frame(data))
print(ncol(data))
print(nrow(data))
The above code will produce the following result −
[1] TRUE
[1] 5
[1] 8
Once user read data in a data frame, user can apply all the functions applicable to data frames.
Get the maximum salary
# Create a data frame.
data <- read.csv("input.csv")

# Get the max salary from data frame.


sal <- max(data$salary)
print(sal)
The above code will produce the following result −
[1] 843.25

Get the details of the person with max salary


User can fetch rows meeting specific filter criteria similar to a SQL where clause.
# Create a data frame.
data <- read.csv("input.csv")

# Get the max salary from data frame.


sal <- max(data$salary)

# Get the person detail having max salary.

CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 31
retval <- subset(data, salary == max(salary))
print(retval)
The above code will produce the following result −
id name salary start_date dept
5 5 Gary 843.25 2015-03-27 Finance

Get all the people working in IT department


# Create a data frame.
data <- read.csv("input.csv")

retval <- subset( data, dept == "IT")


print(retval)
The above code will produce the following result −
id name salary start_date dept
1 1 Rick 623.3 2012-01-01 IT
3 3 Michelle 611.0 2014-11-15 IT
6 6 Nina 578.0 2013-05-21 IT

Get the persons in IT department whose salary is greater than 600


# Create a data frame.
data <- read.csv("input.csv")

info <- subset(data, salary > 600 & dept == "IT")


print(info)
The above code will produce the following result −
id name salary start_date dept
1 1 Rick 623.3 2012-01-01 IT
3 3 Michelle 611.0 2014-11-15 IT

Get the people who joined on or after 2014


# Create a data frame.
data <- read.csv("input.csv")

retval <- subset(data, as.Date(start_date) > as.Date("2014-01-01"))


print(retval)
The above code will produce the following result −
id name salary start_date dept
3 3 Michelle 611.00 2014-11-15 IT
4 4 Ryan 729.00 2014-05-11 HR
5 5 Gary 843.25 2015-03-27 Finance
8 8 Guru 722.50 2014-06-17 Finance

CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 32

Writing into a CSV File


R can create csv file form existing data frame. The write.csv() function is used to create the csv file. This file
gets created in the working directory.
# Create a data frame.
data <- read.csv("input.csv")
retval <- subset(data, as.Date(start_date) > as.Date("2014-01-01"))

# Write filtered data into a new file.


write.csv(retval,"output.csv")
newdata <- read.csv("output.csv")
print(newdata)
The above code will produce the following result −
X id name salary start_date dept
13 3 Michelle 611.00 2014-11-15 IT
24 4 Ryan 729.00 2014-05-11 HR
35 5 Gary 843.25 2015-03-27 Finance
48 8 Guru 722.50 2014-06-17 Finance
Here the column X comes from the data set newper. This can be dropped using additional parameters while
writing the file.
# Create a data frame.
data <- read.csv("input.csv")
retval <- subset(data, as.Date(start_date) > as.Date("2014-01-01"))

# Write filtered data into a new file.


write.csv(retval,"output.csv", row.names = FALSE)
newdata <- read.csv("output.csv")
print(newdata)
The above code will produce the following result −
id name salary start_date dept
1 3 Michelle 611.00 2014-11-15 IT
2 4 Ryan 729.00 2014-05-11 HR
3 5 Gary 843.25 2015-03-27 Finance
4 8 Guru 722.50 2014-06-17 Finance

Excel File Interface:

Microsoft Excel is the most widely used spreadsheet program which stores data in the .xls or .xlsx format. R
can read directly from these files using some excel specific packages. Few such packages are - XLConnect,
xlsx, gdata etc. here user will be using xlsx package. R can also write into excel file using this package.

CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 33
Install xlsx Package
User can use the following command in the R console to install the "xlsx" package. It may ask to install some
additional packages on which this package is dependent. Follow the same command with required package
name to install the additional packages.
install.packages("xlsx")
Verify and Load the "xlsx" Package
Use the following command to verify and load the "xlsx" package.
# Verify the package is installed.
any(grepl("xlsx",installed.packages()))

# Load the library into R workspace.


library("xlsx")
When the script is run user get the following output.
[1] TRUE
Loading required package: rJava
Loading required package: methods
Loading required package: xlsxjars
Input as xlsx File
Open Microsoft excel. Copy and paste the following data in the work sheet named as sheet1.
id name salary start_date dept
1 Rick 623.3 1/1/2012 IT
2 Dan 515.2 9/23/2013 Operations
3 Michelle 611 11/15/2014 IT
4 Ryan 729 5/11/2014 HR
5 Gary 43.25 3/27/2015 Finance
6 Nina 578 5/21/2013 IT
7 Simon 632.8 7/30/2013 Operations
8 Guru 722.5 6/17/2014 Finance
Also copy and paste the following data to another worksheet and rename this worksheet to "city".
name city
Rick Seattle
Dan Tampa
Michelle Chicago
Ryan Seattle
Gary Houston
Nina Boston
Simon Mumbai
Guru Dallas
Save the Excel file as "input.xlsx". It is necessary save it in the current working directory of the R workspace.
Reading the Excel File
The input.xlsx is read by using the read.xlsx() function as shown below. The result is stored as a data frame in
the R environment.

CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 34
# Read the first worksheet in the file input.xlsx.
data <- read.xlsx("input.xlsx", sheetIndex = 1)
print(data)
The above code will produce the following result −
id, name, salary, start_date, dept
1 1 Rick 623.30 2012-01-01 IT
2 2 Dan 515.20 2013-09-23 Operations
3 3 Michelle 611.00 2014-11-15 IT
4 4 Ryan 729.00 2014-05-11 HR
5 5 Gary 843.25 2015-03-27 Finance
6 6 Nina 578.00 2013-05-21 IT
7 7 Simon 632.80 2013-07-30 Operations
8 8 Guru 722.50 2014-06-17 Finance

Viva Questions:

1. What is statistics data analytics?


2. How to calculate mean, median and mode in R?
3. How can we interface CSV file using R?
4. How can we interface Excel File using R?
5. How can we drop missing values from calculation?

CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 35
EXPT. No. - 6. WORKING WITH R CHARTS AND GRAPH
Aim: Visualize information using Pie chart, Bar chart, Histograms, Line Graphs and Scatterplots.
Theory:
R Programming language has numerous libraries to create charts and graphs. A pie-chart is a representation of
values as slices of a circle with different colors. The slices are labeled and the numbers corresponding to each
slice is also represented in the chart.
In R the pie chart is created using the pie() function which takes positive numbers as a vector input. The
additional parameters are used to control labels, color, title etc.
Syntax
The basic syntax for creating a pie-chart using the R is −
pie(x, labels, radius, main, col, clockwise)
Following is the description of the parameters used −
 x is a vector containing the numeric values used in the pie chart.
 labels is used to give description to the slices.
 radius indicates the radius of the circle of the pie chart.(value between −1 and +1).
 main indicates the title of the chart.
 col indicates the color palette.
 clockwise is a logical value indicating if the slices are drawn clockwise or anti clockwise.

Pie Chart Title and Colors


We can expand the features of the chart by adding more parameters to the function. We will use
parameter main to add a title to the chart and another parameter is col which will make use of rainbow colour
pallet while drawing the chart. The length of the pallet should be same as the number of values we have for the
chart. Hence we use length(x).
3D Pie Chart
A pie chart with 3 dimensions can be drawn using additional packages. The package plotrix has a function
Bar Charts:
A bar chart represents data in rectangular bars with length of the bar proportional to the value of the variable. R
uses the function barplot() to create bar charts. R can draw both vertical and Horizontal bars in the bar chart.
In bar chart each of the bars can be given different colors.

Syntax

The basic syntax to create a bar-chart in R is −


barplot(H,xlab,ylab,main, names.arg,col)

Following is the description of the parameters used −


 H is a vector or matrix containing numeric values used in bar chart.
 xlab is the label for x axis.
 ylab is the label for y axis.
 main is the title of the bar chart.
 names.arg is a vector of names appearing under each bar.
 col is used to give colors to the bars in the graph.

CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 36

Bar Chart Labels, Title and Colors

The features of the bar chart can be expanded by adding more parameters. The main parameter is used to
add title. The col parameter is used to add colors to the bars. The args.name is a vector having same number
of values as the input vector to describe the meaning of each bar.

Group Bar Chart and Stacked Bar Chart


We can create bar chart with groups of bars and stacks in each bar by using a matrix as input values.
More than two variables are represented as a matrix which is used to create the group bar chart and stacked bar
chart.

Histograms Chart:

A histogram represents the frequencies of values of a variable bucketed into ranges. Histogram is similar to bar
chat but the difference is it groups the values into continuous ranges. Each bar in histogram represents the
height of the number of values present in that range.
R creates histogram using hist() function. This function takes a vector as an input and uses some more
parameters to plot histograms.

Syntax
The basic syntax for creating a histogram using R is −
hist(v,main,xlab,xlim,ylim,breaks,col,border)

Following is the description of the parameters used −


 v is a vector containing numeric values used in histogram.
 main indicates title of the chart.
 col is used to set color of the bars.
 border is used to set border color of each bar.
 xlab is used to give description of x-axis.
 xlim is used to specify the range of values on the x-axis.
 ylim is used to specify the range of values on the y-axis.
 breaks is used to mention the width of each bar.

Range of X and Y values

To specify the range of values allowed in X axis and Y axis, we can use the xlim and ylim parameters.
The width of each of the bar can be decided by using breaks.

Line Graphs:

A line chart is a graph that connects a series of points by drawing line segments between them. These points
are ordered in one of their coordinate (usually the x-coordinate) value. Line charts are usually used in
identifying the trends in data.
The plot() function in R is used to create the line graph.

Syntax

The basic syntax to create a line chart in R is −

CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 37
plot(v,type,col,xlab,ylab)
Following is the description of the parameters used −
 v is a vector containing the numeric values.
 type takes the value "p" to draw only the points, "l" to draw only the lines and "o" to draw both points
and lines.
 xlab is the label for x axis.
 ylab is the label for y axis.
 main is the Title of the chart.
 col is used to give colors to both the points and lines.

Line Chart Title, Color and Labels

The features of the line chart can be expanded by using additional parameters. We add color to the points and
lines, give a title to the chart and add labels to the axes.

Multiple Lines in a Line Chart

More than one line can be drawn on the same chart by using the lines()function.
After the first line is plotted, the lines() function can use an additional vector as input to draw the second line in
the chart,

Scatterplots

Scatterplots show many points plotted in the Cartesian plane. Each point represents the values of two variables.
One variable is chosen in the horizontal axis and another in the vertical axis.
The simple scatterplot is created using the plot() function.

Syntax
The basic syntax for creating scatterplot in R is −
plot(x, y, main, xlab, ylab, xlim, ylim, axes)

Following is the description of the parameters used −


 x is the data set whose values are the horizontal coordinates.
 y is the data set whose values are the vertical coordinates.
 main is the tile of the graph.
 xlab is the label in the horizontal axis.
 ylab is the label in the vertical axis.
 xlim is the limits of the values of x used for plotting.
 ylim is the limits of the values of y used for plotting.
 axes indicates whether both axes should be drawn on the plot.

Scatterplot Matrices

When we have more than two variables and we want to find the correlation between one variable versus the
remaining ones we use scatterplot matrix. We use pairs() function to create matrices of scatterplots.

CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 38

Syntax

The basic syntax for creating scatterplot matrices in R is −


pairs(formula, data)

Following is the description of the parameters used −


 formula represents the series of variables used in pairs.
 data represents the data set from which the variables will be taken.

Excersize:
1. create a dataset of temperatures in a week and plot a barplot that will have labels.
1. temperature <- c(28, 35, 31, 40, 29, 41, 42)
2. days <- c("Sun", "Mon", "Tues", "Wed",
3. "Thurs", "Fri", "Sat")
4. barplot(temperature, main = "Maximum Temperatures
5. in a Week",
6. xlab = "Days",
7. ylab = "Degree in Celcius",
8. names.arg= days,
9. col = "darkred")

Viva Questions:

1. What are R pie charts?


2. How to create axes in the graph?
3. What is Iplots?
4. Differentiate between histograms and bar charts.
5. What is scatterplot?

CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 39
EXPT. No. - 7. LINEAR REGRESSION ANALYSIS USING R
Aim: Study of Regression Analysis and perform Linear Regression Analysis on Salary_Data_Set.
Regression analysis is a very widely used statistical tool to establish a relationship model between two
variables. One of these variable is called predictor variable whose value is gathered through experiments. The
other variable is called response variable whose value is derived from the predictor variable.
In Linear Regression these two variables are related through an equation, where exponent (power) of both
these variables is 1. Mathematically a linear relationship represents a straight line when plotted as a graph. A
non-linear relationship where the exponent of any variable is not equal to 1 creates a curve.
The general mathematical equation for a linear regression is −
y = ax + b
Following is the description of the parameters used −
 y is the response variable.
 x is the predictor variable.
 a and b are constants which are called the coefficients.
Steps to Establish a Regression
A simple example of regression is predicting weight of a person when his height is known. To do this we need
to have the relationship between height and weight of a person.
The steps to create the relationship is −
 Carry out the experiment of gathering a sample of observed values of height and corresponding weight.
 Create a relationship model using the lm() functions in R.
 Find the coefficients from the model created and create the mathematical equation using these
 Get a summary of the relationship model to know the average error in prediction. Also called residuals.
 To predict the weight of new persons, use the predict() function in R.
Input Data
Below is the sample data representing the observations −
# Values of height
151, 174, 138, 186, 128, 136, 179, 163, 152, 131

# Values of weight.
63, 81, 56, 91, 47, 57, 76, 72, 62, 48

lm() Function
This function creates the relationship model between the predictor and the response variable.
Syntax
The basic syntax for lm() function in linear regression is −
lm(formula,data)

CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 40

Following is the description of the parameters used −


 formula is a symbol presenting the relation between x and y.
 data is the vector on which the formula will be applied.
Create Relationship Model & get the Coefficients

Get the Summary of the Relationship

predict() Function

Syntax
The basic syntax for predict() in linear regression is −
predict(object, newdata)
Following is the description of the parameters used −
 object is the formula which is already created using the lm() function.
 newdata is the vector containing the new value for predictor variable.

Predict the weight of new persons

Visualize the Regression Graphically

CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 41
Excersize:
The overall idea of regression is to examine two things –
 Does a set of predictor variables do a good job in predicting an outcome (dependent) variable?
 Which variables, in particular, are significant predictors of the outcome variable, and in what way
do they–indicated by the magnitude and sign of the beta estimates–impact the outcome variable?

These simple linear regression estimates are used to explain the relationship between one dependent variable
and one independent variable. Now, here we would implement the linear regression approach to one of our
datasets. The dataset that we are using here is the salary dataset of some organization that decides its salary
based on the number of years the employee has worked in the organization. So, we need to find out if there is
any relation between the number of years the employee has worked and the salary he/she gets. Then we are
going to test that the model that we have made on the training dataset is working fine with the test dataset or
not.
Step#1: #1:
The first thing that you need to do is to create dataset (copy and paste in Excel Sheet and save as
Salary_Data.csv in Working Directory)
YearsExperience Salary
1.1 39343
1.3 46205
1.5 37731
2 43525
2.2 39891
2.9 56642
3 60150
3.2 54445
3.2 64445
3.7 57189
3.9 63218
4 55794
4 56957
4.1 57081
4.5 61111
4.9 67938
5.1 66029
5.3 83088
5.9 81363
6 93940
6.8 91738
7.1 98273
7.9 101302
8.2 113812
8.7 109431
9 105582
9.5 116969
9.6 112635

CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 42
10.3 122391
10.5 121872

Step #2:
The next is to open the R studio since we are going to implement the regression in the R environment.

Step #3: Now in this step we are going to deal with the whole operation that we are going to perform in the R
studio. Commands with their brief explanation are as follows –

Loading the Dataset –


The first step is to set the working directory. Working directory means the directory in which you are currently
working. setwd() is used to set the working directory.

setwd("C:/Users/hp/Desktop")

Now we are going to load the dataset to the R studio. In this case, we have a CSV (comma separated values)
file, so we are going to use the read.csv() to load the Salary_Data.csv dataset to the R environment. Also, we are
going to assign the dataset to a variable and here suppose let’s take the name of the variable to be as raw_data.

raw_data <- read.csv("Salary_Data.csv")

Now, to view the dataset on the R studio, use name of the variable to which we have loaded the dataset in the
previous step.
View(raw_data)

Step #4: Splitting the Dataset.


Now we are going to split the dataset into the training dataset and the test dataset.
Training data, also called AI training data, training set, training dataset, or learning set is the information used
to train an algorithm. The training data includes both input data and the corresponding expected output. Based
on this “ground truth” data, the algorithm can learn how to apply technologies such as neural networks, to learn
and produce complex results, so that it can make accurate decisions when later presented with new data.

Testing data, on the other hand, includes only input data, not the corresponding expected output. The testing
data is used to assess how well your algorithm was trained, and to estimate model properties.
For doing the splitting, we need to install the caTools package and import the caTools library.

install.packages('caTools')
library(caTools)

Now, we will set the seed. When we will split the whole dataset into the training dataset and the test dataset,
then this seed will enable us to make the same partitions in the datasets.
set.seed(123)
Now, after setting the seed, we will finally split the dataset. In general, it is recommended to split the dataset in
the ratio of 3:1. That is 75% of the original dataset would be our training dataset and 25% of the original dataset
would be our test dataset. But, here in this dataset, we are having only 30 rows. So, it is more appropriate to
allow 20 rows(i.e. 2/3 part) to the training dataset and 10 rows(i.e. 1/3 part) to the test dataset.

CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 43
split = sample.split(raw_data$Salary, SplitRatio = 2/3)

Here sample.split() is the function that splits the original dataset. The first argument of this function denotes on
the basis of which column we want to split our dataset. Here, we have done the splitting on the basis of Salary
column. SplitRatio specifies the part that would be allocated to the training dataset.
Now, the subset with the split = TRUE will be assigned to the training dataset and the subset with the split =
FALSE will be assigned to the test dataset.

training_set <- subset(raw_data, split == TRUE)


test_set <- subset(raw_data, split == FALSE)

Step #5: Fitting the Linear Simple Regression to the Training Dataset.
Now, we will make a linear regression model that will fit our training dataset. lm() function is used to do
so. lm() is used to fit linear models. It can be used to carry out regression, single stratum analysis of variance
and analysis of covariance.

regressor = lm(formula = Salary ~ YearsExperience, data = training_set)

Basically, there are a number of arguments for the lm() function but here we are not going to use all of them.
The first argument specifies the formula that we want to use to set our linear model. Her, we have used Years of
Experience as an independent variable to predict the dependent variable that is the Salary. The second argument
specifies which dataset we want to feed to the regressor to build our model. We are going to use the training
dataset to feed the regressor.
After training our model on the training dataset, it is the time to analyze our model. To do so, write the
following command in the R console:
summary(regressor)

CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 44

Step #6: Predicting the best set results


Now, it is the time to predict the test set results based on the model that we have made on the training
dataset. predict() function is used to do so. The first argument we have passed in the function is the model.
Here, the model is regressor. The second argument is newdata that specifies which dataset we want to
implement our trained model on and predict the results of the new dataset. Here, we have taken the test_set on
which we want to implement our model.

y_pred = predict(regressor, newdata = test_set)

Step #7: Visualizing the Training Set results


We are going to visualize the training set results. For doing this we are going to use the ggplot2 library. ggplot2 is a
system for declaratively creating graphics, based on The Grammar of Graphics. You provide the data,
tell ggplot2 how to map variables to aesthetics, what graphical primitives to use, and it takes care of the details.

library(ggplot2)

ggplot() +
geom_point(aes(x = training_set$YearsExperience,
y = training_set$Salary), colour = 'red') +
geom_line(aes(x = training_set$YearsExperience,
y = predict(regressor, newdata = training_set)),
colour = 'blue') +
ggtitle('Salary vs Experience (Training Set)') +
xlab('Years of Experience') +
ylab('Salary')
Output:

The blue colored straight line in the graph represents the regressor that we have made from the training dataset.
Since we are working with the simple linear regression, therefore, the straight line is obtained. Also, the red
colored dots represent the actual training dataset.
Although, we did not accurately predict the results but the model that we have trained was close enough to
reach the accuracy.

Step #8: Visualising the Test Set Results


As we have done for visualizing the training dataset, similarly we can do it to visualize the test dataset also.
library(ggplot2)

CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 45
ggplot() +
geom_point(aes(x = test_set$YearsExperience,
y = test_set$Salary),
colour = 'red') +
geom_line(aes(x = training_set$YearsExperience,
y = predict(regressor, newdata = training_set)),
colour = 'blue') +
ggtitle('Salary vs Experience (Test Set)') +
xlab('Years of Experience') +
ylab('Salary')
Output:

The complete code in R:-


# Linear Regression

# Importing the dataset


dataset = read.csv('Salary_Data.csv')

# Splitting the dataset into the Training set and Test set
install.packages('caTools')
library(caTools)
set.seed(123)

split = sample.split(dataset$Salary, SplitRatio = 2/3)


training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

# Fitting Simple Linear Regression to the Training set


regressor = lm(formula = Salary ~ YearsExperience,
data = training_set)

# Predicting the Test set results


y_pred = predict(regressor, newdata = test_set)

# Visualising the Training set results


library(ggplot2)
ggplot() +

CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 46
geom_point(aes(x = training_set$YearsExperience,
y = training_set$Salary),
colour = 'red') +
geom_line(aes(x = training_set$YearsExperience,
y = predict(regressor, newdata = training_set)),
colour = 'blue') +
ggtitle('Salary vs Experience (Training set)') +
xlab('Years of experience') +
ylab('Salary')

# Visualising the Test set results


library(ggplot2)
ggplot() +
geom_point(aes(x = test_set$YearsExperience, y = test_set$Salary),
colour = 'red') +
geom_line(aes(x = training_set$YearsExperience,
y = predict(regressor, newdata = training_set)),
colour = 'blue') +
ggtitle('Salary vs Experience (Test set)') +
xlab('Years of experience') +
ylab('Salary')

Viva Questions:
1. What is Linear Regression Analysis?
2. What is use of lm() function?
3. What is the use of predict() function?
4. How did you visualize the test set data results?
5. What is examine by regression?

CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 47

EXPT. No. - 8. MULTIPLE REGRESSION AND LOGISTIC REGRESSION ANALYSIS USING R


Aim: Study of Multiple Regression and Logistic Regression Analysis and perform this Regression
Analysis on mtcars DataSet.
Multiple regression is an extension of linear regression into relationship between more than two variables. In
simple linear relation we have one predictor and one response variable, but in multiple regression we have
more than one predictor variable and one response variable.
The general mathematical equation for multiple regression is −
y = a + b1x1 + b2x2 +...bnxn
Following is the description of the parameters used −
 y is the response variable.
 a, b1, b2...bn are the coefficients.
 x1, x2, ...xn are the predictor variables.
We create the regression model using the lm() function in R. The model determines the value of the
coefficients using the input data. Next we can predict the value of the response variable for a given set of
predictor variables using these coefficients.
lm() Function
This function creates the relationship model between the predictor and the response variable.

CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 48
Syntax
The basic syntax for lm() function in multiple regression is −
lm(y ~ x1+x2+x3...,data)
Following is the description of the parameters used −
 formula is a symbol presenting the relation between the response variable and predictor variables.
 data is the vector on which the formula will be applied.

Create Relationship Model & get the Coefficients

Create Equation for Regression Model


Based on the above intercept and coefficient values, we create the mathematical equation.
Y = a+Xdisp.x1+Xhp.x2+Xwt.x3
or
Y = 37.15+(-0.000937)*x1+(-0.0311)*x2+(-3.8008)*x3

Apply Equation for predicting New Values


We can use the regression equation created above to predict the mileage when a new set of values for
displacement, horse power and weight is provided.
For a car with disp = 221, hp = 102 and wt = 2.91 the predicted mileage is −
Y = 37.15+(-0.000937)*221+(-0.0311)*102+(-3.8008)*2.91 = 22.7104

The Logistic Regression


The Logistic Regression is a regression model in which the response variable (dependent variable) has
categorical values such as True/False or 0/1. It actually measures the probability of a binary response as the
value of response variable based on the mathematical equation relating it with the predictor variables.
The general mathematical equation for logistic regression is −
y = 1/(1+e^-(a+b1x1+b2x2+b3x3+...))
Following is the description of the parameters used −
 y is the response variable.
 x is the predictor variable.
 a and b are the coefficients which are numeric constants.
The function used to create the regression model is the glm() function.
Syntax
The basic syntax for glm() function in logistic regression is −
glm(formula,data,family)

CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 49
Following is the description of the parameters used −
 formula is the symbol presenting the relationship between the variables.
 data is the data set giving the values of these variables.
 family is R object to specify the details of the model. It's value is binomial for logistic regression.

Create Regression Model


We use the glm() function to create the regression model and get its summary for analysis.
input <- mtcars[,c("am","cyl","hp","wt")]

am.data = glm(formula = am ~ cyl + hp + wt, data = input, family = binomial)

print(summary(am.data))
When we execute the above code, it produces the following result −
Call:
glm(formula = am ~ cyl + hp + wt, family = binomial, data = input)

Deviance Residuals:
Min 1Q Median 3Q Max
-2.17272 -0.14907 -0.01464 0.14116 1.27641

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 19.70288 8.11637 2.428 0.0152 *
cyl 0.48760 1.07162 0.455 0.6491
hp 0.03259 0.01886 1.728 0.0840 .
wt -9.14947 4.15332 -2.203 0.0276 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 43.2297 on 31 degrees of freedom


Residual deviance: 9.8415 on 28 degrees of freedom
AIC: 17.841

Number of Fisher Scoring iterations: 8

Conclusion
In the summary as the p-value in the last column is more than 0.05 for the variables "cyl" and "hp", we
consider them to be insignificant in contributing to the value of the variable "am". Only weight (wt) impacts
the "am" value in this regression model.

CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 50
Viva Questions:

1. What is Multiple regression analysis?


2. What is Logistic regression?
3. How to perform multiple regression analysis?
4. How to perform logistic analysis?
5. What is glm() function?

EXPT. No. - 9. NORMAL, POISSION AND BINOMIAL DISTRIBUTION USING R


Aim: Study of Normal, Poission and Binomial Distribution in R.
NORMAL DISTRIBUTION:
In a random collection of data from independent sources, it is generally observed that the distribution of data is
normal. Which means, on plotting a graph with the value of the variable in the horizontal axis and the count of
the values in the vertical axis we get a bell shape curve. The center of the curve represents the mean of the data
set. In the graph, fifty percent of values lie to the left of the mean and the other fifty percent lie to the right of
the graph. This is referred as normal distribution in statistics.
R has four in built functions to generate normal distribution. They are described below.
dnorm(x, mean, sd)
pnorm(x, mean, sd)
qnorm(p, mean, sd)
rnorm(n, mean, sd)
Following is the description of the parameters used in above functions −
 x is a vector of numbers.
 p is a vector of probabilities.
 n is number of observations(sample size).
 mean is the mean value of the sample data. It's default value is zero.
 sd is the standard deviation. It's default value is 1.
dnorm()
This function gives height of the probability distribution at each point for a given mean and standard deviation.

CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 51
pnorm()
This function gives the probability of a normally distributed random number to be less that the value of a given
number. It is also called "Cumulative Distribution Function".

qnorm()
This function takes the probability value and gives a number whose cumulative value matches the probability
value.

rnorm()
This function is used to generate random numbers whose distribution is normal. It takes the sample size as
input and generates that many random numbers. We draw a histogram to show the distribution of the generated
numbers.

BINOMIAL DISTRIBUTION
The binomial distribution model deals with finding the probability of success of an event which has only two
possible outcomes in a series of experiments. For example, tossing of a coin always gives a head or a tail. The
probability of finding exactly 3 heads in tossing a coin repeatedly for 10 times is estimated during the binomial
distribution.
R has four in-built functions to generate binomial distribution. They are described below.
dbinom(x, size, prob)
pbinom(x, size, prob)
qbinom(p, size, prob)
rbinom(n, size, prob)
Following is the description of the parameters used −
 x is a vector of numbers.
 p is a vector of probabilities.
 n is number of observations.
 size is the number of trials.
 prob is the probability of success of each trial.
dbinom()
This function gives the probability density distribution at each point.

pbinom()
This function gives the cumulative probability of an event. It is a single value representing the probability.
6
qbinom()
This function takes the probability value and gives a number whose cumulative value matches the probability
value.
rbinom()
This function generates required number of random values of given probability from a given sample.

POISSON DISTRIBUTION:
Poisson Regression involves regression models in which the response variable is in the form of counts and not
fractional numbers. For example, the count of number of births or number of wins in a football match series.
Also the values of the response variables follow a Poisson distribution.
The general mathematical equation for Poisson regression is −
log(y) = a + b1x1 + b2x2 + bnxn.....
Following is the description of the parameters used −
 y is the response variable.

CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 52
 a and b are the numeric coefficients.
 x is the predictor variable.
The function used to create the Poisson regression model is the glm() function.
Syntax

The basic syntax for glm() function in Poisson regression is −


glm(formula,data,family)

Following is the description of the parameters used in above functions −


 formula is the symbol presenting the relationship between the variables.
 data is the data set giving the values of these variables.
 family is R object to specify the details of the model. It's value is 'Poisson' for Logistic Regression.
Example
We have the in-built data set "warpbreaks" which describes the effect of wool type (A or B) and tension (low,
medium or high) on the number of warp breaks per loom. Let's consider "breaks" as the response variable
which is a count of number of breaks. The wool "type" and "tension" are taken as predictor variables.
Input Data
input <- warpbreaks
print(head(input))
When we execute the above code, it produces the following result −
breaks wool tension
1 26 A L
2 30 A L
3 54 A L
4 25 A L
5 70 A L
6 52 A L
Create Regression Model
output <-glm(formula = breaks ~ wool+tension, data = warpbreaks,
family = poisson)
print(summary(output))
When we execute the above code, it produces the following result −
Call:
glm(formula = breaks ~ wool + tension, family = poisson, data = warpbreaks)

Deviance Residuals:
Min 1Q Median 3Q Max
-3.6871 -1.6503 -0.4269 1.1902 4.2616
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.69196 0.04541 81.302 < 2e-16 ***
woolB -0.20599 0.05157 -3.994 6.49e-05 ***
tensionM -0.32132 0.06027 -5.332 9.73e-08 ***
tensionH -0.51849 0.06396 -8.107 5.21e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 53

Null deviance: 297.37 on 53 degrees of freedom


Residual deviance: 210.39 on 50 degrees of freedom
AIC: 493.06

Number of Fisher Scoring iterations: 4


In the summary we look for the p-value in the last column to be less than 0.05 to consider an impact of the
predictor variable on the response variable. As seen the wooltype B having tension type M and H have impact
on the count of breaks.

Viva Questions:

1. What do you understand by Normal distribution? Which functions is used for it in R?


2. What do you understand by Poission distribution? Which functions is used for it in R?
3. What do you understand by Binomial distribution? Which functions is used for it in R?
4. What is difference between pnorm() and qnorm() functions?
5. What is glm() function?

EXPT. No. - 10. Covariance and Time Series Analysis in R


Aim: Study of Covariance and Time Series Analysis in R.

ANALYSIS OF COVARIANCE:
We use Regression analysis to create models which describe the effect of variation in predictor variables on the
response variable. Sometimes, if we have a categorical variable with values like Yes/No or Male/Female etc.
The simple regression analysis gives multiple results for each value of the categorical variable. In such
scenario, we can study the effect of the categorical variable by using it along with the predictor variable and
comparing the regression lines for each level of the categorical variable. Such an analysis is termed
as Analysis of Covariance also called as ANCOVA.
Example
Consider the R built in data set mtcars. In it we observer that the field "am" represents the type of transmission
(auto or manual). It is a categorical variable with values 0 and 1. The miles per gallon value(mpg) of a car can
also depend on it besides the value of horse power("hp").
We study the effect of the value of "am" on the regression between "mpg" and "hp". It is done by using
the aov() function followed by the anova() function to compare the multiple regressions.

CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 54
Input Data
Create a data frame containing the fields "mpg", "hp" and "am" from the data set mtcars. Here we take "mpg"
as the response variable, "hp" as the predictor variable and "am" as the categorical variable.
input <- mtcars[,c("am","mpg","hp")]
print(head(input))
When we execute the above code, it produces the following result −
am mpg hp
Mazda RX4 1 21.0 110
Mazda RX4 Wag 1 21.0 110
Datsun 710 1 22.8 93
Hornet 4 Drive 0 21.4 110
Hornet Sportabout 0 18.7 175
Valiant 0 18.1 105

ANCOVA Analysis
We create a regression model taking "hp" as the predictor variable and "mpg" as the response variable taking
into account the interaction between "am" and "hp".
Model with interaction between categorical variable and predictor variable

Df Sum Sq Mean Sq F value Pr(>F)


hp 1 678.4 678.4 77.391 1.50e-09 ***
am 1 202.2 202.2 23.072 4.75e-05 ***
hp:am 1 0.0 0.0 0.001 0.981
Residuals 28 245.4 8.8
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
This result shows that both horse power and transmission type has significant effect on miles per gallon as the
p value in both cases is less than 0.05. But the interaction between these two variables is not significant as the
p-value is more than 0.05.
Model without interaction between categorical variable and predictor variable
When we execute the above code, it produces the following result −
This result shows that both horse power and transmission type has significant effect on miles per gallon as the
p value in both cases is less than 0.05.
Comparing Two Models
Now we can compare the two models to conclude if the interaction of the variables is truly in-significant. For
As the p-value is greater than 0.05 we conclude that the interaction between horse power and transmission type
is not significant. So the mileage per gallon will depend in a similar manner on the horse power of the car in
both auto and manual transmission mode.
TIME SERIES ANALYSIS:
Time series is a series of data points in which each data point is associated with a timestamp. A simple
example is the price of a stock in the stock market at different points of time on a given day. Another example

CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 55
is the amount of rainfall in a region at different months of the year. R language uses many functions to create,
manipulate and plot the time series data. The data for the time series is stored in an R object called time-series
object. It is also a R data object like a vector or data frame.
The time series object is created by using the ts() function.
Syntax
The basic syntax for ts() function in time series analysis is −
timeseries.object.name <- ts(data, start, end, frequency)
Following is the description of the parameters used −
 data is a vector or matrix containing the values used in the time series.
 start specifies the start time for the first observation in time series.
 end specifies the end time for the last observation in time series.
 frequency specifies the number of observations per unit time.
Except the parameter "data" all other parameters are optional.

Different Time Intervals


The value of the frequency parameter in the ts() function decides the time intervals at which the data points
are measured. A value of 12 indicates that the time series is for 12 months. Other values and its meaning is as
below −
 frequency = 12 pegs the data points for every month of a year.
 frequency = 4 pegs the data points for every quarter of a year.
 frequency = 6 pegs the data points for every 10 minutes of an hour.
 frequency = 24*6 pegs the data points for every 10 minutes of a day.
Multiple Time Series
We can plot multiple time series in one chart by combining both the series into a matrix.

CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG
DATA ANALYTICS LABORATORY 56

Viva Questions:

1. How can we create time series object?


2. What is ANOVA() function?
3. What is covariance analysis?
4. What is time series analysis?
5. How can we perform time series analysis in R?

CHAMELI DEVI GROUP OF INSTITUTIONS, INDORE. DEPARTMENT OF COMPUTER SCIENCE & ENGG

You might also like