R Language
R Language
SYLLABUS
‘R’ Language for Data Science (ABCA/IMCA)/
Programming using ‘R’ (MSDS)
UNIT – I
Introduction to R: R overview and history, Basic features of R, Benefits of R, data types in
R, Installing R, Getting started with the RStudio IDE, Running R, Packages in R, variable
names and assignment ,operators, Input/output functions , reading and writing data.
UNIT-II
Preview of Some Important R Data Structures: Vectors, Character Strings, Matrices,
Lists, Data Frames, and Classes.
Control structures: Conditional statements, Loops, dates and times functions, String
manipulations.
UNIT-III
VECTORS: Scalars, Vectors, Arrays and Matrices: Adding and Deleting Vector Elements,
Obtaining the Length of a Vector- Common vector operations: Arithmetic & logical
operations, Vector Indexing, Generating vector sequences with seq (), Repeating vector
constants with rep (), using all () and any () functions, Vectorized operations, NA and NULL
values.
UNIT-IV
MATRICES AND ARRAYS: Creating Matrices, General Matrix operations- linear algebra
operations, matrix indexing, filtering on matrices, using apply() function , Add and Delete
matrix rows and columns.
LISTS: Creating Lists, General List Operations, List Indexing Adding and Deleting List
Elements, Getting the Size of a List ,Accessing List Components and Values, Using lapply()
and sapply() functions.
UNIT-V
DATA FRAMES: Creating Data Frames, Accessing Data Frames - Other Matrix-Like
Operations: Extracting sub data frames, using rbind () and cbind () unctions.
FACTORS AND TABLES : Factors and Levels - Common Functions Used with Factors :
tapply(), split() and by() - Working with Tables, Matrix/Array-Like Operations on Tables,
Extracting a Sub table - Math Functions: aggregate() and cut() functions.
Text Books:
1. The Art of R Programming by Norman Matlof, No starch press, SAN FRANSISCO, 2011.
2. An Introduction to R for Beginners by SASHA HAFNER, on AUG-2019
Reference Books:
1. R Programming for Dummies, Andrie de Vries and Joris Meys, Wiley
2. R for Data Science, Hadley Wickham, Garrett Grolemund, O’Reilly Media
3. R Programming : A Step-By-Step Guide for Absolute Beginners-2nd Edition, Daniel
Daniel Bell
4. Learn R programming in 1 Day, Krishna Rungta, Published by Guru99
UNIT-I
Introduction to ‘R’
What is R?
R is a popular interpreted programming language which is used as a leading tool
for machine learning, statistics, and data analysis. Objects, functions, and
packages can easily be created by R.
It’s a platform-independent language. This means it can be applied to all
operating systems.
It’s an open-source free language. That means anyone can install it in any
organization without purchasing a license.
R is an example of a FLOSS (Free Library and Open Source Software) where one
can freely distribute copies of this software, read it’s source code, modify it, etc.
R – Overview & History
R is an interpreted programming language and software environment for statistical
analysis, graphics representation and reporting.
R is an implementation of S language.
R was created by Ross Ihaka and Robert Gentleman at University of Auckland,
New Zealand in 1991, and is currently developed by the R Development Core Team.
It’s name being inspired after the first character of its author’s name and as a play on
the name of S. In 1995, R declares as an open source project under GNU licenses.
Finally, First stable beta version of R was released in 2000.
Features of R
Features of R Programming Language
R Packages:
One of the major features of R is it has a wide availability of libraries. R has
CRAN(Comprehensive R Archive Network), which is a repository holding more than 10,
0000 packages.
Distributed Computing:
Distributed computing is a model in which components of a software system are
shared among multiple computers to improve efficiency and performance. Two new
packages ddR and multidplyr used for distributed programming in R were released in
November 2015.
Statistical Features of R
Basic Statistics:
The most common basic statistics terms are the mean, mode, and median. These are
all known as “Measures of Central Tendency.” So using the R language we can measure
central tendency very easily.
Static graphics:
R is rich with facilities for creating and developing interesting static graphics. R
contains functionality for many plot types including graphic maps, mosaic plots, biplots,
and the list goes on.
Probability distributions:
Probability distributions play a vital role in statistics and by using R we can easily
handle various types of probability distributions such as Binomial Distribution, Normal
Distribution, Chi-squared Distribution, and many more.
Data analysis:
It provides a large, coherent, and integrated collection of tools for data analysis.
Open-source:
R can be used to perform simple and complex mathematical and statistical calculations
on data objects of a wide variety. It can also perform such operations on large data sets.
While most of its functions are written in R itself, C, C++ or FORTRAN can be used for
computationally heavy tasks. Java, .NET, Python, C, C++, and FORTRAN can also be used to
manipulate objects directly.
R’s massive community support has resulted in a very large collection of libraries. R
is famous for its graphical libraries. These libraries support and enhance the R development
environment. R has libraries with a huge variety of applications.
R is the most used programming language for developing statistical tools. It provides
many statistical techniques such as statistical tests, classification, clustering and data
reduction.
2. Open-source:
R’s massive community support has resulted in a very large collection of libraries. R
is famous for its graphical libraries. These libraries support and enhance the R development
environment. R has libraries with a huge variety of applications.
4. Cross-platform Support:
R can perform operations on vectors, arrays, matrices, and various other data objects
of varying sizes.
R can collect data from the internet through web scraping and other means. It can
also perform data cleansing. Data cleansing is the process
of detecting and removing/correcting inaccurate or corrupt records. R is also useful for
data wrangling which is the process of converting raw data into the desired format for easier
consumption.
7. Powerful Graphics:
R has extensive libraries that can produce quality graphs and visualizations. R is
easy to draw graphs like pie charts, histograms, box plot, scatter plot, etc.
8. Highly Active Community:
R has a large community support. R community is very active. There are users from
all around the world to help and support you. Many latest ideas and technology appear in
the R community.
9. Parallel and Distributed Computing:
Using libraries like ddR or multiDplyr, R can process large data sets using parallel or
distributed computing.
R is an interpreted language. This means that it does not need a compiler to turn
the code into an executable program. Instead, R interprets the provided code into lower-
level calls and pre-compiled code.
R is compatible with other languages like C, C++, and FORTRAN. Other languages like
.NET, Java, Python can also directly manipulate objects.
R can be useful for machine learning as well. Facebook does a lot of its machine
learning research with R. Sentiment analysis and mood prediction are all done using R.
The best use of R when it comes to machine learning is in case of exploration or when
building one-off models.
R contains several packages that enable it to interact with databases. Some of these
packages are Roracle, Open Database Connectivity Protocol), RmySQL, etc.
language. It also has a robust package called Rshiny which can produce full-fledged web
apps. R can also be useful for developing software packages.
Installation of R
Installing R-Studio
R-Studio is an IDE used for R Programming which is available as open-source and
commercial software for Desktop and Server products.
We can directly start coding in R by downloading RStudio IDE. To download this,
follow the below steps:
Step 1: Go to the link- https://www.rstudio.com/
Step 2: Download and install Rstudio on your system.
Step 3: On the successful download of the file, run the .exe file and complete the
installation.
Step 4: Open the RStudio App and you will see that the entire window is divided into 4
panes as below.
Source window:
We add the source code here and run the whole code by clicking on the source button.
To run selected lines, select lines and click Ctrl + Enter or Run button. Run a single line by
clicking on CTRL+ Enter.
R Console:
R displays error logs, warnings, executed statements with their outputs in this pane.
Environment and History:
This pane consists of 3 tabs. The Environment tab displays all variables defined and
used in the R session. The history tab displays the executed statements in R source and
Console. The Connections tab display database and external connection-related
information.
Files & Package Viewer:
This pane consists of 5 tabs. The Files tab displays the files in the current working
directory. The Plots tab displays graphs, charts created using R packages.
The Packages tab lists down installed packages. It also contains 2 buttons (install and
update). The Help tab displays the documentation of any package or function in R.
The Viewer tab displays web applications and maps that are created using R.
Note: In case any of the 4 panes are closed or hidden, Go to View -> Panes -> Show All
Panes to view all panes.
R Comments
Comments can be used to explain R code, and to make it more readable. It can also
be used to prevent execution when testing alternative code.
Comments starts with a #. When executing the R-code, R will ignore anything that
starts with #.
This example uses a comment before a line of code:
Example:
# This is a comment
"Hello World!"
This example uses a comment at the end of a line of code:
Example:
"Hello World!" # This is a comment
Comments does not have to be text to explain the code, it can also be used to prevent
R from executing the code.
Multiline Comments:
Unlike other programming languages, there are no syntax in R for multiline
comments. However, we can just insert a # for each line to create multiline comments:
Example:
# This is a comment
# written in
# more than just one line
"Hello World!"
R package & Libraries
R packages are a group of functions bundled together. These functions are pre-
compiled and used in R scripts by preloading them. We can find the list of packages installed
in the packages tab at the bottom right window.
install.packages([package-name])
Example:
install.packages(c("vioplot", "MASS"))
By default, RStudio installs the packages from CRAN Repository. We can use the
functions by loading the package into memory.
library([package-name])
To check what packages are installed on your computer, type this command:
installed.packages()
update.packages()
install.packages("PACKAGE NAME")
There is always confusion between a package and a library, and we find people calling
libraries as packages.
library(): It is the command used to load a package, and it refers to the place where
the package is contained, usually a folder on our computer.
List of R packages:
1) tidyr
The word tidyr comes from the word tidy, which means clear. So the tidyr package is
used to make the data' tidy'. This package works well with dplyr. This package is an
evolution of the reshape2 package.
2) ggplot2
R allows us to create graphics declaratively. R provides the ggplot package for this
purpose. This package is famous for its elegant and quality graphs which sets it apart from
other visualization packages.
3) ggraph
4) dplyr
R allows us to perform data wrangling and data analysis. R provides the dplyr library
for this purpose. This library facilitates several functions for the data frame in R.
5) tidyquant
The tidyquant is a financial package which is used for carrying out quantitative
financial analysis. This package adds to the tidyverse universe as a financial package which
is used for importing, analyzing and visualizing the data.
6) dygraphs
The dygraphs package provides an interface to the main JavaScript library which we
can use for charting. This package is essentially used for plotting time-series data in R.
7) leaflet
For creating interactive visualization, R provides the leaflet package. This package is
an open-source JavaScript library. The world's popular websites like the New York Times,
Github and Flicker, etc. are using leaflet. The leaflet package makes it easier to interact with
these sites.
8) ggmap
9) glue
R provides the glue package to perform the operations of data wrangling. This
package is used for evaluating R expressions which are present within the string.
10) shiny
11) plotly
The plotly package provides online interactive and quality graphs. This package
extends upon the JavaScript library -plotly.js.
12) tidytext
The tidytext package provides various functions of text mining for word processing
and carrying out analysis through ggplot, dplyr, and other miscellaneous tools.
13) stringr
The stringr package provides simplicity and consistency to use wrappers for the
'stringi' package. The stringi package facilitates common string operations.
14) reshape2
This package facilitates flexible reorganization and aggregation of data using melt ()
and decast () functions.
15) dichromat
16) digest
The digest package is used for the creation of cryptographic hash objects of R
functions.
17) MASS
18) caret
19) e1071
The e1071 library provides useful functions which are essential for data analysis like
Naive Bayes, Fourier Transforms, SVMs, Clustering, and other miscellaneous functions.
20) sentimentr
The sentiment package provides functions for carrying out sentiment analysis. It is used
to calculate text polarity at the sentence level and to perform aggregation by rows or
grouping variables.
Variables
Variables are containers for storing data values. Variable is the name of the memory
location where data is stored. In other words, we can access memory data using variables. A
variable in R can store Numeric values, Complex Values, Words, Matrices and even a Table.
Creating Variables in R :
In R, we can assign variables using any of the following syntaxes.
Name = “ABC”
Age <- 30
“VSU” -> Name
Example:
name <- "Raja"
age <- 20
Multiple Variables :
R allows you to assign the same value to multiple variables in one line.
Example:
# Assign the same value to multiple variables in one line
var1 <- var2 <- var3 <- "Mango"
# Print variable values
var1
var2
var3
Data Types
Kind of data is known as data type. Variables can store data of different types, and
different types can do different things. R has a variety of data types and object classes.
In R, variables do not need to be declared with any particular type, and can even
change type after they have been set:
Example:
my_var <- 30 # my_var is type of numeric
my_var <- "Madhumika" # my_var is now of type character (String)
Basic Data Types
Basic data types in R can be divided into the following types:
numeric - (10.5, 55, 787)
integer - (1L, 55L, 100L, where the letter "L" declares this as an integer)
complex - (9 + 3i, where "i" is the imaginary part)
character (string) - ("k", "R is exciting", "FALSE", "11.5")
logical (boolean) - (TRUE or FALSE)
We can use the class() function to check the data type of a variable:
Example:
# numeric
x <- 10.5
class(x)
# integer
x <- 1000L
class(x)
# complex
x <- 9i + 3
class(x)
# character/string
x <- "R is exciting"
class(x)
# logical/boolean
x <- TRUE
class(x)
Numbers :
There are three number types in R:
numeric
integer
complex
Variables of number types are created when you assign a value to them:
Example:
x <- 10.5 # numeric
y <- 10L # integer
z <- 1i # complex
Numeric :
A numeric data type is the most common type in R, and contains any number with or
without a decimal, like: 10.5, 55, 787:
Example:
x <- 10.5
y <- 55
# Print values of x and y
x
y
x <- 1L # integer
y <- 2 # numeric
if (b > a) {
print ("b is greater than a")
} else {
print("b is not greater than a")
}
Operators
Operators are the symbols directing the compiler to perform various kinds of
operations between the operands. Operators simulate the various mathematical, logical, and
decision operations performed on a set of Complex Numbers, Integers, and Numerical as
input operands.
There are mainly 4 data operators in R, they are as seen below:
Arithmetic Operators:
These operators help us perform the basic arithmetic operations like addition,
subtraction, multiplication, etc.
Name Operator Description Example
a = 1; b = 2; c = a+b;
Addition + Perform the sum of the variables
c=3
a = 5; b = 2; c = a-b;
Subtraction – Return difference of variables
c=3
a = 3; b = 2; c = a*b;
Multiplication * Return product of variables
c=6
a = 10; b = 2; c = a/b;
Division / Divide left operand by right operand
c=5
Modulo Remainder from division of first operand a = 11;b = 3; c = a %% b
%%
Division with second c=2
Performs exponential (power) a = 3; b = 2; c = a**b;
Exponent **
calculation on operators c=9
# R Arithmetic Operators Example for integers
a <- 7.5
b <- 2
print ( a+b ) #addition
print ( a-b ) #subtraction
print ( a*b ) #multiplication
print ( a/b ) #Division
print ( a%%b ) #Reminder
print ( a^b ) #Power of
Output:
[1] 9.5
[1] 5.5
[1] 15
[1] 3.75
[1] 1.5
[1] 56.25
Relational Operators:
These operators help us perform the relational operations like checking if a variable is
greater than, lesser than or equal to another variable. The output of a relational operation is
always a logical value.
Name Operator Description Example
Return True if both operands are
Equal to == a = 1; b = 2; a==b; FALSE
equal
Return True; If both operands are
Not Equal to != a = 5; b = 2; a!=b; TRUE
not equal
Greater/ Return True; If left operand greater
> and < a = 3; b = 2; a>b; TRUE
Lesser than right operand and vice vera.
Greater than Return True; If left operand greater
>= a = 3; b = 2; a>=b; TRUE
equal to than or equal to right operand
Miscellaneous Operators:
These operators does not fall into any of the categories mentioned above. These
operators are used for specific purpose, not for logical computation.
> write.table(xc,"xcnew",row.names=F,col.names=F)
UNIT-II
Data Structures
Data types are used to store information. In R, we do not need to declare a variable
as some data type. The variables are assigned with R-Objects and the data type of the R-
object becomes the data type of the variable. There are mainly six data types present in R:
Vector:
A Vector is a sequence of data elements of the same basic type.
Example:
vtr = (1, 3, 5 ,7, 9 ) or
vtr <- (1, 3, 5 ,7, 9)
There are 5 Atomic vectors, also termed as five classes of vectors.
List:
Lists are the R objects which contain elements of different types like − numbers,
strings, vectors and another list inside it.
>n = c(2, 3, 5)
>s = c("aa", "bb", "cc", "dd", "ee")
>x = list(n, s, TRUE)
>x
Output:
[[1]]
[1] 2 3 5
[[2]]
[1] "aa" "bb" "cc" "dd" "ee"
[[3]]
[1] TRUE
Arrays:
Arrays are the R data objects which can store data in more than two dimensions. It
takes vectors as input and uses the values in the dim parameter to create an array.
vector1 <- c(5,9,3)
vector2 <- c(10,11,12,13,14,15)
result <- array(c(vector1,vector2),dim = c(3,3,2))
Output –
,,1
[,1] [,2] [,3]
[1,] 5 10 13
[2,] 9 11 14
[3,] 3 12 15
,,2
[,1] [,2] [,3]
[1,] 5 10 13
[2,] 9 11 14
[3,] 3 12 15
Matrices:
Matrices are the R objects in which the elements are arranged in a two-dimensional
rectangular layout. A Matrix is created using the matrix() function.
Example:
matrix(data, nrow, ncol, byrow, dimnames) where,
data is the input vector which becomes the data elements of the matrix.
nrow is the number of rows to be created.
ncol is the number of columns to be created.
byrow is a logical clue. If TRUE then the input vector elements are arranged by row.
dimname is the names assigned to the rows and columns.
>Mat <- matrix(c(1:16), nrow = 4, ncol = 4 )
>Mat
Output :
[,1] [,2] [,3] [,4]
[1,] 1 5 9 13
[2,] 2 6 10 14
[3,] 3 7 11 15
[4,] 4 8 12 16
Factors:
Factors are the data objects which are used to categorize the data and store it as
levels. They can store both strings and integers. They are useful in data analysis for
statistical modeling.
>data <- c("East","West","East","North","North","East","West","West“,"East“)
>factor_data <- factor(data)
>factor_data
Output :
[1] East West East North North East West West East
Levels: East North West
Data Frames:
A data frame is a table or a two-dimensional array-like structure in which each column
contains values of one variable and each row contains one set of values from each column.
>std_id = c (1:5)
>std_name = c("Raja","Maneesh","ABC","Jessica","Sumaya")
>marks = c(623.3,515.2,611.0,729.0,843.25)
>std.data <- data.frame(std_id, std_name, marks)
>std.data
Output :
std_id std_name marks
1 1 Raja 623.30
2 2 Maneesh 515.20
3 3 ABC 611.00
4 4 Jessica 729.00
5 5 Sumaya 843.25
String Literals
A character, or strings, are used for storing text. A string is surrounded by either
single quotation marks, or double quotation marks:
"hello" is the same as 'hello':
Example:
"hello"
'hello'
Assign a String to a Variable:
Assigning a string to a variable is done with the variable followed by the <- operator
and the string:
Example:
str <- "Hello"
str # print the value of str
Multiline Strings:
You can assign a multiline string to a variable like this:
Example:
str <- "R is wonderful language,
R used for graphics,
R used for Statistical Representations."
nchar(str)
Check a String:
Use the grepl() function to check if a character or a sequence of characters are
present in a string:
Example:
str <- "Hello World!"
grepl("H", str)
grepl("Hello", str)
grepl("X", str)
Combine Two Strings:
Use the paste() function to merge/concatenate two strings:
Example:
str1 <- "Hello"
str2 <- "World"
> paste(str1, str2)
[1] “HelloWorld”
Escape Characters:
To insert characters that are illegal in a string, you must use an escape character.
An escape character is a backslash \ followed by the character you want to insert.
An example of an illegal character is a double quote inside a string that is surrounded
by double quotes:
Example:
str <- "We are the Data Science "Professionals", from VSU."
> str
Result:
Error: unexpected symbol in "str <- "We are the Data Science "Professionals"
To fix this problem, use the escape character \":
Example:
str <- "We are the Data Science \"Professionals\", from VSU."
> str
> cat(str)
Note that auto-printing the str variable will print the backslash in the output. You can use
the cat() function to print it without backslash.
Other escape characters in R:
\\ Backslash
\n Newline
\r Carriage Return
\t Tab
\b Back Space
Control Structures
Flow control statements play a very important role as they allow you to control the
flow of execution of a script inside a function. The most commonly used flow control
statements are represented in the below.
Conditional statements:
R - Language Supports 3 conditional statements which are –
If
Else If
If Else If
Switch
If Statement:
The flow of If Statement is as follows. As shown in the picture, if the condition is
true, then execute If code else executes the remaining statements that come after if body.
Syntax:
if (condition) {
statements
}
Example:
a <- 20
b <- 30
if (b > a) {
print ("b is greater than a")
}
Output:
[1] “b is greater than a”
Else If Statement:
As shown in the above picture, if the condition is true, then execute If code else
executes Else code and then follow the statements that come after the if-else body. The
flow of Else If Statement is as follows.
Syntax:
Syntax:
if(condition) {
statements
} else {
statements
}
Example:
a <- 20
b <- 13
if (b > a) {
print ("b is greater than a")
} else {
print("b is not greater than a")
}
Output:
[1] “b is not greater than a”
If Else If Statement:
The flow of If Else If Statement is as follows. As shown in the picture, if the condition
is true, then execute If code else checks the second condition. If the condition is true,
execute Else If code otherwise executes Else code followed by statements that come after
if-else-if body.
Syntax:
Syntax:
if(condition) {
statements
} else if (condition){
statements
}else {
statements
}
Example:
a <- 20
b <- 13
if(b == a)
print (“b is equal to a”)
else if (b > a) {
print ("b is greater than a")
} else {
print("b is not greater than a")
}
Output:
[1] "b is not greater than a"
Switch statement:
The flow of Switch Statement is as follows. A switch is another conditional statement
used in R. If statements are generally preferred over switch statements. The basic syntax of
the switch statement is –
Syntax:
switch (expression, list)
Example:
y <- 3
x <- switch(
y,
"Good Morning",
"Good Afternoon",
"Good Evening",
"Good Night"
)
print(x)
Output:
Looping statements reduce the work of a user to perform a task multiple times. These
statements execute a segment of code repeatedly until the condition is met. R – Language
supports 3 looping statements which are,
For
While
Repeat
For Loop:
For loop is the most common looping statement used for repeating a task. A for loop
executes statements for a known number of times. Define a for loop using the following
syntax:
Syntax:
for(var in range){
statements
}
Example:
for(x in 1:10){
print(x)
}
Output:
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10
While Loop:
A while loop repeats a statement or group of statements until the condition is true. It
tests the condition before executing the loop body. A while loop is created using the
following syntax:
Syntax:
while(condition) {
Statements
}
Example:
a=5
while(a>0){
a = a-1
print(a)
}
Output:
[1] 4
[1] 3
[1] 2
[1] 1
[1] 0
Repeat:
Repeat loop in R is used to execute statements multiple number of times. And also it
executes the same code again and again until a break statement is found.
Repeat loop doesn’t use a condition to exit the loop instead it looks for
a break statement otherwise an infinite loop in R can be created. Create a repeat loop using
the following syntax:
Syntax:
repeat
{
statements
}
Here, it executes statements repeatedly. But if we want to terminate the loop, we must use
‘break’ statement within the loop.
Example:
m=5
repeat {
m = m+2
print(m)
if(m>15) {
break
}
}
Output:
[1] 7
[1] 9
[1] 11
[1] 13
[1] 15
[1] 17
Control statements
R – Language supports the following control statements,
Break:
A break statement is used to stop or terminate the execution of statements. When
the break statement is encountered inside a loop, the loop is immediately terminated and
program control resumes at the next statement following the loop. If else and switch
statements contain break statements usually to stop the execution. The syntax to use the
break statement is –
Syntax:
break
Example:
m=5
repeat {m = m+2
print(m)
if(m>15) {
break
}
}
Output:
[1] 7
[1] 9
[1] 11
[1] 13
[1] 15
[1] 17
Next:
The next statement is used to skip the current iteration of a loop without terminating
or ending it. The syntax of the next statement is –
Syntax: next
Example:
for(i in c(1:6)) {
if(i == “3”) {
next
}
print(i)
}
Output:
[1] 1
[1] 2
[1] 4
[1] 5
[1] 6
Functions
A function is a set of statements to perform a specific task. R has in-built functions
and also allows the user to create their own functions. A function performs a task and
returns a result into a variable or print the output in the console. There are mainly two types
of functions in R:
UNIT-III
VECTORS
Scalars in R:
A scalar data structure is the most basic data type that holds only a single atomic
value at a time. Using scalars, more complex data types can be constructed. The most
commonly used scalar types in R.
Numeric
Character
Integer
Logical
Complex
Vectors:
A vector is simply a list of items that are of the same type.
To combine the list of items to a vector, use the c() function and separate the items
by a comma.
In the example below, we create a vector variable called fruits, that combine strings:
Example:
# Vector of strings
> fruits <- c("banana", "apple", "orange")
# Print fruits
> fruits
[1] "banana" "apple" "orange"
In this example, we create a vector that combines numerical values:
Example:
# Vector of numerical values
> numbers <- c(1, 2, 3)
# Print numbers
> numbers
[1] 1 2 3
To create a vector with numerical values in a sequence, use the : operator:
Example:
# Vector with numerical values in a sequence
> numbers <- 1:10
> numbers
[1] 1 2 3 4 5 6 7 8 9 10
You can also create numerical values with decimals in a sequence, but note that if the
last element does not belong to the sequence, it is not used:
Example:
# Vector with numerical decimals in a sequence
> numbers1 <- 1.5:6.5
> numbers1
[1] 1.5 2.5 3.5 4.5 5.5 6.5
# Vector with numerical decimals in a sequence where the last element is not used
> numbers2 <- 1.5:6.3
> numbers2
[1] 1.5 2.5 3.5 4.5 5.5
In the example below, we create a vector of logical values:
Example:
# Vector of logical values
> log_values <- c(TRUE, FALSE, TRUE, FALSE)
> log_values
[1] TRUE FALSE TRUE FALSE
Deleting a Vector:
Deletion of a Vector is the process of deleting all of the elements of the vector. This
can be done by assigning it to a NULL value.
# R program to delete a Vector
# Creating a Vector
M <- c(8, 10, 2, 5)
# set NULL to the vector
M <- NULL
cat('Output vector', M)
Output:
Output vector NULL
Obtaining the Length of a Vector
To find out how many items a vector has, use the length() function:
Example:
>fruits <- c("banana", "apple", "orange")
> length(fruits)
[1] 3
Example:
> x <- c(1,2,4,6)
> length(x)
[1] 4
Sort a Vector:
To sort items in a vector alphabetically or numerically, use the sort() function:
Example:
>fruits <- c("banana", "apple", "orange", "mango", "lemon")
> sort(fruits) # Sort a string
[1] "apple" "banana" "lemon" "mango" "orange"
> numbers <- c(13, 3, 5, 7, 20, 2)
> sort(numbers) # Sort numbers
[1] 2 3 5 7 13 20
Common Vector Operations
Some common operations related to vectors are arithmetic and logical operations,
vector indexing, and some useful ways to create vectors.
Vector Arithmetic and Logical Operations:
Remember that R is a functional language. Every operator, including + in the following
example, is actually a function.
> 2+3
[1] 5
> "+"(2,3)
[1] 5
Recall further that scalars are actually one-element vectors. So, we can add vectors,
and the + operation will be applied element-wise.
> x <- c(1,2,4)
> x + c(5,0,-1)
[1] 6 2 3
If you are familiar with linear algebra, you may be surprised at what happens when
we multiply two vectors.
> x * c(5,0,-1)
[1] 5 0 -4
But remember, because of the way the * function is applied, the multiplication is done
element by element. The first element of the product (5) is the result of the first element of
x (1) being multiplied by the first element of c(5,0,1) (5), and so on.
The same principle applies to other numeric operators. Here’s an example:
> x <- c(1,2,4)
> x / c(5,4,-1)
[1] 0.2 0.5 -4.0
> x %% c(5,4,-1)
[1] 1 2 0
Vector Indexing
Access Vectors:
You can access the vector items by referring to its index number inside brackets [].
The first item has index 1, the second item has index 2, and so on:
Example:
>fruits <- c("banana", "apple", "orange")
> repeat_each
[1] 1 1 1 2 2 2 3 3 3
Example:
Repeat the sequence of the vector:
>repeat_times <- rep(c(1,2,3), times = 3)
> repeat_times
[1] 1 2 3 1 2 3 1 2 3
Example:
Repeat each value independently:
>repeat_indepent <- rep(c(1,2,3), times = c(5,2,1))
> repeat_indepent
[1] 1 1 1 1 1 2 2 3
Using all() and any()
The any() and all() functions are handy shortcuts. They report whether any or all of
their arguments are TRUE.
> x <- 1:10
> any(x > 8)
[1] TRUE
> any(x > 88)
[1] FALSE
> all(x > 88)
[1] FALSE
> all(x > 0)
[1] TRUE
UNIT-IV
MATRICES AND ARRAYS
Matrices:
A matrix is a vector with two additional attributes: the number of rows and the
number of columns. The matrices are vectors, they also have modes, such as numeric and
character.
Creating Matrices:
A matrix is a two dimensional data set with columns and rows.
A column is a vertical representation of data, while a row is a horizontal
representation of data.
A matrix can be created with the matrix() function. Specify
the nrow and ncol parameters to get the amount of rows and columns.
Example:
# Create a matrix
> a <- matrix(c(1,2,3,4,5,6), nrow = 3, ncol = 2)
# Print the matrix
> a
Note: Remember the c() function is used to concatenate items together.
You can also create a matrix with strings:
Example:
> fruits <- matrix(c("apple", "banana", "cherry", "orange"), nrow = 2, ncol = 2)
> fruits
Access Matrix Items:
You can access the items by using [ ] brackets. The first number "1" in the bracket
specifies the row-position, while the second number "2" specifies the column-position:
Example:
thismatrix <- matrix(c("apple", "banana", "cherry", "orange"), nrow = 2, ncol = 2)
thismatrix[1, 2]
The whole row can be accessed if you specify a comma after the number in the
bracket:
Example:
thismatrix <- matrix(c("apple", "banana", "cherry", "orange"), nrow = 2, ncol = 2)
thismatrix[2,]
The whole column can be accessed if you specify a comma before the number in the
bracket:
Example:
thismatrix <- matrix(c("apple", "banana", "cherry", "orange"), nrow = 2, ncol = 2)
thismatrix[,2]
Access More Than One Row:
More than one row can be accessed if you use the c() function:
Example:
thismatrix <-
matrix(c("apple", "banana", "cherry", "orange","grape", "pineapple", "pear", "melon", "fig"),
nrow = 3, ncol = 3)
> thismatrix[c(1,2),]
Access More Than One Column:
More than one column can be accessed if you use the c() function:
Example:
Matrix Indexing:
>z
[,1] [,2] [,3]
[1,] 1 1 1
[2,] 2 1 0
[3,] 3 0 1
[4,] 4 0 0
> z[,2:3]
[,1] [,2]
[1,] 1 1
[2,] 1 0
[3,] 0 1
[4,] 0 0
Here, we requested the submatrix of z consisting of all elements with column numbers
2 and 3 and any row number. This extracts the second and third columns. Here’s an example
of extracting rows instead of columns:
>y
[,1] [,2]
[1,]11 12
[2,]21 22
[3,]31 32
> y[2:3,]
[,1] [,2]
[1,]21 22
[2,]31 32
> y[2:3,2]
[1] 22 32
You can also assign values to submatrices:
>y
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
> y[c(1,3),] <- matrix(c(1,1,8,12),nrow=2)
>y
[,1] [,2]
[1,] 1 8
[2,] 2 5
[3,] 1 12
Here, we assigned new values to the first and third rows of y. And here’s another
example of assignment to submatrices:
> x <- matrix(nrow=3,ncol=3)
> y <- matrix(c(4,5,2,3),nrow=2)
>y
[,1] [,2]
[1,] 4 2
[2,] 5 3
> x[2:3,2:3] <- y
>x
[,1] [,2] [,3]
[1,] NA NA NA
[2,] NA 4 2
[3,] NA 5 3
Negative subscripts, used with vectors to exclude certain elements, work the same
way with matrices:
>y
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6
> y[-2,]
[,1] [,2]
[1,] 1 4
[2,] 3 6
In the second command, we requested all rows of y except the second.
Filtering on Matrices:
Filtering can be done with matrices, just as with vectors. You must be careful with the
syntax, though. Let’s start with a simple example:
>x
x
[1,] 1 2
[2,] 2 3
[3,] 3 4
> x[x[,2] >= 3,]
x
[1,] 2 3
[2,] 3 4
> j <- x[,2] >= 3
>j
[1] FALSE TRUE TRUE
Here, we look at the vector x[,2], which is the second column of x, and determine
which of its elements are greater than or equal to 3. The result, assigned to j, is a Boolean
vector. Now, use j in x:
> x[j,]
x
[1,] 2 3
[2,] 3 4
• The object x[,2] is a vector.
• The operator >= compares two vectors.
• The number 3 was recycled to a vector of 3s.
[2,] 2 5
[3,] 3 6
> f <- function(x) x/c(2,8)
> y <- apply(z,1,f)
>y
[,1] [,2] [,3]
[1,] 0.5 1.000 1.50
[2,] 0.5 0.625 0.75
Adding and Deleting Matrix Rows and Columns
Technically, matrices are of fixed length and dimensions, so we cannot add or delete
rows or columns. However, matrices can be reassigned, and thus we can achieve the same
effect as if we had directly done additions or deletions.
Changing the Size of a Matrix:
Analogous operations can be used to change the size of a matrix. For instance, the
rbind() (row bind) and cbind() (column bind) functions let you add rows or columns to a
matrix.
Add Rows and Columns:
Use the cbind() function to add additional columns in a Matrix:
Example:
thismatrix <-
matrix(c("apple", "banana", "cherry", "orange","grape", "pineapple", "pear", "melon", "fig"),
nrow = 3, ncol = 3)
newmatrix <- cbind(thismatrix, c("strawberry", "blueberry", "raspberry"))
# Print the new matrix
> newmatrix
Note: The cells in the new column must be of the same length as the existing matrix.
Use the rbind() function to add additional rows in a Matrix:
Example:
thismatrix <- matrix(c("apple", "banana", "cherry", "orange","grape", "pineapple", "pear",
"melon", "fig"), nrow = 3, ncol = 3)
> newmatrix <- rbind(thismatrix, c("strawberry", "blueberry", "raspberry"))
# Print the new matrix
> newmatrix
Note: The cells in the new row must be of the same length as the existing matrix.
Remove Rows and Columns:
Use the c() function to remove rows and columns in a Matrix:
Example:
thismatrix <- matrix(c("apple", "banana", "cherry", "orange", "mango", "pineapple"), nrow
= 3, ncol =2)
#Remove the first row and the first column
thismatrix <- thismatrix[-c(1), -c(1)]
> thismatrix
Check if an Item Exists:
To find out if a specified item is present in a matrix, use the %in% operator:
Example:
Check if "apple" is present in the matrix:
thismatrix <- matrix(c("apple", "banana", "cherry", "orange"), nrow = 2, ncol = 2)
> dim(thismatrix)
Matrix Length:
Use the length() function to find the dimension of a Matrix:
Example:
thismatrix <- matrix(c("apple", "banana", "cherry", "orange"), nrow = 2, ncol = 2)
> length(thismatrix)
Total cells in the matrix is the number of rows multiplied by number of columns.
In the example above: Dimension = 2*2 = 4.
Combine two Matrices:
Again, you can use the rbind() or cbind() function to combine two or more matrices
together:
Example:
# Combine matrices
Matrix1 <- matrix(c("apple", "banana", "cherry", "grape"), nrow = 2, ncol = 2)
Matrix2 <- matrix(c("orange", "mango", "pineapple", "watermelon"), nrow = 2, ncol = 2)
# Adding it as a rows
Matrix_Combined <- rbind(Matrix1, Matrix2)
Matrix_Combined
# Adding it as a columns
Matrix_Combined <- cbind(Matrix1, Matrix2)
Matrix_Combined
ARRAYS
Arrays:
Compared to matrices, arrays can have more than two dimensions.
We can use the array() function to create an array, and the dim parameter to specify
the dimensions:
Example:
# An array with one dimension with values ranging from 1 to 24
thisarray <- c(1:24)
> thisarray
# An array with more than one dimension
multiarray <- array(thisarray, dim = c(4, 3, 2))
> multiarray
Example Explained:
In the example above we create an array with the values 1 to 24.
How does dim=c(4,3,2) work?
The first and second number in the bracket specifies the amount of rows and columns.
The last number in the bracket specifies how many dimensions we want.
Note: Arrays can only have one data type.
Access Array Items:
You can access the array elements by referring to the index position. You can use
the [] brackets to access the desired elements from an array:
Example:
thisarray <- c(1:24)
multiarray <- array(thisarray, dim = c(4, 3, 2))
> multiarray[2, 3, 2]
The syntax is as follow: array[row position, column position, matrix level]
You can also access the whole row or column from a matrix in an array, by using
the c() function:
Example:
thisarray <- c(1:24)
# Access all the items from the first row from matrix one
multiarray <- array(thisarray, dim = c(4, 3, 2))
> multiarray[c(1),,1]
# Access all the items from the first column from matrix one
multiarray <- array(thisarray, dim = c(4, 3, 2))
> multiarray[,c(1),1]
A comma (,) before c() means that we want to access the column.
A comma (,) after c() means that we want to access the row.
Check if an Item Exists:
To find out if a specified item is present in an array, use the %in% operator:
Example: Check if the value "2" is present in the array:
thisarray <- c(1:24)
multiarray <- array(thisarray, dim = c(4, 3, 2))
> dim(multiarray)
Array Length:
Use the length() function to find the dimension of an array:
Example:
thisarray <- c(1:24)
multiarray <- array(thisarray, dim = c(4, 3, 2))
> length(multiarray)
LISTS
Lists:
A list in R can contain many different data types inside it. A list is a collection of data
which is ordered and changeable.
To create a list, use the list() function:
Creating Lists:
Example:
# List of strings
thislist <- list("apple", "banana", "cherry")
# Print the list
> thislist
Example:
thislist <- list("apple", "banana", "cherry",”mango”,”orange”)
> thislist[1]
> thislist[-3]
Range of Indexes:
You can specify a range of indexes by specifying where to start and where to end the
range, by using the : operator:
Example:
Return the second, third, fourth and fifth item:
thislist <- list("apple", "banana", "cherry", "orange", "kiwi", "melon", "mango")
>(thislist)[2:5]
Note: The search will start at index 2 (included) and end at index 5 (included).
Remember that the first item has index 1.
Example:
thislist <- list("apple", "banana", "cherry",”mango”,”orange”)
> thislist[1:3]
> thislist[-3:-1]
Change Item Value:
To change the value of a specific item, refer to the index number:
Example:
thislist <- list("apple", "banana", "cherry")
thislist[1] <- "mango"
for (x in thislist) {
print(x)
}
Join Two Lists:
There are several ways to join, or concatenate, two or more lists in R. The most
common way is to use the c() function, which combines two elements together:
Example:
list1 <- list("a", "b", "c")
list2 <- list(1,2,3)
list3 <- c(list1,list2)
# print output
> list3
Accessing List Components and Values
If the components in a list do have tags, as is the case with name, salary, and union
for j, you can obtain them via names():
> names(j)
[1] "name" "salary" "union"
To obtain the values, use unlist():
> v <- unlist(j)
>v
name salary union
"Joe" "55000" "TRUE"
> class(v)
[1] "character"
The return value of unlist() is a vector—in this case, a vector of character strings. Note
that the element names in this vector come from the components in the original list.
On the other hand, if we were to start with numbers, we would get numbers.
> z <- list(a=5,b=12,c=13)
> y <- unlist(z)
> class(y)
[1] "numeric"
>y
abc
5 12 13
So the output of unlist() in this case was a numeric vector. What about a mixed case?
> w <- list(a=5,b="xyz")
> wu <- unlist(w)
> class(wu)
[1] "character"
> wu
ab
"5" "xyz"
Here, R chose the least common denominator: character strings. This sounds like
some kind of precedence structure, and it is. As R’s help for unlist() states:
Where possible the list components are coerced to a common mode during the
unlisting, and so the result often ends up as a character vector. Vectors will be coerced to
the highest type of the components in the hierarchy NULL < raw < logical < integer < real <
complex < character < list < expression: pairlists are treated as lists.
> lapply(list(1:3,25:29),median)
[[1]]
[1] 2
[[2]]
[1] 27
R applied median() to 1:3 and to 25:29, returning a list consisting of 2 and 27. In
some cases, such as the example here, the list returned by lapply() could be simplified to a
vector or matrix. This is exactly what sapply() (for simplified [l]apply) does.
> sapply(list(1:3,25:29),median)
[1] 2 27
UNIT-V
DATA FRAMES
Data Frames:
Data Frames are data displayed in a format as a table.
Data Frames can have different types of data inside it. While the first column can
be character, the second and third can be numeric or logical. However, each column
should have the same type of data.
Use the data.frame() function to create a data frame:
Creating Data Frames:
Example:
# Create a data frame
Data_Frame <- data.frame (
Training = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),
Duration = c(60, 30, 45)
)
Names ages
Raja 12
Bhavana 10
The first two arguments in the call to data.frame() are clear: We wish to produce a
data frame from our two vectors: Names and ages. However, that third argument,
stringsAsFactors=FALSE requires more comment. If the named argument stringsAsFactors is
not specified, then by default, stringsAsFactors will be TRUE. (You can also use options() to
arrange the opposite default.) This means that if we create a data frame from a character
vector—in this case, Namess—R will convert that vector to a factor.
Summarize the Data:
Use the summary() function to summarize the data from a Data Frame:
Example:
Data_Frame <- data.frame (
Training = c("Strength", "Stamina", "Other"),
Pulse = c(100, 150, 120),
Duration = c(60, 30, 45) )
# Print the data frame and summary
> Data_Frame
> summary(Data_Frame)
Accessing Data Frames:
We can use single brackets [ ], double brackets [[ ]] or $ to access columns from a
data frame:
Example:
> levels(music_genre)
Result:
[1] "Classic" "Jazz" "Pop" "Rock"
You can also set the levels, by adding the levels argument inside the factor() function:
Example:
music_genre <- factor(c("Jazz", "Rock", "Classic", "Classic", "Pop", "Jazz", "Rock", "Jazz"),
levels = c("Classic", "Jazz", "Pop", "Rock", "Other"))
> levels(music_genre)
Result:
[1] "Classic" "Jazz" "Pop" "Rock" "Other"
Factor Length:
Use the length() function to find out how many items there are in the factor:
Example:
music_genre <- factor(c("Jazz", "Rock", "Classic", "Classic", "Pop", "Jazz", "Rock", "Jazz"))
> length(music_genre)
Result:
[1] 8
Access Factors:
To access the items in a factor, refer to the index number, using [] brackets:
Example:
Access the third item:
music_genre <- factor(c("Jazz", "Rock", "Classic", "Classic", "Pop", "Jazz", "Rock", "Jazz"))
> music_genre[3]
Result:
[1] Classic
Levels: Classic Jazz Pop Rock
F 33 123000 1
F 24 45650 0
> split(d$income,list(d$gender,d$over25))
$F.0
[1] 32450 45650
$M.0
numeric(0)
$F.1
[1] 123000
$M.1
[1] 55000 88000 76500
The output of split() is a list, and recall that list components are denoted by dollar
signs. So the last vector, for example, was named "M.1" to indicate that it was the result of
combining "M" in the first factor and 1 in the second.
The by() Function:
The function to be applied can be multivariate—for example, range()—but the input
must be a vector. Yet the input for regression is a matrix (or data frame) with at least two
columns: one for the predicted variable and one or more for predictor variables. In our
abalone data application, the matrix would consist of a column for the diameter data and a
column for length. The by() function can be used here. It works like tapply() (which it calls
internally, in fact), but it is applied to objects rather than vectors. Here’s how to use it for
the desired regression analyses:
> aba <- read.csv("abalone.data",header=TRUE)
> by(aba,aba$Gender,function(m) lm(m[,2]~m[,3]))
aba$Gender: F
Call:
lm(formula = m[, 2] ~ m[, 3])
Coefficients:
(Intercept) m[, 3]
0.04288 1.17918
Calls to by() look very similar to calls to tapply(), with the first argument specifying
our data, the second the grouping factor, and the third the function to be applied to each
group.
Just as tapply() forms groups of indices of a vector according to levels of a factor, this
by() call finds groups of row numbers of the data frame aba. That creates three sub data
frames: one for each gender level of M, F, and I. The anonymous function we defined
regresses the second column of its matrix argument m against the third column.
Tables
To begin exploring R tables, consider this example:
> u <- c(22,8,33,6,8,29,-2)
> fl <- list(c(5,12,13,12,13,5,13),c("a","bc","a","a","bc","a","a"))
> tapply(u,fl,length)
a bc
5 2 NA
12 1 1
13 2 1
Here, tapply() again temporarily breaks u into sub vectors, as you saw earlier, and
then applies the length() function to each sub vector. (Note that this is independent of
what’s in u. Our focus now is purely on the factors.) Those sub vector lengths are the counts
of the occurrences of each of the 3 × 2 = 6 combinations of the two factors. For instance, 5
occurred twice with "a" and not at all with "bc"; hence the entries 2 and NA in the first row of
the output. In statistics, this is called a contingency table. There is one problem in this
example: the NA value. It really should be 0, meaning that in no cases did the first factor
have level 5 and the second have level "bc". The table() function creates contingency tables
correctly.
> table(fl)
fl.2
fl.1 a bc
521
12 1 1
13 1 0
The first argument in a call to table() is either a factor or a list of factors. The two
factors here were (5,12,13,12,13,5,13) and ("a","bc","a","a","bc", "a","a"). In this case, an
object that is interpretable as a factor is counted as one.
Typically a data frame serves as the table() data argument. Suppose for instance the
file ct.dat consists of election-polling data, in which candidate X is running for reelection. The
ct.dat file looks like this:
"Vote for X" "Voted For X Last Time"
"Yes" "Yes"
"Yes" "No"
"No" "No"
"Not Sure" "Yes"
"No" "No"
In the usual statistical fashion, each row in this file represents one subject under
study. In this case, we have asked five people the following two questions:
• Do you plan to vote for candidate X?
• Did you vote for X in the last election?
This gives us five rows in the data file.
Let’s read in the file:
> ct <- read.table("ct.dat",header=T)
> ct
Vote.for.X Voted.for.X.Last.Time
1 Yes Yes
2 Yes No
3 No No
4 Not Sure Yes
5 No No
We can use the table() function to compute the contingency table for this data:
> cttab <- table(ct)
> cttab
Voted.for.X.Last.Time
Vote.for.X No Yes
No 2 0
Not Sure 0 1
Yes 1 1
list(Vote.for.X=c("No","Yes"),Voted.for.X.Last.Time=c("No","Yes"))
We can now call the function.
> subtable(cttab,list(Vote.for.X=c("No","Yes"),
+ Voted.for.X.Last.Time=c("No","Yes")))
Voted.for.X.Last.Time
Vote.for.X No Yes
No 2 0
Yes 1 1
Now that we have a feel for what the function does, let’s take a look at its innards.
1 subtable <- function(tbl,subnames) {
2 # get array of cell counts in tbl
3 tblarray <- unclass(tbl)
4 # we'll get the subarray of cell counts corresponding to subnames by
5 # calling do.call() on the "[" function; we need to build up a list
6 # of arguments first
7 dcargs <- list(tblarray)
8 ndims <- length(subnames) # number of dimensions
9 for (i in 1:ndims) {
10 dcargs[[i+1]] <- subnames[[i]]
11 }
12 subarray <- do.call("[",dcargs)
13 # now we'll build the new table, consisting of the subarray, the
14 # numbers of levels in each dimension, and the dimnames() value, plus
15 # the "table" class attribute
16 dims <- lapply(subnames,length)
17 subtbl <- array(subarray,dims,dimnames=subnames)
18 class(subtbl) <- "table"
19 return(subtbl)
20 }
The former operation can be done via R’s array() function, which has the following
arguments:
• data: The data to be placed into the new array. In our case, this is subarray.
• dim: The dimension lengths (number of rows, number of columns, number of layers, and
so on). In our case, this is the value ndims, computed in line 16.
• dimnames: The dimension names and the names of their levels, already given to us by the
user as the argument subnames. This was a somewhat conceptually complex function to
write, but it gets easier once you’ve mastered the inner structures of the "table" class.
Maths Functions
R includes a number of other functions that are handy for working with tables and
factors. Two of them here: aggregate() and cut().
The aggregate() Function:
The aggregate() function calls tapply() once for each variable in a group. For example,
in the abalone data, we could find the median of each variable, broken down by gender, as
follows:
> aggregate(aba[,-1],list(aba$Gender),median)
Group.1 Length Diameter Height WholeWt ShuckedWt ViscWt ShellWt Rings
F 0.590 0.465 0.160 1.03850 0.44050 0.2240 0.295 10
I 0.435 0.335 0.110 0.38400 0.16975 0.0805 0.113 8
M 0.580 0.455 0.155 0.97575 0.42175 0.2100 0.276 10
The first argument, aba[,-1], is the entire data frame except for the first column,
which is Gender itself. The second argument, which must be a list, is our Gender factor as
before. Finally, the third argument tells R to compute the median on each column in each of
the data frames generated by the sub grouping corresponding to our factors. There are three
such subgroups in our example here and thus three rows in the output of aggregate().
y <- cut(x,b,labels=FALSE)
where the bins are defined to be the semi-open intervals (b[1],b[2]],
(b[2],b[3]],.... Here’s an example:
>z
[1] 0.88114802 0.28532689 0.58647376 0.42851862 0.46881514 0.24226859 0.05289197
[8] 0.88035617
> seq(from=0.0,to=1.0,by=0.1)
[1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
> binmarks <- seq(from=0.0,to=1.0,by=0.1)
> cut(z,binmarks,labels=F)
[1] 9 3 6 5 5 3 1 9
This says that z[1], 0.88114802, fell into bin 9, which was (0,0,0.1]; z[2],
0.28532689, fell into bin 3; and so on.
This returns a vector, as seen in the example’s result. But we can convert it into a
factor and possibly then use it to build a table. For instance, you can imagine using this to
write your own specialized histogram function. (The R function findInterval() would be useful
for this, too.).