0% found this document useful (0 votes)
123 views

02b Data Structures Datasets

This document provides tutorials and code examples on data structures in R. It begins by listing various tutorial links on data structures like vectors, matrices, and data frames. It then demonstrates how to create vectors using functions like c(), seq(), and rep(). Examples are given for numeric, character, date, and logical vectors. The document also shows how to reference, subset, filter, sort, and perform vectorized operations on data structures in R.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
123 views

02b Data Structures Datasets

This document provides tutorials and code examples on data structures in R. It begins by listing various tutorial links on data structures like vectors, matrices, and data frames. It then demonstrates how to create vectors using functions like c(), seq(), and rep(). Examples are given for numeric, character, date, and logical vectors. The document also shows how to reference, subset, filter, sort, and perform vectorized operations on data structures in R.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 96

Al.I.

Cuza University of Iai


Faculty of Economics and Business Administration
Department of Accounting, Information Systems and
Statistics

Data Analysis & Data


Science with R
Data structures in R.
Build-in Datasets
By Marin Fotache

Data structures in R

Tutorials (and code) on Data


Structures

Data structures (Advanced R by Hadley Wickham)

http://adv-r.had.co.nz/Data-structures.html

1.2 Variables (Variables and Data Structures)

https://www.youtube.com/watch?v=DG7YNf8kb3w

2 - Introduction to R : Atomic Classes

https://www.youtube.com/watch?v=271FKAYavYE
http://repidemiology.wordpress.com/introduction-to-r-code/

1.3 Vectors (Variables and Data Structures)

https://www.youtube.com/watch?v=QygSZw77Hs8

3- Introduction to R : Vectors

https://www.youtube.com/watch?v=MGphwmXCCgM#t=12
http://repidemiology.wordpress.com/introduction-to-r-code/

1.4 Matrices (Variables and Data Structures)

https://www.youtube.com/watch?v=UakyyZSyuZU

Tutorials on Data Structures (cont.)


1.5

Lists and Data Frames (Variables and Data Structures)


https://www.youtube.com/watch?v=U6vbR4el3kQ
1.6 Logical Vectors and Operators (Variables and Data
Structures)
https://www.youtube.com/watch?v=GQb735O2qjc
4- Introduction to R : Matrix, List and Data Frame
https://www.youtube.com/watch?v=cEX4iXUPqoo
http://repidemiology.wordpress.com/introduction-to-r-code/
Common Data Structures in R
https://www.youtube.com/watch?v=q5YJUGTYUvI
Introduction to R Statistical Computing: Data Structures
https://www.youtube.com/watch?v=OZD4oLobjWM
Lecture 2b: Subsetting
https://www.youtube.com/watch?v=hWbgqzsQJF0&index=7&
list=PLjTlxb-wKvXNSDfcKPFH2gzHGyjpeCZmJ

R script associated with this


presentation
02b_data_structures__datasets.R

http://1drv.ms/1sYllLB

Vectors with c() function


Vectors

are one-dimensional arrays that can hold


numeric, character logical, or date/time/timestamp data
Most frequently function c() is used to declare/form the
vector
> x = c(1, 3, 5, 7, 25, -13, 47)
> x
[1]
1
3
5
7 25 -13 47
> y = c("one", "two", "three", "eight")
> y
[1] "one"
"two"
"three" "eight"
> z = c(TRUE, FALSE, TRUE, TRUE, FALSE, TRUE)
> z
[1] TRUE FALSE TRUE TRUE FALSE TRUE
The data in a vector must only be one type (numeric,
character, or logical)

Vectors of numbers with


sequences
Vectors

can also be created with a sequence

> ten_integers.1 <- 5:14


> ten_integers.1
[1] 5 6 7 8 9 10 11 12 13 14
or
> ten_integers.2 <- seq(from=5, to=14, by=1)
> ten_integers.2
[1] 5 6 7 8 9 10 11 12 13 14
Declare

a vector of descending numbers

> seq(from=5, to=-5, by=-1)


[1] 5 4 3 2 1 0 -1 -2 -3 -4 -5
Combine

sequences and c function

> a_vector <- c( 2:4, 8:14)


> a_vector
[1] 2 3 4 8 9 10 11 12 13 14

Vectors containing a range of


dates
Generating

a vector with dates between


September 29th and October 2nd 2014 as
"pure" dates

First solution:

> seq(as.Date("2014/09/29"), by = "day", length.out = 4)

Second solution:

> seq(as.Date("2014/09/29"), as.Date("2014/10/02"),


"days")

In both cases the result is:


[1] "2014-09-29" "2014-09-30" "2014-10-01" "201410-02"

Vectors containing a range of


timestamps
Generating

a vector with dates between


September 29th and October 2nd 2014 as
timestamps
First solution
> seq(c(ISOdate(2014,9,29)), by = "DSTday",
length.out = 4)
Second solution
> x <- as.POSIXct("2014-09-25 23:59:59",
tz="Turkey")
> format(seq(x, by="day", length.out=8),
"%Y-%m-%d %Z")
Third solution
> d1<-ISOdate(year=2014,month=9,day=25,tz="GMT")
> seq(from=d1,by="day",length.out=8)

Vectors generated from the


normal distribution
Vector

object named x contains five random


values drawn from the standard normal
distribution; values are not ordered
> x <- rnorm(5)
> x

[1] -0.2766566 0.7262000


-0.3409396 -0.5192846

0.5508588

Numbers

are extracted randomly, so that the


same function will draw other five numbers:
> x <- rnorm(5)
> x

[1] 1.9030714 -1.7139177 -0.2287666


0.8369275 0.4203014

Vectors created with function rep


(repeat)
Vector

x.rep contains a sequence of


numbers (5, 7, 11) repeated three times

> x.rep <- rep(c(5, 7, 11), 3)


> x.rep
[1] 5 7 11 5 7 11 5 7 11
See

the difference with version which uses


each clause:

> x.rep.2 <- rep(c(5, 7, 11), each=2,


times=3)
> x.rep.2
[1] 5 5 7 7 11 11 5 5 7 7 11 11
5 5 7 7 11 11

Example of built-in (system


defined) vectors
> Letters
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n"
"o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"

> LETTERS
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N"
"O" "P" "Q" "R" "S" "T" "U" "V" "W" "X" "Y" "Z"

> month.name
[1] "January"
"June"
[10] "October"

"February"
"July"
"November"

"March"
"August"
"December"

"April"
"May"
"September"

> state.name
[1] "Alabama"
"Arkansas"
...

"Alaska"

> state.area
[1]
...

51609 589757 113909

53104

"Arizona"

Vectors of factors
Factors

are nominal variables whose values have a number of

levels
Very important in data analysis and visualization
Ex: two vectors:
student names
student genres
Both

vectors initially contain characters

> names <- c( "Popescu I. Valeria", "Ionescu V. Viorel",


+
"Genete I. Aurelia", "Lazar T. Ionut",
+
"Sadovschi V. Iuliana", "Dominte I. Nicoleta")
> genre <- c("Female", "Male", "Female", "Male",
+
"Female", "Female" )
> class(names)
[1] "character"
> class(genre)
[1] "character"

Vectors of factors (cont.)


> unclass(genre)
[1] "Female" "Male"
"Female" "Male"
"Female" "Female"
Genre can have only two values, so it is converted into a factor
> genre <- as.factor(genre)
> class(genre)
[1] "factor"
> unclass(genre)
[1] 1 2 1 2 1 1
attr(,"levels")
[1] "Female" "Male"
If

a non existing value is added in vector "genre", it is


automatically converted back into character

> genre <- c(genre, "Boy")


> class(genre)
[1] "character"
> unclass(genre)

Functions for getting vector


type and length

Class

returns elements data type; unclass returns the


values
> class(ten_integers.1)
[1] "integer"

> unclass(ten_integers.1)
[1] 5 6 7 8 9 10 11 12 13 14
Internally, factor levels are stored

as integers

> class(genre)
[1] "factor"

> unclass(genre)
[1] 1 2 1 2 1 1
attr(,"levels")
[1] "Female" "Male"

> typeof(genre)
[1] "integer"
Function length

returns the number of elements in a vector


> length(ten_integers.1)
[1] 10

Referencing vector elements


First

element in vector ten_integers.1


> ten_integers.1 [1]
[1] 5
Last element in vector ten_integers.1
> ten_integers.1 [length(ten_integers.1)]
[1] 14
First three elements in vector ten_integers.1
> ten_integers.1 [1:3]
[1] 5 6 7
Last three elements in vector
> ten_integers.1 [(length(ten_integers.1)-2) :
length(ten_integers.1)]
[1] 12 13 14
First, third, fifth and sixth elements
> ten_integers.1 [c(1, 3, 5, 6)]
[1] 5 7 9 10

Referencing vector elements


(cont.)
Indices

of elements can be qualified with other

vectors
Display first, third, fifth and sixth elements in
vector ten_integers.1
Vector ind contains indices for elements of
interest from vector ten_integers.1
> ind <- c(1, 3, 5, 6)
> ind
[1] 1 3 5 6
> ten_integers.1
[1]
Now

9 10 11 12 13 14

the result:

> ten_integers.1 [ind]


[1] 5 7 9 10

Excluding elements from a


vector
Basic

idea: R will exclude from a vector the


elements whose indices are negative
(prefixed by minus)

Excluding

first element:

> ten_integers.1 [-1]


[1]

Excluding

9 10 11 12 13 14

first three elements:

> ten_integers.1 [-(1:3)]


[1]

9 10 11 12 13 14

Excluding

first, third, and fourth elements:

> ten_integers.1 [-(c(1,3,4))]


[1]

9 10 11 12 13 14

Excluding elements from a vector


(cont.)
Excluding

first three elements and the 6 th


element and the 8th element

> ten_integers.1 [-(c(1:3,6,8))]


[1] 8 9 11 13 14

Excluding

the first two elements and


the last two elements of the vector:

> ten_integers.1 [-c((1:2),


(length(ten_integers.1)-1) :
length(ten_integers.1))]
[1] 7 8 9 10 11 12

Vector filtering
Filter

vector elements - select only elements


greater than 10

> ten_integers.1 [ten_integers.1 > 10]

[1] 11 12 13 14
How

many elementes are greater than 10 ?

> length(ten_integers.1 [ten_integers.1 > 10])

[1] 4
Display

INDICES of elements greater than 10

> which (ten_integers.1 > 10)

[1]

9 10

Filter

vector elements - select only elements


greater than 10 ver. 2

> ind <- which (ten_integers.1 > 10)


> ten_integers.1 [ind]

[1] 11 12 13 14

Sorting/ordering a vector
Initial

vector

> names <- c( "Popescu I. Valeria", "Ionescu V. Viorel",


+
"Genete I. Aurelia", "Lazar T. Ionut",
+
"Sadovschi V. Iuliana", "Dominte I. Nicoleta")
Sort

the vector elements in ascending (default) order

> names <- sort(names)


> names
[1] "Dominte I. Nicoleta" "Genete I. Aurelia"
"Ionescu V. Viorel"
"Lazar T. Ionut"
[5] "Popescu I. Valeria"
"Sadovschi V. Iuliana"
Sorting

the vector in descending order

> names.desc <- rev(sort(names))


> names.desc
[1] "Sadovschi V. Iuliana" "Popescu I. Valeria"
T. Ionut"
"Ionescu V. Viorel"
[5] "Genete I. Aurelia"
"Dominte I. Nicoleta"

"Lazar

R as a vectorized language
Lecture

2c: Vectorized Operations


https://www.youtube.com/watch?v=Fm8SORJQjPY&list=PLjTlx
b-wKvXNSDfcKPFH2gzHGyjpeCZmJ&index=8
Operations

are automatically applied on each element of the


vector without looping among vector elements

> num.vec.1 <- c(1, 3, 5, 7, 25, -13, 47)


> num.vec.2 <- num.vec.1 + 100
> num.vec.2
[1] 101 103 105 107 125 87 147
> date.vec.1 <- c ("2013-10-01", "2013-10-03", "2013-10-10")
For

the moment, elements are strings

> class(date.vec.1)
[1] "character"
as.Date()

converts all of the vector elements into dates

> date.vec.1 <- as.Date(date.vec.1)


> class(date.vec.1)
[1] "Date"

R as a vectorized language
(cont.)
Operations

can be applied on two or more vectors

> num.vec.3 <- num.vec.1 + num.vec.2


> num.vec.3
[1] 102 106 110 114 150 74 194
Compare

a vector with a value

> x
[1] -0.56757455 -0.90079348
> x >= 0
[1] FALSE FALSE TRUE FALSE
> x.1 <- x >= 0
> x.1
[1] FALSE FALSE TRUE FALSE
Testing

0.24397156 -0.51325283

0.03209287

TRUE

TRUE

if at least one of the vector elements fulfils the predicate

> x
[1] -0.56757455 -0.90079348
> any(x > 0)
[1] TRUE

0.24397156 -0.51325283

0.03209287

R as a vectorized language
(cont.)
Testing

if all the vector elements fulfill the


predicate (function all)

> all(x > 0)


[1] FALSE
> all(x > -25)
[1] TRUE
For

a character vector, display the number of


characters for each element

> y
[] "one"
"two"
> nchar(y)
[1] 3 3 5 5
>

"three" "eight"

Naming vector elements


Provide

a name for each vector element

> num_ro = c (one = "unu", two="doi", three="trei",


four="patru")
> num_ro
one
two
three
four
"unu"
"doi" "trei" "patru"
The

same result can be accomplished with:

> num_ro = c ("unu", "doi", "trei", "patru")


> num_ro
[1] "unu"
"doi"
"trei" "patru"
> names(num_ro) = c ("one", "two", "three", "four")
> num_ro
one
two
three
four
"unu"
"doi" "trei" "patru"

Descriptive statistics on vectors


A

vector (age) containing the age of 10 persons


(Kabacoff, 2011)

> age = c(1,3,5,2,11,9,3,9,12,3)

Another

vector containing the weight of above people

> weight = c(4.4,5.3,7.2,5.2,8.5,7.3,6.0,10.4,10.2,6.1)

Suppose

above weights were in US metric system, we had


convert them from lbs into kg

> weight.kg <- weight * 0.454


Compute

the mean of people's weight

> mean(weight)

[1] 7.06
Compute

the standard deviation of people's weight

> sd(weight)

[1] 2.077498
Compute

correlation between age and weight

> cor(age,weight)

Matrices
Two-dimensional

arrays where each element has


the same type (numeric,character, or logical)
Created with the m atrix function. Format:
> Myymatrix <- matrix(vector,
nrow=number_of_rows,
ncol=number_of_columns, byrow=logical_value,
dimnames=list( char_vector_rownames,
char_vector_colnames))
vector contains the elements for the matrix
nrow and ncol specify the row and column dimensions
dimnames contains optional row and column labels stored in
character vectors.
byrow indicates whether the matrix should be filled in by row
(byrow=TRUE) or by column (byrow=FALSE); the default is by
column.

Matrices (cont.)
m.1

is a 5 x 4 matrix
> m.1 <- matrix(1:20, nrow=5, ncol=4)
> m.1
[,1] [,2] [,3] [,4]
[1,]
1
6
11
16
[2,]
2
7
12
17
[3,]
3
8
13
18
[4,]
4
9
14
19
[5,]
5
10
15
20
m.2

>
>
>
>
+

is a 2 x 2 matrix, filled by rows


cells <- c(1,26,24,68)
rownames <- c("Row1", "Row2")
colnames <- c("Col1", "Col2")
m.2 <- matrix(cells, nrow=2, ncol=2, byrow=TRUE,
dimnames=list(rownames, colnames))

Matrices (cont.)
Display

m.2

> m.2
Col1 Col2
Row 1 1 26
Row 2 24 68
m.3 is a 2 x 2 matrix, filled by columns
list is a data structure presented after data frame
> m.3 <- matrix(cells, nrow=2, ncol=2,
byrow=FALSE,
+ dimnames=list(rownames, colnames))
> m.3
Col1 Col2
Row 1 1 24
Row 2 26 68

Matrices (cont.)
m.4

is a 4 x 3 matrix, filled by rows

> m.4 <- matrix(1:12, nrow=4, ncol=3, byrow=TRUE)


> m.4
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
[3,] 7 8 9
[4,] 10 11 12
Naming

rows: row.1, row.2, ... and columns: col.1, col.2, ...


> dimnames(m.4)=list(paste("row.", 1:nrow(m.4), sep=""),
paste("col.", 1:ncol(m.4), sep=""))
> m.4
col.1 col.2 col.3
row .1
1
2
3
row .2
4
5
6
row .3
7
8
9
row .4 10 11 12

Accesing matrix elements


> m.1
[,1] [,2] [,3] [,4]
[1,] 1 6 11 16
[2,] 2 7 12 17
[3,] 3 8 13 18
[4,] 4 9 14 19
[5,] 5 10 15 20

Display the 3rd row


> m.1[3,]
[1] 3 8 13 18

Display the

3rd column

> m.1[,3]
[1] 11 12 13 14 15

Display the element

at the intersection of the 2nd


row and the 3rd column

> m.1 [2,3]


[1] 12

Accesing matrix elements


(cont.)
Display

two elements from the same row: m.1 [2,3]


and m.1[2,4]
> m.1 [2, c(3,4)]
[1] 12 17
Display three elements from the same column:
m.1[1,2], m1[2,2] and m.1[3,2]
> m.1 [c(1,2, 3), 2]
[1] 6 7 8
Display a "submatrix", from m1 [2,2] to m2[4.4]
> m.1 [ c(2,3,4), c(2,3,4)]
[,1] [,2] [,3]
[1,] 7 12 17
[2,] 8 13 18

Basic statistics on matrix


> m.4
col.1 col.2 col.3
row .1
1
2
3
row .2
4
5
6
row .3
7
8
9
row .4 10 11 12
Compute mean of all the cells in matrix m.4
> mean(m.4)
[1] 6.5
Compute mean of all the cells on the third column
> mean(m.4[,3])
[1] 7.5
Compute mean of all the cells on the third row
> mean(m.4[3,])
[1] 8

Basic statistics on matrix (cont.)


Compute

sum of
> sum(m.4)
[1] 78
Compute sum of
> sum(m.4[,3])
[1] 30
Compute sum of
> sum(m.4[3,])
[1] 24
Compute sum of
> sum(m.4)
[1] 78

all the cells in matrix m.4

all the cells on the third column

all the cells on the third row

all the cells in matrix m.4

rowSums/colSums
rowSums

calculates the sum of the cells for each row of a

matrix
> rowSums(m.4)
row .1 row .2 row .3 row .4
6 15 24 33
colSums

calculated the sums of the cells for each column of

a matrix
> colSums(m.4)
col.1 col.2 col.3
22 26 30
rowMeans/colMeans

> rowMeans(m.4)
row .1 row .2 row .3 row .4
2
5
8 11

> colMeans(m.4)
col.1 col.2 col.3
5.5 6.5 7.5

calculate mean of the every row/column

Adding total rows and columns to


a matrix
> m.4

col.1 col.2 col.3


row .1
1
2
3
row .2
4
5
6
row .3
7
8
9
row .4 10 11 12
Add

total column
> m.4 <- cbind(m.4, rowSums(m.4))
Setting the name for the total column
> column.names <- colnames(m.4)
> column.names
[1] "col.1" "col.2" "col.3" ""

> column.names[length(column.names)] <"col.total"


> colnames(m.4) <- column.names

Adding total rows and columns to


a matrix (cont.)

Check

the operation

> m.4
col.1 col.2 col.3 col.total
row .1
1
2
3
6
row .2
4
5
6
15
row .3
7
8
9
24
row .4 10 11 12
33

Add

total row

> m.4 <- rbind(m.4, colSums(m.4))

Setting

the name for the total column

> row.names <- rownames(m.4)


> row.names
[1] "row .1" "row .2" "row .3" "row .4" ""
> row.names[length(row.names)] <- "row.total"
> rownames(m.4) <- row.names

Adding total rows and columns to


a matrix (cont.)
Check

the operation; notice the


names of rows and columns and the
content of last row and column

> m.4
col.1 col.2 col.3 col.total
row .1

row .2

15

row .3

24

row .4

10

row .total 22

11
26

12
30

33
78

Arrays
Similar

to matrices but can have more than


two dimensions
Elements must be of the same type
Created with array function:
> myarray <- array(vector,
+
dimensions, dimnames)

vector contains the data for the array


dimensions is a numeric vector giving the maximal
index for each dimension
dimnames - optional list of dimension labels.

Elements

in arrays are accessed similar to


those in matrices

Create and access arrays


> dim1 <- c("A1", "A2")
> dim2 <- c("B1", "B2", "B3")
> dim3 <- c("C1", "C2",
+
"C3", "C4")
> a1 <- array(1:24, c(2, 3, 4), +
dimnames=list(dim1, dim2, + dim3))
>
> a1
,,C1
B1 B2 B3
A1 1 3 5
A2 2 4 6
,,C2
B1 B2 B3
A1 7 9 11
A2 8 10 12

Cont. of previous column

, , C3
B1 B2 B3
A1 13 15 17
A2 14 16 18
, , C4
B1 B2 B3
A1 19 21 23
A2 20 22 24

display element [2,2,3]


> a1 [2,2,3]

[1] 16

Create and access arrays (cont.)


display a matrix from
elements of A and B for first
row/column of C
> a1 [,,1]

display a subarray containg all


elements from first two
rows/columns of A, B and C
> a1 [c(1,2),c(1,2),c(1,2)]

B1 B2 B3
A1 1 3 5

, , C1

A2 2 4 6
B1 B2
display elements of A for the
3rd "row" of B and 2nd
row/columns of C
> a1 [,3,2]

A1 A2
11 12

A1 1 3
A2 2 4
, , C2
B1 B2
A1 7 9
A2 8 10

Data Frames
Most

important data structure in R (at least


for us)
A data frame is a structure in R that holds
data and is similar to the datasets found in
standard statistical packages (for example,
SAS, SPSS, and Stata) and databases
The columns are variables and the rows
are observations
Variables can have different types (for
example, numeric, character) in the same
data frame

Create an empty data frame


> student_gi <- data.frame(studentID = numeric(),
name = character(), age = numeric(),
scholarship = character(),
lab_assessment = character(),
final_grade = numeric())
> class(student_gi)
[1] "data.fram e"
> str(student_gi)
'data.fram e': 0 obs. of 6 variables:
$ studentID
: num
$ nam e
: Factor w / 0 levels:
$ age
: num
$ scholarship : Factor w / 0 levels:
$ lab_assessm ent: Factor w / 0 levels:
$ fi
nal_grade : num

Create a data frame from vectors

Create

the vectors

> studentID <- c(1, 2, 3, 4, 5)


> name <- c("Popescu I. Vasile", "Ianos W.
Adriana",
+
"Kovacz V. Iosef", "Babadag I. Maria",
+
"Pop P. Ion")
> age <- c(23, 19, 21, 22, 31)
> scholarship <- c("Social", "Studiu1", "Studiu2",
+
"Merit", "Studiu1")
> lab_assessment <- c("Bine", "Foarte bine",
+
"Excelent", "Bine", "Slab")
> final_grade <- c(9, 9.45, 9.75, 9, 6)

Create

the data frame using the above vectors

> student_gi <- data.frame(studentID, name, age,


+
scholarship, lab_assessment, final_grade)

Display data frame content


Display

data frame (content)

> student_gi
studentID
nam e age scholarship lab_assessm ent fi
nal_grade
1
1 Popescu I.Vasile 23
Social
Bine
9.00
2
2 Ianos W .Adriana 19
Studiu1 Foarte bine
9.45
3
3 Kovacz V.Iosef 21
Studiu2
Excelent
9.75
4
4 Babadag I.M aria 22
M erit
Bine
9.00
5
5
Pop P.Ion 31
Studiu1
Slab
6.00
Display one column of the data frame as a vector
> student_gi$name
[1] Popescu I.Vasile Ianos W .Adriana Kovacz V.Iosef Babadag I.M aria Pop P.Ion
Levels: Babadag I.M aria Ianos W .Adriana Kovacz V.Iosef Pop P.Ion Popescu I.Vasile
Display one column of the data frame as a... column
> student_gi["name"]
name
1 Popescu I.Vasile
2 Ianos W .Adriana
3 Kovacz V.Iosef
4 Babadag I.M aria
5
Pop P.Ion

Display data frame structure

Confirm

student_giis indeed a data frame

> class(student_gi)
[1] "data.fram e"

Display

structure of the data frame

> str(student_gi)
'data.fram e': 5 obs. of 6 variables:
$ studentID
: num 1 2 3 4 5
$ nam e
: Factor w / 5 levels "Babadag I.M aria",..: 5 2 3 1 4
$ age
: num 23 19 21 22 31
$ scholarship : Factor w / 4 levels "M erit","Social",..: 2 3 4 1 3
$ lab_assessm ent: Factor w / 4 levels "Bine","Excelent",..: 1 3 2 1 4
$ fi
nal_grade : num 9 9.45 9.75 9 6

Display

type of invididual variables within the data fra

> class(student_gi$studentID)
[1] "num eric"

> class(student_gi$name)
[1] "factor"

Useful functions for displaying


some data frame properties

Number

of observations (rows)

> nrow(student_gi)

[1] 5

Number

of variables (columns)

> ncol(student_gi)

[1] 6

Both

the number of observations (rows) and variables


(columns)

> dim(student_gi)

[1] 5 6

Display

the names of all the variables (columns)

> names(student_gi)

[1] "studentID "


"nam e"
"age"
"lab_assessm ent" "fi
n al_grade"

Display

"scholarship"

the names of the second, third and fourth


variable

> names(student_gi[2:4])

Selecting columns

Select/display

first two columns (studentID and

name )
> student_gi [1:2]

studentID
nam e
1
1 Popescu I. Vasile
2
2 Ianos W . Adriana
3
3 Kovacz V. Iosef
4
4 Babadag I. M aria
5
5
Pop P. Ion

or
> student_gi [, 1:2]

or

> student_gi [c("studentID", "name")]

or

(see on next slide)

Selecting columns (cont.)

Select/display

first two columns (studentID and


name ) other solutions

> student_gi [, c("studentID", "name")]


Using

a vector for storing indices of the first two


columns

> cols <- c("studentID", "name")


> student_gi[cols]

or

> student_gi[, names(student_gi) %in% cols]

Return

"final_grade" variable (column) as a vector

> student_gi$final_grade
[1] 9.00 9.45 9.75 9.00 6.00

or ... See on the next slide

Selecting columns (cont.)

Return

"final_grade" variable (column) as a vector

(cont.)
> student_gi[ , 6]

or
> student_gi[ , "final_grade"]

Return

"final_grade" variable (column) as a one-column


data frame
> student_gi[ , "final_grade", drop=FALSE]
fi
nal_grade
1
9.00
2
9.45
3
9.75
4
9.00
5
6.00

Selecting rows

Display

first two observations (rows)


> student_gi [1:2,]
studentID
nam e age scholarship
1
1 Popescu I. Vasile 23
Social
2
2 Ianos W . Adriana 19
Studiu1
lab_assessm ent fi
n al_grade
1
Bine
9.00
2 Foarte bine
9.45

Display

display observations 1, 2 and 5


> student_gi [c(1:2, 5),]
studentID
nam e age scholarship lab_assessm ent
fi
nal_grade
1
1 Popescu I. Vasile 23
Social
Bine
9.00
2
2 Ianos W . Adriana 19
Studiu1 Foarte bine
9.45
5
5
Pop P. Ion 31
Studiu1
Slab
6.00

attach function
attach

adds the data frame to the R search path


> search()
[1] ".G lobalEnv"
"tools:rstudio"
[3] "package:stats" "package:graphics"
[5] "package:grD evices" "package:utils"
[7] "package:datasets" "package:m ethods"
[9] "Autoloads"
"package:base"
When a variable name is encountered, data
frames in the search path are checked in order to
locate the variable.
Commands

without attach
> student_gi$final_grade
> table (student_gi$lab_assessment,
student_gi$final_grade)
> summary(student_gi$final_grade)

attach vs. with


The

>
>
>
>
>

same commands using attach


attach(student_gi)
final_grade
table (lab_assessment, final_grade)
summary(final_grade)
plot(age, final_grade)

detach

removes an objects from the search path


> detach(student_gi)
It

is advisable to use
> with (student_gi,
> with (student_gi,
final_grade))
> with (student_gi,
final_grade) )

with instead of attach:


final_grade)
table (lab_assessment,
plot(lab_assessment,

Case (row) identifiers


Act

like primary/unique keys in relational tables


Can be specified by rowname option within the
data.frame function
We allocate new values for studentID (to avoid
confusion with row numbers); the remaining
vectors are identical
> studentID <- c(1001, 1002, 1003, 1004,
1005)
> name <- c("Popescu I. Vasile",
+
"Ianos W. Adriana", "Kovacz V. Iosef",
+
"Babadag I. Maria", "Pop P. Ion")
> age <- c(23, 19, 21, 22, 31)
> scholarship <- c("Social", "Studiu1",
+
"Studiu2", "Merit", "Studiu1")
> lab_assessment <- c("Bine", "Foarte bine",
+
"Excelent", "Bine", "Slab")

Case (row) identifiers (cont.)


A

(slightly) new version of the data frame:


> student_gi <- data.frame(studentID, name,
age,
+
scholarship, lab_assessment,
+ final_grade, row.names = studentID)
studentID is the variable to use in labeling cases
on various printouts and graphics produced with
R.
display

the name of the rows (observations)


> rownames(student_gi)
[1] "1001" "1002" "1003" "1004" "1005"
> student_gi
studentID
nam e age scholarship lab_assessm ent
1001
1001 Popescu I. Vasile 23
Social
Bine
1002
1002 Ianos W . Adriana 19
Studiu1 Foarte bine
1003
1003 Kovacz V. Iosef 21
Studiu2
Excelent

Case (row) identifiers (cont.)


display

the name of the rows (observations)


> rownames(student_gi)
[1] "1001" "1002" "1003" "1004" "1005"
Notice

the leftmost column of the data frame


display
> student_gi
studentID
1001
1001
1002
1002
1003
1003
1004
1004
1005
1005

nam e age scholarship lab_assessm ent


Popescu I. Vasile 23
Social
Bine
Ianos W . Adriana 19
Studiu1 Foarte bine
Kovacz V. Iosef 21
Studiu2
Excelent
Babadag I. M aria 22
M erit
Bine
Pop P. Ion 31
Studiu1
Slab

fi
nal_grade
1001
9.00
1002
9.45
1003
9.75
1004
9.00

Case (row) identifiers (cont.)


Display

the observation (row) corresponding to


student Ianos W. Adriana using her case
identifier ("1002")
> student_gi["1002",]
studentID
nam e age scholarship lab_assessm ent
1002
1002 Ianos W . Adriana 19
Studiu1 Foarte bine
fi
nal_grade
1002
9.45

Display

the observations corresponding to


students Ianos W. Adriana and Pop P. Ion using
their case identifier ("1002" and "1005")
> student_gi[c("1002", "1005"),]
studentID
nam e age scholarship lab_assessm ent
1002
1002 Ianos W . Adriana 19
Studiu1 Foarte bine
1005
1005
Pop P. Ion 31
Studiu1
Slab
fi
nal_grade
1002
9.45
1005
6.0

Factors (reprise)
In

presentation 02a, variables were described as


nominal, ordinal, interval, and ratio
Nominal variables are categorical, without an
implied order. Examples: MaritalStatus, Sex, Job,
MasterProgramme
Ordinal variables imply order but not amount.
Examples: Status (poor, improved, excellent ),
LabAssessment (slab, bine, foarteBine, excelent)
Interval and Ratio variables can take on any
value within some range, and both order and
amount are implied. Examples: LitersPer100Km,
Height, Weight, FinalGrade (with decimals)
Categorical (nominal) and ordered categorical
(ordinal) variables are called factors.

Function factor
Factors

determine how data will be analyzed and


presented visually
The function factor() stores the categorical
values as a vector of integers in the range [1... k ]
(where k is the number of unique values in the
nominal variable), and an internal vector of
character strings (the original values) mapped to
these integers
Initially vector scholarship is a nominal variable
> scholarship <- c("Social", "Studiu1",
"Studiu2",
+
"Merit", "Studiu1")
Now

it will be converted into a factor:

> scholarship_f <- factor(scholarship)


> scholarship_f
[1] Social Studiu1 Studiu2 M erit Studiu1
Levels: M erit SocialStudiu1 Studiu2

Ordered factors
Another

ordinal variable
> lab_assessment <- c("Bine", "Foarte bine",
+
"Excelent", "Bine", "Slab")
Notice the way of dispaying
> lab_assessment
[1] "Bine"
"Foarte bine" "Excelent" "Bine"
[5] "Slab"
Now declare the vector as an ordered factor
> lab_assessment <- factor(lab_assessment,
+
order=TRUE, levels=c("Slab", "Bine",
+
"Foarte bine", "Excelent"))
Notice the new way of displaying the vector
> lab_assessment
[1] Bine
Foarte bine Excelent Bine
Slab
Levels: Slab < Bine < Foarte bine < Excelent

Factors in data frames


Re-create

the data frame using factors

> studentID <- c(1001, 1002, 1003, 1004, 1005)


> name <- c("Popescu I. Vasile", "Ianos W.
Adriana",
+
"Kovacz V. Iosef", "Babadag I. Maria",
+
"Pop P. Ion")
> age <- c(23, 19, 21, 22, 31)
> scholarship <- c("Social", "Studiu1",
"Studiu2",
+
"Merit", "Studiu1")
> scholarship <- factor(scholarship)
> lab_assessment <- c("Bine", "Foarte bine",
+
"Excelent", "Bine", "Slab")
> lab_assessment <- factor(lab_assessment,
+
order=TRUE, levels=c("Slab", "Bine",
+
"Foarte bine", "Excelent"))
> final_grade <- c(9, 9.45, 9.75, 9, 6)

Factors in data frames (cont.)


Another

version of the data frame

> student_gi <- data.frame(name, age,


scholarship,
+
lab_assessment, final_grade,
+
row.names = studentID)
Display

the structure of the data frame

> str(student_gi)
'data.fram e':5 obs.of 5 variables:
$ nam e
: Factor w / 5 levels "Babadag I.M aria",..: 5
2314
$ age
: num 23 19 21 22 31
$ scholarship : Factor w / 4 levels "M erit","Social",..: 2 3
413
$ lab_assessm ent: O rd.factor w / 4 levels
"Slab"< "Bine"< ..: 2 3 4 2 1
$ fi
n al_grade : num 9 9.45 9.75 9 6

Factors in data frames (cont.)


Basic

statistics about variables in data frame

> summary(student_gi)
nam e
age
scholarship
Babadag I.M aria :1 M in. :19.0 M erit :1
Ianos W .Adriana :1 1st Q u.:21.0 Social:1
Kovacz V.Iosef :1 M edian :22.0 Studiu1:2
Pop P.Ion
:1 M ean :23.2 Studiu2:1
Popescu I. Vasile:1 3rd Q u.:23.0
M ax. :31.0
lab_assessm ent fi
nal_grade
Slab
:1
M in. :6.00
Bine
:2
1st Q u.:9.00
Foarte bine:1
M edian :9.00
Excelent :1
M ean :8.64
3rd Q u.:9.45
M ax. :9.75

Factors and value labels


> patientID <- c(1, 2, 3, 4)
> age <- c(25, 34, 28, 52)
> diabetes <- c("Type1", "Type2", "Type1",
"Type1")
> status <- c("Poor", "Improved", "Excellent",
+
"Poor")
> diabetes <- factor(diabetes)
> status <- factor(status, order=TRUE)
> gender <- c(1, 2, 2, 1)
> patientdata <- data.frame(patientID, age,
+
diabetes, status, gender)
For

variable gender (coded 1 for males and 2 for


females) the value labels are declared with options
levels (indicating the values) and labels
(indicating the labels):

> patientdata$gender <-

Factors and value labels (cont.)


For

gender, labels (instead of of values) are displayed

> patientdata
patientID age diabetes status gender
1
1 25 Type1
Poor m ale
2
2 34 Type2 Im proved fem ale
3
3 28 Type1 Excellent fem ale
4
4 52 Type1
Poor m ale
Data

frame structure (see information about gender):

> str(patientdata)
'data.fram e':4 obs.of 5 variables:
$ patientID : num 1 2 3 4
$ age
: num 25 34 28 52
$ diabetes : Factor w / 2 levels "Type1","Type2": 1 2 1 1
$ status : O rd.factor w / 3 levels "Excellent"< "Im proved"< ..: 3
213
$ gender : Factor w / 2 levels "m ale","fem ale": 1 2 2 1

Lists
Lists

are the most complex of the R data types


A list is an ordered collection of objects
(components).
A list allows gathering a large variety of (possibly
unrelated) objects under one name.
A list can contain a combination of vectors,
matrices, data frames, and even other list
Created using list() function :
mylist <- list(object1, object2, )

where the objects are any of the structures seen so far


Optionally, the objects in a list can be named:
mylist <- list(name1=object1,
+
name2=object2, )

First example of list: POSIXlt variables


Variable

t gets the current system timestamp:

> t = Sys.time()

POSIXlt

objects are actually lists

> l.1 <- as.POSIXlt(t)


> l.1
[1] "2014-09-25 08:37:24 EEST"
> typeof(l.1)
[1] "list"
> names(l.1)
NULL
> unclass(l.1)
$sec
[1] 24.19267
$min
[1] 37
$hour
[1] 8
$mday
[1] 25
...

First example of list: POSIXlt variables (cont.)


Extract

list components values (seconds, minutes,


hours, ...) eqivalent to l.1$sec, l.1$min ...:

> l.1[[1]]
[1] 24.19267
> l.1[[2]]
[1] 37
> l.1[[3]]
[1] 8
> l.1[[4]]
[1] 25
...

Display

(horizontally) components of the timestamp

object
> unlist(l.1)
sec
min
24.19267 37.00000
wday
yday

hour
8.00000
isdst

mday
25.00000

mon
year
8.00000 114.00000

Matrices and lists


Matrix

dimension names (dimnames) object is a list

> m.3 <- matrix(cells, nrow=2, ncol=2,


+
byrow=FALSE,
+
dimnames=list(rownames, colnames))
> m.3
Col1 Col2
Row1
1
24
Row2
26
68
> dimnames(m.3)
[[1]]
[1] "Row1" "Row2"
[[2]]
[1] "Col1" "Col2"
> unlist(dimnames(m.3))
[1] "Row1" "Row2" "Col1" "Col2"

Creating and displaying simple lists


Create

two simple lists


> list.1 = list ("unu", "doi", "trei")
> list.2 = list( c("doi", "trei", "patru"))
Vizualizing

> list.1
[[1]]
[1] "unu"

[[2]]
[1] "doi"
[[3]]
[1] "trei"
> list.2

[[1]]

lists

Create a more complex list


list.3

contains two previous lists, a vector (sequence) and a data


frame:

> list.3 = list (list.1, list.2, 3:7, patientdata)


> list.3
[[1]]
[[1]][[1]]
[1] "unu"
[[1]][[2]]
[1] "doi"
[[1]][[3]]
[1] "trei"
[[2]]
[[2]][[1]]
[1] "doi"
"trei" "patru"
[[3]]
[1] 3 4 5 6 7
[[4]]
patientID age diabetes
status gender
1
1 25
Type1
Poor
male
2
2 34
Type2 Improved female
3
3 28
Type1 Excellent female
4
4 52
Type1
Poor
male

Create a more complex list (cont.)


Display

the structure of list.3:

> str(list.3)
List of 4
$ :List of 3
..$ : chr "unu"
..$ : chr "doi"
..$ : chr "trei"
$ :List of 1
..$ : chr [1:3] "doi" "trei" "patru"
$ : int [1:5] 3 4 5 6 7
$ :'data.frame': 4 obs. of 5 variables:
..$ patientID: num [1:4] 1 2 3 4
..$ age
: num [1:4] 25 34 28 52
..$ diabetes : Factor w/ 2 levels "Type1","Type2": 1 2 1 1
..$ status
: Ord.factor w/ 3 levels
"Excellent"<"Improved"<..: 3 2 1 3
..$ gender
: Factor w/ 2 levels "male","female": 1 2 2 1

Accessing list components


Display

the number of objects in a list

> length(list.3)
[1] 4

Access

the first object of the list

> list.3[[1]]
[[1]]
[1] "unu"

[[2]]
[1] "doi"
[[3]]
[1] "trei"
> class(list.3[[1]])
[1] "list"

Accessing list components (cont)


Access

the second component of the list

> list.3[[2]]
[[1]]
[1] "doi"
"trei" "patru"
> class(list.3[[2]])
[1] "list"
...

and the fourth component

> list.3[[4]]
patientID age diabetes
status gender
1
1 25
Type1
Poor
male
2
2 34
Type2 Improved female
3
3 28
Type1 Excellent female
4
4 52
Type1
Poor
male
> class(list.3[[4]])
[1] "data.frame"

List component attributes/names


Function

names display the names of


designated components of a list

The

first object of list.3 is a list whose


components have no name:

> names(list.3[[1]])
NULL
The

fourth object of list.3 is a data frame


called patientdata; this data frame have four
variables (columns) whose names can be
displayed with function names:

> names(list.3[[4]])
[1] "patientID" "age"
"gender"

"diabetes"

"status"

Accessing components within components


Display

the third object within the first component in list.3

> list.3[[1]][[3]]
[1] "trei"
Display, in the data

frame patientdata (the data frame is


the 4th component of the list) the values of column age (this
column is the 2nd of the data frame)

list.3[[4]][,

2]
[1] 25 34 28 52

Display

or

> list.3[[4]][, "age"]

age as a column (not a vector)

> list.3[[4]][, "age", drop=FALSE]


age
1 25
2 34
3 28
4 52

Display

age of the third patient

> list.3[[4]][, 2][3]


> list.3[[4]][, "age", drop=FALSE]$age[3]
[1] 28

Tables in R
Not

full-fledged data structure, but a sort of


labeled (named) arrays
Some functions (e.g. graphic functions,
categorical data analysis functions) accept
only tables as arguments
More about tables in script 06c
Two

main types of tables:

tables of frequencies counts number of occurences


for each value of a (usually) categorical variable
tables of proportions which divides number of
occurences of each value to total number of
occurences of a (usually) categorical variable

Uni-dimensional tables
Create

a table with frequencies of scholarship in data frame


student_gi
> table.1 <- with(student_gi, table(scholarship))
> table.1
scholarship
Merit Social Studiu1 Studiu2
1
1
2
1
Display structure of table.1
> str(table.1)
'table' int [1:4(1d)] 1 1 2 1
- attr(*, "dimnames")=List of 1
..$ scholarship: chr [1:4] "Merit" "Social" "Studiu1"
"Studiu2"
> class(table.1)
[1] "table"
Unidimensional

tables are vectors with labeled elements (each


element's label is a value of the attribute used in function table)
> names(table.1)
[1] "Merit"
"Social" "Studiu1" "Studiu2"

Access/display uni-dimensional tables


tables.1

is not a data frame, so we cannot qualify the variable using

$...
> table.1$Merit
Error in table.1$Merit : $ operator is invalid for atomic vectors
...

but we can access with vector indices

> table.1[1]
Merit
1
...

or list indices

> table.1[[1]]
[1] 1
Display

both label and the of the 3rd element in table table.1:

> table.1[3]
Studiu1
2
...

or
> unlist(table.1)[3]
Studiu1
2

Access/display uni-dimensional tables (cont.)


Display

only the label of the 3rd element of the table table.1:

> names(table.1) [3]


[1] "Studiu1"
Display

only the value of the 3rd element in table.1:


> unlist(table.1)[[3]]
[1] 2
Display

3rd elements' both name and value by the name:


> table.1["Studiu1"]
Studiu1
2
Display

both names and values of two elements by their

names:
> table.1[c("Merit", "Studiu1")]
scholarship
Merit Studiu1
1
2

Bi-dimensional tables
Similar

to pivot tables in Excel

Create

a contingency (pivot) table with frequencies of


scholarship by lab_assessment

> table.2 <- with(student_gi, table(scholarship, lab_assessment))


> table.2
lab_assessm ent
scholarship Slab Bine Foarte bine Excelent
M erit
0 1
0
0
Social 0 1
0
0
Studiu1 1 0
1
0
Studiu2 0 0
0
1
Structure

of table.2

> str(table.2)
'table'int [1:4,1:4] 0 0 1 0 1 1 0 0 0 0 ...
- attr(*,"dim nam es")= List of 2
..$ scholarship : chr [1:4] "M erit" "Social" "Studiu1" "Studiu2"
..$ lab_assessm ent: chr [1:4] "Slab" "Bine" "Foarte bine" "Excelent"
> class(table.2)
[1] "table"

Accessing bi-dimensional tables


Any

cell can be accessed using indices of row and column...

> table.2[1,2]

[1] 1
...

or the names/labels
> table.2["Merit", "Bine"]
[1] 1
Display

the second column (associated with value Bine of


lab_assessment) as a vector using the index (2)...
> table.2[, 2]
M erit SocialStudiu1 Studiu2
1

...

or the name of the column (Bine)


> table.2[, "Bine"]
M erit SocialStudiu1 Studiu2
1

Accessing bi-dimensional tables (cont.)

Similarly,

Access

one can access individual (or group of) rows

particular rows and columns in a table

> table.2[c("Merit", "Studiu1"), c("Slab", "Excelent")]

lab_assessm ent
scholarship Slab Excelent
M erit
Studiu1

0
1

Tri-dimensional tables
Create

a three-dimensional table with frequencies of scholarship by


lab_assessment by final_grade

> table.3 <- with(student_gi, table(scholarship, lab_assessment,


final_grade))
Display

table.3

> table.3
,,fi
nal_grade = 6
lab_assessm ent
scholarship Slab Bine Foarte bine Excelent
M erit
0 0
0
0
Social 0 0
0
0
Studiu1 1 0
0
0
Studiu2 0 0
0
0
,,fi
nal_grade = 9
lab_assessm ent
scholarship Slab Bine Foarte bine Excelent
M erit
0 1
0
0
Social 0 1
0
0
Studiu1 0 0
0
0
Studiu2 0 0
0
0

Tri-dimensional tables (cont.)


Display

table.3 (cont.)

, , fi
n al_grade = 9.45
lab_assessm ent
scholarship Slab Bine Foarte bine Excelent
M erit
0 0
0
0
Social 0 0
0
0
Studiu1 0 0
1
0
Studiu2 0 0
0
0
, , fi
n al_grade = 9.75
lab_assessm ent
scholarship Slab Bine Foarte bine Excelent
M erit
0 0
0
0
Social 0 0
0
0
Studiu1 0 0
0
0
Studiu2 0 0
0
1

ftable
ftable

improves the display of three-dimensional tables

> ftable(table.3)
fi
n al_grade 6 9 9.45 9.75
scholarship lab_assessm ent
M erit
Slab
00 0 0
Bine
01 0 0
Foarte bine
00 0 0
Excelent
00 0 0
Social
Slab
00 0 0
Bine
01 0 0
Foarte bine
00 0 0
Excelent
00 0 0
Studiu1
Slab
10 0 0
Bine
00 0 0
Foarte bine
00 1 0
Excelent
00 0 0
Studiu2
Slab
00 0 0
Bine
00 0 0
Foarte bine
00 0 0
Excelent
00 0 1

Accessing three-dimensional tables


Any

cell can be accessed using indices of the three axes...

> table.3[3, 3, 3]
[1] 1
...

or the names/labels

> table.3["Studiu2", "Excelent", "9.75"]


[1] 1
Display,

as an one-dimensional table, the values of the


lab_assessment which corespond to value Studiu2 (4th) of
scholarship and the value 9.75 (4th) of final_grade

one can use the indexes ...


> table.3[4, , 4]
Slab
0

Bine Foarte bine Excelent


0
0
1

... or the label/names


> table.3[ "Studiu2", , "9.75" ]
Slab
0

Bine Foarte bine Excelent


0
0
1

Accessing three-dimensional tables (cont.)


Display,

as a bi-dimensional table, the values of the first


(scholarship) and the third (final_grade) axes associated with
the 4th value (Excelent) of the second axis (lab_assessment)

one can use the index...


> table.3[, 4, ]
fi
nal_grade
scholarship 6 9 9.45 9.75
M erit 0 0 0 0
Social 0 0 0 0
Studiu1 0 0 0 0
Studiu2 0 0 0 1

... or the label/name


> table.3[, "Excelent", ]
fi
nal_grade
scholarship 6 9 9.45 9.75
M erit 0 0 0 0
Social 0 0 0 0
Studiu1 0 0 0 0
Studiu2 0 0 0 1

Accessing three-dimensional tables (cont.)

One

can access particular ranges on each axis

> table.3[c("Merit", "Studiu1"), c("Slab",


"Excelent"), c("9.45", "9.75") ]
, , fi
n al_grade = 9.45
lab_assessm ent
scholarship Slab Excelent
M erit
0
0
Studiu1 0
0
, , fi
n al_grade = 9.75
lab_assessm ent
scholarship Slab Excelent
M erit
0
0
Studiu1 0
0

Built-in datasets

Some

datasets are available in base (core) R (e.g. faithful)


> head(faithful, 3)
eruptions w aiting
1
3.600
79
2
1.800
54
3
3.333
74

Most

data sets are available in packages (e.g. ggplot2, vcd,

...)

In

most cases, data sets are stored as data frames, e.g.


the dataset movies from package ggplot2

Every

package must be installed (once per computer)


> install.packages("ggplot2")

After

installation, a package must be loaded (once for


every RStudio session)
> library(ggplot2)

Built-in datasets (cont.)

Display

the structure of dataset movies

> str(movies)
'data.fram e':58788 obs. of 24 variables:
$ title
: chr "$" "$1000 a Touchdow n" "$21 a D ay O nce a
M onth" "$40,000" ...
$ year
: int 1971 1939 1941 1996 1975 2000 2002
2002 1987 1917 ...
$ length
: int 121 71 7 70 71 91 93 25 97 61 ...
$ budget : int N A N A N A N A N A N A N A N A N A N A ...
$ rating
: num 6.4 6 8.2 8.2 3.4 4.3 5.3 6.7 6.6 6 ...
$ votes
: int 348 20 5 6 17 45 200 24 18 51 ...
$ r1
: num 4.5 0 0 14.5 24.5 4.5 4.5 4.5 4.5 4.5 ...
$ r2
: num 4.5 14.5 0 0 4.5 4.5 0 4.5 4.5 0 ...
$ r3
: num 4.5 4.5 0 0 0 4.5 4.5 4.5 4.5 4.5 ...
...

Built-in dataset stored as table

Data

set HairEyeColor in package vcd is stored as


three-dimensional table (http://cran.us.rproject.org/w eb/packages/vcdExtra/vignettes/vcdtutorial.pdf)
> install.packages("vcd")
> library(vcd)
> head(HairEyeColor)
[1] 32 53 10 3 11 50
> str(HairEyeColor)
table [1:4, 1:4, 1:2] 32 53 10 3 11 50 10 30 10 25 ...
- attr(*, "dim nam es")= List of 3
..$ H air: chr [1:4] "Black" "Brow n" "Red" "Blond"
..$ Eye : chr [1:4] "Brow n" "Blue" "H azel" "G reen"
..$ Sex : chr [1:2] "M ale" "Fem ale"
> class(HairEyeColor)
[1] "table"

Package datasets

has a special package called datasets


> library(datasets)

function

data displays all the datasets in this package

> data()

Visualize

all the data sets available in all packages:


> data(package = .packages(all.available =
TRUE))

Display the datasets available in package ggplot2


> try(data(package = "ggplot2") )

...or

> data(package = "ggplot2")$results

list (made in 2012) of all datasets in R is available at


http://www.public.iastate.edu/~hofmann/data_in_r_sor

Data structures conversion

Not

all conversions from an object (of a data type) into


another object (of another data type) are possible

Generally,

function as.data.frame converts any other


data type object into a a data frame

Ex:

convert a vector into a data frame


> a_vector
[1] 2 3 4 8 9 10 11 12 13 14
> v_to_df.1 <- as.data.frame(a_vector)
> v_to_df.1
a_vector
1
2
2
3
3
4
...

Data structures conversion (cont.)

Convert

matrix m.4 into a data frame


> m_to_df.1 <- as.data.frame(m.4)
> m_to_df.1
col.1 col.2 col.3 col.total
row.1
1
2
3
6
row.2
4
5
6
15
row.3
7
8
9
24
row.4
10
11
12
33
row.total
22
26
30
78
> str(m_to_df.1)
'data.frame': 5 obs. of 4 variables:
$ col.1
: num 1 4 7 10 22
$ col.2
: num 2 5 8 11 26
$ col.3
: num 3 6 9 12 30
$ col.total: num 6 15 24 33 78

Data structures conversion (cont.)

Convert

a table into a data frame


> table_to_dataframe =
data.frame(unlist(HairEyeColor))
> head(table_to_dataframe, 3)
Hair
Eye Sex Freq
1 Black Brown Male
32
2 Brown Brown Male
53
3 Red Brown Male
10

Convert

>
+
>
1
2
3

a list into a data frame


df <- data.frame(matrix(unlist(list.1), nrow=132,
byrow=T))
head(df,3)
matrix.unlist.list.1...nrow...132..byrow...T.
unu
doi
trei

You might also like