02b Data Structures Datasets
02b Data Structures Datasets
Data structures in R
http://adv-r.had.co.nz/Data-structures.html
https://www.youtube.com/watch?v=DG7YNf8kb3w
https://www.youtube.com/watch?v=271FKAYavYE
http://repidemiology.wordpress.com/introduction-to-r-code/
https://www.youtube.com/watch?v=QygSZw77Hs8
3- Introduction to R : Vectors
https://www.youtube.com/watch?v=MGphwmXCCgM#t=12
http://repidemiology.wordpress.com/introduction-to-r-code/
https://www.youtube.com/watch?v=UakyyZSyuZU
http://1drv.ms/1sYllLB
First solution:
Second solution:
0.5508588
Numbers
> LETTERS
[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N"
"O" "P" "Q" "R" "S" "T" "U" "V" "W" "X" "Y" "Z"
> month.name
[1] "January"
"June"
[10] "October"
"February"
"July"
"November"
"March"
"August"
"December"
"April"
"May"
"September"
> state.name
[1] "Alabama"
"Arkansas"
...
"Alaska"
> state.area
[1]
...
53104
"Arizona"
Vectors of factors
Factors
levels
Very important in data analysis and visualization
Ex: two vectors:
student names
student genres
Both
Class
> unclass(ten_integers.1)
[1] 5 6 7 8 9 10 11 12 13 14
Internally, factor levels are stored
as integers
> class(genre)
[1] "factor"
> unclass(genre)
[1] 1 2 1 2 1 1
attr(,"levels")
[1] "Female" "Male"
> typeof(genre)
[1] "integer"
Function length
vectors
Display first, third, fifth and sixth elements in
vector ten_integers.1
Vector ind contains indices for elements of
interest from vector ten_integers.1
> ind <- c(1, 3, 5, 6)
> ind
[1] 1 3 5 6
> ten_integers.1
[1]
Now
9 10 11 12 13 14
the result:
Excluding
first element:
Excluding
9 10 11 12 13 14
9 10 11 12 13 14
Excluding
9 10 11 12 13 14
Excluding
Vector filtering
Filter
[1] 11 12 13 14
How
[1] 4
Display
[1]
9 10
Filter
[1] 11 12 13 14
Sorting/ordering a vector
Initial
vector
"Lazar
R as a vectorized language
Lecture
> class(date.vec.1)
[1] "character"
as.Date()
R as a vectorized language
(cont.)
Operations
> x
[1] -0.56757455 -0.90079348
> x >= 0
[1] FALSE FALSE TRUE FALSE
> x.1 <- x >= 0
> x.1
[1] FALSE FALSE TRUE FALSE
Testing
0.24397156 -0.51325283
0.03209287
TRUE
TRUE
> x
[1] -0.56757455 -0.90079348
> any(x > 0)
[1] TRUE
0.24397156 -0.51325283
0.03209287
R as a vectorized language
(cont.)
Testing
> y
[] "one"
"two"
> nchar(y)
[1] 3 3 5 5
>
"three" "eight"
Another
Suppose
> mean(weight)
[1] 7.06
Compute
> sd(weight)
[1] 2.077498
Compute
> cor(age,weight)
Matrices
Two-dimensional
Matrices (cont.)
m.1
is a 5 x 4 matrix
> m.1 <- matrix(1:20, nrow=5, ncol=4)
> m.1
[,1] [,2] [,3] [,4]
[1,]
1
6
11
16
[2,]
2
7
12
17
[3,]
3
8
13
18
[4,]
4
9
14
19
[5,]
5
10
15
20
m.2
>
>
>
>
+
Matrices (cont.)
Display
m.2
> m.2
Col1 Col2
Row 1 1 26
Row 2 24 68
m.3 is a 2 x 2 matrix, filled by columns
list is a data structure presented after data frame
> m.3 <- matrix(cells, nrow=2, ncol=2,
byrow=FALSE,
+ dimnames=list(rownames, colnames))
> m.3
Col1 Col2
Row 1 1 24
Row 2 26 68
Matrices (cont.)
m.4
Display the
3rd column
> m.1[,3]
[1] 11 12 13 14 15
sum of
> sum(m.4)
[1] 78
Compute sum of
> sum(m.4[,3])
[1] 30
Compute sum of
> sum(m.4[3,])
[1] 24
Compute sum of
> sum(m.4)
[1] 78
rowSums/colSums
rowSums
matrix
> rowSums(m.4)
row .1 row .2 row .3 row .4
6 15 24 33
colSums
a matrix
> colSums(m.4)
col.1 col.2 col.3
22 26 30
rowMeans/colMeans
> rowMeans(m.4)
row .1 row .2 row .3 row .4
2
5
8 11
> colMeans(m.4)
col.1 col.2 col.3
5.5 6.5 7.5
total column
> m.4 <- cbind(m.4, rowSums(m.4))
Setting the name for the total column
> column.names <- colnames(m.4)
> column.names
[1] "col.1" "col.2" "col.3" ""
Check
the operation
> m.4
col.1 col.2 col.3 col.total
row .1
1
2
3
6
row .2
4
5
6
15
row .3
7
8
9
24
row .4 10 11 12
33
Add
total row
Setting
> m.4
col.1 col.2 col.3 col.total
row .1
row .2
15
row .3
24
row .4
10
row .total 22
11
26
12
30
33
78
Arrays
Similar
Elements
, , C3
B1 B2 B3
A1 13 15 17
A2 14 16 18
, , C4
B1 B2 B3
A1 19 21 23
A2 20 22 24
[1] 16
B1 B2 B3
A1 1 3 5
, , C1
A2 2 4 6
B1 B2
display elements of A for the
3rd "row" of B and 2nd
row/columns of C
> a1 [,3,2]
A1 A2
11 12
A1 1 3
A2 2 4
, , C2
B1 B2
A1 7 9
A2 8 10
Data Frames
Most
Create
the vectors
Create
> student_gi
studentID
nam e age scholarship lab_assessm ent fi
nal_grade
1
1 Popescu I.Vasile 23
Social
Bine
9.00
2
2 Ianos W .Adriana 19
Studiu1 Foarte bine
9.45
3
3 Kovacz V.Iosef 21
Studiu2
Excelent
9.75
4
4 Babadag I.M aria 22
M erit
Bine
9.00
5
5
Pop P.Ion 31
Studiu1
Slab
6.00
Display one column of the data frame as a vector
> student_gi$name
[1] Popescu I.Vasile Ianos W .Adriana Kovacz V.Iosef Babadag I.M aria Pop P.Ion
Levels: Babadag I.M aria Ianos W .Adriana Kovacz V.Iosef Pop P.Ion Popescu I.Vasile
Display one column of the data frame as a... column
> student_gi["name"]
name
1 Popescu I.Vasile
2 Ianos W .Adriana
3 Kovacz V.Iosef
4 Babadag I.M aria
5
Pop P.Ion
Confirm
> class(student_gi)
[1] "data.fram e"
Display
> str(student_gi)
'data.fram e': 5 obs. of 6 variables:
$ studentID
: num 1 2 3 4 5
$ nam e
: Factor w / 5 levels "Babadag I.M aria",..: 5 2 3 1 4
$ age
: num 23 19 21 22 31
$ scholarship : Factor w / 4 levels "M erit","Social",..: 2 3 4 1 3
$ lab_assessm ent: Factor w / 4 levels "Bine","Excelent",..: 1 3 2 1 4
$ fi
nal_grade : num 9 9.45 9.75 9 6
Display
> class(student_gi$studentID)
[1] "num eric"
> class(student_gi$name)
[1] "factor"
Number
of observations (rows)
> nrow(student_gi)
[1] 5
Number
of variables (columns)
> ncol(student_gi)
[1] 6
Both
> dim(student_gi)
[1] 5 6
Display
> names(student_gi)
Display
"scholarship"
> names(student_gi[2:4])
Selecting columns
Select/display
name )
> student_gi [1:2]
studentID
nam e
1
1 Popescu I. Vasile
2
2 Ianos W . Adriana
3
3 Kovacz V. Iosef
4
4 Babadag I. M aria
5
5
Pop P. Ion
or
> student_gi [, 1:2]
or
or
Select/display
or
Return
> student_gi$final_grade
[1] 9.00 9.45 9.75 9.00 6.00
Return
(cont.)
> student_gi[ , 6]
or
> student_gi[ , "final_grade"]
Return
Selecting rows
Display
Display
attach function
attach
without attach
> student_gi$final_grade
> table (student_gi$lab_assessment,
student_gi$final_grade)
> summary(student_gi$final_grade)
>
>
>
>
>
detach
is advisable to use
> with (student_gi,
> with (student_gi,
final_grade))
> with (student_gi,
final_grade) )
fi
nal_grade
1001
9.00
1002
9.45
1003
9.75
1004
9.00
Display
Factors (reprise)
In
Function factor
Factors
Ordered factors
Another
ordinal variable
> lab_assessment <- c("Bine", "Foarte bine",
+
"Excelent", "Bine", "Slab")
Notice the way of dispaying
> lab_assessment
[1] "Bine"
"Foarte bine" "Excelent" "Bine"
[5] "Slab"
Now declare the vector as an ordered factor
> lab_assessment <- factor(lab_assessment,
+
order=TRUE, levels=c("Slab", "Bine",
+
"Foarte bine", "Excelent"))
Notice the new way of displaying the vector
> lab_assessment
[1] Bine
Foarte bine Excelent Bine
Slab
Levels: Slab < Bine < Foarte bine < Excelent
> str(student_gi)
'data.fram e':5 obs.of 5 variables:
$ nam e
: Factor w / 5 levels "Babadag I.M aria",..: 5
2314
$ age
: num 23 19 21 22 31
$ scholarship : Factor w / 4 levels "M erit","Social",..: 2 3
413
$ lab_assessm ent: O rd.factor w / 4 levels
"Slab"< "Bine"< ..: 2 3 4 2 1
$ fi
n al_grade : num 9 9.45 9.75 9 6
> summary(student_gi)
nam e
age
scholarship
Babadag I.M aria :1 M in. :19.0 M erit :1
Ianos W .Adriana :1 1st Q u.:21.0 Social:1
Kovacz V.Iosef :1 M edian :22.0 Studiu1:2
Pop P.Ion
:1 M ean :23.2 Studiu2:1
Popescu I. Vasile:1 3rd Q u.:23.0
M ax. :31.0
lab_assessm ent fi
nal_grade
Slab
:1
M in. :6.00
Bine
:2
1st Q u.:9.00
Foarte bine:1
M edian :9.00
Excelent :1
M ean :8.64
3rd Q u.:9.45
M ax. :9.75
> patientdata
patientID age diabetes status gender
1
1 25 Type1
Poor m ale
2
2 34 Type2 Im proved fem ale
3
3 28 Type1 Excellent fem ale
4
4 52 Type1
Poor m ale
Data
> str(patientdata)
'data.fram e':4 obs.of 5 variables:
$ patientID : num 1 2 3 4
$ age
: num 25 34 28 52
$ diabetes : Factor w / 2 levels "Type1","Type2": 1 2 1 1
$ status : O rd.factor w / 3 levels "Excellent"< "Im proved"< ..: 3
213
$ gender : Factor w / 2 levels "m ale","fem ale": 1 2 2 1
Lists
Lists
> t = Sys.time()
POSIXlt
> l.1[[1]]
[1] 24.19267
> l.1[[2]]
[1] 37
> l.1[[3]]
[1] 8
> l.1[[4]]
[1] 25
...
Display
object
> unlist(l.1)
sec
min
24.19267 37.00000
wday
yday
hour
8.00000
isdst
mday
25.00000
mon
year
8.00000 114.00000
> list.1
[[1]]
[1] "unu"
[[2]]
[1] "doi"
[[3]]
[1] "trei"
> list.2
[[1]]
lists
> str(list.3)
List of 4
$ :List of 3
..$ : chr "unu"
..$ : chr "doi"
..$ : chr "trei"
$ :List of 1
..$ : chr [1:3] "doi" "trei" "patru"
$ : int [1:5] 3 4 5 6 7
$ :'data.frame': 4 obs. of 5 variables:
..$ patientID: num [1:4] 1 2 3 4
..$ age
: num [1:4] 25 34 28 52
..$ diabetes : Factor w/ 2 levels "Type1","Type2": 1 2 1 1
..$ status
: Ord.factor w/ 3 levels
"Excellent"<"Improved"<..: 3 2 1 3
..$ gender
: Factor w/ 2 levels "male","female": 1 2 2 1
> length(list.3)
[1] 4
Access
> list.3[[1]]
[[1]]
[1] "unu"
[[2]]
[1] "doi"
[[3]]
[1] "trei"
> class(list.3[[1]])
[1] "list"
> list.3[[2]]
[[1]]
[1] "doi"
"trei" "patru"
> class(list.3[[2]])
[1] "list"
...
> list.3[[4]]
patientID age diabetes
status gender
1
1 25
Type1
Poor
male
2
2 34
Type2 Improved female
3
3 28
Type1 Excellent female
4
4 52
Type1
Poor
male
> class(list.3[[4]])
[1] "data.frame"
The
> names(list.3[[1]])
NULL
The
> names(list.3[[4]])
[1] "patientID" "age"
"gender"
"diabetes"
"status"
> list.3[[1]][[3]]
[1] "trei"
Display, in the data
list.3[[4]][,
2]
[1] 25 34 28 52
Display
or
Display
Tables in R
Not
Uni-dimensional tables
Create
$...
> table.1$Merit
Error in table.1$Merit : $ operator is invalid for atomic vectors
...
> table.1[1]
Merit
1
...
or list indices
> table.1[[1]]
[1] 1
Display
> table.1[3]
Studiu1
2
...
or
> unlist(table.1)[3]
Studiu1
2
names:
> table.1[c("Merit", "Studiu1")]
scholarship
Merit Studiu1
1
2
Bi-dimensional tables
Similar
Create
of table.2
> str(table.2)
'table'int [1:4,1:4] 0 0 1 0 1 1 0 0 0 0 ...
- attr(*,"dim nam es")= List of 2
..$ scholarship : chr [1:4] "M erit" "Social" "Studiu1" "Studiu2"
..$ lab_assessm ent: chr [1:4] "Slab" "Bine" "Foarte bine" "Excelent"
> class(table.2)
[1] "table"
> table.2[1,2]
[1] 1
...
or the names/labels
> table.2["Merit", "Bine"]
[1] 1
Display
...
Similarly,
Access
lab_assessm ent
scholarship Slab Excelent
M erit
Studiu1
0
1
Tri-dimensional tables
Create
table.3
> table.3
,,fi
nal_grade = 6
lab_assessm ent
scholarship Slab Bine Foarte bine Excelent
M erit
0 0
0
0
Social 0 0
0
0
Studiu1 1 0
0
0
Studiu2 0 0
0
0
,,fi
nal_grade = 9
lab_assessm ent
scholarship Slab Bine Foarte bine Excelent
M erit
0 1
0
0
Social 0 1
0
0
Studiu1 0 0
0
0
Studiu2 0 0
0
0
table.3 (cont.)
, , fi
n al_grade = 9.45
lab_assessm ent
scholarship Slab Bine Foarte bine Excelent
M erit
0 0
0
0
Social 0 0
0
0
Studiu1 0 0
1
0
Studiu2 0 0
0
0
, , fi
n al_grade = 9.75
lab_assessm ent
scholarship Slab Bine Foarte bine Excelent
M erit
0 0
0
0
Social 0 0
0
0
Studiu1 0 0
0
0
Studiu2 0 0
0
1
ftable
ftable
> ftable(table.3)
fi
n al_grade 6 9 9.45 9.75
scholarship lab_assessm ent
M erit
Slab
00 0 0
Bine
01 0 0
Foarte bine
00 0 0
Excelent
00 0 0
Social
Slab
00 0 0
Bine
01 0 0
Foarte bine
00 0 0
Excelent
00 0 0
Studiu1
Slab
10 0 0
Bine
00 0 0
Foarte bine
00 1 0
Excelent
00 0 0
Studiu2
Slab
00 0 0
Bine
00 0 0
Foarte bine
00 0 0
Excelent
00 0 1
> table.3[3, 3, 3]
[1] 1
...
or the names/labels
One
Built-in datasets
Some
Most
...)
In
Every
After
Display
> str(movies)
'data.fram e':58788 obs. of 24 variables:
$ title
: chr "$" "$1000 a Touchdow n" "$21 a D ay O nce a
M onth" "$40,000" ...
$ year
: int 1971 1939 1941 1996 1975 2000 2002
2002 1987 1917 ...
$ length
: int 121 71 7 70 71 91 93 25 97 61 ...
$ budget : int N A N A N A N A N A N A N A N A N A N A ...
$ rating
: num 6.4 6 8.2 8.2 3.4 4.3 5.3 6.7 6.6 6 ...
$ votes
: int 348 20 5 6 17 45 200 24 18 51 ...
$ r1
: num 4.5 0 0 14.5 24.5 4.5 4.5 4.5 4.5 4.5 ...
$ r2
: num 4.5 14.5 0 0 4.5 4.5 0 4.5 4.5 0 ...
$ r3
: num 4.5 4.5 0 0 0 4.5 4.5 4.5 4.5 4.5 ...
...
Data
Package datasets
function
> data()
Visualize
...or
Not
Generally,
Ex:
Convert
Convert
Convert
>
+
>
1
2
3