Stat405   Subsetting & shortcuts


                            Hadley Wickham
Tuesday, 7 September 2010
Roadmap
                   • Lectures 1-3: basic graphics
                   • Lectures 4-6: basic data handling
                   • Lectures 7-9: basic functions


                   • The absolutely most essential tools. Rest
                     of course is building your vocab, and
                     learning how to use them all together.


Tuesday, 7 September 2010
1. Character subsetting
               2. Sorting
               3. Shortcuts
               4. Iteration
               5. (Optional extra: command line tips)



Tuesday, 7 September 2010
Subsetting

Tuesday, 7 September 2010
Your turn

                   In pairs, try and recall the five types of
                   subsetting we talked about last week.
                   You have one minute!




Tuesday, 7 September 2010
blank     include all


                            integer   +ve: include
                                      -ve: exclude

                            logical   include TRUEs


                            character lookup by name


Tuesday, 7 September 2010
# Matches by names
     diamonds[1:5, c("carat", "cut", "color")]

     # Useful technique: change labelling
     c("Fair" = "C", "Good" = "B", "Very Good" = "B+",
     "Premium" = "A", "Ideal" = "A+")[diamonds$cut]

     # Can also be used to collapse levels
     table(c("Fair" = "C", "Good" = "B", "Very Good" =
     "B", "Premium" = "A", "Ideal" = "A")[diamonds$cut])

     # (see ?cut for continuous to discrete equivalent)



Tuesday, 7 September 2010
Sorting a data frame
               x <- c(2, 4, 3, 1)
               order(x)
               # means: to get x in order, put 4th in
               # 1st, 1st in 2nd, 3rd in 3rd and 2nd in 4th
               x[order(x)]

               # What does this do?
               diamonds[order(diamonds$price), ]


Tuesday, 7 September 2010
# Order by x, then y, then z
     order(diamonds$x, diamonds$y, diamonds$y)

     # Put in order of quality
     order(diamonds$color, desc(diamonds$cut),
       desc(diamonds$clarity))

     # desc sorts in descending order
     # also found in the plyr package
     x[order(x)]
     x[order(desc(x))]



Tuesday, 7 September 2010
Your turn
                   Reorder the mpg dataset from most to
                   least efficient.
                   The fl variable gives the type of fuel (r =
                   regular, d = diesel, p = premium, c = cng,
                   e = ethanol). Modify fl to spell out the
                   fuel type explicitly, collapsing c, d, and e
                   in a single other category.


Tuesday, 7 September 2010
Short cuts

Tuesday, 7 September 2010
Short cuts
                   You’ve been typing diamonds many many
                   times. These following shortcut save
                   typing, but may be a little harder to
                   understand, and will not work in some
                   situations. (Don’t forget the basics!)
                   Four specific to data frames, one more
                   generic.


Tuesday, 7 September 2010
Function                Package

                                  subset                    base

                                summarise                   plyr

                                transform                   base

                                 arrange                    plyr
                            plyr is loaded automatically with ggplot2, or
                            load it explicitly with library(plyr).
                            base always automatically loaded
Tuesday, 7 September 2010
# subset: short cut for subsetting
     zero_dim <- diamonds$x == 0 | diamonds$y == 0 |
       diamonds$z == 0
     diamonds[zero_dim & !is.na(zero_dim), ]

     subset(diamonds, x == 0 | y == 0 | z == 0)

     # summarise/summarize: short cut for creating summary
     biggest <- data.frame(
       price.max = max(diamonds$price),
       carat.max = max(diamonds$carat))

     biggest <- summarise(diamonds,
       price.max = max(price),
       carat.max = max(carat))

Tuesday, 7 September 2010
# transform: short cut for adding new variables
     diamonds$volume <- diamonds$x * diamonds$y * diamonds$z
     diamonds$density <- diamonds$volume / diamonds$carat

     diamonds <- transform(diamonds, volume = x * y * z)
     diamonds <- transform(diamonds,
       density = volume / carat)

     # arrange: short cut for reordering
     diamonds <- diamonds[order(diamonds$price,
       desc(diamonds$carat)), ]

     diamonds <- arrange(diamonds, price, desc(carat))

Tuesday, 7 September 2010
#     They all have similar syntax. The first argument
     #     is a data frame, and all other arguments are
     #     interpreted in the context of that data frame
     #     (so you don't need to use data$ all the time)

     subset(df, subset)
     transform(df, var1 = expr1, ...)
     summarise(df, var1 = expr1, ...)
     arrange(df, var1, ...)

     # They all return a modified data frame. You still
     # have to save that to a variable if you want to
     # keep it


Tuesday, 7 September 2010
Your turn
                   Use summarise, transform, subset and arrange
                   to:
                   Find all diamonds bigger than 3 carats and
                   order from most expensive to cheapest.
                   Add a new variable that estimates the
                   diameter of the diamond (average of x and y).
                   Compute depth (z / diameter * 100) yourself.
                   How does it compare to the depth in the data?


Tuesday, 7 September 2010
Aside:
                            never use attach!
                   Non-local effects; not symmetric; implicit,
                   not explicit.
                   Makes it very easy to make mistakes.
                   Use with() instead:
                   with(bnames, table(year, length))



Tuesday, 7 September 2010
# with is more general. Use in concert with other
     # functions, particularly those that don't have a data
     # argument

     diamonds$volume <- with(diamonds, x * y * z)

     # This won't work:
     with(diamonds, volume <- x * y * z)
     # with only changes lookup, not assignment




Tuesday, 7 September 2010
Iteration

Tuesday, 7 September 2010
Stories
                   Best data analyses tell a story, with a
                   natural flow from beginning to end.
                   For homeworks, try and come up with
                   three plots that tell a story.
                   Stories about a small sample of the data
                   can work well.



Tuesday, 7 September 2010
qplot(x, y, data = diamonds)
     qplot(x, z, data = diamonds)

     # Start by fixing incorrect values

     y_big <- diamonds$y > 10
     z_big <- diamonds$z > 6

     x_zero <- diamonds$x == 0
     y_zero <- diamonds$y == 0
     z_zero <- diamonds$z == 0

     diamonds$x[x_zero] <- NA
     diamonds$y[y_zero | y_big] <- NA
     diamonds$z[z_zero | z_big] <- NA
Tuesday, 7 September 2010
qplot(x, y, data = diamonds)
     # How can I get rid of those outliers?

     qplot(x, x - y, data = diamonds)
     qplot(x - y, data = diamonds, binwidth = 0.01)
     last_plot() + xlim(-0.5, 0.5)
     last_plot() + xlim(-0.2, 0.2)

     asym <- abs(diamonds$x - diamonds$y) > 0.2
     diamonds_sym <- diamonds[!asym, ]

     # Did it work?
     qplot(x, y, data = diamonds_sym)
     qplot(x, x - y, data = diamonds_sym)
     # Something interesting is going on there!
     qplot(x, x - y, data = diamonds_sym,
       geom = "bin2d", binwidth = c(0.1, 0.01))

Tuesday, 7 September 2010
# What about x and z?
     qplot(x, z, data = diamonds_sym)
     qplot(x, x - z, data = diamonds_sym)
     # Subtracting doesn't work - z smaller than x and y
     qplot(x, x / z, data = diamonds_sym)
     # But better to log transform to make symmetrical
     qplot(x, log10(x / z), data = diamonds_sym)

     # and so on...




Tuesday, 7 September 2010
# How does symmetry relate to price?
     qplot(abs(x - y), price, data =diamonds_sym) +
       geom_smooth()

     qplot(abs(x - y), price, data = diamonds_sym, geom =
     "boxplot", group = round(abs(x-y) * 10))

     diamonds_sym$sym <- zapsmall(abs(diamonds_sym$x -
       diamonds_sym$y))
     qplot(sym, price, data = diamonds_sym,
       geom = "boxplot", group = sym)
     # Are asymmetric diamonds worth more?

     qplot(carat, price, data = diamonds_sym, colour = sym)
     qplot(log10(carat), log10(price), data = diamonds_sym, colour
     = sym, group = sym) + geom_smooth(method = lm, se = F)


Tuesday, 7 September 2010
# Modelling

     summary(lm(log10(price) ~ log10(carat) + sym,
       data = diamonds_sym))
     # But statistical significance != practical
     # significance

     sd(diamonds_sym$sym, na.rm = T)
     # [1] 0.02368828

     #     So       1 sd increase in sym, decreases log10(price)
     #     by       -0.01 (= 0.23 * -0.44)
     #     10       ^ -0.01 = 0.976
     #     So       1 sd increase in sym decreases price by ~2%


Tuesday, 7 September 2010
Command
                       line
Tuesday, 7 September 2010
Why?

                   Provenance & reproducibility.
                   Working with remote servers.
                   Automation & scripting.
                   Common tools.




Tuesday, 7 September 2010
Basics
                   pwd: the location of the current directory
                   ls: the files in the current directory
                   cd: change to another directory
                   cd ..: change to parent directory
                   cd ~: change to home directory
                   mkdir: create a new directory


Tuesday, 7 September 2010
Your turn
                   Create a directory for stat405.
                   Inside that directory, create a directory for
                   homework 2.
                   Confirm that there are no files in that
                   directory.
                   Navigate back to your home directory.
                   What other files are there?


Tuesday, 7 September 2010

05 subsetting

  • 1.
    Stat405 Subsetting & shortcuts Hadley Wickham Tuesday, 7 September 2010
  • 2.
    Roadmap • Lectures 1-3: basic graphics • Lectures 4-6: basic data handling • Lectures 7-9: basic functions • The absolutely most essential tools. Rest of course is building your vocab, and learning how to use them all together. Tuesday, 7 September 2010
  • 3.
    1. Character subsetting 2. Sorting 3. Shortcuts 4. Iteration 5. (Optional extra: command line tips) Tuesday, 7 September 2010
  • 4.
  • 5.
    Your turn In pairs, try and recall the five types of subsetting we talked about last week. You have one minute! Tuesday, 7 September 2010
  • 6.
    blank include all integer +ve: include -ve: exclude logical include TRUEs character lookup by name Tuesday, 7 September 2010
  • 7.
    # Matches bynames diamonds[1:5, c("carat", "cut", "color")] # Useful technique: change labelling c("Fair" = "C", "Good" = "B", "Very Good" = "B+", "Premium" = "A", "Ideal" = "A+")[diamonds$cut] # Can also be used to collapse levels table(c("Fair" = "C", "Good" = "B", "Very Good" = "B", "Premium" = "A", "Ideal" = "A")[diamonds$cut]) # (see ?cut for continuous to discrete equivalent) Tuesday, 7 September 2010
  • 8.
    Sorting a dataframe x <- c(2, 4, 3, 1) order(x) # means: to get x in order, put 4th in # 1st, 1st in 2nd, 3rd in 3rd and 2nd in 4th x[order(x)] # What does this do? diamonds[order(diamonds$price), ] Tuesday, 7 September 2010
  • 9.
    # Order byx, then y, then z order(diamonds$x, diamonds$y, diamonds$y) # Put in order of quality order(diamonds$color, desc(diamonds$cut), desc(diamonds$clarity)) # desc sorts in descending order # also found in the plyr package x[order(x)] x[order(desc(x))] Tuesday, 7 September 2010
  • 10.
    Your turn Reorder the mpg dataset from most to least efficient. The fl variable gives the type of fuel (r = regular, d = diesel, p = premium, c = cng, e = ethanol). Modify fl to spell out the fuel type explicitly, collapsing c, d, and e in a single other category. Tuesday, 7 September 2010
  • 11.
    Short cuts Tuesday, 7September 2010
  • 12.
    Short cuts You’ve been typing diamonds many many times. These following shortcut save typing, but may be a little harder to understand, and will not work in some situations. (Don’t forget the basics!) Four specific to data frames, one more generic. Tuesday, 7 September 2010
  • 13.
    Function Package subset base summarise plyr transform base arrange plyr plyr is loaded automatically with ggplot2, or load it explicitly with library(plyr). base always automatically loaded Tuesday, 7 September 2010
  • 14.
    # subset: shortcut for subsetting zero_dim <- diamonds$x == 0 | diamonds$y == 0 | diamonds$z == 0 diamonds[zero_dim & !is.na(zero_dim), ] subset(diamonds, x == 0 | y == 0 | z == 0) # summarise/summarize: short cut for creating summary biggest <- data.frame( price.max = max(diamonds$price), carat.max = max(diamonds$carat)) biggest <- summarise(diamonds, price.max = max(price), carat.max = max(carat)) Tuesday, 7 September 2010
  • 15.
    # transform: shortcut for adding new variables diamonds$volume <- diamonds$x * diamonds$y * diamonds$z diamonds$density <- diamonds$volume / diamonds$carat diamonds <- transform(diamonds, volume = x * y * z) diamonds <- transform(diamonds, density = volume / carat) # arrange: short cut for reordering diamonds <- diamonds[order(diamonds$price, desc(diamonds$carat)), ] diamonds <- arrange(diamonds, price, desc(carat)) Tuesday, 7 September 2010
  • 16.
    # They all have similar syntax. The first argument # is a data frame, and all other arguments are # interpreted in the context of that data frame # (so you don't need to use data$ all the time) subset(df, subset) transform(df, var1 = expr1, ...) summarise(df, var1 = expr1, ...) arrange(df, var1, ...) # They all return a modified data frame. You still # have to save that to a variable if you want to # keep it Tuesday, 7 September 2010
  • 17.
    Your turn Use summarise, transform, subset and arrange to: Find all diamonds bigger than 3 carats and order from most expensive to cheapest. Add a new variable that estimates the diameter of the diamond (average of x and y). Compute depth (z / diameter * 100) yourself. How does it compare to the depth in the data? Tuesday, 7 September 2010
  • 18.
    Aside: never use attach! Non-local effects; not symmetric; implicit, not explicit. Makes it very easy to make mistakes. Use with() instead: with(bnames, table(year, length)) Tuesday, 7 September 2010
  • 19.
    # with ismore general. Use in concert with other # functions, particularly those that don't have a data # argument diamonds$volume <- with(diamonds, x * y * z) # This won't work: with(diamonds, volume <- x * y * z) # with only changes lookup, not assignment Tuesday, 7 September 2010
  • 20.
  • 21.
    Stories Best data analyses tell a story, with a natural flow from beginning to end. For homeworks, try and come up with three plots that tell a story. Stories about a small sample of the data can work well. Tuesday, 7 September 2010
  • 22.
    qplot(x, y, data= diamonds) qplot(x, z, data = diamonds) # Start by fixing incorrect values y_big <- diamonds$y > 10 z_big <- diamonds$z > 6 x_zero <- diamonds$x == 0 y_zero <- diamonds$y == 0 z_zero <- diamonds$z == 0 diamonds$x[x_zero] <- NA diamonds$y[y_zero | y_big] <- NA diamonds$z[z_zero | z_big] <- NA Tuesday, 7 September 2010
  • 23.
    qplot(x, y, data= diamonds) # How can I get rid of those outliers? qplot(x, x - y, data = diamonds) qplot(x - y, data = diamonds, binwidth = 0.01) last_plot() + xlim(-0.5, 0.5) last_plot() + xlim(-0.2, 0.2) asym <- abs(diamonds$x - diamonds$y) > 0.2 diamonds_sym <- diamonds[!asym, ] # Did it work? qplot(x, y, data = diamonds_sym) qplot(x, x - y, data = diamonds_sym) # Something interesting is going on there! qplot(x, x - y, data = diamonds_sym, geom = "bin2d", binwidth = c(0.1, 0.01)) Tuesday, 7 September 2010
  • 24.
    # What aboutx and z? qplot(x, z, data = diamonds_sym) qplot(x, x - z, data = diamonds_sym) # Subtracting doesn't work - z smaller than x and y qplot(x, x / z, data = diamonds_sym) # But better to log transform to make symmetrical qplot(x, log10(x / z), data = diamonds_sym) # and so on... Tuesday, 7 September 2010
  • 25.
    # How doessymmetry relate to price? qplot(abs(x - y), price, data =diamonds_sym) + geom_smooth() qplot(abs(x - y), price, data = diamonds_sym, geom = "boxplot", group = round(abs(x-y) * 10)) diamonds_sym$sym <- zapsmall(abs(diamonds_sym$x - diamonds_sym$y)) qplot(sym, price, data = diamonds_sym, geom = "boxplot", group = sym) # Are asymmetric diamonds worth more? qplot(carat, price, data = diamonds_sym, colour = sym) qplot(log10(carat), log10(price), data = diamonds_sym, colour = sym, group = sym) + geom_smooth(method = lm, se = F) Tuesday, 7 September 2010
  • 26.
    # Modelling summary(lm(log10(price) ~ log10(carat) + sym, data = diamonds_sym)) # But statistical significance != practical # significance sd(diamonds_sym$sym, na.rm = T) # [1] 0.02368828 # So 1 sd increase in sym, decreases log10(price) # by -0.01 (= 0.23 * -0.44) # 10 ^ -0.01 = 0.976 # So 1 sd increase in sym decreases price by ~2% Tuesday, 7 September 2010
  • 27.
    Command line Tuesday, 7 September 2010
  • 28.
    Why? Provenance & reproducibility. Working with remote servers. Automation & scripting. Common tools. Tuesday, 7 September 2010
  • 29.
    Basics pwd: the location of the current directory ls: the files in the current directory cd: change to another directory cd ..: change to parent directory cd ~: change to home directory mkdir: create a new directory Tuesday, 7 September 2010
  • 30.
    Your turn Create a directory for stat405. Inside that directory, create a directory for homework 2. Confirm that there are no files in that directory. Navigate back to your home directory. What other files are there? Tuesday, 7 September 2010