Skip to content

Rootograms to assess model fit of count data models #28

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
paul-buerkner opened this issue Sep 4, 2016 · 25 comments
Closed

Rootograms to assess model fit of count data models #28

paul-buerkner opened this issue Sep 4, 2016 · 25 comments

Comments

@paul-buerkner
Copy link
Collaborator

I recently stumbled upon a blog post about rootograms (http://www.fromthebottomoftheheap.net/2016/06/07/rootograms/), which help in assessing model fit of count data models. Given that ppc_dens_overlay is not ideal in vizualizing discrete distributions, rootograms may be a nice alternative. What do you think?

@jgabry
Copy link
Member

jgabry commented Sep 4, 2016

Hadn't seen those before, looks interesting. I think even for count
models some combination of the available PPC plots could probably diagnose
the same type of misfit, but rootograms seem like a more convenient way to
do it. So yeah, probably a good addition to the package.

On Sunday, September 4, 2016, Paul-Christian Bürkner <
[email protected]> wrote:

I recently stumbled upon a blog post about rootograms (http://www.
fromthebottomoftheheap.net/2016/06/07/rootograms/), which help in
assessing model fit of count data models. Given that ppc_dens_overlay is
not ideal in vizualizing discrete distributions, rootograms may be a nice
alternative. What do you think?


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#28, or mute the thread
https://github.com/notifications/unsubscribe-auth/AHb4Q0Xcy82jGdkJbqLX7DYrfHNaxgiEks5qmrD-gaJpZM4J0hJp
.

@jgabry jgabry added the feature label Sep 5, 2016
@jgabry
Copy link
Member

jgabry commented Sep 6, 2016

I started playing around with this, but I'm going to stop working on it until after we get the first release out. I do think it will be a nice addition though.

@jgabry
Copy link
Member

jgabry commented Dec 16, 2016

@paul-buerkner Any interest in trying to write this function? If not no worries.

@paul-buerkner
Copy link
Collaborator Author

Yeah, I have actually some code lying around somewhere, but it is not yet "bayesplot ready". I will try to get this done in the near future :-)

@jgabry
Copy link
Member

jgabry commented Dec 16, 2016

Ok cool thanks Paul!

@paul-buerkner
Copy link
Collaborator Author

So the problem I was facing, were those "hanging" rootograms. "Suspended" and "standing" rootograms won't be much of an issue. Did you ever try to making histograms "hanging" from a line with ggplot? If yes, I would be glad for a pointer in the right direction ;-)

@andrewgelman
Copy link

andrewgelman commented Dec 16, 2016 via email

@paul-buerkner
Copy link
Collaborator Author

But you think rootograms in general can make sense, right?

@andrewgelman
Copy link

andrewgelman commented Dec 16, 2016 via email

@jgabry
Copy link
Member

jgabry commented Dec 16, 2016

@andrewgelman Is the gimmick that it's "hanging" or do you not like rootograms? If the former we don't necessarily need to offer that. If the latter maybe there's a different type of plot we should be doing for count data?

@jgabry
Copy link
Member

jgabry commented Dec 16, 2016

@andrewgelman Nevermind, I didn't see your second comment

@jgabry
Copy link
Member

jgabry commented Dec 16, 2016

I don't know much about rootograms, but the square root scale is somewhat appealing. Could be useful because it adjusts to some extent for the fact that there can be very small and very large counts in the same outcome vector y in some cases (and thus both very small and very large deviations between y and yrep that need to be visualized in the same plot).

@jgabry
Copy link
Member

jgabry commented Dec 16, 2016

@paul-buerkner Maybe don't worry about "hanging" ones, but the rootogram is probably worth at least prototyping. At a minimum we'll be able to compare to residual plots and see if we find any cases where we can diagnose a problem using the rootogram better than we can using the residual plot.

@paul-buerkner
Copy link
Collaborator Author

Sounds good.

@paul-buerkner
Copy link
Collaborator Author

Here is an initial prototype function implementing rootograms.

# TODO: document
ppc_rootogram <- function(y, yrep, style = c("standing", "suspended"), ...) {
  require(ggplot2)
  style <- match.arg(style)
  # TODO: check validity of y and yrep
  ymax <- max(y, yrep)
  x <- 0L:ymax
  
  # prepare a table for yrep
  tyrep <- apply(yrep, 1, table)
  for (i in seq_along(tyrep)) {
    tyrep[[i]] <- as.numeric(tyrep[[i]][match(x, rownames(tyrep[[i]]))])
  }
  tyrep <- do.call(rbind, tyrep)
  tyrep[is.na(tyrep)] <- 0
  tyexp <- sqrt(colMeans(tyrep))
  
  # prepare a table for y
  ty <- table(y)
  ty <- sqrt(as.numeric(ty[match(x, rownames(ty))]))
  if (style == "suspended") {
    ty <- ty - tyexp
  }
  ty[is.na(ty)] <- 0
  
  # TODO: enforce bayesplot color schemes
  ggplot(data.frame(x, tyexp, ty)) + 
    geom_histogram(aes_string("x", "ty"), stat = "identity",
                   fill = "lightblue", color = "blue") +
    geom_line(aes_string("x", "tyexp"), size = 1.3) +
    bayesplot::theme_default() +
    xlab("y") +
    ylab("sqrt(Frequency)")
}

This should help us in discussing whether it is worth implementing rootograms in bayesplot. Attached are some figures generated with the above function using the data from the blog post and brms to fit the models.

Interestingly, the plots shown in the blog post only display counts from 0 to 30, but the actually response variable goes over 100. I have tried it with the latest version of the countreg package from RForge and it is still the case. I don't know whether it is a bug or a feature but at least it appears strange to omit 2/3 of the actually response range.

ppc_rootogram_standing_1.pdf
ppc_rootogram_standing_2.pdf
ppc_rootogram_suspended_1.pdf

@jgabry
Copy link
Member

jgabry commented Dec 19, 2016

Thanks @paul-buerkner. A few questions/comments

  • Can you show uncertainty in the plot in some way?
  • I started a branch feature/ppc_rootogram and put your function there with a few edits (bayesplot colors, verifying y and yrep, etc). So feel free to work off that branch going forward (assuming you like the edits).
  • I used light colors for the bars and dark for the line, which seemed to be most natural. But typically for the PPC plots I use light for yrep and dark for y, so this goes against that tendency. But maybe that will be too confusing?

@paul-buerkner
Copy link
Collaborator Author

Thanks for adding the branch. Your changes look great! I think making the line dark is intuitive and much better than making a light line and a dark histogram even if is goes against consistency.

I have added uncertainty intervals based on quantiles, but they look pretty bad for few predicted counts. Maybe we can display uncertainty in a way that is somewhat "smoother" than quantiles of counts. I have added color "m" for the intervals for now, but I think this is probably not the best idea...

Shell I make a pull request so we can work on it together? (I don't have the permission to push to the branch directly)

@jgabry
Copy link
Member

jgabry commented Dec 19, 2016 via email

@paul-buerkner
Copy link
Collaborator Author

I am definitely interested. I will have to learn some of the inner structure of bayesplot to contribute to certain parts of the package, but that shouldn't be too difficult.

@paul-buerkner
Copy link
Collaborator Author

I have also tried mean +- sd to visualize uncertainty, which is much smoother, but of course isn't very reasonable for counts. I think we should stick to quantiles, but we may want to explicitely document the option to turn them off completely by setting prob = 0.

@jgabry
Copy link
Member

jgabry commented Dec 20, 2016 via email

@paul-buerkner
Copy link
Collaborator Author

I have now also added the "hanging" histogram gimmick just to see how it looks like. Feel free to remove it if you don't like it.

@paul-buerkner
Copy link
Collaborator Author

@jgabry I think the rootogram branch is ready for a PR. Shell I open one, or do you want to have a look first?

@jgabry
Copy link
Member

jgabry commented Jan 2, 2017

Cool. Go for it and then I'll take a look and make comments on the PR.

@jgabry
Copy link
Member

jgabry commented Jan 10, 2017

closed by #62

@jgabry jgabry closed this as completed Jan 10, 2017
@jgabry jgabry added the new plot label Aug 7, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants