-
-
Notifications
You must be signed in to change notification settings - Fork 40
Description
tl;dr
When applying parameters::parameters()
to a model of class svycoxph
, the reported SE is taken as the model-based SE instead of the robust SE, 95% CIs are incorrectly estimated, p-values are incorrectly estimated, and degrees of freedom may be incorrectly reported.
The problem in-depth
Take the following data, which is adapted from the survey
package vignette for svycoxph
# Loading packages
library(parameters)
library(survival)
library(survey)
library(broom)
# Importing data
df <- survival::pbc
# Defining radomization
df$randomized <- with(df, !is.na(trt) & trt > 0)
# Defining randomization probability weights
df$randprob <- fitted(glm(randomized ~ age * edema, data = df, family = binomial))
# Generating survey design object
d.svy <- svydesign(id = ~1,
prob = ~randprob,
strata = ~edema,
data = subset(df, randomized))
# The design-based cox proportional hazards model
m.index <- svycoxph(Surv(time, status > 0) ~ log(bili) + protime + albumin, design = d.svy)
I first use summary()
to generate a coefficient summary table that will be taken as the correct summary of the model.
# A model summary produced via `summary()`
m.index |> summary()
#> Stratified Independent Sampling design (with replacement)
#> svydesign(id = ~1, prob = ~randprob, strata = ~edema, data = subset(df,
#> randomized))
#> Call:
#> svycoxph(formula = Surv(time, status > 0) ~ log(bili) + protime +
#> albumin, design = d.svy)
#>
#> n= 312, number of events= 144
#>
#> coef exp(coef) se(coef) robust se z Pr(>|z|)
#> log(bili) 0.88592 2.42522 0.09140 0.09048 9.791 < 2e-16 ***
#> protime 0.24487 1.27745 0.07825 0.08122 3.015 0.00257 **
#> albumin -1.04298 0.35240 0.21211 0.20454 -5.099 3.41e-07 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> exp(coef) exp(-coef) lower .95 upper .95
#> log(bili) 2.4252 0.4123 2.031 2.8958
#> protime 1.2775 0.7828 1.089 1.4979
#> albumin 0.3524 2.8377 0.236 0.5262
#>
#> Concordance= 0.808 (se = 0.019 )
#> Likelihood ratio test= NA on 3 df, p=NA
#> Wald test = 178.3 on 3 df, p=<2e-16
#> Score (logrank) test = NA on 3 df, p=NA
#>
#> (Note: the likelihood ratio and score tests assume independence of
#> observations within a cluster, the Wald and robust score tests do not).
When I deploy broom::tidy()
I can obtain estimates of the HRs, SEs, Robust SEs, and p-values that match the content of the first coefficient summary table from summary()
and 95% CIs that match those from the second coefficients summary table from summary()
.
# A model summary produced via `tidy()`
m.index |> tidy(exponentiate = TRUE, conf.int = TRUE)
#> Stratified Independent Sampling design (with replacement)
#> svydesign(id = ~1, prob = ~randprob, strata = ~edema, data = subset(df,
#> randomized))
#> # A tibble: 3 × 8
#> term estimate std.error robust.se statistic p.value conf.low conf.high
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 log(bili) 2.43 0.0914 0.0905 9.79 1.23e-22 2.03 2.90
#> 2 protime 1.28 0.0782 0.0812 3.01 2.57e- 3 1.09 1.50
#> 3 albumin 0.352 0.212 0.205 -5.10 3.41e- 7 0.236 0.526
However, when I deploy parameters::parameters()
expecting to obtain the same estimates, the reported output departs from expectation in a few notable ways:
- Information on the survey design object is iteratively printed, a departure from
parameter
summaries ofsvyglm
models. - The SEs are consistent with the model-based SEs, but I believe the Robust SEs are recommended here.
- The CIs are inconsistent with those from
summary()
orbroom::tidy()
, I believe owing to the fact that they are estimated using the model-based SEs instead of the robust SEs (though in this example, they are also 95% CIs for the log-HR and not the HR). - The p-values are inconsistent with those from
summary()
orbroom::tidy()
. - In this example, the degrees of freedom reported by
parameters::parameters()
are consistent with the degrees of freedom estimated bysurvey::degf()
which accounts for the complex sampling architecture. However, for a separate project using prepared NHANES data set, when I summarized asvycoxph
model usingparameters::parameters()
the degrees of freedom reported byparameters::parameters()
were inconsistent with the DoF calculated bysurvey::degf()
. While the later example is not included here, I believe the issue might stem from the fact that the design object produced for the current example contains no cluster information (e.g.,svydesign(id = ~1, ...)
is specified instead of setting to a vector (or vectors) reflecting a different psu).
# A model summary produced via `parameters()`
m.index |> parameters(digits = 5, exponentiate = FALSE)
#> Stratified Independent Sampling design (with replacement)
#> svydesign(id = ~1, prob = ~randprob, strata = ~edema, data = subset(df,
#> randomized))
#> Stratified Independent Sampling design (with replacement)
#> svydesign(id = ~1, prob = ~randprob, strata = ~edema, data = subset(df,
#> randomized))
#> Stratified Independent Sampling design (with replacement)
#> svydesign(id = ~1, prob = ~randprob, strata = ~edema, data = subset(df,
#> randomized))
#> Stratified Independent Sampling design (with replacement)
#> svydesign(id = ~1, prob = ~randprob, strata = ~edema, data = subset(df,
#> randomized))
#> Stratified Independent Sampling design (with replacement)
#> svydesign(id = ~1, prob = ~randprob, strata = ~edema, data = subset(df,
#> randomized))
#> Stratified Independent Sampling design (with replacement)
#> svydesign(id = ~1, prob = ~randprob, strata = ~edema, data = subset(df,
#> randomized))
#> Stratified Independent Sampling design (with replacement)
#> svydesign(id = ~1, prob = ~randprob, strata = ~edema, data = subset(df,
#> randomized))
#> Stratified Independent Sampling design (with replacement)
#> svydesign(id = ~1, prob = ~randprob, strata = ~edema, data = subset(df,
#> randomized))
#> Stratified Independent Sampling design (with replacement)
#> svydesign(id = ~1, prob = ~randprob, strata = ~edema, data = subset(df,
#> randomized))
#> Stratified Independent Sampling design (with replacement)
#> svydesign(id = ~1, prob = ~randprob, strata = ~edema, data = subset(df,
#> randomized))
#> Stratified Independent Sampling design (with replacement)
#> svydesign(id = ~1, prob = ~randprob, strata = ~edema, data = subset(df,
#> randomized))
#> Stratified Independent Sampling design (with replacement)
#> svydesign(id = ~1, prob = ~randprob, strata = ~edema, data = subset(df,
#> randomized))
#> Stratified Independent Sampling design (with replacement)
#> svydesign(id = ~1, prob = ~randprob, strata = ~edema, data = subset(df,
#> randomized))
#> Stratified Independent Sampling design (with replacement)
#> svydesign(id = ~1, prob = ~randprob, strata = ~edema, data = subset(df,
#> randomized))
#> Stratified Independent Sampling design (with replacement)
#> svydesign(id = ~1, prob = ~randprob, strata = ~edema, data = subset(df,
#> randomized))
#> Parameter | Coefficient | SE | 95% CI | Statistic | df | p
#> ------------------------------------------------------------------------------------
#> bili [log] | 0.88592 | 0.09140 | [ 0.70608, 1.06576] | 9.79099 | 309 | 0.002
#> protime | 0.24487 | 0.07825 | [ 0.09091, 0.39883] | 3.01490 | 309 | 0.083
#> albumin | -1.04298 | 0.21211 | [-1.46034, -0.62562] | -5.09909 | 309 | > .999
#>
#> Uncertainty intervals (equal-tailed) and p-values (two-tailed) computed
#> using a Wald distribution approximation.
I do not have immediate recommendations on how to resolve this issue, but I thought to bring it to your attention in case it has gone overlooked.