Join implementation #15

ppaxisa · 2025-04-08T15:26:24Z

Following up on issue #13.

I modified how to deal with groups: extracting group variables before merge and rerunning group_by after merge. I'm not sure I follow your comment here:

if a group column is no longer present in the result, it should no longer be a group column.

join should not drop columns? The only edge case I can imagine is ia grouping column would being all NA in case of right_join where there is no match, or maybe being constant if only one category matches.

I dropped the arrange from the tests excepted for right_join.

One last thought I have is regarding rownames. I think dplyr behavior behavior is to drop rownames with join operations and right now that would be the same with this implementation. I think that's fine but wanted to point that out.

jonocarroll · 2025-04-08T23:19:00Z

Looking good so far. I think the happy path is solid.

The edge case for groups is when there is a common column but it's not used for joining, in which case the result has a .x and .y column, and so regrouping fails.

da <- starwars[, c("name", "eye_color", "height", "mass")][1:10, ]
da <- group_by(da, eye_color)
db <- starwars[, c("name", "eye_color", "homeworld")]

Da <- as(da, "DataFrame")
Da <- group_by(Da, eye_color)
Db <- as(db, "DataFrame")

# these all work, preserving group
left_join(da, db)
right_join(da, db)
inner_join(da, db[1:3, ])

# these all give the right answers, group preserved
left_join(Da, Db)
right_join(Da, Db)
inner_join(Da, Db[1:3, ])

# none of these retain the group because it's no longer a column, but they don't error
left_join(da, db, by = "name")
right_join(da, db, by = "name")
inner_join(da, db[1:3, ], by = "name")

# these currently fail because they try to group by a non-existent column
left_join(Da, Db, by = "name")
right_join(Da, Db, by = "name")
inner_join(Da, Db[1:3, ], by = "name")

I think join_internal just needs a

  grps <- intersect(grps, colnames(x)) # <-- preserve remaining groups

  # if no grouping return object
  if(length(grps) == 0)
    return(x)
  # else rebuild groups with new DF
  group_by(x, !!!rlang::syms(grps))

but please add some tests for this case (and confirm that it works).

jonocarroll · 2025-04-09T00:01:52Z

For the rownames issue, DFplyr doesn't need to follow the dplyr approach because we like rownames 😝. I suspect the desired result will be to preserve rownames wherever possible. merge doesn't seem to preserve them, but would there be a reason not to re-attach them in our case? One of the big gotchas is rearranged rows. I think we could solve this by attaching a temporary ...rownames column, performing the merge, then extracting the rownames back out.

is_leftish <- function(...) {
  # does this look like a non-right join?
  args <- list(...)
  if (utils::hasName(args, "all.y") && args$all.y) return(FALSE)
  TRUE
}

join_internal <- function(x, y, by = NULL, ...) {
  use_rownames <- is_leftish(...)
  if (use_rownames) x$...rownames <- rownames(x)
  if (is.null(by))
    by <- check_common_columns(names(x), names(y))

  grps <- group_vars(x)
  x <- S4Vectors::merge(x, y, by = by, sort = FALSE, ...)
  if (use_rownames) rownames(x) <- x$...rownames
  if (use_rownames) x$...rownames <- NULL

  grps <- intersect(grps, colnames(x))
  # if no grouping return object
  if(length(grps) == 0)
    return(x)
  # else rebuild groups with new DF
  group_by(x, !!!rlang::syms(grps))
}

Produces this:

library(S4Vectors)
library(DFplyr)

da <- starwars[, c("name", "eye_color", "height", "mass")][1:10, ]
da <- group_by(da, eye_color)
db <- starwars[, c("name", "eye_color", "homeworld")]

Da <- as(da, "DataFrame")
Da <- group_by(Da, eye_color)
rownames(Da) <- paste0("row", 1:10)
Db <- as(db, "DataFrame")

left_join(Da, Db)
#> Joining with `by = c("name", "eye_color")`
#> DataFrame with 10 rows and 5 columns
#> Groups:  eye_color 
#>                     name   eye_color    height      mass   homeworld
#>              <character> <character> <integer> <numeric> <character>
#> row1      Luke Skywalker        blue       172        77    Tatooine
#> row2               C-3PO      yellow       167        75    Tatooine
#> row3               R2-D2         red        96        32       Naboo
#> row4         Darth Vader      yellow       202       136    Tatooine
#> row5         Leia Organa       brown       150        49    Alderaan
#> row6           Owen Lars        blue       178       120    Tatooine
#> row7  Beru Whitesun Lars        blue       165        75    Tatooine
#> row8               R5-D4         red        97        32    Tatooine
#> row9   Biggs Darklighter       brown       183        84    Tatooine
#> row10     Obi-Wan Kenobi   blue-gray       182        77     Stewjon

right_join(Da, Db)
#> Joining with `by = c("name", "eye_color")`
#> DataFrame with 87 rows and 5 columns
#> Groups:  eye_color 
#>               name     eye_color    height      mass   homeworld
#>        <character>   <character> <integer> <numeric> <character>
#> 1   Luke Skywalker          blue       172        77    Tatooine
#> 2            C-3PO        yellow       167        75    Tatooine
#> 3            R2-D2           red        96        32       Naboo
#> 4      Darth Vader        yellow       202       136    Tatooine
#> 5      Leia Organa         brown       150        49    Alderaan
#> ...            ...           ...       ...       ...         ...
#> 83             BB8         black        NA        NA          NA
#> 84  Captain Phasma       unknown        NA        NA          NA
#> 85        San Hill          gold        NA        NA  Muunilinst
#> 86        Shaak Ti         black        NA        NA       Shili
#> 87        Grievous green, yellow        NA        NA       Kalee

inner_join(Da, Db[1:3, ])
#> Joining with `by = c("name", "eye_color")`
#> DataFrame with 3 rows and 5 columns
#> Groups:  eye_color 
#>                name   eye_color    height      mass   homeworld
#>         <character> <character> <integer> <numeric> <character>
#> row1 Luke Skywalker        blue       172        77    Tatooine
#> row2          C-3PO      yellow       167        75    Tatooine
#> row3          R2-D2         red        96        32       Naboo

full_join(Da, Db[1:3, ])
#> Joining with `by = c("name", "eye_color")`
#> DataFrame with 10 rows and 5 columns
#> Groups:  eye_color 
#>                     name   eye_color    height      mass   homeworld
#>              <character> <character> <integer> <numeric> <character>
#> row1      Luke Skywalker        blue       172        77    Tatooine
#> row2               C-3PO      yellow       167        75    Tatooine
#> row3               R2-D2         red        96        32       Naboo
#> row5         Leia Organa       brown       150        49          NA
#> row6           Owen Lars        blue       178       120          NA
#> row7  Beru Whitesun Lars        blue       165        75          NA
#> row4         Darth Vader      yellow       202       136          NA
#> row9   Biggs Darklighter       brown       183        84          NA
#> row10     Obi-Wan Kenobi   blue-gray       182        77          NA
#> row8               R5-D4         red        97        32          NA

This seems to work - for left, inner, and full joins the rownames are preserved (note the order change for full_join!) but for right_join, where we're going to end up with rows that don't exist in x, it's not clear that we should use the rownames from y - this would typically be data we're appending, and may not have useful names.

ppaxisa · 2025-04-09T07:29:07Z

Thanks for the feedback! all good points, I'll work on implementing this. I agree on rownames especially when thinking about using this for GenomicRanges and SummarizedExperiment.

One extra edge case I see regarding rownames is when you have multiple matches generated by join. Which will lead to row duplications, and in this case the rownames will not be unique anymore. One solution would be to test that now(input) == nrow(output), if TRUE we restore rownames, if FALSE we drop it. We could go a bit further by attempting to repair rownames by adding suffixes in the event of duplicates too...

jonocarroll · 2025-04-25T11:27:10Z

Just checking in - any progress?

ppaxisa · 2025-04-25T12:17:36Z

limited bandwidth right now, but I'll have more time in May

jonocarroll · 2025-04-25T12:19:00Z

No rush on my side - let's target the next Bioconductor release (October?).

ppaxisa · 2025-06-23T14:37:24Z

Hey, I've been stuck for a while on devising unit tests for grouped DF:

test_that("join preserves groups when possible", {
  da <- starwars[, c("name", "eye_color", "height", "mass")][1:10, ]
  db <- starwars[, c("name", "eye_color", "homeworld")]

  Da <- as(da, "DataFrame")
  Db <- as(db, "DataFrame")

  da <- group_by(da, eye_color)
  Da <- group_by(Da, eye_color)

  # groups preserved
  res_left <- left_join(da, db)
  res_right <- right_join(da, db)
  res_inner <- inner_join(da, db[1:3, ])
  res_full <- full_join(da, db[1:3, ])

  Res_left <- left_join(Da, Db)
  Res_right <- right_join(Da, Db)
  Res_inner <- inner_join(Da, Db[1:3, ])
  Res_full <- full_join(Da, Db[1:3, ])

  expect_s4_class(Res_left, "DataFrame")
  expect_s4_class(Res_right, "DataFrame")
  expect_s4_class(Res_inner, "DataFrame")
  expect_s4_class(Res_full, "DataFrame")

  expect_identical(Res_left, as(res_left, "DataFrame") %>% group_by(eye_color))
  expect_identical(arrange(Res_right, name), arrange(as(res_right, "DataFrame"), name) %>% group_by(eye_color))
  expect_identical(Res_inner, as(res_inner, "DataFrame") %>% group_by(eye_color))
  expect_identical(arrange(Res_full, name), arrange(as(res_full, "DataFrame"), name))

  # groups altered
  res_left <- left_join(da, db, by = "name")
  res_right <- right_join(da, db, by = "name")
  res_inner <- inner_join(da, db[1:3, ], by = "name")
  res_full <- full_join(da, db[1:3, ], by = "name")

  # these currently fail because they try to group by a non-existent column
  Res_left <- left_join(Da, Db, by = "name")
  Res_right <- right_join(Da, Db, by = "name")
  Res_inner <- inner_join(Da, Db[1:3, ], by = "name")
  Res_full <- full_join(Da, Db[1:3, ], by = "name")

  expect_s4_class(Res_left, "DataFrame")
  expect_s4_class(Res_right, "DataFrame")
  expect_s4_class(Res_inner, "DataFrame")
  expect_s4_class(Res_full, "DataFrame")

  expect_identical(Res_left, as(res_left, "DataFrame") %>% group_by(eye_color))
  expect_identical(arrange(Res_right, name), arrange(as(res_right, "DataFrame"), name) %>% group_by(eye_color))
  expect_identical(Res_inner, as(res_inner, "DataFrame") %>% group_by(eye_color))
  expect_identical(arrange(Res_full, name), arrange(as(res_full, "DataFrame"), name))
})

which keeps failing, but I realize now that group info is structured differently in vanilla tidyverse and DFplyr. How would you advise running those unit tests to check how *_join handles groups?

jonocarroll · 2025-06-24T04:39:04Z

I've linked it from elsewhere in #16 but I might need to rewrite the way groups are handled internally, anyway - it appears my implementation isn't exactly the same as what dplyr uses, but I should be able to realign to make it more consistent, and hopefully make joins even easier.

I'll see what needs to change and whether that makes this branch's tests pass.

jonocarroll · 2025-06-30T05:16:47Z

I've rolled these changes into the group_subset branch which has all of these additions vs master:

master...group_subset

because this overlaps a lot with the other changes I was making - I needed to modify how rbind, [, and some other pieces fit together, as well as needing an actual asMethod("grouped_df, "DataFrame") which I hadn't considered - the regular as(x, "DataFrame") when x is a grouped tibble was squeezing the group information into listData, breaking the expect_identical stuff.

Now all the tests pass and joins appear to work. Have a look and feel free to target that branch with any additional suggestions.

ppaxisa added 2 commits April 5, 2025 22:14

joins with updated doc and tests

7d3861c

update groups before returning object

540eb17

join respects groups and rownames

70610d3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Join implementation #15

Join implementation #15

Uh oh!

ppaxisa commented Apr 8, 2025

Uh oh!

jonocarroll commented Apr 8, 2025

Uh oh!

jonocarroll commented Apr 9, 2025

Uh oh!

ppaxisa commented Apr 9, 2025

Uh oh!

jonocarroll commented Apr 25, 2025

Uh oh!

ppaxisa commented Apr 25, 2025

Uh oh!

jonocarroll commented Apr 25, 2025

Uh oh!

ppaxisa commented Jun 23, 2025

Uh oh!

jonocarroll commented Jun 24, 2025

Uh oh!

jonocarroll commented Jun 30, 2025

Uh oh!

Uh oh!

Join implementation #15

Are you sure you want to change the base?

Join implementation #15

Uh oh!

Conversation

ppaxisa commented Apr 8, 2025

Uh oh!

jonocarroll commented Apr 8, 2025

Uh oh!

jonocarroll commented Apr 9, 2025

Uh oh!

ppaxisa commented Apr 9, 2025

Uh oh!

jonocarroll commented Apr 25, 2025

Uh oh!

ppaxisa commented Apr 25, 2025

Uh oh!

jonocarroll commented Apr 25, 2025

Uh oh!

ppaxisa commented Jun 23, 2025

Uh oh!

jonocarroll commented Jun 24, 2025

Uh oh!

jonocarroll commented Jun 30, 2025

Uh oh!

Uh oh!