Skip to content

Join implementation #15

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Join implementation #15

wants to merge 2 commits into from

Conversation

ppaxisa
Copy link

@ppaxisa ppaxisa commented Apr 8, 2025

Following up on issue #13.

I modified how to deal with groups: extracting group variables before merge and rerunning group_by after merge. I'm not sure I follow your comment here:

if a group column is no longer present in the result, it should no longer be a group column.

join should not drop columns? The only edge case I can imagine is ia grouping column would being all NA in case of right_join where there is no match, or maybe being constant if only one category matches.

I dropped the arrange from the tests excepted for right_join.

One last thought I have is regarding rownames. I think dplyr behavior behavior is to drop rownames with join operations and right now that would be the same with this implementation. I think that's fine but wanted to point that out.

@jonocarroll
Copy link
Owner

Looking good so far. I think the happy path is solid.

The edge case for groups is when there is a common column but it's not used for joining, in which case the result has a .x and .y column, and so regrouping fails.

da <- starwars[, c("name", "eye_color", "height", "mass")][1:10, ]
da <- group_by(da, eye_color)
db <- starwars[, c("name", "eye_color", "homeworld")]

Da <- as(da, "DataFrame")
Da <- group_by(Da, eye_color)
Db <- as(db, "DataFrame")

# these all work, preserving group
left_join(da, db)
right_join(da, db)
inner_join(da, db[1:3, ])

# these all give the right answers, group preserved
left_join(Da, Db)
right_join(Da, Db)
inner_join(Da, Db[1:3, ])

# none of these retain the group because it's no longer a column, but they don't error
left_join(da, db, by = "name")
right_join(da, db, by = "name")
inner_join(da, db[1:3, ], by = "name")

# these currently fail because they try to group by a non-existent column
left_join(Da, Db, by = "name")
right_join(Da, Db, by = "name")
inner_join(Da, Db[1:3, ], by = "name")

I think join_internal just needs a

  grps <- intersect(grps, colnames(x)) # <-- preserve remaining groups

  # if no grouping return object
  if(length(grps) == 0)
    return(x)
  # else rebuild groups with new DF
  group_by(x, !!!rlang::syms(grps))

but please add some tests for this case (and confirm that it works).

@jonocarroll
Copy link
Owner

For the rownames issue, DFplyr doesn't need to follow the dplyr approach because we like rownames 😝. I suspect the desired result will be to preserve rownames wherever possible. merge doesn't seem to preserve them, but would there be a reason not to re-attach them in our case? One of the big gotchas is rearranged rows. I think we could solve this by attaching a temporary ...rownames column, performing the merge, then extracting the rownames back out.

is_leftish <- function(...) {
  # does this look like a non-right join?
  args <- list(...)
  if (utils::hasName(args, "all.y") && args$all.y) return(FALSE)
  TRUE
}

join_internal <- function(x, y, by = NULL, ...) {
  use_rownames <- is_leftish(...)
  if (use_rownames) x$...rownames <- rownames(x)
  if (is.null(by))
    by <- check_common_columns(names(x), names(y))

  grps <- group_vars(x)
  x <- S4Vectors::merge(x, y, by = by, sort = FALSE, ...)
  if (use_rownames) rownames(x) <- x$...rownames
  if (use_rownames) x$...rownames <- NULL

  grps <- intersect(grps, colnames(x))
  # if no grouping return object
  if(length(grps) == 0)
    return(x)
  # else rebuild groups with new DF
  group_by(x, !!!rlang::syms(grps))
}

Produces this:

library(S4Vectors)
library(DFplyr)

da <- starwars[, c("name", "eye_color", "height", "mass")][1:10, ]
da <- group_by(da, eye_color)
db <- starwars[, c("name", "eye_color", "homeworld")]

Da <- as(da, "DataFrame")
Da <- group_by(Da, eye_color)
rownames(Da) <- paste0("row", 1:10)
Db <- as(db, "DataFrame")

left_join(Da, Db)
#> Joining with `by = c("name", "eye_color")`
#> DataFrame with 10 rows and 5 columns
#> Groups:  eye_color 
#>                     name   eye_color    height      mass   homeworld
#>              <character> <character> <integer> <numeric> <character>
#> row1      Luke Skywalker        blue       172        77    Tatooine
#> row2               C-3PO      yellow       167        75    Tatooine
#> row3               R2-D2         red        96        32       Naboo
#> row4         Darth Vader      yellow       202       136    Tatooine
#> row5         Leia Organa       brown       150        49    Alderaan
#> row6           Owen Lars        blue       178       120    Tatooine
#> row7  Beru Whitesun Lars        blue       165        75    Tatooine
#> row8               R5-D4         red        97        32    Tatooine
#> row9   Biggs Darklighter       brown       183        84    Tatooine
#> row10     Obi-Wan Kenobi   blue-gray       182        77     Stewjon

right_join(Da, Db)
#> Joining with `by = c("name", "eye_color")`
#> DataFrame with 87 rows and 5 columns
#> Groups:  eye_color 
#>               name     eye_color    height      mass   homeworld
#>        <character>   <character> <integer> <numeric> <character>
#> 1   Luke Skywalker          blue       172        77    Tatooine
#> 2            C-3PO        yellow       167        75    Tatooine
#> 3            R2-D2           red        96        32       Naboo
#> 4      Darth Vader        yellow       202       136    Tatooine
#> 5      Leia Organa         brown       150        49    Alderaan
#> ...            ...           ...       ...       ...         ...
#> 83             BB8         black        NA        NA          NA
#> 84  Captain Phasma       unknown        NA        NA          NA
#> 85        San Hill          gold        NA        NA  Muunilinst
#> 86        Shaak Ti         black        NA        NA       Shili
#> 87        Grievous green, yellow        NA        NA       Kalee

inner_join(Da, Db[1:3, ])
#> Joining with `by = c("name", "eye_color")`
#> DataFrame with 3 rows and 5 columns
#> Groups:  eye_color 
#>                name   eye_color    height      mass   homeworld
#>         <character> <character> <integer> <numeric> <character>
#> row1 Luke Skywalker        blue       172        77    Tatooine
#> row2          C-3PO      yellow       167        75    Tatooine
#> row3          R2-D2         red        96        32       Naboo

full_join(Da, Db[1:3, ])
#> Joining with `by = c("name", "eye_color")`
#> DataFrame with 10 rows and 5 columns
#> Groups:  eye_color 
#>                     name   eye_color    height      mass   homeworld
#>              <character> <character> <integer> <numeric> <character>
#> row1      Luke Skywalker        blue       172        77    Tatooine
#> row2               C-3PO      yellow       167        75    Tatooine
#> row3               R2-D2         red        96        32       Naboo
#> row5         Leia Organa       brown       150        49          NA
#> row6           Owen Lars        blue       178       120          NA
#> row7  Beru Whitesun Lars        blue       165        75          NA
#> row4         Darth Vader      yellow       202       136          NA
#> row9   Biggs Darklighter       brown       183        84          NA
#> row10     Obi-Wan Kenobi   blue-gray       182        77          NA
#> row8               R5-D4         red        97        32          NA

This seems to work - for left, inner, and full joins the rownames are preserved (note the order change for full_join!) but for right_join, where we're going to end up with rows that don't exist in x, it's not clear that we should use the rownames from y - this would typically be data we're appending, and may not have useful names.

@ppaxisa
Copy link
Author

ppaxisa commented Apr 9, 2025

Thanks for the feedback! all good points, I'll work on implementing this. I agree on rownames especially when thinking about using this for GenomicRanges and SummarizedExperiment.

One extra edge case I see regarding rownames is when you have multiple matches generated by join. Which will lead to row duplications, and in this case the rownames will not be unique anymore. One solution would be to test that now(input) == nrow(output), if TRUE we restore rownames, if FALSE we drop it. We could go a bit further by attempting to repair rownames by adding suffixes in the event of duplicates too...

@jonocarroll
Copy link
Owner

Just checking in - any progress?

@ppaxisa
Copy link
Author

ppaxisa commented Apr 25, 2025

limited bandwidth right now, but I'll have more time in May

@jonocarroll
Copy link
Owner

No rush on my side - let's target the next Bioconductor release (October?).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants