Skip to content

feat: add DefaultIndexKind.NULL to use as index_col in read_gbq*, creating an indexless DataFrame/Series #662

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 24 commits into from
May 20, 2024

Conversation

TrevorBergeron
Copy link
Contributor

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

  • Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)

Fixes #<issue_number_goes_here> 🦕

@product-auto-label product-auto-label bot added size: l Pull request size is large. api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. labels May 6, 2024
@TrevorBergeron TrevorBergeron marked this pull request as ready for review May 10, 2024 16:22
@TrevorBergeron TrevorBergeron requested review from a team as code owners May 10, 2024 16:22
@TrevorBergeron TrevorBergeron requested a review from tswast May 10, 2024 16:22
@@ -2113,6 +2119,10 @@ def __repr__(self) -> str:

def to_pandas(self) -> pd.Index:
"""Executes deferred operations and downloads the results."""
if len(self.column_ids) == 0:
raise bigframes.exceptions.NullIndexError(
"Cannot perform this operation without an index. Set an index using set_index."
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's be more specfic about the operation. As a user I would appreciate knowing why can't do this. Seems a bit obvious that we can't get a pandas index because there isn't an index, but let's spell it out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rewrote message

# ----------------------------------------------------

if not index_col and len(index_cols) == 0:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reads odd to me: len(index_cols) == 0 implies not index_col. Could you rephrase what you're trying to check here? I assume you're trying to exclude the DefaultIndexKind enum from this check. Let's be explicit about that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, basically, trying to fall back to sequential index if don't have null index, user provided index columns, or metadata-derived index columns. Rewrote the condition though with the code structure, its still a bit weird.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are your overall thoughts on testing? How can we be confident that empty/null index works? As we add operations that should support null/empty index, we add tests here and in the usual location?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think it will be tricky, as null index tests will require manually configured expectations for many tests, as we cannot always compare against pandas (which doesn't have null index). So yeah, I think we will be stuck with a parallel test suite, which will be a bit burdensome to maintain. Being pandas-equivalent has been a huge boon for testing thus far.

@TrevorBergeron TrevorBergeron requested a review from tswast May 16, 2024 17:29
Comment on lines 1935 to 1936
and (sort is False)
and (block_identity_join is False)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these are only bool per type checking, we don't have to worry about any false-y values to mess up "not"

Suggested change
and (sort is False)
and (block_identity_join is False)
and not sort
and not block_identity_join

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Doesn't block_identity_join imply indexless join is allowed? What's this checking for?

Edit: found it

allow_row_identity_join=(not block_identity_join),

Apparently block_identity_join is the opposite of what I thought it is. Please add a docstring to this function explaining what these args mean.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added docstring

TrevorBergeron and others added 3 commits May 17, 2024 11:15
Co-authored-by: Tim Sweña (Swast) <[email protected]>
Co-authored-by: Tim Sweña (Swast) <[email protected]>
Co-authored-by: Tim Sweña (Swast) <[email protected]>
Copy link
Collaborator

@tswast tswast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@tswast
Copy link
Collaborator

tswast commented May 20, 2024

Test failure FAILED tests/system/large/test_remote_function.py::test_remote_function_via_session_vpc_invalid appears to be unrelated.

@tswast tswast changed the title feat: Support indexless dataframe/series feat: add DefaultIndexKind.NULL to use as index_col in read_gbq*, creating an indexless DataFrame/Series May 20, 2024
@tswast tswast merged commit 29e4886 into main May 20, 2024
20 of 21 checks passed
@tswast tswast deleted the null_index branch May 20, 2024 20:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. size: l Pull request size is large.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants