fix: ml.model_selection.train_test_split index to match in unordered mode #2283

GarrettWu · 2025-11-20T00:20:19Z

Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:

Make sure to open an issue as a bug/issue before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea
Ensure the tests and linter pass
Code coverage does not decrease (if any source code was changed)
Appropriate docs were updated (if necessary)

Fixes b/462105877

…mode

TrevorBergeron · 2025-11-24T20:39:31Z

bigframes/ml/model_selection.py

-        results.append(joined_df_train[columns])
-        results.append(joined_df_test[columns])
+        results.append(joined_df_train[columns].cache())
+        results.append(joined_df_test[columns].cache())


This is a lot of .cache() calls. I think where the caching ideally happens is actually inside the block.split method. This way, the ordering is locked in, but only a single table is cached total, which should be a lot faster.

I can put the caches outside of the loop, which removes some queries. But otherwise (caching anywhere in block.split) it doesn't work. Do you have an insight why is it?

hmm, really? would expect caching anywhere around this area:

python-bigquery-dataframes/bigframes/core/blocks.py

Lines 901 to 913 in b487cf1

block, string_ordering_col = block.apply_unary_op(

ordering_col, ops.AsTypeOp(to_type=bigframes.dtypes.STRING_DTYPE)

)

# Apply hash method to sum col and order by it.

block, string_sum_col = block.apply_binary_op(

string_ordering_col, random_state_col, ops.strconcat_op

)

block, hash_string_sum_col = block.apply_unary_op(string_sum_col, ops.hash_op)

block = block.order_by(

[ordering.OrderingExpression(ex.deref(hash_string_sum_col))]

)

would work ok (ideally at the end of this block).

No, no matter where, within the block.split, it won't work. Only do a cache() to the end results would help.

screen/6A2RFRNf9m96Qvo

Would it be a bug in some deeper code?

GarrettWu added 2 commits November 19, 2025 23:14

fix: ml.model_selection.train_test_split index to match in unordered …

2b5f122

…mode

fix

9e2cd71

GarrettWu requested review from TrevorBergeron and tswast November 20, 2025 00:20

GarrettWu self-assigned this Nov 20, 2025

GarrettWu requested review from a team as code owners November 20, 2025 00:20

product-auto-label bot added size: s Pull request size is small. api: bigquery Issues related to the googleapis/python-bigquery-dataframes API. labels Nov 20, 2025

TrevorBergeron reviewed Nov 24, 2025

View reviewed changes

GarrettWu added 2 commits November 24, 2025 23:53

move cache()

b29c6f3

Merge remote-tracking branch 'github/main' into garrettwu-split

b2815b0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: ml.model_selection.train_test_split index to match in unordered mode #2283

fix: ml.model_selection.train_test_split index to match in unordered mode #2283

Uh oh!

GarrettWu commented Nov 20, 2025

Uh oh!

TrevorBergeron Nov 24, 2025

Uh oh!

GarrettWu Nov 24, 2025

Uh oh!

TrevorBergeron Nov 25, 2025

Uh oh!

GarrettWu Nov 25, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	block, string_ordering_col = block.apply_unary_op(
	ordering_col, ops.AsTypeOp(to_type=bigframes.dtypes.STRING_DTYPE)
	)

	# Apply hash method to sum col and order by it.
	block, string_sum_col = block.apply_binary_op(
	string_ordering_col, random_state_col, ops.strconcat_op
	)
	block, hash_string_sum_col = block.apply_unary_op(string_sum_col, ops.hash_op)
	block = block.order_by(
	[ordering.OrderingExpression(ex.deref(hash_string_sum_col))]
	)

fix: ml.model_selection.train_test_split index to match in unordered mode #2283

Are you sure you want to change the base?

fix: ml.model_selection.train_test_split index to match in unordered mode #2283

Uh oh!

Conversation

GarrettWu commented Nov 20, 2025

Uh oh!

TrevorBergeron Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

GarrettWu Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

TrevorBergeron Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

GarrettWu Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

GarrettWu Nov 25, 2025 •

edited

Loading