-
Notifications
You must be signed in to change notification settings - Fork 63
fix: ml.model_selection.train_test_split index to match in unordered mode #2283
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
bigframes/ml/model_selection.py
Outdated
| results.append(joined_df_train[columns]) | ||
| results.append(joined_df_test[columns]) | ||
| results.append(joined_df_train[columns].cache()) | ||
| results.append(joined_df_test[columns].cache()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a lot of .cache() calls. I think where the caching ideally happens is actually inside the block.split method. This way, the ordering is locked in, but only a single table is cached total, which should be a lot faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can put the caches outside of the loop, which removes some queries. But otherwise (caching anywhere in block.split) it doesn't work. Do you have an insight why is it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, really? would expect caching anywhere around this area:
python-bigquery-dataframes/bigframes/core/blocks.py
Lines 901 to 913 in b487cf1
| block, string_ordering_col = block.apply_unary_op( | |
| ordering_col, ops.AsTypeOp(to_type=bigframes.dtypes.STRING_DTYPE) | |
| ) | |
| # Apply hash method to sum col and order by it. | |
| block, string_sum_col = block.apply_binary_op( | |
| string_ordering_col, random_state_col, ops.strconcat_op | |
| ) | |
| block, hash_string_sum_col = block.apply_unary_op(string_sum_col, ops.hash_op) | |
| block = block.order_by( | |
| [ordering.OrderingExpression(ex.deref(hash_string_sum_col))] | |
| ) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, no matter where, within the block.split, it won't work. Only do a cache() to the end results would help.
screen/6A2RFRNf9m96Qvo
Would it be a bug in some deeper code?
Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly:
Fixes b/462105877