[wip] reworking serialization #710

PGijsbers · 2025-06-10T18:38:20Z

Ideal scenario there are only two formats left in the end:

parquet for tabular data
json for everything else

but might need to support sparse data through scipy.
In any case I hope to avoid situations where pickle is used which leaks dependencies between libraries used by the benchmark software and the integration scripts.

Summary by CodeRabbit

Refactor
- The output file path is no longer passed as an argument to result functions across multiple frameworks; instead, it is conditionally included in the result data.
- Serialization of NumPy arrays now avoids using pickle, improving security by serializing object arrays as JSON lists and restricting pandas serializers to "parquet" and "json" formats only.
Chores
- Removed the configuration option for NumPy pickle allowance, streamlining serialization settings.
- Updated package installation script to conditionally install requirements based on import checks, reducing unnecessary installs.
Tests
- Removed multiple tests for pickle serialization and sparse DataFrame deserialization to reflect updated serialization methods.
CI
- Added diagnostic steps in the GitHub Actions workflow to verify virtual environment contents and coverage tool availability.

coderabbitai · 2025-06-10T18:38:27Z

Warning

Rate limit exceeded

@PGijsbers has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 12 minutes and 24 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 89ff448 and af99117.

📒 Files selected for processing (1)

.github/workflows/run_all_frameworks.yml (3 hunks)

Walkthrough

This update removes the output_file parameter from various result and prediction-saving function calls across multiple framework and extension scripts. The result function in frameworks/shared/callee.py is updated to always include the output_file key in the result dictionary using the configuration value, and its signature no longer accepts an output_file argument. Additionally, serialization logic is updated to remove the numpy_allow_pickle configuration and enforce safer NumPy array serialization by avoiding pickle for object dtype arrays. Several tests related to pickle serialization were removed. The GitHub Actions workflow was enhanced with diagnostic steps for environment and coverage tool inspection. The shared setup script now installs requirements conditionally based on import checks rather than bulk installation.

Changes

File(s)	Change Summary
amlb/utils/serialization.py	Removed `numpy_allow_pickle` config; updated NumPy serialization to avoid pickle by serializing object dtype arrays as JSON lists; pandas serialization restricted to "parquet" and "json"; deserialization no longer uses `allow_pickle`.
frameworks/shared/callee.py	Removed `output_file` parameter from `result` function signature; `result` now ensures `"output_file"` key is set in result dict using config value; added error log on uncaught exceptions in `call_run`.
examples/custom/extensions/GradientBoosting/exec.py	Removed explicit `output_file` argument from `save_predictions` call.
examples/custom/extensions/Stacking/exec.py	Removed explicit `output_file` argument from `result()` call.
frameworks/AutoGluon/exec.py frameworks/AutoGluon/exec_ts.py frameworks/FEDOT/exec.py frameworks/FEDOT/exec_ts.py frameworks/GAMA/exec.py frameworks/H2OAutoML/exec.py frameworks/MLPlan/exec.py frameworks/NaiveAutoML/exec.py frameworks/RandomForest/exec.py frameworks/SapientML/exec.py frameworks/TPOT/exec.py frameworks/TunedRandomForest/exec.py frameworks/autosklearn/exec.py frameworks/flaml/exec.py frameworks/hyperoptsklearn/exec.py frameworks/lightautoml/exec.py frameworks/mljarsupervised/exec.py frameworks/oboe/exec.py	Removed explicit `output_file=config.output_predictions_file` argument from `result()` function calls; output file handling is now internal or implicit.
tests/unit/amlb/utils/serialization/test_serializers.py	Removed tests related to pickle-based serialization and deserialization, including sparse pandas DataFrame pickle tests.
.github/workflows/run_all_frameworks.yml	Added a new "Check Things" step to list contents and permissions of the virtual environment directories and print the current working directory; added diagnostic commands for coverage tool availability.
frameworks/shared/setup.sh	Changed pip install from bulk to per-package conditional install based on import checks; skips installation if package is already importable.

Sequence Diagram(s)

sequenceDiagram
    participant Framework as Framework Script
    participant Callee as callee.result()
    participant Config as Config Namespace
    participant File as Result File

    Framework->>Callee: result(predictions, ... )
    Callee->>Config: Access config.output_predictions_file
    Callee->>Callee: Add "output_file" key to result dict if missing
    Callee->>File: Dump result dict as JSON
    Callee-->>Framework: Return result dict

sequenceDiagram
    participant serialize_data
    participant np as numpy
    participant json as json_dump

    serialize_data->>np: np.save(path, data, allow_pickle=False)
    np-->>serialize_data: Success or ValueError (if object dtype)
    alt object dtype array
        serialize_data->>serialize_data: squeeze array
        serialize_data->>json: json_dump(data.tolist(), path)
    else non-object dtype array
        serialize_data->>np: np.save(path, data, allow_pickle=False)
    end

Poem

A bunny hopped through code today,
Tidied up how results find their way.
No more output files to pass around,
Now config keeps them safe and sound.
Serialization’s safer too—
Pickles gone, just clean and new!
🐇✨

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

codecov-commenter · 2025-06-10T18:39:54Z

Codecov Report

Attention: Patch coverage is 0% with 13 lines in your changes missing coverage. Please review.

Project coverage is 51.10%. Comparing base (74d03cb) to head (0f93348).
Report is 1 commits behind head on master.

Files with missing lines	Patch %	Lines
amlb/utils/serialization.py	0.00%	11 Missing ⚠️
frameworks/shared/callee.py	0.00%	2 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           master     #710       +/-   ##
===========================================
- Coverage   69.46%   51.10%   -18.36%     
===========================================
  Files          57       36       -21     
  Lines        6821     5262     -1559     
===========================================
- Hits         4738     2689     -2049     
- Misses       2083     2573      +490

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

coderabbitai

Actionable comments posted: 2

🔭 Outside diff range comments (1)

amlb/utils/serialization.py (1)
219-235: ⚠️ Potential issue

Reference to removed config.numpy_pickle breaks .npz loading

numpy_allow_pickle was removed, yet deserialize_data() still uses it:
with np.load(path, allow_pickle=config.numpy_pickle) as loaded:
This raises AttributeError.
Given the new policy (“no pickle”), simply drop the arg:
-with np.load(path, allow_pickle=config.numpy_pickle) as loaded:
-    return loaded
+with np.load(path, allow_pickle=False) as loaded:
+    return loaded
Optionally expose an explicit allow_pickle flag in ser_config if future flexibility is required.

🧰 Tools

🪛 Pylint (3.3.7)

[refactor] 219-269: Unnecessary "elif" after "return", remove the leading "el" from "elif"

(R1705)

[refactor] 225-234: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dfe8d21 and 526da41.

📒 Files selected for processing (22)

amlb/utils/serialization.py (3 hunks)
examples/custom/extensions/GradientBoosting/exec.py (0 hunks)
examples/custom/extensions/Stacking/exec.py (0 hunks)
frameworks/AutoGluon/exec.py (0 hunks)
frameworks/AutoGluon/exec_ts.py (0 hunks)
frameworks/FEDOT/exec.py (0 hunks)
frameworks/FEDOT/exec_ts.py (0 hunks)
frameworks/GAMA/exec.py (0 hunks)
frameworks/H2OAutoML/exec.py (0 hunks)
frameworks/MLPlan/exec.py (0 hunks)
frameworks/NaiveAutoML/exec.py (0 hunks)
frameworks/RandomForest/exec.py (0 hunks)
frameworks/SapientML/exec.py (0 hunks)
frameworks/TPOT/exec.py (0 hunks)
frameworks/TunedRandomForest/exec.py (0 hunks)
frameworks/autosklearn/exec.py (0 hunks)
frameworks/flaml/exec.py (0 hunks)
frameworks/hyperoptsklearn/exec.py (0 hunks)
frameworks/lightautoml/exec.py (0 hunks)
frameworks/mljarsupervised/exec.py (0 hunks)
frameworks/oboe/exec.py (0 hunks)
frameworks/shared/callee.py (1 hunks)

💤 Files with no reviewable changes (20)

examples/custom/extensions/Stacking/exec.py
frameworks/GAMA/exec.py
frameworks/lightautoml/exec.py
frameworks/oboe/exec.py
frameworks/RandomForest/exec.py
frameworks/mljarsupervised/exec.py
frameworks/TunedRandomForest/exec.py
frameworks/SapientML/exec.py
frameworks/MLPlan/exec.py
frameworks/TPOT/exec.py
frameworks/hyperoptsklearn/exec.py
frameworks/autosklearn/exec.py
frameworks/flaml/exec.py
frameworks/FEDOT/exec_ts.py
frameworks/AutoGluon/exec_ts.py
frameworks/H2OAutoML/exec.py
frameworks/AutoGluon/exec.py
examples/custom/extensions/GradientBoosting/exec.py
frameworks/NaiveAutoML/exec.py
frameworks/FEDOT/exec.py

⏰ Context from checks skipped due to timeout of 90000ms (5)

GitHub Check: randomforest/iris
GitHub Check: randomforest/cholesterol
GitHub Check: randomforest/kc2
GitHub Check: pytest
GitHub Check: run_benchmark

frameworks/shared/callee.py

amlb/utils/serialization.py

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

.github/workflows/run_all_frameworks.yml (1)

167-172: Remove temporary "Check Things" debug step.

This step listing venv directories is useful only during troubleshooting and clutters CI logs. Consider removing it or gating it behind a debug flag once venv setup issues are resolved.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7953856 and a08b155.

📒 Files selected for processing (1)

.github/workflows/run_all_frameworks.yml (1 hunks)

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

.github/workflows/run_all_frameworks.yml (1)
167-173: Eliminate Useless echo and Quote Command Substitution
In the diagnostic step, echo $(pwd) is redundant—pwd alone suffices—and the unquoted command substitution could trigger word splitting. Replace:
echo $(pwd)
with:
pwd
This removes the unnecessary echo and avoids potential splitting issues.

🧰 Tools

🪛 actionlint (1.7.7)

168-168: shellcheck reported issue in this script: SC2046:warning:4:6: Quote this to prevent word splitting

(shellcheck)

168-168: shellcheck reported issue in this script: SC2005:style:4:6: Useless echo? Instead of 'echo $(cmd)', just use 'cmd'

(shellcheck)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a08b155 and cf8ca91.

📒 Files selected for processing (2)

.github/workflows/run_all_frameworks.yml (1 hunks)
frameworks/shared/setup.sh (1 hunks)

🧰 Additional context used

🪛 actionlint (1.7.7)

.github/workflows/run_all_frameworks.yml

168-168: shellcheck reported issue in this script: SC2046:warning:4:6: Quote this to prevent word splitting

(shellcheck)

168-168: shellcheck reported issue in this script: SC2005:style:4:6: Useless echo? Instead of 'echo $(cmd)', just use 'cmd'

(shellcheck)

⏰ Context from checks skipped due to timeout of 90000ms (3)

GitHub Check: randomforest/kc2
GitHub Check: randomforest/cholesterol
GitHub Check: randomforest/iris

coderabbitai · 2025-06-10T20:25:15Z

frameworks/shared/setup.sh

+REQ_FILE="$SHARED_DIR/requirements.txt"
+
+for line in $(grep -vE '^\s*#' "$REQ_FILE" | grep -vE '^\s*$'); do
+    pkg=$(echo "$line" | sed -E 's/[=><~!].*$//')
+    # In a line like "numpy==1.12.0" then pkg=numpy and line is the whole line
+
+    if ! PY -c "import $pkg" &> /dev/null; then
+        echo "$pkg not found. Installing from requirements.txt..."
+        PIP install --no-cache-dir "$line"
+    else
+        echo "$pkg is already installed by the framework, using that instead."
+    fi
+done


🛠️ Refactor suggestion

Avoid word-splitting in the requirements loop
The for line in $(grep ...) construct will split on whitespace and may mis-handle requirements with spaces or special characters. Switch to a while IFS= read -r line pattern to preserve each line intact and eliminate the extra subshell:

- for line in $(grep -vE '^\s*#' "$REQ_FILE" | grep -vE '^\s*$'); do + while IFS= read -r line; do pkg=${line%%[=><~!]*} if ! PY -c "import $pkg" &> /dev/null; then echo "$pkg not found. Installing from requirements.txt..." PIP install --no-cache-dir "$line" else echo "$pkg is already installed by the framework, using that instead." fi - done + done < "$REQ_FILE"

This approach avoids splitting issues, reduces forking (sed can be replaced with parameter expansion), and is more robust.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In frameworks/shared/setup.sh around lines 50 to 62, the loop uses `for line in $(grep ...)` which causes word-splitting and can mishandle requirements with spaces or special characters. Replace this loop with a `while IFS= read -r line` construct to read each line intact without splitting. Also, remove the external `sed` command by using shell parameter expansion to extract the package name from each line. This will make the script more robust and efficient.

Remove the need to pass back the output file since it is known

6f8c664

coderabbitai bot reviewed Jun 10, 2025

View reviewed changes

frameworks/shared/callee.py Show resolved Hide resolved

amlb/utils/serialization.py Outdated Show resolved Hide resolved

PGijsbers force-pushed the serialization branch from 526da41 to 3fe0e20 Compare June 10, 2025 18:53

PGijsbers added 2 commits June 10, 2025 20:55

Serialize numpy object arrays through JSON instead of pickle

0896e1f

Remove some options I don't want to support (mostly pickle)

9f1047a

PGijsbers force-pushed the serialization branch from 3fe0e20 to 9f1047a Compare June 10, 2025 18:55

PGijsbers added 3 commits June 10, 2025 21:32

Remove pickle serialization tests

2759e2e

Be more explicit about the script failing with uncaught exception

7953856

Add debugging step to job.. cache is not restored with cov in bin

a08b155

coderabbitai bot reviewed Jun 10, 2025

View reviewed changes

PGijsbers added 2 commits June 10, 2025 22:19

More debug stuff

c0a9e64

Only install packages not already installed by framework

cf8ca91

coderabbitai bot reviewed Jun 10, 2025

View reviewed changes

Trying to figure out why coverage wont work even though its there

89ff448

PGijsbers force-pushed the serialization branch from c1a19d5 to 89ff448 Compare June 10, 2025 20:43

PGijsbers added 3 commits June 10, 2025 22:55

Disable caching

af99117

temporarily move installation of deps to ag script

44225cd

set high verbosity

0f93348

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[wip] reworking serialization #710

[wip] reworking serialization #710

Uh oh!

PGijsbers commented Jun 10, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jun 10, 2025 •

edited

Loading

Rate limit exceeded

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

Uh oh!

codecov-commenter commented Jun 10, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Jun 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

[wip] reworking serialization #710

Are you sure you want to change the base?

[wip] reworking serialization #710

Uh oh!

Conversation

PGijsbers commented Jun 10, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram(s)

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

codecov-commenter commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jun 10, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

PGijsbers commented Jun 10, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jun 10, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

codecov-commenter commented Jun 10, 2025 •

edited

Loading