Skip to content

Conversation

@PGijsbers
Copy link
Collaborator

@PGijsbers PGijsbers commented Jun 10, 2025

Ideal scenario there are only two formats left in the end:

  • parquet for tabular data
  • json for everything else

but might need to support sparse data through scipy.
In any case I hope to avoid situations where pickle is used which leaks dependencies between libraries used by the benchmark software and the integration scripts.

Summary by CodeRabbit

  • Refactor

    • The output file path is no longer passed as an argument to result functions across multiple frameworks; instead, it is conditionally included in the result data.
    • Serialization of NumPy arrays now avoids using pickle, improving security by serializing object arrays as JSON lists and restricting pandas serializers to "parquet" and "json" formats only.
  • Chores

    • Removed the configuration option for NumPy pickle allowance, streamlining serialization settings.
    • Updated package installation script to conditionally install requirements based on import checks, reducing unnecessary installs.
  • Tests

    • Removed multiple tests for pickle serialization and sparse DataFrame deserialization to reflect updated serialization methods.
  • CI

    • Added diagnostic steps in the GitHub Actions workflow to verify virtual environment contents and coverage tool availability.

@coderabbitai
Copy link

coderabbitai bot commented Jun 10, 2025

Warning

Rate limit exceeded

@PGijsbers has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 12 minutes and 24 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 89ff448 and af99117.

📒 Files selected for processing (1)
  • .github/workflows/run_all_frameworks.yml (3 hunks)

Walkthrough

This update removes the output_file parameter from various result and prediction-saving function calls across multiple framework and extension scripts. The result function in frameworks/shared/callee.py is updated to always include the output_file key in the result dictionary using the configuration value, and its signature no longer accepts an output_file argument. Additionally, serialization logic is updated to remove the numpy_allow_pickle configuration and enforce safer NumPy array serialization by avoiding pickle for object dtype arrays. Several tests related to pickle serialization were removed. The GitHub Actions workflow was enhanced with diagnostic steps for environment and coverage tool inspection. The shared setup script now installs requirements conditionally based on import checks rather than bulk installation.

Changes

File(s) Change Summary
amlb/utils/serialization.py Removed numpy_allow_pickle config; updated NumPy serialization to avoid pickle by serializing object dtype arrays as JSON lists; pandas serialization restricted to "parquet" and "json"; deserialization no longer uses allow_pickle.
frameworks/shared/callee.py Removed output_file parameter from result function signature; result now ensures "output_file" key is set in result dict using config value; added error log on uncaught exceptions in call_run.
examples/custom/extensions/GradientBoosting/exec.py Removed explicit output_file argument from save_predictions call.
examples/custom/extensions/Stacking/exec.py Removed explicit output_file argument from result() call.
frameworks/AutoGluon/exec.py
frameworks/AutoGluon/exec_ts.py
frameworks/FEDOT/exec.py
frameworks/FEDOT/exec_ts.py
frameworks/GAMA/exec.py
frameworks/H2OAutoML/exec.py
frameworks/MLPlan/exec.py
frameworks/NaiveAutoML/exec.py
frameworks/RandomForest/exec.py
frameworks/SapientML/exec.py
frameworks/TPOT/exec.py
frameworks/TunedRandomForest/exec.py
frameworks/autosklearn/exec.py
frameworks/flaml/exec.py
frameworks/hyperoptsklearn/exec.py
frameworks/lightautoml/exec.py
frameworks/mljarsupervised/exec.py
frameworks/oboe/exec.py
Removed explicit output_file=config.output_predictions_file argument from result() function calls; output file handling is now internal or implicit.
tests/unit/amlb/utils/serialization/test_serializers.py Removed tests related to pickle-based serialization and deserialization, including sparse pandas DataFrame pickle tests.
.github/workflows/run_all_frameworks.yml Added a new "Check Things" step to list contents and permissions of the virtual environment directories and print the current working directory; added diagnostic commands for coverage tool availability.
frameworks/shared/setup.sh Changed pip install from bulk to per-package conditional install based on import checks; skips installation if package is already importable.

Sequence Diagram(s)

sequenceDiagram
    participant Framework as Framework Script
    participant Callee as callee.result()
    participant Config as Config Namespace
    participant File as Result File

    Framework->>Callee: result(predictions, ... )
    Callee->>Config: Access config.output_predictions_file
    Callee->>Callee: Add "output_file" key to result dict if missing
    Callee->>File: Dump result dict as JSON
    Callee-->>Framework: Return result dict
Loading
sequenceDiagram
    participant serialize_data
    participant np as numpy
    participant json as json_dump

    serialize_data->>np: np.save(path, data, allow_pickle=False)
    np-->>serialize_data: Success or ValueError (if object dtype)
    alt object dtype array
        serialize_data->>serialize_data: squeeze array
        serialize_data->>json: json_dump(data.tolist(), path)
    else non-object dtype array
        serialize_data->>np: np.save(path, data, allow_pickle=False)
    end
Loading

Poem

A bunny hopped through code today,
Tidied up how results find their way.
No more output files to pass around,
Now config keeps them safe and sound.
Serialization’s safer too—
Pickles gone, just clean and new!
🐇✨


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@codecov-commenter
Copy link

codecov-commenter commented Jun 10, 2025

Codecov Report

Attention: Patch coverage is 0% with 13 lines in your changes missing coverage. Please review.

Project coverage is 51.10%. Comparing base (74d03cb) to head (0f93348).
Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
amlb/utils/serialization.py 0.00% 11 Missing ⚠️
frameworks/shared/callee.py 0.00% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           master     #710       +/-   ##
===========================================
- Coverage   69.46%   51.10%   -18.36%     
===========================================
  Files          57       36       -21     
  Lines        6821     5262     -1559     
===========================================
- Hits         4738     2689     -2049     
- Misses       2083     2573      +490     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🔭 Outside diff range comments (1)
amlb/utils/serialization.py (1)

219-235: ⚠️ Potential issue

Reference to removed config.numpy_pickle breaks .npz loading

numpy_allow_pickle was removed, yet deserialize_data() still uses it:

with np.load(path, allow_pickle=config.numpy_pickle) as loaded:

This raises AttributeError.
Given the new policy (“no pickle”), simply drop the arg:

-with np.load(path, allow_pickle=config.numpy_pickle) as loaded:
-    return loaded
+with np.load(path, allow_pickle=False) as loaded:
+    return loaded

Optionally expose an explicit allow_pickle flag in ser_config if future flexibility is required.

🧰 Tools
🪛 Pylint (3.3.7)

[refactor] 219-269: Unnecessary "elif" after "return", remove the leading "el" from "elif"

(R1705)


[refactor] 225-234: Unnecessary "else" after "return", remove the "else" and de-indent the code inside it

(R1705)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dfe8d21 and 526da41.

📒 Files selected for processing (22)
  • amlb/utils/serialization.py (3 hunks)
  • examples/custom/extensions/GradientBoosting/exec.py (0 hunks)
  • examples/custom/extensions/Stacking/exec.py (0 hunks)
  • frameworks/AutoGluon/exec.py (0 hunks)
  • frameworks/AutoGluon/exec_ts.py (0 hunks)
  • frameworks/FEDOT/exec.py (0 hunks)
  • frameworks/FEDOT/exec_ts.py (0 hunks)
  • frameworks/GAMA/exec.py (0 hunks)
  • frameworks/H2OAutoML/exec.py (0 hunks)
  • frameworks/MLPlan/exec.py (0 hunks)
  • frameworks/NaiveAutoML/exec.py (0 hunks)
  • frameworks/RandomForest/exec.py (0 hunks)
  • frameworks/SapientML/exec.py (0 hunks)
  • frameworks/TPOT/exec.py (0 hunks)
  • frameworks/TunedRandomForest/exec.py (0 hunks)
  • frameworks/autosklearn/exec.py (0 hunks)
  • frameworks/flaml/exec.py (0 hunks)
  • frameworks/hyperoptsklearn/exec.py (0 hunks)
  • frameworks/lightautoml/exec.py (0 hunks)
  • frameworks/mljarsupervised/exec.py (0 hunks)
  • frameworks/oboe/exec.py (0 hunks)
  • frameworks/shared/callee.py (1 hunks)
💤 Files with no reviewable changes (20)
  • examples/custom/extensions/Stacking/exec.py
  • frameworks/GAMA/exec.py
  • frameworks/lightautoml/exec.py
  • frameworks/oboe/exec.py
  • frameworks/RandomForest/exec.py
  • frameworks/mljarsupervised/exec.py
  • frameworks/TunedRandomForest/exec.py
  • frameworks/SapientML/exec.py
  • frameworks/MLPlan/exec.py
  • frameworks/TPOT/exec.py
  • frameworks/hyperoptsklearn/exec.py
  • frameworks/autosklearn/exec.py
  • frameworks/flaml/exec.py
  • frameworks/FEDOT/exec_ts.py
  • frameworks/AutoGluon/exec_ts.py
  • frameworks/H2OAutoML/exec.py
  • frameworks/AutoGluon/exec.py
  • examples/custom/extensions/GradientBoosting/exec.py
  • frameworks/NaiveAutoML/exec.py
  • frameworks/FEDOT/exec.py
⏰ Context from checks skipped due to timeout of 90000ms (5)
  • GitHub Check: randomforest/iris
  • GitHub Check: randomforest/cholesterol
  • GitHub Check: randomforest/kc2
  • GitHub Check: pytest
  • GitHub Check: run_benchmark

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
.github/workflows/run_all_frameworks.yml (1)

167-172: Remove temporary "Check Things" debug step.

This step listing venv directories is useful only during troubleshooting and clutters CI logs. Consider removing it or gating it behind a debug flag once venv setup issues are resolved.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7953856 and a08b155.

📒 Files selected for processing (1)
  • .github/workflows/run_all_frameworks.yml (1 hunks)

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
.github/workflows/run_all_frameworks.yml (1)

167-173: Eliminate Useless echo and Quote Command Substitution
In the diagnostic step, echo $(pwd) is redundant—pwd alone suffices—and the unquoted command substitution could trigger word splitting. Replace:

echo $(pwd)

with:

pwd

This removes the unnecessary echo and avoids potential splitting issues.

🧰 Tools
🪛 actionlint (1.7.7)

168-168: shellcheck reported issue in this script: SC2046:warning:4:6: Quote this to prevent word splitting

(shellcheck)


168-168: shellcheck reported issue in this script: SC2005:style:4:6: Useless echo? Instead of 'echo $(cmd)', just use 'cmd'

(shellcheck)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a08b155 and cf8ca91.

📒 Files selected for processing (2)
  • .github/workflows/run_all_frameworks.yml (1 hunks)
  • frameworks/shared/setup.sh (1 hunks)
🧰 Additional context used
🪛 actionlint (1.7.7)
.github/workflows/run_all_frameworks.yml

168-168: shellcheck reported issue in this script: SC2046:warning:4:6: Quote this to prevent word splitting

(shellcheck)


168-168: shellcheck reported issue in this script: SC2005:style:4:6: Useless echo? Instead of 'echo $(cmd)', just use 'cmd'

(shellcheck)

⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: randomforest/kc2
  • GitHub Check: randomforest/cholesterol
  • GitHub Check: randomforest/iris

Comment on lines 50 to 62
REQ_FILE="$SHARED_DIR/requirements.txt"

for line in $(grep -vE '^\s*#' "$REQ_FILE" | grep -vE '^\s*$'); do
pkg=$(echo "$line" | sed -E 's/[=><~!].*$//')
# In a line like "numpy==1.12.0" then pkg=numpy and line is the whole line

if ! PY -c "import $pkg" &> /dev/null; then
echo "$pkg not found. Installing from requirements.txt..."
PIP install --no-cache-dir "$line"
else
echo "$pkg is already installed by the framework, using that instead."
fi
done
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Avoid word-splitting in the requirements loop
The for line in $(grep ...) construct will split on whitespace and may mis-handle requirements with spaces or special characters. Switch to a while IFS= read -r line pattern to preserve each line intact and eliminate the extra subshell:

- for line in $(grep -vE '^\s*#' "$REQ_FILE" | grep -vE '^\s*$'); do
+ while IFS= read -r line; do
      pkg=${line%%[=><~!]*}
      if ! PY -c "import $pkg" &> /dev/null; then
          echo "$pkg not found. Installing from requirements.txt..."
          PIP install --no-cache-dir "$line"
      else
          echo "$pkg is already installed by the framework, using that instead."
      fi
- done
+ done < "$REQ_FILE"

This approach avoids splitting issues, reduces forking (sed can be replaced with parameter expansion), and is more robust.

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In frameworks/shared/setup.sh around lines 50 to 62, the loop uses `for line in
$(grep ...)` which causes word-splitting and can mishandle requirements with
spaces or special characters. Replace this loop with a `while IFS= read -r line`
construct to read each line intact without splitting. Also, remove the external
`sed` command by using shell parameter expansion to extract the package name
from each line. This will make the script more robust and efficient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants