Skip to content

Conversation

emilie-wang
Copy link
Contributor

Parallelizes manifest processing to improve performance for large tables with many manifest files. After parallel processing, merges the resulting partition maps to produce the final aggregated result.
Previous example ref: e937f6a

Rationale for this change

Perf improvement.
We experienced slowness with table.inspect.partitions() with large table.

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

Parallelizes manifest processing to improve performance for large tables with many manifest files.
After parallel processing, merges the resulting partition maps to produce the final aggregated result.
@emilie-wang
Copy link
Contributor Author

Hey @jayceslesar could you please take a look on this PR? Thank you. I took this PR as ref and wanted to apply to inspect.partitions to speed up the query. Thanks.

Copy link
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @emilie-wang Thanks for speeding this up. While at it, I think we need to do a minor refactor as well to keep everything readable.

partitions_map: Dict[Tuple[str, Any], Any] = {}
snapshot = self._get_snapshot(snapshot_id)
for manifest in snapshot.manifests(self.tbl.io):
def process_manifest(manifest: ManifestFile) -> Dict[Tuple[str, Any], Any]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're at it, I would suggest two things:

  • Move the inline function to the class level, and add an underscore to the name, to indicate that it is considered private _process_manifest.
  • Merge this function with update_partitions_map, since that function isn't used anywhere else.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Fokko, thank you for the review and updated with the code refactoring.

@Fokko Fokko merged commit 8db086d into apache:main Aug 20, 2025
10 checks passed
@Fokko
Copy link
Contributor

Fokko commented Aug 20, 2025

Thanks for fixing this @emilie-wang 🙌

@emilie-wang emilie-wang deleted the hanzhi/inspect-partitions branch August 21, 2025 17:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants