Skip to content

fix(ingest/rest): out-of-date structured report being sent #13866

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Jun 27, 2025

Conversation

anshbansal
Copy link
Collaborator

@anshbansal anshbansal commented Jun 25, 2025

  • Fix a bug which was causing structured report to have outdated information. Fix is explained here https://www.loom.com/share/122e6a79cd574f7baae3003a982bd378 As this is a multi-threading bug in the underlying structures being used not sure how to add tests for this.

    • the fix is in metadata-ingestion/src/datahub/ingestion/api/sink.py and metadata-ingestion/src/datahub/ingestion/sink/datahub_rest.py which is quite small.
  • Attempt to add an initial CLAUDE.MD so AI agents can work with the repository effectively

  • Add datahub-mock-data source to be able to produce mock data which helped in reproducing this bug reliably. This will be used later on for more testing of ingestion sources. Heavily unit tested

@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Jun 25, 2025
Copy link

codecov bot commented Jun 25, 2025

Codecov Report

Attention: Patch coverage is 92.85714% with 11 lines in your changes missing coverage. Please review.

❌ Unsupported file format

Upload processing failed due to unsupported file format. Please review the parser error message:
Error deserializing json

Caused by:
expected value at line 1 column 1

For more help, visit our troubleshooting guide.

Files with missing lines Patch % Lines
...gestion/src/datahub/ingestion/sink/datahub_rest.py 25.00% 6 Missing ⚠️
...ub/ingestion/source/mock_data/datahub_mock_data.py 95.76% 5 Missing ⚠️

📢 Thoughts on this report? Let us know!

@datahub-cyborg datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Jun 26, 2025
@@ -805,6 +805,7 @@
"datahub-gc = datahub.ingestion.source.gc.datahub_gc:DataHubGcSource",
"datahub-debug = datahub.ingestion.source.debug.datahub_debug:DataHubDebugSource",
"datahub-apply = datahub.ingestion.source.apply.datahub_apply:DataHubApplySource",
"datahub-mock-data = datahub.ingestion.source.mock_data.datahub_mock_data:DataHubMockDataSource",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having mechanisms to generate mock data is definitely valuable.
What concerns me is coupling that logic directly with an ingestion source.

Would it make more sense to have a general-purpose DataHub source that ingests arbitrary MCE file(s)? That feels more flexible and reusable.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Ingesting arbitrary MCE can already be done using datahub ingest mcps FILE_NAME
  • There is already a datahub source which is for migration.

Using the ingestion framework for this allows us to test for various bugs in here and iterate on ingestion framework items which isn't really possible otherwise. This is a separate isolated source which is clearly marked as being there for testing. Unless there are very strong opinions would request this be allowed in.

Co-authored-by: Sergio Gómez Villamor <sgomezvillamor@gmail.com>
@datahub-cyborg datahub-cyborg bot added needs-review Label for PRs that need review from a maintainer. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Jun 27, 2025
Copy link
Contributor

@sgomezvillamor sgomezvillamor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thats what I meant, that we should have something as generic as datahub ingest mcps FILE_NAME` in the ingestion framework.
While I think that would be a more generic approach, I'm approving to unblock.

@datahub-cyborg datahub-cyborg bot added pending-submitter-merge and removed needs-review Label for PRs that need review from a maintainer. labels Jun 27, 2025
@anshbansal anshbansal merged commit 5759711 into master Jun 27, 2025
73 of 74 checks passed
@anshbansal anshbansal deleted the ab-2025-jun-26-mock-data-source branch June 27, 2025 09:37
anshbansal added a commit that referenced this pull request Jun 27, 2025
Co-authored-by: Sergio Gómez Villamor <sgomezvillamor@gmail.com>
kartikey-visa pushed a commit to kartikey-visa/datahub that referenced this pull request Jul 23, 2025
…roject#13866)

Co-authored-by: Sergio Gómez Villamor <sgomezvillamor@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ingestion PR or Issue related to the ingestion of metadata pending-submitter-merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants