Skip to content

Refactor and optimize the Export metadata framework, especially for the data/variable-level metadata #11405

Open
@landreev

Description

@landreev

Historically Dataverse has supported one format that encoded all the available DataVariable-level metadata from DataTable objects, the "full DDI". The reason our Exporters write the metadata into an OutputStream rather than return it as an object in memory was to accommodate potentially very large formats like the DDI. This export is still prohibitively expensive in terms of memory use since it receives the entire dataset-worth of variable-level metadata from ExportDataProvider.getDatasetFileDetails() as one big JsonArray.

I would like to address this similarly to how we have added offset-length parameters to the /versions and /versions//files APIs for the SPA; and make it possible for getDatasetFileDetails() to page through the files/datatables in smaller batches.

Once the DataExporter is refactored, the croissant exporter in the gdcc repo could be refactored as well. Just like the DDI it encodes the datavariable information. It's a little bit worse, because it doesn't stream its output either; it instead accumulates the entire json object in memory, then writes it all at once as outputStream.write(job.build().toString().getBytes("UTF8"));. But it should be very doable to make it stream instead.

My secondary, less pressing concern is I want to have an option for the main getDatasetJson() method to skip the files info. (From what I can tell, InternalExportDataProvier.getDatasetJson() always calls JsonPrinter.jsonWithCitation(DatasetVersion dsv, boolean includeFiles) with includeFiles=true). If we are going to be exporting formats individually, this may be wasteful too, to pack thousands, or worse, of files into the json when exporting a cheap format like oai_dc that does not need them.

Metadata

Metadata

Assignees

Labels

FY25 Sprint 24FY25 Sprint 24 (2025-05-21 - 2025-06-04)FY25 Sprint 25FY25 Sprint 25 (2025-06-04 - 2025-06-18)FY25 Sprint 26FY25 Sprint 26 (2025-06-18 - 2025-07-02)Size: 80A percentage of a sprint. 56 hours.

Type

No type

Projects

Status

In Progress 💻

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions