Refactor and optimize the Export metadata framework, especially for the data/variable-level metadata

Historically Dataverse has supported one format that encoded all the available DataVariable-level metadata from DataTable objects, the "full DDI". The reason our Exporters write the metadata into an OutputStream rather than return it as an object in memory was to accommodate potentially very large formats like the DDI. This export is still prohibitively expensive in terms of memory use since it receives the entire dataset-worth of variable-level metadata from `ExportDataProvider.getDatasetFileDetails()` as one big JsonArray. 

I would like to address this similarly to how we have added offset-length parameters to the /versions and /versions//files APIs for the SPA; and make it possible for `getDatasetFileDetails()` to page through the files/datatables in smaller batches. 

Once the DataExporter is refactored, the croissant exporter in the gdcc repo could be refactored as well. Just like the DDI it encodes the datavariable information. It's a little bit worse, because it doesn't stream its output either; it instead accumulates the entire json object in memory, then writes it all at once as `outputStream.write(job.build().toString().getBytes("UTF8"));`. But it should be very doable to make it stream instead. 

My secondary, less pressing concern is I want to have an option for the main `getDatasetJson()` method to skip the files info. (From what I can tell, `InternalExportDataProvier.getDatasetJson()` always calls `JsonPrinter.jsonWithCitation(DatasetVersion dsv, boolean includeFiles)` with `includeFiles=true`). If we are going to be exporting formats individually, this may be wasteful too, to pack thousands, or worse, of files into the json when exporting a cheap format like oai_dc that does not need them.  

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor and optimize the Export metadata framework, especially for the data/variable-level metadata #11405

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Refactor and optimize the Export metadata framework, especially for the data/variable-level metadata #11405

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions