Description
Historically Dataverse has supported one format that encoded all the available DataVariable-level metadata from DataTable objects, the "full DDI". The reason our Exporters write the metadata into an OutputStream rather than return it as an object in memory was to accommodate potentially very large formats like the DDI. This export is still prohibitively expensive in terms of memory use since it receives the entire dataset-worth of variable-level metadata from ExportDataProvider.getDatasetFileDetails()
as one big JsonArray.
I would like to address this similarly to how we have added offset-length parameters to the /versions and /versions//files APIs for the SPA; and make it possible for getDatasetFileDetails()
to page through the files/datatables in smaller batches.
Once the DataExporter is refactored, the croissant exporter in the gdcc repo could be refactored as well. Just like the DDI it encodes the datavariable information. It's a little bit worse, because it doesn't stream its output either; it instead accumulates the entire json object in memory, then writes it all at once as outputStream.write(job.build().toString().getBytes("UTF8"));
. But it should be very doable to make it stream instead.
My secondary, less pressing concern is I want to have an option for the main getDatasetJson()
method to skip the files info. (From what I can tell, InternalExportDataProvier.getDatasetJson()
always calls JsonPrinter.jsonWithCitation(DatasetVersion dsv, boolean includeFiles)
with includeFiles=true
). If we are going to be exporting formats individually, this may be wasteful too, to pack thousands, or worse, of files into the json when exporting a cheap format like oai_dc that does not need them.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status