Skip to content

Conversation

@bouweandela
Copy link
Member

@bouweandela bouweandela commented Jul 3, 2025

Description

Add an interface for adding new data sources. Documentation of the new interface is available here: esmvalcore.io.

The existing esmvalcore.local and esmvalcore.esgf modules have been modified to make use of the new interface and as an example use case, support for using intake-esgf to find input data has been added.

Several commands have been added:

  • esmvaltool config show: print the current configuration
  • esmvaltool config list: list available example configuration files
  • esmvaltool config copy: copy an example configuration file to your configuration directory, i.e. ~/.config/esmvaltool or the path defined by the ESMVALTOOL_CONFIG_DIR environment variable.

To try the new intake-esgf data source, configure esmvaltool to use it by running the command esmvalcore config copy intake-esgf-data.yml.

Related to #2584

Contains changes to esmvalcore.local.DataSource that are not backwards compatible.

Link to documentation:

Follow up ideas:

  • Add descriptions to the example configuration files for displaying in the command esmvaltool config list
  • Improve validation of the data source configuration
  • Move the modules esmvalcore.esgf and esmvalcore.local into esmvalcore.io. To avoid introducing even more changes in the pull request, I will do this in a follow up pull request.
  • Make the fixes module configurable per data source
  • Add a site configuration setting that selects defaults appropriate to that site, e.g. site: levante would select data sources and dask settings appropriate to Levante, site: jasmin for Jasmin, to simplify configuration of the tool Add a site option to the get_config_user command #1706

Before you get started

Checklist

It is the responsibility of the author to make sure the pull request is ready to review. The icons indicate whether the item will be subject to the 🛠 Technical or 🧪 Scientific review.


To help with the number pull requests:

@valeriupredoi
Copy link
Contributor

I'll work with you on this one @bouweandela 🍺

@bouweandela bouweandela force-pushed the add-intake-esgf-support branch from e91e383 to 9d67ed5 Compare July 22, 2025 13:56
@bouweandela bouweandela added the enhancement New feature or request label Jul 23, 2025
Copy link
Contributor

@valeriupredoi valeriupredoi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

having a dive in this, bud - let me know how I can help!

f"but your configuration for project '{project}' contains "
f"'{data_source}' of type '{type(data_source)}'."
)
raise TypeError(msg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we want to see if we can first convert it to a DataSource before we toss it out the window

-------
:obj:`typing.Iterable` of :obj:`esmvalcore.io.base.DataElement`
The data elements that have been found.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is an excellent addition - we are finally abstracting a data object that gets ingested by esmvalcore, and we generalize it: let's be careful how we implement this so it can be reused with little fuss for the future: I'd argue that "data that can be loaded" can be anything ie the most generic file object (not needing to be on disk, nor it needing it to be downloaded), so we can operate with object stores too

@valeriupredoi
Copy link
Contributor

valeriupredoi commented Jul 24, 2025

this one here ties in very well with this PR, bud #2785 - enjoy your time off 🏖️

@valeriupredoi
Copy link
Contributor

hey @bouweandela hope you're enjoying your holiday time! I kept myself busy and we now have Zarr support (in _io.load) and have done other improvements, hence the conflicts with main, let me fix those for you now. Alas, you can now pass an Intake catalog via this PR, and if that has Zarr files in S3 buckets, then we can load them and test this one 😃

@codecov
Copy link

codecov bot commented Aug 19, 2025

Codecov Report

❌ Patch coverage is 93.76147% with 34 lines in your changes missing coverage. Please review.
✅ Project coverage is 95.34%. Comparing base (e8bc6e0) to head (0a946d3).

Files with missing lines Patch % Lines
esmvalcore/_main.py 20.00% 28 Missing ⚠️
esmvalcore/local.py 96.34% 3 Missing ⚠️
esmvalcore/config/_data_sources.py 92.30% 2 Missing ⚠️
esmvalcore/io/intake_esgf.py 98.95% 1 Missing ⚠️

❌ Your patch check has failed because the patch coverage (93.76%) is below the target coverage (100.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2765      +/-   ##
==========================================
- Coverage   95.43%   95.34%   -0.09%     
==========================================
  Files         260      264       +4     
  Lines       15528    15869     +341     
==========================================
+ Hits        14819    15131     +312     
- Misses        709      738      +29     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@bouweandela bouweandela force-pushed the add-intake-esgf-support branch 2 times, most recently from 3bf06ad to ef2e7cd Compare September 17, 2025 09:15
@bouweandela bouweandela added this to the v2.14.0 milestone Oct 3, 2025
@bouweandela bouweandela force-pushed the add-intake-esgf-support branch 4 times, most recently from bea9cf8 to bce7c5a Compare October 17, 2025 10:26
Move timerange extraction to DataElement

Move tests/unit/test_provenance.py to tests/unit/provenance and add more tests
@bouweandela bouweandela force-pushed the add-intake-esgf-support branch from 0b12c7b to 1794742 Compare October 17, 2025 14:36
@bouweandela bouweandela force-pushed the add-intake-esgf-support branch from ca867c6 to 94287ab Compare October 22, 2025 16:00
@bouweandela bouweandela changed the title Add support for intake-esgf Add an interface for adding new data sources and add support for intake-esgf as a first example Oct 22, 2025
@valeriupredoi
Copy link
Contributor

am finally able to start looking at this in great detail, bud, sorry, got hijacked by other things until now 🍺

encounter any issues using this module, please report them at
https://github.com/ESMValGroup/ESMValCore/issues.

Run the command ``esmvalcore config copy intake-esgf-data.yml`` to update
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Run the command ``esmvalcore config copy intake-esgf-data.yml`` to update
Run the command ``esmvaltool config copy intake-esgf-data.yml`` to update

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems to not work at first try - the runs keep picking up the old or "legacy" esgf-pyclient way of finding the data

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK! I made it work, only by declaring the environment variable export ESMVALTOOL_CONFIG_DIR=~/.esmvaltool (and subsequently removing an old config-developer file from there); so be careful with the default .config/esmvaltool - even if I popped the intake yml config there, it's still not found, and the old esgf way is used

Copy link
Contributor

@valeriupredoi valeriupredoi Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh and by Lord Loki, the data transfer (download) is SLOWWW - but it WORKS 🥳

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect esmvaltool is still using the rootpath, drs, and search_esgf settings from your config-user.yml file in addition to the intake-esgf configuration and then finding the data there first and not using intake-esgf at all. Could you try removing any rootpath and drs entries for projects where you want to use intake-esgf? I'll see if I can make the defaults more intuitive.

Copy link
Member Author

@bouweandela bouweandela Oct 30, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the slowness, I would highly recommend configuring intake-esgf so it can find data on your system, e.g. set local_cache to the value you have for download_dir (e.g. ~/climate_data by default) and esg_dataroot to the correct path for Jasmin (assuming you're testing there). You can also configure which indices it searches, maybe there is an issue with the default ones today. I'll add a note to our documentation about doing that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no Bouwe, I specifically removed all data from climate_data - nothing to be found there, there is a yet to understand configuration issue, that prevents the use of intake setup unless I declare and set that env variable;

as for slowness - ha! -it's our job to provide the user with a ready-set config that maximaizes performance, we are not gonna tell them to use this or that intake-esm configuration; I am running on my local machine BTW, not on JASMIN, the UREAD JANET wired network is a LOT faster than anything JASMIN (except the CEDA node of course, which is - there)

@valeriupredoi
Copy link
Contributor

valeriupredoi commented Oct 30, 2025

many thanks @bouweandela - as promised, I have now started to stress-test this baby - please see my very initial query #2765 (comment)

The first type of test is a basic run (what they do with any aircraft prototype - they just taxi it at the very beginning):

  • esgf intake gets the data but that takes about 4 or 5 x longer as before with the old esgf-pyclient
  • reruns are fine, data is cached and no downloads happen (as expected)

Am looking through the debug log and am seeing

2025-10-30 15:20:42,767 UTC [3125417] DEBUG   globus_sdk.config.env_vars:59 on lookup, default setting: GLOBUS_SDK_ENVIRONMENT=production
[many many debug lines later]
2025-10-30 15:23:12,505 UTC [3125417] DEBUG   globus_sdk.client:518 request completed with response code: 200

-> that's about 3 minutes of Globus SDK going around over requests and fetching data that takes 30s with the old esgf pyclient - we need to sort this out somehow or we'll be toast!

@valeriupredoi
Copy link
Contributor

another thing from the first test: we really shouldn't dump the intake config yamls in the debug log file - that poor thing is now 33k lines 😆

@bouweandela
Copy link
Member Author

Which recipe are you testing with? That seems like a lot of lines indeed.

@valeriupredoi
Copy link
Contributor

Which recipe are you testing with? That seems like a lot of lines indeed.

examples/recipe_python.yml - it's the various yamls, I counted {'*' appears in 10k lines 😁

@bouweandela
Copy link
Member Author

bouweandela commented Oct 30, 2025

Which recipe are you testing with? That seems like a lot of lines indeed.

examples/recipe_python.yml - it's the various yamls, I counted {'*' appears in 10k lines 😁

It looks like we are printing our own configuration many times, which grew a lot in #2747. It's unrelated to the changes in this pull request. This should be fixed by #2869.

@valeriupredoi
Copy link
Contributor

@bouweandela good news! The debug log is now only 12k lines vs yesterday's 33k lines 🥳 - we can still improve it though, all that stuff from globus is prob not needed I reckon.

Also good news, I managed to get a 36s time for the recipe (examples/recipe_python.yml), and that's good, but I am worried about the ESGFIndex nodes it's looking at - this, with absolutely no tweaking of the configuration:

2025-10-31 12:47:07,464 UTC [3223234] INFO    intake-esgf:44 └─GlobusESGFIndex('ESGF2-US-1.5-Catalog') results=0 response_time=0.02
2025-10-31 12:47:07,483 UTC [3223234] INFO    intake-esgf:44 └─GlobusESGFIndex('ESGF2-US-1.5-Catalog') results=0 response_time=0.00
2025-10-31 12:47:07,497 UTC [3223234] INFO    intake-esgf:44 └─GlobusESGFIndex('ESGF2-US-1.5-Catalog') results=4 response_time=0.00
2025-10-31 12:47:07,530 UTC [3223234] INFO    intake-esgf:44 └─GlobusESGFIndex('ESGF2-US-1.5-Catalog') results=2 response_time=0.01
2025-10-31 12:47:07,571 UTC [3223234] INFO    intake-esgf:44 └─GlobusESGFIndex('ESGF2-US-1.5-Catalog') results=4 response_time=0.01
2025-10-31 12:47:07,607 UTC [3223234] INFO    intake-esgf:44 └─GlobusESGFIndex('ESGF2-US-1.5-Catalog') results=2 response_time=0.01
2025-10-31 12:47:07,639 UTC [3223234] INFO    intake-esgf:44 └─GlobusESGFIndex('ESGF2-US-1.5-Catalog') results=4 response_time=0.00
2025-10-31 12:47:07,684 UTC [3223234] INFO    intake-esgf:44 └─GlobusESGFIndex('ESGF2-US-1.5-Catalog') results=2 response_time=0.01
2025-10-31 12:47:07,769 UTC [3223234] INFO    intake-esgf:44 └─GlobusESGFIndex('ESGF2-US-1.5-Catalog') results=4 response_time=0.01
2025-10-31 12:47:10,858 UTC [3223234] INFO    intake-esgf:44 └─GlobusESGFIndex('ESGF2-US-1.5-Catalog') results=4 response_time=0.00
2025-10-31 12:47:10,874 UTC [3223234] INFO    intake-esgf:44 └─GlobusESGFIndex('ESGF2-US-1.5-Catalog') results=4 response_time=0.01
2025-10-31 12:47:10,894 UTC [3223234] INFO    intake-esgf:44 └─GlobusESGFIndex('ESGF2-US-1.5-Catalog') results=2 response_time=0.01
2025-10-31 12:47:14,050 UTC [3223234] INFO    intake-esgf:44 └─GlobusESGFIndex('ESGF2-US-1.5-Catalog') results=2 response_time=0.00
2025-10-31 12:47:14,063 UTC [3223234] INFO    intake-esgf:44 └─GlobusESGFIndex('ESGF2-US-1.5-Catalog') results=2 response_time=0.00
2025-10-31 12:47:14,122 UTC [3223234] INFO    intake-esgf:44 └─GlobusESGFIndex('ESGF2-US-1.5-Catalog') results=4 response_time=0.00
2025-10-31 12:47:14,131 UTC [3223234] INFO    intake-esgf:44 └─GlobusESGFIndex('ESGF2-US-1.5-Catalog') results=4 response_time=0.00
2025-10-31 12:47:16,005 UTC [3223234] INFO    intake-esgf:44 └─GlobusESGFIndex('ESGF2-US-1.5-Catalog') results=2 response_time=0.00
2025-10-31 12:47:16,016 UTC [3223234] INFO    intake-esgf:44 └─GlobusESGFIndex('ESGF2-US-1.5-Catalog') results=2 response_time=0.00
2025-10-31 12:47:24,456 UTC [3223234] INFO    intake-esgf:44 └─GlobusESGFIndex('ESGF2-US-1.5-Catalog') results=4 response_time=0.00
2025-10-31 12:47:24,467 UTC [3223234] INFO    intake-esgf:44 └─GlobusESGFIndex('ESGF2-US-1.5-Catalog') results=4 response_time=0.00
2025-10-31 12:47:26,576 UTC [3223234] INFO    intake-esgf:44 └─GlobusESGFIndex('ESGF2-US-1.5-Catalog') results=2 response_time=0.00
2025-10-31 12:47:26,585 UTC [3223234] INFO    intake-esgf:44 └─GlobusESGFIndex('ESGF2-US-1.5-Catalog') results=2 response_time=0.00 
2025-10-31 12:47:31,453 UTC [3223234] INFO    intake-esgf:44 └─GlobusESGFIndex('ESGF2-US-1.5-Catalog') results=4 response_time=0.00
2025-10-31 12:47:31,462 UTC [3223234] INFO    intake-esgf:44 └─GlobusESGFIndex('ESGF2-US-1.5-Catalog') results=4 response_time=0.00
2025-10-31 12:47:33,483 UTC [3223234] INFO    intake-esgf:44 └─GlobusESGFIndex('ESGF2-US-1.5-Catalog') results=2 response_time=0.00
2025-10-31 12:47:33,492 UTC [3223234] INFO    intake-esgf:44 └─GlobusESGFIndex('ESGF2-US-1.5-Catalog') results=2 response_time=0.00

I'll rerun a few times, and look where it's looking for data, and how long it takes.

@valeriupredoi
Copy link
Contributor

oh and BTW we should make clear that, with no prior configuration, the downloaded data goes to an $HOME/.esgf dir, and not climate_data, as specified in the config-user

@valeriupredoi
Copy link
Contributor

we need to output the url where the data is downloaded from! I would be very happy to see that in the main log. but most definitely in the main log debug - at this point in time I have no idea where the data is http-fetched from! And I am seeing variability in terms of download speeds, which is normal, but I need to understand what's the physical distance to the data.

Also, we need to have data fetched preferentially from the closes node to the place where the run happens - this is no easyimplement since we need to first figure out where the tool is run, but perhaps, at the very least, we should have pre-configured setups if one runs in Europe, or the US etc

@valeriupredoi
Copy link
Contributor

also, if that US node is the only place where we are looking for data, why am I getting this:

2025-10-31 13:49:39,354 UTC [3272234] INFO    intake-esgf:44 └─GlobusESGFIndex('ESGF2-US-1.5-Catalog') results=0 response_time=0.44

and wasting 0.44s only to be repeated afterwards?

@valeriupredoi
Copy link
Contributor

run_tests on CircleCI: FAILED tests/unit/io/test_intake_esgf.py::test_to_iris_online - intake_esgf.exceptions.LocalCacheNotWritable: You do not have write permission in the cache directories specified: ['~/.esgf/']

Just rerunning see if that was a fluke - though I think the runner is not able to create dirs in ~

@bouweandela
Copy link
Member Author

Download speeds should get better over time, as intake-esgf keeps track of which hosts are fastest, similar to how we do it. Here is how to see the intake-esgf download speeds:

python -c 'import intake_esgf; from pathlib import Path; p = Path.home() / ".config" / "intake-esgf" / "download.db"; print(intake_esgf.database.get_download_rate_dataframe(p))'

and here is how to see the esmvalcore.esgf download speeds:

python -c 'import yaml; import pandas as pd; from pathlib import Path; p = Path.home() / ".esmvaltool" / "cache" / "esgf-hosts.yml"; print(pd.DataFrame.from_dict(yaml.safe_load(p.read_text()), orient="index").sort_values("speed (MB/s)"))'

@bouweandela
Copy link
Member Author

we need to output the url where the data is downloaded from! I would be very happy to see that in the main log. but most definitely in the main log debug - at this point in time I have no idea where the data is http-fetched from! And I am seeing variability in terms of download speeds, which is normal, but I need to understand what's the physical distance to the data.

This is already in the log:

2025-10-30 15:22:31,672 UTC [3125417] INFO    intake-esgf:44 transfer_time=106.03 [s] at 0.61 [Mb s-1] http://esgf-node.ornl.gov/thredds/fileServer/cmip5_css01_data/cmip5/output1/BCC/bcc-csm1-1/historical/mon/atmos/Amon/r1i1p1/v1/tas/tas_Amon_bcc-csm1-1_historical_r1i1p1_185001-201212.nc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants