-
Notifications
You must be signed in to change notification settings - Fork 44
Add an interface for adding new data sources and add support for intake-esgf as a first example #2765
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
I'll work with you on this one @bouweandela 🍺 |
e91e383 to
9d67ed5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
having a dive in this, bud - let me know how I can help!
esmvalcore/config/_data_sources.py
Outdated
| f"but your configuration for project '{project}' contains " | ||
| f"'{data_source}' of type '{type(data_source)}'." | ||
| ) | ||
| raise TypeError(msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we want to see if we can first convert it to a DataSource before we toss it out the window
| ------- | ||
| :obj:`typing.Iterable` of :obj:`esmvalcore.io.base.DataElement` | ||
| The data elements that have been found. | ||
| """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is an excellent addition - we are finally abstracting a data object that gets ingested by esmvalcore, and we generalize it: let's be careful how we implement this so it can be reused with little fuss for the future: I'd argue that "data that can be loaded" can be anything ie the most generic file object (not needing to be on disk, nor it needing it to be downloaded), so we can operate with object stores too
|
this one here ties in very well with this PR, bud #2785 - enjoy your time off 🏖️ |
|
hey @bouweandela hope you're enjoying your holiday time! I kept myself busy and we now have Zarr support (in |
196c6e4 to
8a7a935
Compare
5283ecb to
b98ab5d
Compare
Codecov Report❌ Patch coverage is
❌ Your patch check has failed because the patch coverage (93.76%) is below the target coverage (100.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #2765 +/- ##
==========================================
- Coverage 95.43% 95.34% -0.09%
==========================================
Files 260 264 +4
Lines 15528 15869 +341
==========================================
+ Hits 14819 15131 +312
- Misses 709 738 +29 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
3bf06ad to
ef2e7cd
Compare
bea9cf8 to
bce7c5a
Compare
Move timerange extraction to DataElement Move tests/unit/test_provenance.py to tests/unit/provenance and add more tests
0b12c7b to
1794742
Compare
ca867c6 to
94287ab
Compare
|
am finally able to start looking at this in great detail, bud, sorry, got hijacked by other things until now 🍺 |
| encounter any issues using this module, please report them at | ||
| https://github.com/ESMValGroup/ESMValCore/issues. | ||
|
|
||
| Run the command ``esmvalcore config copy intake-esgf-data.yml`` to update |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Run the command ``esmvalcore config copy intake-esgf-data.yml`` to update | |
| Run the command ``esmvaltool config copy intake-esgf-data.yml`` to update |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems to not work at first try - the runs keep picking up the old or "legacy" esgf-pyclient way of finding the data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK! I made it work, only by declaring the environment variable export ESMVALTOOL_CONFIG_DIR=~/.esmvaltool (and subsequently removing an old config-developer file from there); so be careful with the default .config/esmvaltool - even if I popped the intake yml config there, it's still not found, and the old esgf way is used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh and by Lord Loki, the data transfer (download) is SLOWWW - but it WORKS 🥳
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I expect esmvaltool is still using the rootpath, drs, and search_esgf settings from your config-user.yml file in addition to the intake-esgf configuration and then finding the data there first and not using intake-esgf at all. Could you try removing any rootpath and drs entries for projects where you want to use intake-esgf? I'll see if I can make the defaults more intuitive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding the slowness, I would highly recommend configuring intake-esgf so it can find data on your system, e.g. set local_cache to the value you have for download_dir (e.g. ~/climate_data by default) and esg_dataroot to the correct path for Jasmin (assuming you're testing there). You can also configure which indices it searches, maybe there is an issue with the default ones today. I'll add a note to our documentation about doing that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no Bouwe, I specifically removed all data from climate_data - nothing to be found there, there is a yet to understand configuration issue, that prevents the use of intake setup unless I declare and set that env variable;
as for slowness - ha! -it's our job to provide the user with a ready-set config that maximaizes performance, we are not gonna tell them to use this or that intake-esm configuration; I am running on my local machine BTW, not on JASMIN, the UREAD JANET wired network is a LOT faster than anything JASMIN (except the CEDA node of course, which is - there)
|
many thanks @bouweandela - as promised, I have now started to stress-test this baby - please see my very initial query #2765 (comment) The first type of test is a basic run (what they do with any aircraft prototype - they just taxi it at the very beginning):
Am looking through the debug log and am seeing -> that's about 3 minutes of Globus SDK going around over requests and fetching data that takes 30s with the old esgf pyclient - we need to sort this out somehow or we'll be toast! |
|
another thing from the first test: we really shouldn't dump the intake config yamls in the debug log file - that poor thing is now 33k lines 😆 |
|
Which recipe are you testing with? That seems like a lot of lines indeed. |
|
It looks like we are printing our own configuration many times, which grew a lot in #2747. It's unrelated to the changes in this pull request. This should be fixed by #2869. |
|
@bouweandela good news! The debug log is now only 12k lines vs yesterday's 33k lines 🥳 - we can still improve it though, all that stuff from globus is prob not needed I reckon. Also good news, I managed to get a 36s time for the recipe (examples/recipe_python.yml), and that's good, but I am worried about the ESGFIndex nodes it's looking at - this, with absolutely no tweaking of the configuration: I'll rerun a few times, and look where it's looking for data, and how long it takes. |
|
oh and BTW we should make clear that, with no prior configuration, the downloaded data goes to an |
|
we need to output the url where the data is downloaded from! I would be very happy to see that in the main log. but most definitely in the main log debug - at this point in time I have no idea where the data is http-fetched from! And I am seeing variability in terms of download speeds, which is normal, but I need to understand what's the physical distance to the data. Also, we need to have data fetched preferentially from the closes node to the place where the run happens - this is no easyimplement since we need to first figure out where the tool is run, but perhaps, at the very least, we should have pre-configured setups if one runs in Europe, or the US etc |
|
also, if that US node is the only place where we are looking for data, why am I getting this: and wasting 0.44s only to be repeated afterwards? |
|
run_tests on CircleCI: FAILED tests/unit/io/test_intake_esgf.py::test_to_iris_online - intake_esgf.exceptions.LocalCacheNotWritable: You do not have write permission in the cache directories specified: ['~/.esgf/'] Just rerunning see if that was a fluke - though I think the runner is not able to create dirs in |
|
Download speeds should get better over time, as intake-esgf keeps track of which hosts are fastest, similar to how we do it. Here is how to see the intake-esgf download speeds: and here is how to see the |
This is already in the log: |
Description
Add an interface for adding new data sources. Documentation of the new interface is available here:
esmvalcore.io.The existing
esmvalcore.localandesmvalcore.esgfmodules have been modified to make use of the new interface and as an example use case, support for using intake-esgf to find input data has been added.Several commands have been added:
esmvaltool config show: print the current configurationesmvaltool config list: list available example configuration filesesmvaltool config copy: copy an example configuration file to your configuration directory, i.e.~/.config/esmvaltoolor the path defined by theESMVALTOOL_CONFIG_DIRenvironment variable.To try the new intake-esgf data source, configure
esmvaltoolto use it by running the commandesmvalcore config copy intake-esgf-data.yml.Related to #2584
Contains changes to
esmvalcore.local.DataSourcethat are not backwards compatible.Link to documentation:
Follow up ideas:
esmvaltool config listesmvalcore.esgfandesmvalcore.localintoesmvalcore.io. To avoid introducing even more changes in the pull request, I will do this in a follow up pull request.siteconfiguration setting that selects defaults appropriate to that site, e.g.site: levantewould select data sources and dask settings appropriate to Levante,site: jasminfor Jasmin, to simplify configuration of the tool Add asiteoption to theget_config_usercommand #1706Before you get started
Checklist
It is the responsibility of the author to make sure the pull request is ready to review. The icons indicate whether the item will be subject to the 🛠 Technical or 🧪 Scientific review.
Changes are backward compatibleTo help with the number pull requests: