ENH: Add a script to query fMRI compliant data from OpenNeuro #89

jhlegarreta · 2025-04-23T01:30:40Z

Add a script to query fMRI compliant data from OpenNeuro.

codecov · 2025-04-23T02:08:28Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 71.35%. Comparing base (fb8f3f3) to head (726a6ea).

Additional details and impacted files

@@           Coverage Diff           @@
##             main      #89   +/-   ##
=======================================
  Coverage   71.35%   71.35%           
=======================================
  Files          23       23           
  Lines        1138     1138           
  Branches      139      139           
=======================================
  Hits          812      812           
  Misses        283      283           
  Partials       43       43

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

jhlegarreta · 2025-06-19T21:25:42Z

I revisited this.

In 736e276 I split the OpenNeuro querying and BOLD feature extraction/analysis/selection. Querying OpenNeuro dataset information and querying the files of each dataset is also split into two files: scripts/query_datasets_openneuro.py and scripts/query_mri_dataset_files_openneuro.py in order to separate querying OpenNeuro (datasets and files). The rationale is to minimize the number of times we need to query OpenNeuro. For now, 736e276 only does the querying job.

I've only tested scripts/query_mri_dataset_files_openneuro.py with the first ten datasets that are listed by the first script's output.

Next is to develop/refactor the necessary parts for extracting the relevant features and selecting those datasets/files complying with the criteria into different scripts.

jhlegarreta · 2025-06-21T17:02:17Z

Re #89 (comment)

Next is to develop/refactor the necessary parts for extracting the relevant features and selecting those datasets/files complying with the criteria into different scripts.

Done in subsequent commits.

I think I am done with this. Thoughts:

Some datasets are no longer BIDS-compliant (visible as errors/warnings when browsing OpenNeuro) or seem to have some issues. Do we want to identify those? How can we identify these programmatically?
I am only keeping those datasets that have human in their species field. Most datasets have the species field empty in the response received from GraphQl. Not sure if this field is the same as in the JSON sidecar. Should we fetch those where the tag is empty? Should we double-check with the JSON sidecar? Looking for e.g. "mouse", "mice", "rat", macaque", "non-human primate" in the description sounds brittle. If we decide to include any of these a visual QA (if an automatic one does not exist) should be done.
As opposed to the initial approach in ea8ee4c#diff-c8a4bb6425ad95a7dd63f31feb2f1ef0a676b7fd2659e6cb16d4d91891cd7e40R184 and ea8ee4c#diff-c8a4bb6425ad95a7dd63f31feb2f1ef0a676b7fd2659e6cb16d4d91891cd7e40R245, I am exclusively relying on the file list and shape of the NIfTI header instead of the JSON sidecar to get the number of runs/timepoints.
I am not checking whether there is some meaningful fMRI task, e.g. ea8ee4c#diff-c8a4bb6425ad95a7dd63f31feb2f1ef0a676b7fd2659e6cb16d4d91891cd7e40R270

Once these have been addressed and the scripts do not need updates, I'll squash all commits.

Additionally, as I've split all this into 4 different scripts, I think it has become apparent that most of the scripts in the scripts folder do not belong here, and should be put into a separate repository where all scripts to produce context results should dwell, keeping here only those scripts that are necessary for the package to run, if any. We may want to do this now to avoid polluting the folder.

jhlegarreta · 2025-06-24T13:01:45Z

Re: the commit message 4f9b1e9, re-run earlier commits again; 9f66a50 still incurs a deadlock; 2567252 brings query_mri_dataset_files_openneuro.py to completion (without handling errors), so I will try to bring changes one by one to query_mri_dataset_files_openneuro.py and see where the deadlock starts to happen. Thanks for the patience.

jhlegarreta · 2025-07-01T01:34:39Z

OK, this is working. Takes like 11 hours, but it is working, as opposed to less than two hours with a headlong rush that only does

return response.json()["data"]["snapshot"]["files"]

or

return response.json().get("data", {}).get("snapshot").get("files", []) or []

(the latter only solving part of the NoneType object is not iterable exceptions)

There is probably things to improve, but I think it serves the purpose and I've spent enough time on this. The retries are nice as they demonstrate that we've made efforts to get the widest possible set of cases: e.g. in the last run, all datasets were queried successfully (then getting the number of bold runs in the selection script fails for reasons related to malformed NIfTI files or other reasons, i.e. unrelated to this).

I will go ahead and squash all commits into one and then fix the style issues. At that point this will be ready to be merged. Apologies the noise generated by all the commits here.

The decision to transfer all these scripts to another repository (x-ref #89 (comment)) can be made once this is merged and be applied to all scripts.

Edit: I see that there is some issue when returning the failed queries in scripts/query_mri_dataset_files_openneuro.py depending on the exception: no failed queries may be reported, but the written dataset_id.tsv files may not account for all datasets. Anyways, the offset is only a few datasets: 4 in two different runs I did. So I am inclined to go ahead with this as is, and open an issue; I've already spent a lot of time with this. Unless we come up with a way to simulate responses from a server in a consistent manner, debugging this will be hard.

jhlegarreta · 2025-07-03T13:39:36Z

Edit: I see that there is some issue when returning the failed queries in scripts/query_mri_dataset_files_openneuro.py depending on the exception: no failed queries may be reported, but the written dataset_id.tsv files may not account for all datasets. Anyways, the offset is only a few datasets: 4 in two different runs I did. So I am inclined to go ahead with this as is, and open an issue; I've already spent a lot of time with this. Unless we come up with a way to simulate responses from a server in a consistent manner, debugging this will be hard.

Last force push (eeeed67) solves this.

oesteban

Looks good overall. I missed having examples in the initial docstring of every file.

Considering that all these scripts are heavily related, and the shared static variables, would it make sense to aggregate all of them into a single script, and select the behavior with click commands? Something like:

$ openneuro_query.py profile
$ openneuro_query.py datasets
$ openneuro_query.py files
$ openneuro_query.py select

scripts/characterize_bold_runs_openneuro.py

scripts/query_datasets_openneuro.py

scripts/query_mri_dataset_files_openneuro.py

scripts/select_bold_runs_openneuro.py

jhlegarreta · 2025-07-08T22:09:53Z

Considering that all these scripts are heavily related, and the shared static variables, would it make sense to aggregate all of them into a single script, and select the behavior with click commands?

I like the idea, but I'd propose to do it in a separate PR in order to have a working version at least. I know merging this clutters the repository, and in other circumstances, I'd be for adopting the approach here, but in this case I believe that I've spent too much time on this PR, so I'd rather have something merged and work from there. Also, as said, I think these scripts, together with the rest of the scripts that are not essential for the package, should dwell in a separate repository. So I'd vote for merging this as is to have a record of something that works, then transferring to another repository and converting to a click-based script in that repository.

jhlegarreta · 2025-07-08T22:31:31Z

I missed having examples in the initial docstring of every file.

Examples added.

Add a series of scripts to query fMRI compliant data from OpenNeuro: - A script that queries the existing datasets. - A script that, based on the results of the previous one, queries the human (f)MRI files. - A script that, based on the results of the previous one, identifies files constituting BOLD runs and characterizes them in terms of the number of timepoints they contain. - A script that, based on the results of the previous one, selects the relevant BOLD runs for the study according to some criteria. Notes: - Querying the human (f)MRI files takes a few hours. If a 502 response is obtained from the server, the request is posted again after a delay. If unsuccessful with yet another 502 error, this is repeated a fixed number of times. - When characterizing the BOLD runs, S3 is used instead of rely on URLs. It is observed that some `urls` fields contain multiple items, and rather than picking one of them (and even if the assumption that all URLs are the same or point to the same resource), prefer to use S3. It is experimentally observed that it is twice as fast for the queries done within the context of the script.

Improve descriptions of OpenNeuro query scripts: - Improve the wording. - Add examples on how to call the scripts. - Add empty lines for easier reading. Co-authored-by: Oscar Esteban <[email protected]>

jhlegarreta · 2025-07-09T14:15:53Z

So, feel free to open a new repository to specifically host only these scripts. I will push the scripts to there when that is done, then work on the click re-implementation, then (mid- to long-term) to generalizing to any modality.

jhlegarreta force-pushed the AddfMRIDataOpenNeuroQueryScript branch 7 times, most recently from 806c6c5 to 736e276 Compare June 19, 2025 21:20

jhlegarreta force-pushed the AddfMRIDataOpenNeuroQueryScript branch 7 times, most recently from 063bf1a to 0bcbf15 Compare June 21, 2025 15:42

jhlegarreta force-pushed the AddfMRIDataOpenNeuroQueryScript branch 8 times, most recently from e7c394b to 4f9b1e9 Compare June 23, 2025 22:52

jhlegarreta force-pushed the AddfMRIDataOpenNeuroQueryScript branch from b1fc859 to 6d580fc Compare July 1, 2025 02:11

jhlegarreta marked this pull request as ready for review July 1, 2025 02:15

jhlegarreta force-pushed the AddfMRIDataOpenNeuroQueryScript branch from 6d580fc to abf9856 Compare July 2, 2025 01:05

jhlegarreta force-pushed the AddfMRIDataOpenNeuroQueryScript branch from abf9856 to eeeed67 Compare July 3, 2025 13:38

jhlegarreta force-pushed the AddfMRIDataOpenNeuroQueryScript branch 3 times, most recently from 05a22af to 401b43c Compare July 3, 2025 18:56

oesteban reviewed Jul 8, 2025

View reviewed changes

jhlegarreta force-pushed the AddfMRIDataOpenNeuroQueryScript branch from f8ef0ca to e0bb868 Compare July 8, 2025 22:31

jhlegarreta and others added 2 commits July 8, 2025 18:32

DOC: Improve descriptions of OpenNeuro query scripts

726a6ea

Improve descriptions of OpenNeuro query scripts: - Improve the wording. - Add examples on how to call the scripts. - Add empty lines for easier reading. Co-authored-by: Oscar Esteban <[email protected]>

jhlegarreta force-pushed the AddfMRIDataOpenNeuroQueryScript branch from e0bb868 to 726a6ea Compare July 8, 2025 22:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ENH: Add a script to query fMRI compliant data from OpenNeuro #89

ENH: Add a script to query fMRI compliant data from OpenNeuro #89

Uh oh!

jhlegarreta commented Apr 23, 2025

Uh oh!

codecov bot commented Apr 23, 2025 •

edited

Loading

Uh oh!

jhlegarreta commented Jun 19, 2025 •

edited

Loading

Uh oh!

jhlegarreta commented Jun 21, 2025 •

edited

Loading

Uh oh!

jhlegarreta commented Jun 24, 2025 •

edited

Loading

Uh oh!

jhlegarreta commented Jul 1, 2025 •

edited

Loading

Uh oh!

jhlegarreta commented Jul 3, 2025

Uh oh!

oesteban left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jhlegarreta commented Jul 8, 2025

Uh oh!

jhlegarreta commented Jul 8, 2025

Uh oh!

jhlegarreta commented Jul 9, 2025

Uh oh!

Uh oh!

ENH: Add a script to query fMRI compliant data from OpenNeuro #89

Are you sure you want to change the base?

ENH: Add a script to query fMRI compliant data from OpenNeuro #89

Uh oh!

Conversation

jhlegarreta commented Apr 23, 2025

Uh oh!

codecov bot commented Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jhlegarreta commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jhlegarreta commented Jun 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jhlegarreta commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jhlegarreta commented Jul 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jhlegarreta commented Jul 3, 2025

Uh oh!

oesteban left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jhlegarreta commented Jul 8, 2025

Uh oh!

jhlegarreta commented Jul 8, 2025

Uh oh!

jhlegarreta commented Jul 9, 2025

Uh oh!

Uh oh!

codecov bot commented Apr 23, 2025 •

edited

Loading

jhlegarreta commented Jun 19, 2025 •

edited

Loading

jhlegarreta commented Jun 21, 2025 •

edited

Loading

jhlegarreta commented Jun 24, 2025 •

edited

Loading

jhlegarreta commented Jul 1, 2025 •

edited

Loading