Skip to content

ENH: Add a script to query fMRI compliant data from OpenNeuro #89

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

jhlegarreta
Copy link
Contributor

Add a script to query fMRI compliant data from OpenNeuro.

Copy link

codecov bot commented Apr 23, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 71.35%. Comparing base (fb8f3f3) to head (726a6ea).

Additional details and impacted files
@@           Coverage Diff           @@
##             main      #89   +/-   ##
=======================================
  Coverage   71.35%   71.35%           
=======================================
  Files          23       23           
  Lines        1138     1138           
  Branches      139      139           
=======================================
  Hits          812      812           
  Misses        283      283           
  Partials       43       43           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@jhlegarreta jhlegarreta force-pushed the AddfMRIDataOpenNeuroQueryScript branch 7 times, most recently from 806c6c5 to 736e276 Compare June 19, 2025 21:20
@jhlegarreta
Copy link
Contributor Author

jhlegarreta commented Jun 19, 2025

I revisited this.

In 736e276 I split the OpenNeuro querying and BOLD feature extraction/analysis/selection. Querying OpenNeuro dataset information and querying the files of each dataset is also split into two files: scripts/query_datasets_openneuro.py and scripts/query_mri_dataset_files_openneuro.py in order to separate querying OpenNeuro (datasets and files). The rationale is to minimize the number of times we need to query OpenNeuro. For now, 736e276 only does the querying job.

I've only tested scripts/query_mri_dataset_files_openneuro.py with the first ten datasets that are listed by the first script's output.

Next is to develop/refactor the necessary parts for extracting the relevant features and selecting those datasets/files complying with the criteria into different scripts.

@jhlegarreta jhlegarreta force-pushed the AddfMRIDataOpenNeuroQueryScript branch 7 times, most recently from 063bf1a to 0bcbf15 Compare June 21, 2025 15:42
@jhlegarreta
Copy link
Contributor Author

jhlegarreta commented Jun 21, 2025

Re #89 (comment)

Next is to develop/refactor the necessary parts for extracting the relevant features and selecting those datasets/files complying with the criteria into different scripts.

Done in subsequent commits.

I think I am done with this. Thoughts:

Once these have been addressed and the scripts do not need updates, I'll squash all commits.

Additionally, as I've split all this into 4 different scripts, I think it has become apparent that most of the scripts in the scripts folder do not belong here, and should be put into a separate repository where all scripts to produce context results should dwell, keeping here only those scripts that are necessary for the package to run, if any. We may want to do this now to avoid polluting the folder.

@jhlegarreta jhlegarreta force-pushed the AddfMRIDataOpenNeuroQueryScript branch 8 times, most recently from e7c394b to 4f9b1e9 Compare June 23, 2025 22:52
@jhlegarreta
Copy link
Contributor Author

jhlegarreta commented Jun 24, 2025

Re: the commit message 4f9b1e9, re-run earlier commits again; 9f66a50 still incurs a deadlock; 2567252 brings query_mri_dataset_files_openneuro.py to completion (without handling errors), so I will try to bring changes one by one to query_mri_dataset_files_openneuro.py and see where the deadlock starts to happen. Thanks for the patience.

@jhlegarreta
Copy link
Contributor Author

jhlegarreta commented Jul 1, 2025

OK, this is working. Takes like 11 hours, but it is working, as opposed to less than two hours with a headlong rush that only does

return response.json()["data"]["snapshot"]["files"]

or

return response.json().get("data", {}).get("snapshot").get("files", []) or []

(the latter only solving part of the NoneType object is not iterable exceptions)

There is probably things to improve, but I think it serves the purpose and I've spent enough time on this. The retries are nice as they demonstrate that we've made efforts to get the widest possible set of cases: e.g. in the last run, all datasets were queried successfully (then getting the number of bold runs in the selection script fails for reasons related to malformed NIfTI files or other reasons, i.e. unrelated to this).

I will go ahead and squash all commits into one and then fix the style issues. At that point this will be ready to be merged. Apologies the noise generated by all the commits here.

The decision to transfer all these scripts to another repository (x-ref #89 (comment)) can be made once this is merged and be applied to all scripts.

Edit: I see that there is some issue when returning the failed queries in scripts/query_mri_dataset_files_openneuro.py depending on the exception: no failed queries may be reported, but the written dataset_id.tsv files may not account for all datasets. Anyways, the offset is only a few datasets: 4 in two different runs I did. So I am inclined to go ahead with this as is, and open an issue; I've already spent a lot of time with this. Unless we come up with a way to simulate responses from a server in a consistent manner, debugging this will be hard.

@jhlegarreta jhlegarreta force-pushed the AddfMRIDataOpenNeuroQueryScript branch from b1fc859 to 6d580fc Compare July 1, 2025 02:11
@jhlegarreta jhlegarreta marked this pull request as ready for review July 1, 2025 02:15
@jhlegarreta jhlegarreta force-pushed the AddfMRIDataOpenNeuroQueryScript branch from 6d580fc to abf9856 Compare July 2, 2025 01:05
@jhlegarreta jhlegarreta force-pushed the AddfMRIDataOpenNeuroQueryScript branch from abf9856 to eeeed67 Compare July 3, 2025 13:38
@jhlegarreta
Copy link
Contributor Author

Edit: I see that there is some issue when returning the failed queries in scripts/query_mri_dataset_files_openneuro.py depending on the exception: no failed queries may be reported, but the written dataset_id.tsv files may not account for all datasets. Anyways, the offset is only a few datasets: 4 in two different runs I did. So I am inclined to go ahead with this as is, and open an issue; I've already spent a lot of time with this. Unless we come up with a way to simulate responses from a server in a consistent manner, debugging this will be hard.

Last force push (eeeed67) solves this.

@jhlegarreta jhlegarreta force-pushed the AddfMRIDataOpenNeuroQueryScript branch 3 times, most recently from 05a22af to 401b43c Compare July 3, 2025 18:56
Copy link
Member

@oesteban oesteban left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall. I missed having examples in the initial docstring of every file.

Considering that all these scripts are heavily related, and the shared static variables, would it make sense to aggregate all of them into a single script, and select the behavior with click commands? Something like:

$ openneuro_query.py profile
$ openneuro_query.py datasets
$ openneuro_query.py files
$ openneuro_query.py select

@jhlegarreta
Copy link
Contributor Author

Considering that all these scripts are heavily related, and the shared static variables, would it make sense to aggregate all of them into a single script, and select the behavior with click commands?

I like the idea, but I'd propose to do it in a separate PR in order to have a working version at least. I know merging this clutters the repository, and in other circumstances, I'd be for adopting the approach here, but in this case I believe that I've spent too much time on this PR, so I'd rather have something merged and work from there. Also, as said, I think these scripts, together with the rest of the scripts that are not essential for the package, should dwell in a separate repository. So I'd vote for merging this as is to have a record of something that works, then transferring to another repository and converting to a click-based script in that repository.

@jhlegarreta jhlegarreta force-pushed the AddfMRIDataOpenNeuroQueryScript branch from f8ef0ca to e0bb868 Compare July 8, 2025 22:31
@jhlegarreta
Copy link
Contributor Author

I missed having examples in the initial docstring of every file.

Examples added.

jhlegarreta and others added 2 commits July 8, 2025 18:32
Add a series of scripts to query fMRI compliant data from OpenNeuro:
- A script that queries the existing datasets.
- A script that, based on the results of the previous one, queries the
  human (f)MRI files.
- A script that, based on the results of the previous one, identifies
  files constituting BOLD runs and characterizes them in terms of the
  number of timepoints they contain.
- A script that, based on the results of the previous one, selects the
  relevant BOLD runs for the study according to some criteria.

Notes:
- Querying the human (f)MRI files takes a few hours. If a 502 response
  is obtained from the server, the request is posted again after a
  delay. If unsuccessful with yet another 502 error, this is repeated
  a fixed number of times.
- When characterizing the BOLD runs, S3 is used instead of rely on URLs.
  It is observed that some `urls` fields contain multiple items, and
  rather than picking one of them (and even if the assumption that all
  URLs are the same or point to the same resource), prefer to use S3. It
  is experimentally observed that it is twice as fast for the queries
  done within the context of the script.
Improve descriptions of OpenNeuro query scripts:
- Improve the wording.
- Add examples on how to call the scripts.
- Add empty lines for easier reading.

Co-authored-by: Oscar Esteban <[email protected]>
@jhlegarreta jhlegarreta force-pushed the AddfMRIDataOpenNeuroQueryScript branch from e0bb868 to 726a6ea Compare July 8, 2025 22:33
@jhlegarreta
Copy link
Contributor Author

So, feel free to open a new repository to specifically host only these scripts. I will push the scripts to there when that is done, then work on the click re-implementation, then (mid- to long-term) to generalizing to any modality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants