cohort_creator.data package#

Submodules#

Utilities for data handling.

cohort_creator.data.utils.booleanify(value: bool | str | list[str] | None) → bool | str | list[str] | None#

cohort_creator.data.utils.filter_data(df: pd.Dataframe, config: Any = None) → pd.DataFrame#

Filter the listing of datasets based on some configuration.

"fmriprep" : None | bool
"mriqc" : None | bool
"physio" : None | bool
"task": str
"datatypes" : list[str] of any of the BIDS datatypes
"datatypes_and_or" : “OR” | “AND” if any or all of the datatypes must be present
"sources" : list[str] source of the dataset (openneuro, abide…)
"sources_and_or" : “OR” | “AND” if any or all of the sources must be present

If None is passed will default to the DEFAULT_CONFIG.

cohort_creator.data.utils.is_known_dataset(dataset_name: str) → bool#

Check if a dataset is known to the cohort creator.

cohort_creator.data.utils.known_datasets_df() → DataFrame#

Return dataframe of all datasets known to the cohort creator.

pd.DataFrame: Dataframe containing list of datasets known to the cohort creator.

A data dictionary can be found in: cohort_creator/data/columns_description.json

cohort_creator.data.utils.wrangle_data(df: DataFrame) → DataFrame#

Do general wrangling of the known datasets.

nb_datatypes: int Number of unique BIDS supported datatypes in the dataset.
nb_sessions: int Total number of unique sessions in the dataset.
nb_authors: int Number of authors as reported in the description of the dataset.
nb_tasks: int Total number of unique tasks in the dataset.
useful_participants_tsv: bool
has_physio: bool True if the dataset contains any *_physio.tsv.gz files.
has_fmriprep: bool` True if the dataset has knwow fmriprep preprocessed derivatives.
has_freesurfer: bool True if the dataset has knwow freesurfer preprocessed derivatives.
has_mriqc: bool True if the dataset has knwow mriqc derivatives.
source: Specifies the source of the dataset.
mean_size: size per subject in kilobytes
datatype: one column for each BIDS known datatype with True if this dataset contains that datatype.
total_duration: Total “scanned” duration per participant (combines runs from: func, eeg, ieeg)