cohort_creator.data package#

Submodules#

cohort_creator.data.utils module#

Utilities for data handling.

cohort_creator.data.utils.booleanify(value: bool | str | list[str] | None) bool | str | list[str] | None#
cohort_creator.data.utils.filter_data(df: pd.Dataframe, config: Any = None) pd.DataFrame#

Filter the listing of datasets based on some configuration.

Parameters#

dfpd.Dataframe

Listing of datasets to filter.

configAny, default=None

Should be a dict with any of the following keys.

  • "fmriprep" : None | bool

  • "mriqc" : None | bool

  • "physio" : None | bool

  • "task": str

  • "datatypes" : list[str] of any of the BIDS datatypes

  • "datatypes_and_or" : “OR” | “AND” if any or all of the datatypes must be present

  • "sources" : list[str] source of the dataset (openneuro, abide…)

  • "sources_and_or" : “OR” | “AND” if any or all of the sources must be present

If None is passed will default to the DEFAULT_CONFIG.

Returns#

pd.DataFrame

Filtered listing of datasets.

cohort_creator.data.utils.is_known_dataset(dataset_name: str) bool#

Check if a dataset is known to the cohort creator.

Parameters#

dataset_namestr

Name of the dataset to check, for example ds000117.

cohort_creator.data.utils.known_datasets_df() DataFrame#

Return dataframe of all datasets known to the cohort creator.

Returns#

pd.DataFrame

Dataframe containing list of datasets known to the cohort creator.

A data dictionary can be found in: cohort_creator/data/columns_description.json

cohort_creator.data.utils.save_dataset_listing(df: DataFrame) None#
cohort_creator.data.utils.wrangle_data(df: DataFrame) DataFrame#

Do general wrangling of the known datasets.

Parameters#

dfpd.DataFrame

dataframe of known datasets

Returns#

dfpd.DataFrame

dataframe of known datasets with extra columns

  • nb_datatypes: int Number of unique BIDS supported datatypes in the dataset.

  • nb_sessions: int Total number of unique sessions in the dataset.

  • nb_authors: int Number of authors as reported in the description of the dataset.

  • nb_tasks: int Total number of unique tasks in the dataset.

  • useful_participants_tsv: bool

  • has_physio: bool True if the dataset contains any *_physio.tsv.gz files.

  • has_fmriprep: bool` True if the dataset has knwow fmriprep preprocessed derivatives.

  • has_freesurfer: bool True if the dataset has knwow freesurfer preprocessed derivatives.

  • has_mriqc: bool True if the dataset has knwow mriqc derivatives.

  • source: Specifies the source of the dataset.

  • mean_size: size per subject in kilobytes

  • datatype: one column for each BIDS known datatype with True if this dataset contains that datatype.

  • total_duration: Total “scanned” duration per participant (combines runs from: func, eeg, ieeg)

Module contents#