cohort_creator.data package#
Submodules#
cohort_creator.data.utils module#
Utilities for data handling.
- cohort_creator.data.utils.booleanify(value: bool | str | list[str] | None) bool | str | list[str] | None#
- cohort_creator.data.utils.filter_data(df: pd.Dataframe, config: Any = None) pd.DataFrame#
Filter the listing of datasets based on some configuration.
Parameters#
- dfpd.Dataframe
Listing of datasets to filter.
- configAny, default=None
Should be a
dictwith any of the following keys.
"fmriprep": None | bool"mriqc": None | bool"physio": None | bool"task": str"datatypes": list[str] of any of the BIDS datatypes"datatypes_and_or": “OR” | “AND” if any or all of the datatypes must be present"sources": list[str] source of the dataset (openneuro, abide…)"sources_and_or": “OR” | “AND” if any or all of the sources must be present
If
Noneis passed will default to the DEFAULT_CONFIG.Returns#
- pd.DataFrame
Filtered listing of datasets.
- cohort_creator.data.utils.is_known_dataset(dataset_name: str) bool#
Check if a dataset is known to the cohort creator.
Parameters#
- dataset_name
str Name of the dataset to check, for example
ds000117.
- dataset_name
- cohort_creator.data.utils.known_datasets_df() DataFrame#
Return dataframe of all datasets known to the cohort creator.
Returns#
- pd.DataFrame
Dataframe containing list of datasets known to the cohort creator.
A data dictionary can be found in:
cohort_creator/data/columns_description.json
- cohort_creator.data.utils.save_dataset_listing(df: DataFrame) None#
- cohort_creator.data.utils.wrangle_data(df: DataFrame) DataFrame#
Do general wrangling of the known datasets.
Parameters#
- dfpd.DataFrame
dataframe of known datasets
Returns#
- dfpd.DataFrame
dataframe of known datasets with extra columns
nb_datatypes:intNumber of unique BIDS supported datatypes in the dataset.nb_sessions:intTotal number of unique sessions in the dataset.nb_authors:intNumber of authors as reported in the description of the dataset.nb_tasks:intTotal number of unique tasks in the dataset.useful_participants_tsv:boolhas_physio:boolTrueif the dataset contains any*_physio.tsv.gzfiles.has_fmriprep:bool`Trueif the dataset has knwow fmriprep preprocessed derivatives.has_freesurfer:boolTrueif the dataset has knwow freesurfer preprocessed derivatives.has_mriqc:boolTrueif the dataset has knwow mriqc derivatives.source: Specifies the source of the dataset.mean_size: size per subject in kilobytesdatatype: one column for each BIDS known datatype with
Trueif this dataset contains that datatype.total_duration: Total “scanned” duration per participant (combines runs from:func,eeg,ieeg)