cohort_creator.data package#
Submodules#
cohort_creator.data.utils module#
Utilities for data handling.
- cohort_creator.data.utils.booleanify(value: bool | str | list[str] | None) bool | str | list[str] | None #
- cohort_creator.data.utils.filter_data(df: pd.Dataframe, config: Any = None) pd.DataFrame #
Filter the listing of datasets based on some configuration.
Parameters#
- dfpd.Dataframe
Listing of datasets to filter.
- configAny, default=None
Should be a
dict
with any of the following keys.
"fmriprep"
: None | bool"mriqc"
: None | bool"physio"
: None | bool"task"
: str"datatypes"
: list[str] of any of the BIDS datatypes"datatypes_and_or"
: “OR” | “AND” if any or all of the datatypes must be present"sources"
: list[str] source of the dataset (openneuro, abide…)"sources_and_or"
: “OR” | “AND” if any or all of the sources must be present
If
None
is passed will default to the DEFAULT_CONFIG.Returns#
- pd.DataFrame
Filtered listing of datasets.
- cohort_creator.data.utils.is_known_dataset(dataset_name: str) bool #
Check if a dataset is known to the cohort creator.
Parameters#
- dataset_name
str
Name of the dataset to check, for example
ds000117
.
- dataset_name
- cohort_creator.data.utils.known_datasets_df() DataFrame #
Return dataframe of all datasets known to the cohort creator.
Returns#
- pd.DataFrame
Dataframe containing list of datasets known to the cohort creator.
A data dictionary can be found in:
cohort_creator/data/columns_description.json
- cohort_creator.data.utils.save_dataset_listing(df: DataFrame) None #
- cohort_creator.data.utils.wrangle_data(df: DataFrame) DataFrame #
Do general wrangling of the known datasets.
Parameters#
- dfpd.DataFrame
dataframe of known datasets
Returns#
- dfpd.DataFrame
dataframe of known datasets with extra columns
nb_datatypes
:int
Number of unique BIDS supported datatypes in the dataset.nb_sessions
:int
Total number of unique sessions in the dataset.nb_authors
:int
Number of authors as reported in the description of the dataset.nb_tasks
:int
Total number of unique tasks in the dataset.useful_participants_tsv
:bool
has_physio
:bool
True
if the dataset contains any*_physio.tsv.gz
files.has_fmriprep
:bool`
True
if the dataset has knwow fmriprep preprocessed derivatives.has_freesurfer
:bool
True
if the dataset has knwow freesurfer preprocessed derivatives.has_mriqc
:bool
True
if the dataset has knwow mriqc derivatives.source
: Specifies the source of the dataset.mean_size
: size per subject in kilobytesdatatype: one column for each BIDS known datatype with
True
if this dataset contains that datatype.total_duration
: Total “scanned” duration per participant (combines runs from:func
,eeg
,ieeg
)