preprocessing and feature extraction for ds0000201
pulling the datalad container into envs to version control content
git clone https://github.com/edickie/ds000201_preproc.git
location of this repo on the CAMH Scientific Computing Cluster (with the data folders) is:
/external/rprshnas01/netdata_kcni/edlab/ds000201_preproc
-
General Overview and Objectives for the use of this project (given by the data knights)
-
Neuroimaging Learning Resources
We will be working with data from the The Stockholm Sleepy Brain Study open data set.
- [Primary Deliverable] Multiclass Classification of the specific "task" being performed by the subject, based on the fMRI images in the dataset.
- [Stretch Goal #1] Unsupervised Machine Learning feature generation on the full dataset. Looking for correlation between "tasks" being performed by subject with the aim of learning more about how the tasks are represented in the fMRI data.
- [Stretch Goal #2] Identify any correlation of the generated/existing features with the numerical "sleepiness" scale as ranked in the sleepiness task. End goal is to predict "sleepiness" rating based on analyzed fMRI data.
The data is available on openneuro.org and is organized according to the Brain Imaging Data Structure (BIDS). Meaning that certain important peices of information is implied according to the naming convention of the files and/or encoded in specific places (i.e. participant information is in a participants.tsv file). The BIDS specification is available here
- Our recent course for BIDS data organization and proprocessing here
- Our recent course on python neuroimaing anaylsis - with nilearn here
- There are also some amazing resources at the ongoing neurohackademy course.
Note: the full raw data download is about 176G. I wanted to pull the data using datalad - but the datalad repo had broken download links.. So instead this data was pulled using the amazon-cli (in a singularity container).
On the SSC this was Run 2020-07-30, singularity version 3.5.3
# module load singularity
# note run this from the repository root i.e. ds000201_preproc folder
singularity build envs/aws-cli.sif docker://amazon/aws-cli
The data download
singularity run envs/aws-cli.sif s3 sync --no-sign-request s3://openneuro.org/ds000201 data/input/bids/
for reason's i do not understand the dataset_description and participants.tsv did not get pulled down from the s3 bucket - so I will pull those separately..
cd data/input/bids
wget https://openneuro.org/crn/datasets/ds000201/snapshots/1.0.3/files/participants.tsv
wget https://openneuro.org/crn/datasets/ds000201/snapshots/1.0.3/files/dataset_description.json
To preprocess we will use the current stable release of fmriprep
singularity build envs/fmriprep-20.1.1.simg docker://poldracklab/fmriprep:20.1.1
Testing and setting up for the singularity run..
we need a copy of the freesurfer license to be in:
$SCRATCH/fmriprep_home/.freesurfer.license
Testing the singularity binds..
cd $SCRATCH/ds000201_preproc
singularity shell --cleanenv \
-B ${SCRATCH}/fmriprep_home:/home/fmriprep --home /home/fmriprep \
envs/fmriprep-20.1.1.simg
From inside the container - set up templateflow (note due this before submitting a job)
python -c "from templateflow.api import get; get(['MNI152NLin2009cAsym', 'MNI152NLin6Asym'])"
python -c "from templateflow.api import get; get(['fsaverage', 'fsLR'])"
python -c "from templateflow.api import get; get(['OASIS30ANTs'])"
## go to the repo and pull new changes
cd $SCRATCH/ds000201_preproc
git pull
## calculate the length of the array-job given
SUB_SIZE=5
N_SUBJECTS=$(( $( wc -l data/input/bids/participants.tsv | cut -f1 -d' ' ) - 1 ))
array_job_length=$(echo "$N_SUBJECTS/${SUB_SIZE}" | bc)
echo "number of array is: ${array_job_length}"
## submit the array job to the queue
sbatch --array=0-${array_job_length} $SCRATCH/ds000201_preproc/code/00_fmriprep_anat.sh
## go the repo and pull new changes
cd $SCRATCH/ds000201_preproc
git pull
## figuring out appropriate array-job size
SUB_SIZE=1 # for func the sub size is moving to 1 participant because there are two runs and 8 tasks per run..
N_SUBJECTS=$(( $( wc -l data/input/bids/participants.tsv | cut -f1 -d' ' ) - 1 ))
array_job_length=$(echo "$N_SUBJECTS/${SUB_SIZE}" | bc)
echo "number of array is: ${array_job_length}"
## submit the array-job
sbatch --array=0-${array_job_length} $SCRATCH/ds000201_preproc/code/01_fmriprep_func.sh
Update - when looking at the QA pages - the fieldmap correction runs look worse than the syth corrected runs - so we will rerun everything with the --ignore-fieldmap flag.. urg
## go the repo and pull new changes
cd $SCRATCH/ds000201_preproc
git pull
## figuring out appropriate array-job size
SUB_SIZE=1 # for func the sub size is moving to 1 participant because there are two runs and 8 tasks per run..
N_SUBJECTS=$(( $( wc -l data/input/bids/participants.tsv | cut -f1 -d' ' ) - 1 ))
array_job_length=$(echo "$N_SUBJECTS/${SUB_SIZE}" | bc)
echo "number of array is: ${array_job_length}"
## submit the array-job
sbatch --array=0-${array_job_length} $SCRATCH/ds000201_preproc/code/01_fmriprep_func_allsynthSDC.sh
ssh nia-dm2
screen
rsync -av ${SCRATCH}/ds000201_QA_sdc.zip [email protected]:/external/rprshnas01/netdata_kcni/edlab/
rsync -av ${SCRATCH}/ds000201_QA_sdc.zip [email protected]:/external/rprshnas01/tigrlab/scratch/edickie/
rsync -av $SCRATCH/ds000201_preproc/data/derived/fmriprep [email protected]:/external/rprshnas01/tigrlab/scratch/edickie/tmp_sleep_proj/
From the exit codes - only three subjects report a non-zero exit status
sub-9078 - acutally there is no functional data in the input folders - so this person we will just exclude..
sub-9070 57 1 - weird - just no funcitonal processing? sub-9078 65 1 - weird - just no functional processing? - the scans are missing from the folder? sub-9100 85 1 - lost connection with the burst buffer? - just re-submit?
QA_dir=${SCRATCH}/ds000201_QA
FMRIPREP_dir=${SCRATCH}/ds000201_preproc/data/derived/fmriprep
mkdir -p ${QA_dir}
subjects=`cd ${FMRIPREP_dir}; ls -1d sub-* | grep -v html`
cd ${SCRATCH}
for subject in ${subjects}; do
cp ${FMRIPREP_dir}/${subject}.html ${QA_dir}
mkdir -p ${QA_dir}/${subject}
rsync -av ${FMRIPREP_dir}/${subject}/figures ${QA_dir}/${subject}
done
Update - these QA images can be viewed from jupyter hub but only using a firefox browser! (Also make sure to add a symplink from your home to the drive where the fmriprep outputs are so that you can view images via the jupyter hub).
However it might be a alot easier to go through everything quickly if the desc-sdc_bold QA images (the most important one to look at) are all in one local folder - so I will write a script to do that..
QA_dir=${SCRATCH}/ds000201_QA_sdc
FMRIPREP_dir=${SCRATCH}/ds000201_preproc/data/derived/fmriprep
mkdir -p ${QA_dir}
subjects=`cd ${FMRIPREP_dir}; ls -1d sub-* | grep -v html`
cd ${SCRATCH}
rsync -av ${FMRIPREP_dir}/*/figures/*_desc-sdc_bold.svg ${QA_dir}/
Major issue, it is clear that:
- the arrows, faces, and hands tasks used different aquisition paremeters than rest and sleepyness
- the arrows faces and hands tasks do not a full brain coverage - they are cases were both the top and bottom/cerebellum of the brain and omitted.
- the rest and sleepyness runs appear to have full brain coverage for the most part
- one subject (the last one) need to be rerun
- registration for sub-9038 (only one of the sessions) is very bad (exclude)
- sub-9094 ses-2 anatomical looks odd sometimes.
For most fmriprep runs - when I check the resource usage they would hang at full RAM and no CPU usage for hours - and then something happens and they all continue to run normally -
A couple participants completely time out in this state..
for the fieldmap run - it was 3837049_86 - or sub-9100 - insterestingly - these job was submitted twice - due to an array indexing error..
For many of the most recent runs - the hanging processes all finished between 1-2am - this seems to correspond to the writing of the config-*.toml to the workdir..
- things we could try..
- one would be to not share the home - mount templateflow separate?
- another would be to make sure to state explicitly the freesurfer anat derivatives..
- mount different $BBUFFER for each process?
Mapping the array id to the subject id..
We wish this would work - but it did not.
clone the input data source as a submodule into this repo.
singularity pull shub://datalad/datalad:fullmaster
singularity run envs/datalad_fullmaster.sif clone --dataset . ///openneuro/ds000201 data/inputs/bids