This package provides a framework and examples for running machine
learning experiments in sleep classification. Pisces offers automated
data set and subject/feature discovery based on a light folder
structure, loading CSVs into pandas DataFrame
objects. A number of
tools are also provided for plotting, scoring, and debugging sleep
research pipelines.
Start by making a python or conda environment with Python 3.11 and
installing the requirements from file. For example, you can create an
environment called pisces
by:
conda create -n pisces python=3.11
conda activate pisces
In the same terminal (so that your new conda environment is active),
navigate to the directory where you’d like to clone the package and run
the following commands to clone it and use pip
to install the package
in an editable way with -e .
git clone https://github.com/Arcascope/pisces.git
cd pisces
pip install -e .
You may end up with a version of Keras incompatible with the marshalled
data in pisces/cached_models
. In that case, run
pisces_setup
in a terminal;
pisces_setup
is in your path as long as a Python environment with pisces
installed
is active.
The pipeline is intended to be flexible and can be easily extended to include new models, datasets, and evaluation metrics. In version 2.0, we have streamlined the library to prioritizing nimbleness and easy debugging.
The examples/NHRC
folder shows how to use pisces
with other packages
like sklearn
and tensorflow
providing machine learning frameworks.
Pisces automatically discovers data sets that match a simple, flexible
format inside a given directory. The analysis in examples/NHRC/src
finds data contained in the data
folder of the Pisces repository. The
code is simple:
from pisces.data_sets import DataSetObject
sets = DataSetObject.find_data_sets("../data")
walch = sets['walch_et_al']
hybrid = sets['hybrid_motion']
Now we have 2
DataSetObject
s,
walch
and hybrid
, that can be queried for their subjects and
features. These were discovered because these are folders inside of
data
that have a compatible structure.
These two sets were discovered because of the presence of at least one
subdirectory matching the glob expression cleaned_*
. Every
subdirectory that matches this pattern is considered a feature, so based
on the example below, Pisces discovers that hybrid_motion
and
walch_et_al
both have psg
, accelerometer
, and activity
features,
in addition to other folders they may have not listed.
The data
directory looks like:
data
├── walch_et_al
│ ├── cleaned_accelerometer
│ │ ├── 46343_cleaned_motion.out
│ │ ├── 759667_cleaned_motion.out
│ │ ├── ...
│ ├── cleaned_activity
│ │ ├── 46343_cleaned_counts.out
│ │ ├── 759667_cleaned_counts.out
│ │ ├── ...
│ ├── cleaned_psg
│ │ ├── 46343_cleaned_psg.out
│ │ ├── 759667_cleaned_psg.out
│ │ ├── ...
├── hybrid_motion
│ ├── cleaned_accelerometer
│ │ ├── 46343.csv
│ │ ├── 759667.csv
│ │ ├── ...
│ ├── cleaned_activity
│ │ ├── 46343.csv
│ │ ├── 759667.csv
│ │ ├── ...
│ ├── cleaned_psg
│ │ ├── 46343_labeled_sleep.txt
│ │ ├── 759667_labeled_sleep.txt
│ │ ├── ...
- The data set is discovered based on the presence of a subdirectory
matching the glob expression
cleaned_*
. - Every subdirectory that matches this pattern is considered a
feature; these features are named after the part matching
*
. - Subjects within a feature are computed per-feature, based on
variadic and constant parts of the filenames within each feature
directory. Said in a less fancy way, because the
walch_et_al
accelerometer folders contain the files46343_cleaned_motion.out
and759667_cleaned_motion.out
which have_cleaned_motion.out
in common, Pisces identifies46343
and759667
as subject IDs that have accelerometer feature data forwalch_et_al
.- It is no problem if some subjects are missing a certain feature.
When the feature data for an existing subject, without that
feature in their data, is requested, the feature will return
None
for that subject. - The naming scheme can vary greatly between features. However,
the subject id MUST be the prefix on the filenames. For example,
46343_labeled_sleep.txt
are both for the same subject,46343
. If instead we named thosefinal_46343_cleaned_motion.out
and46343_labeled_sleep.txt
then the subject’s data would be broken into two subjects,46343
andfinal_46343
.
- It is no problem if some subjects are missing a certain feature.
When the feature data for an existing subject, without that
feature in their data, is requested, the feature will return
- There is no a-priori rule about what features in a data set give the labels and which are model inputs. This allows you to call the label feature whatever you want, or use a mixture of features (psg + …) as labels for complex models supporting rich outputs.
- You can have other folders inside data set directories that do NOT
match
cleaned_*
, and these are totally ignored. This allows you to store other data, like raw data or metadata, in the same directory as the cleaned data. - You can have other folders whose sub-structure does not match the subject/feature structure, and these are totally ignored.