ML for apy dashboard

Objective:

Predict pool "apy runway"

Machine learning aspects:

target framing: Instead of trying to estimate the nb of days a pool can keep up current apy (a regression problem) think going with binary classification makes more sense. Reason: less noise, therefore likely better results and we get a probability distribution on top. So this is out target: what is the probability a pool can keep up its apy (within a defined range) for the next 4weeks?
inference frequency: we call the model every hour
metric: roc auc
data: use full historical snapshot but reduce to daily granularity, either use only a specific data point (eg midnight) or aggregate daily ones
X: everything i got so far and more, backward looking stuff such as rolling/expanding stats probably very useful
y: important to calculate target on a sorted pool level. training though will be iid on full batch. some first ideas how to design the target in more detail:

2. encode target:

version A) [very simple]
y = {
    0: if ((apyFuture / apy) - 1) < 0
    1: else
}

version B) [not as strict and imo more useful]
y = {
    0: if ((apyFuture / apy) - 1) < -20
    1: else
}

where apyFuture can be multiple things, a few which come to mind:
- a) simply the apy 30day in the future (could be a baseline)
- b) the avg of 30day apy values
- c) the median of 30day apy values to correct for potential outliers on apy series

use baseline, logistic regression and random forest oob with minimal settings, don't waste much time on tuning until we happy
you can find the datasets used for training here:

https://defillama-datasets.s3.eu-central-1.amazonaws.com/yield-ml/dataEnriched.json
https://defillama-datasets.s3.eu-central-1.amazonaws.com/yield-ml/pools.csv

EDIT: datasets used for retraining the model on the latest available data:
https://defillama-datasets.s3.eu-central-1.amazonaws.com/yield-ml/dataEnriched_2022_05_20.json
https://defillama-datasets.s3.eu-central-1.amazonaws.com/yield-ml/pools_2022_05_20.csv (>1gb)

notes regarding lambda layer/serverless

includes a simple lambda handler function for ML inference. the steps:

loads the saved model artefact from s3
casts the incoming data into a numpy array
calls the predict method on the model
returns the required prediction arrays

notes for installation process

python runtime lambdas are a bit of a pain when using scientific computing libraries as dependencies (numpy, scikit-learn, scipy etc). multiple difficulties:

the dependencies are very large (~100mb together) and can only be compressed to ~80mb
must be built on linux or via docker container otherwise the python lambda runtime environment can't import the necessary libs
size is so large that testing and changing the handler function gets very slow

hence i chose the following process instead:

building a lambda layer with all required dependencies using the createlayer.sh script
uploading the created zip file to s3 (aws s3 cp layer.zip bucket)
creating a layer manually via aws console
adding that layer to the handler function inside serverless.yml
and only then calling sls deploy

The result will be a very small zip file for the lambda as the layer is already build and completly isolated

Layer consits of sklearn only (which will install numpy, scipy, joblib and threadpoolctl). All versions are the latest except scipy, had to downgrade from 1.8.0 to 1.4.1 to get rid of

[ERROR] Runtime.ImportModuleError: Unable to import module 'handler': /opt/python/lib/python3.8/site-packages/scipy/linalg/_fblas.cpython-38-x86_64-linux-gnu.so: ELF load command address/offset not properly aligned

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

ML for apy dashboard

notes regarding lambda layer/serverless

notes for installation process

Files

README.md

Latest commit

History

README.md

File metadata and controls

ML for apy dashboard

notes regarding lambda layer/serverless

notes for installation process