This is a personal project to create easily accessible and sanitized datasets related to the Buddhist canon.
You can find all the data and the scripts I used to create them in the ./data
folder.
Some datasets are just "raw data", like the translations of the Sutta Pitaka, while others involve some modelling such as text embeddings.
If you'd like to collobarate (add data to this collection) just get in touch.
This project draws mostly on the open-source data of SuttaCentral.