Releases: mitmedialab/sherlock-project
Releases · mitmedialab/sherlock-project
Feature extraction speedup, bugfixes and model code.
This release provides:
- a significant speedup and memory reduction of the feature extraction phase,
- bugfixes in the feature extraction pipeline,
- the code of the original model architecture (tensorflow keras),
- alignment of the
SherlockModel
class with the scikit-learn API (i.e. w/fit
,predict
,predict_proba
methods), - improved notebooks demonstrating 1) full reproduction of the feature extraction and model training/evaluation pipelines, 2) out-of-the-box usage of the Sherlock model for a given table, 3) how performance can be improved with additional classifiers.
Contributions by:
@lowecg
@madelonhulsebos
Original code
This release reflects the code that was used for the experiments in the paper "Sherlock: a deep learning approach to semantic data type detection" (link to the paper on arXiv). This release provides code for:
- Download of the original train and test data used for the experiment results as reported in the paper.
- Feature extraction to numerically represent new columns.
- Evaluating a trained Sherlock model on unseen table columns.
- Retraining the original Sherlock model.
This release consists inefficiencies and bugs, hence it is recommended to use the latest release of this project in production settings or new research projects. More about this project can be found on this website.