sortinghat.jemdoc

# jemdoc: menu{MENU2}{sortinghat.html}
= ADA Lab @ UCSD

~~~
{}{img_left}{images/sortinghat.jpg}{}{}{80px}{}
== Project SortingHat
~~~ 

=== Overview

Data preparation (prep) time still remains a major roadblock for real-world ML 
applications, since it is ususally handled fully manually by data scientists. 
The grunt work of data prep involves diverse tasks such as identifying features 
types for ML, standardizing and cleaning feature values, and feature transformations. 
This reduces data scientists' productivity and raises costs. It is also a major
impediment to the effectiveness of emerging industrial end-to-end AutoML platforms 
that build ML models on millions of datasets for enterprises, Web companies, and more.

To tackle the above challenge, we envision a new line of research to dramatically 
reduce the human effort needed in data prep for ML, as well as to accurately benchmark
the automation of data prep in AutoML platforms: /create benchmarks and labeled datasets/ 
and /use ML to automate ML data prep/. We abstract a series of specific and ubiquitous 
ML data prep tasks and formalize them as prediction tasks. 
We present the /ML Data Prep Zoo/, a community-led repository of our benchmark 
labeled datasets and pre-trained ML models for such ML data prep tasks.

So far, we have applied the above philosophy to a gateway step in ML data prep for 
tabular data: /ML Schema Extraction/. Datasets are typically exported from DBMSs into 
tools such as Python as CSV files. This creates a semantic gap between attribute types 
in a DB schema (such as strings or integers) and feature types in a ML schema (such as 
numeric or categorical). Hence, data scientists or AutoML platform users are forced to 
manually extract the ML schema, which is tedious, slow, and error-prone. 
We cast this task as a multi-class ML classification problem for the first time and 
allow users to quickly dispose of easy features, repriortize their effort towards 
features that need more human attention, and enable AutoML platforms to more accurately
and robustly identify feature types, which in turn boosts their downstream model building.

~~~
All of our labeled datasets, pre-trained models, code, and documentation are available here: [https://github.com/pvn25/ML-Data-Prep-Zoo *ML Data Prep Zoo*]
~~~

~~~
Recent long talk on vision and first task: [https://www.youtube.com/watch?v=NVrYZQ3YCuw *Youtube video*].
~~~

=== Overview Paper

- The ML Data Prep Zoo: Towards Semi-Automatic Data Preparation for ML\n
Vraj Shah and Arun Kumar\n
ACM SIGMOD 2019 DEEM Workshop | [papers/2019_DataPrepZoo_DEEM.pdf Paper PDF] and [papers/2019_SortingHat_SIGMOD.txt BibTeX] | [papers/TR_2019_DataPrepZoo.pdf TechReport]
 | [https://adalabucsd.github.io/research-blog/research/2019/06/21/mldataprepzoo.html Blog post]
 | [https://github.com/pvn25/ML-Data-Prep-Zoo Data Prep Zoo Repository on GitHub]

=== Category Deduplication

- How do Categorical Duplicates Affect ML? A New Benchmark and Empirical Analyses\n
Vraj Shah, Thomas Parashos, and Arun Kumar\n 
VLDB 2024 | [papers/2024_CategDedup_VLDB.pdf Paper PDF] | [papers/TR_2023_CategDedup.pdf TechReport] | Code and Data coming soon

- Categorical Data Deduplication\n
Soham Pachpande and Gehan Chopade\n
[papers/TR_2022_CSE234_CategDedup.pdf CSE 234 Project TechReport] |
[https://github.com/sohampachpande/data-deduplication Code, Data, and Pre-trained Models on GitHub]

=== Feature Type Inference

- Towards Benchmarking Feature Type Inference for AutoML Platforms\n
Vraj Shah, Jonathan Lacanlale, Premanand Kumar, Kevin Yang, and Arun Kumar\n
ACM SIGMOD 2021 | [papers/2021_SortingHat_SIGMOD.pdf Paper PDF] | [papers/TR_2021_SortingHat.pdf TechReport] | Talk Videos: [https://youtu.be/KAs-uU59AEM Short Talk] [https://youtu.be/dpx74zQyU3k Long Talk] | [https://github.com/pvn25/ML-Data-Prep-Zoo/tree/master/MLFeatureTypeInference Data, Code, and Pre-trained Models on GitHub] | [https://github.com/pvn25/ML-Data-Prep-Zoo/tree/master/MLFeatureTypeInference/Library Python library]

- Bringing ML-based Feature Type Inference to OpenML\n
Ryan Tran and Victor Zhu\n
[papers/TR_2022_CSE234_OpenML.pdf CSE 234 Project TechReport] | 
[https://github.com/bobotran/SortingHatLib Code Release on GitHub] |
[https://pypi.org/project/sortinghatinf/ Package on PyPi]

- Improving Feature Type Inference Accuracy of TFDV with SortingHat\n
Vraj Shah, Kevin Yang, and Arun Kumar\n
[papers/TR_2020_TFDV.pdf TechReport]

- Towards Semi-Automatic Embedded Data Type Inference\n
Jonathan Lacanlale, Vraj Shah, and Arun Kumar\n
[papers/TR_2019_Jonathan.pdf TechReport]

=== Student Contact

Vraj Shah: vps002 \[at\] eng \[dot\] ucsd \[dot\] edu

=== Acknowledgments

This project is supported in part by a Faculty Research Award from Google Research, an Amazon Research Award, and the NSF Convergence Accelerator under award OIA-2040727.