Skip to content

Latest commit

 

History

History
46 lines (31 loc) · 2.01 KB

README.md

File metadata and controls

46 lines (31 loc) · 2.01 KB

The Generic and Semantic Profiling of Big Datasets


We call ourselves CreaTAK
T is for Theodore
A is for Ankush
K is for Kat
...and we all like to create stuff.

In this study, we ran Apache Spark over NYU’s 48-node Hadoop cluster, running Cloudera CDH 5.15.0, to generically and semantically profile 1159 datasets from NYC Open Data. We refer to these two profiling methods as Task 1 and Task 2, respectively.

Of the 1159 files we profiled in Task 1, we found 11674 integer columns, 13646 text columns, 1137 date/time columns, and 4527 real number columns. For Task 2, we analyzed 260 columns and we were able to identify the semantic types for 210 columns with a precision of 72.40%.


Instructions

  • Log into NYU's DUMBO.
  • Load the correct versions of Python and Spark:
    • module load python/gnu/3.6.5
    • module load spark/2.4.0

Task 1: Generic Profiling

  • Navigate to task1/src/
  • Type the following to run:
    • spark-submit --conf spark.pyspark.python=$PYSPARK_PYTHON task1.py

Task 2: Generic Profiling

  • Navigate to task2/src/
  • Type the following to run:
    • spark-submit --conf spark.pyspark.python=$PYSPARK_PYTHON task2.py
  • Use task2_md.py to execute with 'en_web_core_md' NLP model. This model was used to produce the final results.
  • This model is not available on dumbo by default. Use the custom environment from 'env' or use the following command to install the package.

python -m spacy download en_core_web_md