Skip to content

Create Upfront Index

Anil Shanbhag edited this page May 13, 2016 · 2 revisions

There are fabric scripts already configured to do this with 3 simple commands. Before you can use them, you need to do some configuration setup.

  • First, run jps and make sure you have Hadoop, Spark, Zookeeper up and working.
  • Go to scripts/fabfile/confs.py. Change the appropriate settings of local_ or create a new conf entry to match your development environment.

Then,

fab setup:<your conf entry> create_table_info bulk_sample_gen create_robust_tree write_partitions

Here is what of the commands in the fab does:

  • bulk_sample_gen runs on each of the machines and samples the input data files based on the sampling percentage specified in the conf.
  • create_robust_tree runs only on the master and creates an upfront partitioning tree based on the samples.
  • write_partitions runs on each of the machines, take the index as input and writes out the input data partitioned into HDFS.
Clone this wiki locally