GitHub - mganta/spark-parquet

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.settings		.settings
hive		hive
input		input
src/main		src/main
.classpath		.classpath
.project		.project
README		README
pom.xml		pom.xml

Repository files navigation

-------
README 
-------
This is an example to learn spark and show how to read a csv & convert it into a parquet file from spark. This shows how to store Avro data in Parquet files. 
This example uses a cdh5-beta2 cluster


Here are the steps after you do a mvn build

1. Load the csv file into hdfs. A sample file is in the input folder

2. Run the spark program to convert it to parquet

   You can use spark-class but due to a bug in spark 0.9.0, you will get this error. Has been reported as a blocker (hopefully, will be fixed soon)
    Jar url 'hdfs:///namenode:8020/user/foo/<filename>.jar' is not a valid URL.
    Jar must be in URL format (e.g. hdfs://XX, file://XX)
  
  Meanwhile, From eclipse or mvn, you can directly invoke the class com.cloudera.sa.simple.SparkParquetCSVExample with the following parameters
    spark://spark-makster.mycluster.net:7077 \
    hdfs://namenode.mycluster.net:8020/user/foo/input/hotl*.csv \
    hdfs://namenode.mycluster.net:8020/user/foo/output
    
3. Use the sample create table script in hive folder to create an external table

4. Log into impala or hive or shark to run sql on the dataset

Sample csv files from http://www.window.state.tx.us/taxinfo/taxfiles.html
A big thanks to the blog at http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/