apache-parquet-avro

Aim

Experiment with Apache Parquet and Apache Avro

Dependency

Using Spark2.1.0 binary prebuilt with Hadoop 2.7

Outcome

Learnt how to use 3rd party / external SPARK packages like spark-csv, spark-avro (parquet inbuilt)

Steps

Call pyspark with Avro package

[root@chai chai]# pyspark --packages com.databricks:spark-avro_2.11:3.2.0

Without Avro package it throws error -

pyspark.sql.utils.AnalysisException: u'Failed to find data source: com.databricks.spark.avro. Please find an Avro package at http://spark.apache.org/third-party-projects.html;'

Download package. Unzip it. Put it in the directory and rename folder spark-avro_2.11:3.2.0

Reading Avro file

>>> df1 = sqlContext.read.load("/usr/local/spark/examples/src/main/resources/users.avro","com.databricks.spark.avro")

Reading Parquet file

>>> df = sqlContext.read.load("/usr/local/spark/examples/src/main/resources/users.parquet")

Storing as Parquet file

>>> df.select("name", "favorite_color").write.save("namesAndFavColors.parquet")

Storing as Avro file

>>> df.select("name", "favorite_color").write.save("namesAndFavColors.avro","com.databricks.spark.avro")

Resolved Error

 java.sql.SQLException: Unable to open a test connection to the given database. 
 JDBC url = jdbc:derby:;databaseName=metastore_db;create=true, username = APP. 
 Terminating connection pool (set lazyInit to true if you expect to start your database after your app). 

 Original Exception: 
 ------
java.sql.SQLException: 
Failed to start database 'metastore_db' with class loader org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@223bc09d, 
see the next exception for details. ```

Same exception consisted of following exception as -

Caused by: ERROR XSDB6: Another instance of Derby may have already booted the database /usr/local/spark/metastore_db.

All these 3 exceptions went thanks to

http://stackoverflow.com/questions/34465516/caused-by-error-xsdb6-another-instance-of-derby-may-have-already-booted-the-da

Solution -

$ ps -ef | grep spark-shell

$ kill -9 Spark-Shell-processID

Doubt -

[root@chai chai]# ps -ef | grep spark-shell
root     18382 18362  1 20:10 pts/1    00:00:48 /usr/lib/jvm/jdk1.8.0_121/bin/java -cp /usr/local/spark/conf/:/usr/local/spark/jars/* -Xmx1g org.apache.spark.deploy.SparkSubmit --name PySparkShell --packages com.databricks:spark-avro_2.11:3.2.0 pyspark-shell
root     20312 16067  0 21:08 pts/1    00:00:00 grep --color=auto spark-shell

Process 1 killed

[root@chai chai]#kill -9 18382

Process 20312 unable to be killed stating not a process why?

[root@chai chai]# kill -9 20312
bash: kill: (20312) - No such process

Unresolved

1. Warning -

WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

Possible Solution -

2. Warning -

WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException

3. Error

SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.

Possible Solutions -

Sources -

To read write Parquet files - http://www.kdnuggets.com/2015/10/ibm-seti-spark-sql-parquets.html
Spark Packages - https://spark-packages.org/
PySpark tutorial - https://github.com/mahmoudparsian/pyspark-tutorial/tree/master/howto
Apache Parquet Documentation - https://parquet.apache.org/documentation/latest/
SQL Programming - https://spark.apache.org/docs/1.6.1/sql-programming-guide.html
Parquet + S3 - https://www.appsflyer.com/blog/the-bleeding-edge-spark-parquet-and-s3/
https://www.svds.com/how-to-choose-a-data-format/
https://groups.google.com/forum/#!topic/parquet-dev/lfWanFOc040
http://apache-spark-user-list.1001560.n3.nabble.com/Write-to-Parquet-File-in-Python-td22186.html
Parquet for Spark SQL - https://developer.ibm.com/hadoop/2015/12/03/parquet-for-spark-sql/
Parquet - http://www.kdnuggets.com/?s=parquet
Spark SQL Parquet - http://www.kdnuggets.com/2015/10/ibm-seti-spark-sql-parquets.html
SparkSQL Real-time analytics - http://www.kdnuggets.com/2015/09/spark-sql-real-time-analytics.html
http://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html
Spark Dataframe Operations - https://www.analyticsvidhya.com/blog/2016/10/spark-dataframe-and-operations/
Intro to RDDs - https://www.analyticsvidhya.com/blog/2016/09/comprehensive-introduction-to-apache-spark-rdds-dataframes-using-pyspark/
PySpark TUtorial - https://www.dezyre.com/apache-spark-tutorial/pyspark-tutorial

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
converted.parquet		converted.parquet
convertedwithformat.parquet		convertedwithformat.parquet
myusers.parquet		myusers.parquet
namesAndFavColors.avro		namesAndFavColors.avro
namesAndFavColors.parquet		namesAndFavColors.parquet
LICENSE		LICENSE
README.md		README.md
bash_history_avro_parquet		bash_history_avro_parquet

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

apache-parquet-avro

Aim

Dependency

Outcome

Steps

Resolved Error

Doubt -

Unresolved

1. Warning -

2. Warning -

3. Error

Sources -

About

Releases

Packages

License

ChaiBapchya/apache-parquet-avro

Folders and files

Latest commit

History

Repository files navigation

apache-parquet-avro

Aim

Dependency

Outcome

Steps

Resolved Error

Doubt -

Unresolved

1. Warning -

2. Warning -

3. Error

Sources -

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages