Spark SQL row-oriented indexed file format
Riff is a Spark SQL row-oriented file format designed for faster writes and point queries, and reasonable range queries compare to Parquet. It is built to work natively with Spark SQL eliminating row values conversion, performing predicate pushdown and indexing. You can see benchmark results in this gist.
Spark version | riff latest version |
---|---|
2.0.x | 0.2.0 |
2.1.x | 0.2.0 |
The riff
package can be added to Spark by using the --packages
command line option.
For example, run this to include it when starting spark-shell
(Scala 2.11.x):
$SPARK_HOME/bin/spark-shell --packages sadikovi:riff:0.2.0-s_2.11
See build instructions to create jar for Scala 2.10.x
Currently supported options, use --conf key=value
on a command line to provide options similar to
other Spark configuration or add them to spark-defaults.conf
file, or add them in the code by
running spark.conf.set("option", "value")
.
Name | Description | Default |
---|---|---|
spark.sql.riff.compression.codec |
Compression codec to use for riff (none , snappy , gzip , deflate ) |
deflate |
spark.sql.riff.stripe.rows |
Number of rows to keep per stripe | 10000 |
spark.sql.riff.column.filter.enabled |
When enabled, write column filters in addition to min/max/null statistics (true , false ) |
true |
spark.sql.riff.buffer.size |
Buffer size in bytes for out/in stream | 256 * 1024 |
spark.sql.riff.filterPushdown |
When enabled, propagate filter to riff format, otherwise filter data in Spark only | true |
spark.sql.riff.metadata.count.enabled |
When enabled, use metadata information for count queries, otherwise read table data | true |
These options that you can specify when writing DataFrame by calling df.write.option("key", "value").save(...)
.
Name | Description | Default |
---|---|---|
index |
Optional setting to specify columns to index by Riff; if no columns provided, default row layout is used | <empty string> |
IntegerType
LongType
StringType
DateType
TimestampType
BooleanType
ShortType
ByteType
Usage is very similar to other datasources for Spark, e.g. Parquet, ORC, JSON, etc. Riff allows to set some datasource options in addition to ones listed in the table above.
// Write DataFrame into Riff format by using standard specification
val df: DataFrame = ...
df.write.format("com.github.sadikovi.spark.riff").save("/path/to/table")
// Read DataFrame from Riff format by using standard specification
val df = spark.read.format("com.github.sadikovi.spark.riff").load("/path/to/table")
Alternatively you can use shortcuts to write and read Riff files.
// You can also import implicit conversions to make it similar to Parquet read/write
import com.github.sadikovi.spark.riff._
val df: DataFrame = ...
df.write.riff("/path/to/table")
val df = spark.read.riff("/path/to/table")
// You can specify fields to index, Riff will create column filters for those, and restructure
// records to optimize filtering by those fields. This is optional and can be specified on writes,
// when reading data Riff will automatically use those fields - no settings required.
val df: DataFrame = ...
df.write.option("index", "col1,col2,col3").riff("/path/to/table")
val df = spark.read.riff("/path/to/table").filter("col1 = 'abc'")
# You can also specify index columns for table
df.write.format("com.github.sadikovi.spark.riff").save("/path/to/table")
...
df = spark.read.format("com.github.sadikovi.spark.riff").load("/path/to/table")
CREATE TEMPORARY VIEW test
USING com.github.sadikovi.spark.riff
OPTIONS (path "/path/to/table");
SELECT * FROM test LIMIT 10;
This library is built using sbt
, to build a JAR file simply run sbt package
from project root.
To build jars for Scala 2.10.x and 2.11.x run sbt +package
.
Run sbt test
from project root. See .travis.yml
for CI build matrix.
Run sbt package
to package project, next run spark-submit
for following benchmarks
(jar path/name might be different from example below). All data created during benchmarks is stored
in ./temp
folder.
- Write benchmark
spark-submit --class com.github.sadikovi.benchmark.WriteBenchmark \
target/scala-2.11/riff_2.11-0.1.0-SNAPSHOT.jar
- Scan benchmark
spark-submit --class com.github.sadikovi.benchmark.ScanBenchmark \
target/scala-2.11/riff_2.11-0.1.0-SNAPSHOT.jar
- Query benchmark
spark-submit --class com.github.sadikovi.benchmark.QueryBenchmark \
target/scala-2.11/riff_2.11-0.1.0-SNAPSHOT.jar
- Project benchmark
spark-submit --class com.github.sadikovi.benchmark.ProjectBenchmark \
target/scala-2.11/riff_2.11-0.1.0-SNAPSHOT.jar