Skip to content
Sameer Agarwal edited this page Aug 18, 2013 · 6 revisions

BlinkDB is a large-scale data warehouse system built on Shark and Spark and is designed to be compatible with Apache Hive. It can answer HiveQL queries up to 200-300 times faster than Hive by executing them on user-specified samples of data and providing approximate answers that are augmented with meaningful error bars. BlinkDB 0.1.0 is an alpha developer release that supports creating/deleting samples on any input table and/or materialized view and executing approximate HiveQL queries with those aggregates that have statistical closed forms (i.e., AVG, SUM, COUNT, VAR and STDEV).

Developer Alpha 0.1.0 Highlights

  • Create/Delete Samples on Native Table and/or Materialized View
  • Approximate Answers w/ Error Bars for closed form aggregates such as AVG, SUM, COUNT etc.
  • Complete support for GROUP BYs and FILTERs

Requirements

  • Scala 2.9.3
  • Spark 0.8.x
  • OpenJDK 7 or Oracle HotSpot JDK 7 or Oracle HotSpot JDK 6u23+

Setup Instructions

BlinkDB, being built upon Shark and Spark, shares a large portion of its setup instructions from these two codebases. Here are the specific set of instructions for running BlinkDB locally, on a cluster or on EC2. In case of any problems, please open a Github issue or send us an email at sameerag [AT] cs.berkeley.edu.

Running BlinkDB Locally: Get BlinkDB up and running on a single node for a quick spin in ~ 5 mins.

Running BlinkDB on a Cluster: Get BlinkDB up and running on your own cluster.

BlinkDB User Guide

BlinkDB User Guide: An introduction to running BlinkDB and its API.

Acknowledgements

BlinkDB is being developed in the UC Berkeley AMP Lab. This research and development is supported in part by NSF CISE Expeditions award CCF-1139158 and DARPA XData Award FA8750-12-2-0331, and gifts from Amazon Web Services, Google, SAP, Blue Goji, Cisco, Clearstory Data, Cloudera, Ericsson, Facebook, General Electric, Hortonworks, Huawei, Intel, Microsoft, NetApp, Oracle, Quanta, Samsung, Splunk, VMware and Yahoo!. Sameer Agarwal is supported by the Qualcomm Innovation Fellowship during 2012-13 and the Facebook Graduate Fellowship during 2013-14.

This wiki is closely mirrored after the Shark Wiki

Related Projects

Shark: Hive on Spark.

Spark: The in-memory cluster computing framework that powers Shark.

Apache Hive: Apache Hive data warehouse system.

Apache Mesos: cluster manager that provides efficient resource isolation and sharing across distributed applications.