Skip to content

Latest commit

 

History

History
34 lines (16 loc) · 7.76 KB

File metadata and controls

34 lines (16 loc) · 7.76 KB

http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-214.pdf

Shark: SQL and Rich Analytics at Scale

Shark: SQL and Rich Analytics at Scale

ABSTRACT

Shark is a new data analysis system that marries query processing with complex analytics on large clusters. It leverages a novel distributed memory abstraction to provide a unified engine that can run SQL queries and sophisticated analytics functions (e.g., iterative machine learning) at scale, and efficiently recovers from failures mid-query. This allows Shark to run SQL queries up to 100� faster than Apache Hive, and machine learning programs up to 100� faster than Hadoop. Unlike previous systems, Shark shows that it is possible to achieve these speedups while retaining a MapReduce-like execution engine, and the fine-grained fault tolerance properties that such engines provide. It extends such an engine in several ways, including column-oriented in-memory storage and dynamic mid-query replanning, to effectively execute SQL. The result is a system that matches the speedups reported for MPP analytic databases over MapReduce, while offering fault tolerance properties and complex analytics capabilities that they lack.

1 Introduction

1 引言

Modern data analysis faces a confluence of growing challenges. First, data volumes are expanding dramatically, creating the need to scale out across clusters of hundreds of commodity machines. Second, this new scale increases the incidence of faults and stragglers (slow tasks), complicating parallel database design. Third, the complexity of data analysis has also grown: modern data analysis employs sophisticated statistical methods, such as machine learning algorithms, that go well beyond the roll-up and drill-down capabilities of traditional enterprise data warehouse systems. Finally, despite these increases in scale and complexity, users still expect to be able to query data at interactive speeds.

现代的数据分析面临很多增长的挑战。首先,数据量飞速增长,催生了在数百商业主机集群上的水平扩展需求。其次,新的规模使得故障或者掉队(某些慢的任务)概率大增,使得并行数据库设计更为复杂。再者,数据分析的复杂度也在上升:现代的数据分析采用复杂的统计学方法,比如机器学习算法,在roll-up和drill-down两方面的能力都比传统企业级数据仓库走的更远。最后,虽然在扩展和复杂度上的要求都大为增加,用户仍然希望能做交互式的数据查询。

To tackle the “big data” problem, two major lines of systems have recently been explored. The first, composed of MapReduce [13] and various generalizations [17, 9], offers a fine-grained fault tolerance model suitable for large clusters, where tasks on failed or slow nodes can be deterministically re-executed on other nodes. MapReduce is also fairly general: it has been shown to be able to express many statistical and learning algorithms [11]. It also easily supports unstructured data and “schema-on-read.” However, MapReduce engines lack many of the features that make databases efficient, and have high latencies of tens of seconds to hours. Even systems that have significantly optimized MapReduce for SQL queries, such as Google’s Tenzing [9], or that combine it with a traditional database on each node, such as HadoopDB [3], report a minimum latency of 10 seconds. As such, MapReduce approaches have largely been dismissed for interactive-speed queries [25], and even Google is developing new engines for such workloads [24].

为了处理“大数据”的问题,对两种思路进行了探索。首先,MapReduce等抽象,为大型集群提供了细粒度的容错模型,当任务失败,或者有节点掉队被发现后,可以在其他节点上重新执行。而且MapReduce还很具一般性:已经表明可以使用MapReduce来表达很多统计学和机器学习算法。它还很容易支持非结构化数据和“schema-on-read”但是,MapReduce缺少很多可以使数据库高效的特性,它有数十秒到数小时的延迟。甚至向谷歌Tenzing等针对SQL查询对MapReduce做了极大的优化的系统,或者像HadoopDB等在每个节点和传统数据库结合的系统,都有至少10秒的延迟。正因为如此,在交互式查询的场景中,大都排除了MapReduce的处理,甚至谷歌也在为这些场景开发新的引擎。

Instead, most MPP analytic databases (e.g., Vertica, Greenplum, Teradata) and several of the new low-latency engines proposed for MapReduce environments (e.g., Google Dremel [24], Cloudera Impala [1]) employ a coarser-grained recovery model, where an entire query has to be resubmitted if a machine fails. 1 This works well for short queries where a retry is inexpensive, but faces significant challenges in long queries as clusters scale up [3]. In addition, these systems often lack the rich analytics functions that are easy to implement in MapReduce, such as machine learning and graph algorithms. Furthermore, while it may be possible to implement some of these functions using UDFs, these algorithms are often expensive, furthering the need for fault and straggler recovery for long queries. Thus, most organizations tend to use other systems alongside MPP databases to perform complex analytics.

另一方面,大多数MPP分析型数据库(比如Vertica,Greenplum,Teradaa)以及针对Hadoop环境的新的低延迟引擎(比如谷歌的Dremel,Cloudera的Impala)采用了一个粗糙的恢复模型,如果在某个机器上失败,则重新提交查询。这对于那些短的查询是可行的,因为重新执行的成本很低。但是由于集群的扩展,对长查询就非常有挑战了。此外,这些系统通常缺乏在MapReduce上很容易实现的丰富的分析函数,比如集群学习和图算法。还有,虽然可以通过UDF实现某些功能,这种算法通常是非常昂贵的,进一步增加了长查询的容错和掉队处理的需求。因此,大多数公司趋向于采用其他系统执行复杂的分析,把MPP数据库凉在一边。

To provide an effective environment for big data analysis, we believe that processing systems will need to support both SQL and complex analytics efficiently, and to provide fine-grained fault recovery across both types of operations. This paper describes a new system that meets these goals, called Shark. Shark is open source and compatible with Apache Hive, and has already been used at web companies to speed up queries by 40–100�.

为了为大数据分析提供一个高效的环境,我们相信这个系统需要高效的支持SQL和复杂的查询,并为这两种操作提供优雅的容错。本论文描述了一个符合这些目标的系统,称为Shark。Shark是开源的,并且兼容Apache Hive,并且已经被互联网公司用来对查询做40-100倍的性能提升。

Shark builds on a recently-proposed distributed shared memory abstraction called Resilient Distributed Datasets (RDDs) [33] to perform most computations in memory while offering fine-grained fault tolerance. In-memory computing is increasingly important in large-scale analytics for two reasons. First, many complex analytics functions, such as machine learning and graph algorithms, are iterative, going over the data multiple times; thus, the fastest systems deployed for these applications are in-memory [23, 22, 33]. Second, even traditional SQL warehouse workloads exhibit strong temporal and spatial locality, because more-recent fact table data and small dimension tables are read disproportionately often. A study of Facebook’s Hive warehouse and Microsoft’s Bing analytics cluster showed that over 95% of queries in both systems could be served out of memory using just 64 GB/node as a cache, even though each system manages more than 100 PB of total data [5].