Skip to content

Latest commit

 

History

History
113 lines (66 loc) · 7.02 KB

README.md

File metadata and controls

113 lines (66 loc) · 7.02 KB

Schedoscope

Introduction

Schedoscope is a scheduling framework for painfree agile development, testing, (re)loading, and monitoring of your datahub, lake, or whatever you choose to call your Hadoop data warehouse these days.

With Schedoscope,

  • you never have to create DDL and schema migration scripts;
  • you do not have to manually determine which data must be deleted and recomputed in face of retroactive changes to logic or data structures;
  • you specify Hive table structures (called "views"), partitioning schemes, storage formats, dependent views, as well as transformation logic in a concise Scala DSL;
  • you have a wide range of options for expressing data transformations - from file operations and MapReduce jobs to Pig scripts, Hive queries, and Oozie workflows;
  • you benefit from Scala's static type system and your IDE's code completion to make less typos that hit you late during deployment or runtime;
  • you can easily write unit tests for your transformation logic and run them quickly right out of your IDE;
  • you schedule jobs by expressing the views you need - Schedoscope takes care that all required dependencies - and only those- are computed as well;
  • you achieve a higher utilization of your YARN cluster's resources because job launchers are not YARN applications themselves that consume cluster capacitity.

Getting Started

Get a glance at

Follow the Open Street Map tutorial to install, compile, and run Schedoscope in a standard Hadoop distribution image within minutes:

Take a look at the View DSL Primer to get more information about the capabilities of the Schedoscope DSL:

More documentation can be found here:

When is Schedoscope not for you?

Schedoscope is based on the following assumptions:

  • data are largely relational and meaningfully representable as Hive tables;
  • there is enough cluster time and capacity to actually allow for retroactive recomputation of data;
  • it is acceptable to compile table structures, dependencies, and transformation logic into what is effectively a project-specific scheduler.

Should any of those assumptions not hold in your context, you should probably look for a different scheduler.

Origins

Schedoscope was conceived at the Business Intelligence department of Otto Group

Contributions

The following people have contributed to the various parts of Schedoscope so far:

Utz Westermann (maintainer), Hans-Peter Zorn, Dominik Benz, Annika Leveringhaus

We would love to get contributions from you as well. We haven't got a formalized submission process yet. If you have an idea for a contribution or even coded one already, get in touch with Utz or just send us your pull request. We will work it out from there.

Please help making Schedoscope better!

News

01/22/2016 - Release 0.3.5

We have released Version 0.3.5 as a Maven artifact to our Bintray repository (see Setting Up A Schedoscope Project for an example pom).

This release migrates Schedoscope's Hadoop dependencies to CDH-5.5.1. Furthermore, the test framework has been ported to Hive 1.1.0. Finally, Schedoscope's resilience against Metastore failures has been improved. It is able to reconnect and resume work when the Metastore has become unavailable in more error cases.

11/21/2015 - Release 0.3.4

We have released Version 0.3.4 as a Maven artifact to our Bintray repository (see Setting Up A Schedoscope Project for an example pom).

This release fixes a bug in Schedoscope which led to not correctly instantiating ViewActors for newly appearing dependencies such as date changes. Moreover, checksum versioning code has been cleaned up. Note that checksumming is not backwards compatible; you might want to execute your next materializations with the -m RESET_TRANSFORMATION_CHECKSUMS option.

11/13/2015 - Release 0.3.3

We have released Version 0.3.3 as a Maven artifact to our Bintray repository (see Setting Up A Schedoscope Project for an example pom).

This release gets some order into the logging framework mess inherited from the various libraries used. It does so by routing Java util logging and Apache commons logging through SLF4J and SLF4J to logback. By muting log4j and setting an appropriate logback-test.xml test outputs are now a lot less chatty.

11/10/2015 - Release 0.3.2

We have released Version 0.3.2 as a Maven artifact to our Bintray repository (see Setting Up A Schedoscope Project for an example pom).

This fixes a nasty resource leak in the Touch FileSystemTransformation

11/09/2015 - Release 0.3.1

We have released Version 0.3.1 as a Maven artifact to our Bintray repository (see Setting Up A Schedoscope Project for an example pom).

Fields can now be given comments as well: val id = fieldOf[String]("An ID.")

11/06/2015 - Release 0.3.0

We have released Version 0.3.0 as a Maven artifact to our Bintray repository (see Setting Up A Schedoscope Project for an example pom).

This is a big release, with the following major changes:

  • Migration to Scala 2.11 and Akka 2.3.14
  • Support of Hive 1.1.0 in test framework
  • Significant code cleanup
  • Significant round of Scaladoc documentation
  • Significant performance improvements when dealing with many views / partitions

Please note that the cleanup incurred some breaking of the API. In particular, the storage format classes have been moved to a separate package org.schedoscope.dsl.storageformats. Moreover, the various path builders for views have been renamed in a more systematic way. See Storage Paths.

Community / Forums

Build Status

Build Status

License

Licensed under the Apache License 2.0