Skip to content

Developing With EclairJS

Doron Rosenberg edited this page Aug 10, 2016 · 8 revisions

EclairJS Node provides the Apache Spark API with some minor differences. As its a work in progress, please check our API docs to see what APIs are implemented.

Lets look at the following example of a simple Apache Spark program where we parallelize an array of numbers and then collect them.

var eclairjs = require('eclairjs');

var sc = new eclairjs.SparkContext("local[*]", "Simple Spark Program");

var rdd = sc.parallelize([1.10, 2.2, 3.3, 4.4]);

rdd.collect().then(function(results) {
  console.log("results: ", results);
  sc.stop();
}).catch(function(err) {
  console.error(err);
  sc.stop();
});

The eclairjs object provides the Apache Spark API (such as SparkContext).

Method Calls

Methods that return single Spark objects will immediately return their result. For parallelize the RDD object is returned directly.

var rdd = sc.parallelize([1.10, 2.2, 3.3, 4.4]);

If it returns an array of Spark objects (such as RDD.randomSplit) or if it returns a native JavaScript object (like collect) will return a Promise.

rdd.collect().then(function(results) {
   ...
}).catch(function(err) {
   ...
});

This is done because these calls can take a while to return a result and Node.js does not like having its execution blocked.

Stopping the SparkContext

It is important to stop the SparkContext when your application is done as the actual Apache Spark program is running remotely and EclairJS will not stop the SparkContext if your Node application exits. The same applies to the StreamingContext.

Lambda Functions

We have a separate page about Lambda functions.

Error Handling and Debugging

We have a separate page devoted to debugging

Clone this wiki locally