-
Notifications
You must be signed in to change notification settings - Fork 37
Developing With EclairJS
EclairJS Node provides the Apache Spark API with some minor differences. As its a work in progress, please check our API docs to see what APIs are implemented.
Lets look at the following example of a simple Apache Spark program where we parallelize an array of numbers and then collect them.
var eclairjs = require('eclairjs');
var sc = new eclairjs.SparkContext("local[*]", "Simple Spark Program");
var rdd = sc.parallelize([1.10, 2.2, 3.3, 4.4]);
rdd.collect().then(function(results) {
console.log("results: ", results);
sc.stop();
}).catch(function(err) {
console.error(err);
sc.stop();
});The eclairjs object provides the Apache Spark API (such as SparkContext).
Methods that return single Spark objects will immediately return their result. For parallelize the RDD object is returned directly.
var rdd = sc.parallelize([1.10, 2.2, 3.3, 4.4]);
If it returns an array of Spark objects (such as RDD.randomSplit) or if it returns a native JavaScript object (like collect) will return a Promise.
rdd.collect().then(function(results) {
...
}).catch(function(err) {
...
});
This is done because these calls can take a while to return a result and Node.js does not like having its execution blocked.
It is important to stop the SparkContext when your application is done as the actual Apache Spark program is running remotely and EclairJS will not stop the SparkContext if your Node application exits. The same applies to the StreamingContext.
We have a separate page about Lambda functions.
We have a separate page devoted to debugging