Skip to content
Sungryong Hong edited this page Sep 30, 2018 · 10 revisions

Let's PySpark the Universe to unveil cosmic mysteries!

See each sub-wiki page on the right.

The pages describe

[1] how to build a stand-alone cluster, using an os x desktop master with linux slaves

[2] About a larger Spark/Hadoop cluster in our institute. It turns out I do not use this much since Google Cloud Platform is much better to use for large calculations beyond my 4-node cluster. Yes, paying money makes our lives easier.

[3] how to set up and run spark codes in Google Cloud Platform. GCP is quite great for heavy Spark users, except for its price (AWS will charge you more, anyway). In my experience, building your own spark/hadoop cluster is much cheaper than paying Cloud services. But, the capability (and scalability) of custom hardware is inevitably limited. Hence, though having your own cluster, GCP is necessary for many many reasons.