GitHub - shongscience/PySpark-the-Universe: By utilizing python APIs and PySpark, we measure graph statistics from various cosmological data sets.

1. Why PySpark for Astronomy?

Apache Spark has many wonderful features that we (python programmers?! astronomers?!) even do not have to know during our life time.

One little feature of python UDF is quite enough why we need to learn PySpark, since we can apply our hand-made python function (User Defined Function; UDF) to data in parallel ways by utilizing hundreds or thousands of cpu cores simultaneously.

I was amused by using the multi-thread capability of python. I could run 4 or 6 threads in parallel, which excelled my computation! But, now, my little 4 nodes' Spark Cluster can run 48 threads simultaneously for the same python code (python UDF). Yeh!

Surely, this is only one of what PySpark can do. My major task is to measure graph statistics from big astronomical data sets. Currently, I am dealing with 18 million galaxies. For even a small linking length of 0.8 Mpc, my friends-of-friends network calculations already use up 80 Gb memory. Yes, this cannot be done without Spark Cluster and I need hundreds of Gb memory to complete my analyses.
Spark can handle even 1000Gb memory in calculations easily?! Another Yeh!

I am a pro-bigdata astronomer, strongly suggesting that we need to adopt the python/PySpark platform more actively in astronomical field.

Try, then you will love it!

For details, check out the wiki

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
JupyterNotebooks		JupyterNotebooks
misc		misc
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

1. Why PySpark for Astronomy?

About

Releases

Packages

Languages

shongscience/PySpark-the-Universe

Folders and files

Latest commit

History

Repository files navigation

1. Why PySpark for Astronomy?

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages