Skip to content

By utilizing python APIs and PySpark, we measure graph statistics from various cosmological data sets.

Notifications You must be signed in to change notification settings

shongscience/PySpark-the-Universe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 

Repository files navigation

1. Why PySpark for Astronomy?

Apache Spark has many wonderful features that we (python programmers?! astronomers?!) even do not have to know during our life time.

One little feature of python UDF is quite enough why we need to learn PySpark, since we can apply our hand-made python function (User Defined Function; UDF) to data in parallel ways by utilizing hundreds or thousands of cpu cores simultaneously.

I was amused by using the multi-thread capability of python. I could run 4 or 6 threads in parallel, which excelled my computation! But, now, my little 4 nodes' Spark Cluster can run 48 threads simultaneously for the same python code (python UDF). Yeh!

Surely, this is only one of what PySpark can do. My major task is to measure graph statistics from big astronomical data sets. Currently, I am dealing with 18 million galaxies. For even a small linking length of 0.8 Mpc, my friends-of-friends network calculations already use up 80 Gb memory. Yes, this cannot be done without Spark Cluster and I need hundreds of Gb memory to complete my analyses.
Spark can handle even 1000Gb memory in calculations easily?! Another Yeh!

I am a pro-bigdata astronomer, strongly suggesting that we need to adopt the python/PySpark platform more actively in astronomical field.

Try, then you will love it!

For details, check out the wiki

About

By utilizing python APIs and PySpark, we measure graph statistics from various cosmological data sets.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published