Skip to content

JKiddu/pyspark-tutorial

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 

Repository files navigation

PYSPARK Tutorial

Description

Spark is the name of the engine to realise cluster computing while PySpark is the Python’s library to use Spark

Getting Started

Dependencies

  • Jupyter Notebook
  • Pyspark package
  • Apache Spark

Installation

We assume that you have already installed anaconda and have some basic knowledge of python and sql to follow along with the tutorial.

  1. Go to Apache Spark website.
  2. Choose a spark release. In our case we have used 2.4.3 (May 07 2019).
  3. Choose a package type. It will be selected by default.
  4. Click the Download spark link.

You can follow the link to install Spark.

Credits: Michael Galarnyk

Authors

  • Karan
  • Gunnika
  • Ravinder

Dataset

References

  • To learn different terms such as SparkContext, RDDs, Transformation/Actions and using methods like show(), groupBy(), etc. we have used the guru99 website.

Please read the report to better understand the tutorial. Also, the datasets used are bigger in size so please contact me on LinkedIn and I will give you access to the drive link.

About

A "Hello World" tutorial to pyspark.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published