This repository contains two sets of example labs for distributed data processing: a Hadoop/ folder with MapReduce Java examples and a Spark/ folder with PySpark example scripts. Each folder contains multiple experiments with small input datasets and a short README where applicable.
-
Hadoop/- A set of Java MapReduce experiments (Exp-1 .. Exp-5).
- Each experiment folder typically contains Java source files, small sample input files, and a README describing the individual experiment.
- Files included (high-level):
Exp-1/— Word Count exampleWC_Mapper.java,WC_Reducer.java,WC_Driver.java,input.txt,Readme.md
Exp-2/— Max Temperature exampleMaxTemperature.java,MaxTempMapper.java,MaxTempReducer.java,weather.csv,README.md
Exp-3/— Students / data-processing examplestudents.csv,README.md
Exp-4/— SequenceFile exampleSequenceFileWriterExample.java,README.md
Exp-5/— Map-side join exampleMapSideJoinDriver.java,MapSideJoinMapper.java,customers.txt,orders.txt,README.md
-
Spark/- PySpark example scripts (EXP-1 .. EXP-7) and a
pyproject.tomlat the root of the Spark folder. - Each experiment folder contains
expN.pyand usually aREADME.mdand small sample data where needed. - Files included (high-level):
EXP-1/—exp1.pyEXP-2/—exp2.py,sample.txtEXP-3/—exp3.pyEXP-4/—exp4.pyEXP-5/—exp5.py,people.csvEXP-6/—exp6.pyEXP-7/—exp7.py
- PySpark example scripts (EXP-1 .. EXP-7) and a
These labs are educational examples demonstrating common big-data patterns:
- MapReduce programming with Hadoop in Java (mappers, reducers, drivers, joins, SequenceFile usage).
- PySpark scripts demonstrating RDD/DataFrame operations and small data analysis tasks.
Each experiment is intentionally small and self-contained so you can run it locally (in standalone or pseudo-distributed mode) or on a cluster for learning.
- For Hadoop experiments:
- JDK (8+), Apache Hadoop (configured locally or accessible cluster), and
javac/jaravailable.
- JDK (8+), Apache Hadoop (configured locally or accessible cluster), and
- For Spark experiments:
- Python 3.x and Apache Spark (or a Spark distribution with
spark-submit). - Optionally a Python virtual environment and deps managed by
pyproject.tomlin theSpark/folder.
- Python 3.x and Apache Spark (or a Spark distribution with
Below are minimal examples to run the experiments. Adjust class names, paths, and Hadoop/Spark configuration as appropriate for your environment.
Hadoop (from a machine with hadoop available):
# compile and build a jar (example for Exp-1 Word Count)
cd Hadoop/Exp-1
javac -classpath "$(hadoop classpath)" -d classes WC_*.java
jar -cvf wc.jar -C classes .
# run the MapReduce job (input and output paths are examples)
hadoop jar wc.jar WC_Driver input.txt output-wcNotes:
- On Windows, run these commands in a WSL shell or in an environment where
hadoopis available. - Replace
WC_Driverwith the fully qualified driver class name if package statements are used.
Spark (use spark-submit for each expN.py):
# run an example Spark script locally
cd Spark/EXP-1
spark-submit --master local[*] exp1.pyIf you prefer to run the Python file directly (without Spark cluster features), some simple scripts may run with plain Python, but spark-submit is the recommended way.
- Each experiment folder usually contains its own
README.mdwith experiment-specific notes and sample data. Check the folder for additional details.
- To add an experiment, create a new
Exp-<n>/(Hadoop) orEXP-<n>/(Spark) folder, include source, a small sample input and a short README explaining the objective and how to run it.
This repository contains educational examples. No license file is included by default — add one if you want to set explicit reuse terms.
If you want, I can:
- Add a short top-level table of contents linking directly to each experiment folder.
- Add example build scripts (Makefile / build.sh) for Hadoop compilation and jar creation.
- Create a simple requirements file or
pyprojectadjustments for theSpark/folder.
Tell me which of those you'd like next.