This project involved the development of a data pipeline for trending movies using Airflow and Python. It involved the following:
- The data pipeline ingested trending movies' and distributors' data from IMDb and Box Office.
- The ingested data was cleansed, formatted, processed, and indexed on elastic search.
- Finally, A dashboard was created from the enriched data using Kibana analytics.
Watch Demo here
The tools and libraries used in this project included:
- Airflow for automating the pipeline
- Selenium for data ingestion through web scraping
- Pandas for data cleansing and formatting
- Pyspark for data processing
- Elastic search and kibana.
Total revenue of all the trending movies, Top 5 trending movies by user rating, and distributors' revenue share
Top 5 distributors by revenue