Skip to content

helpfulengineering/project-hab-prediction

 
 

Repository files navigation

🌊 Harmful Algal Bloom (HAB) Prediction using Machine Learning

This repository presents a machine learning pipeline to predict the occurrence of Karenia brevis harmful algal blooms (HABs) along the coast of Florida. It combines in-situ data from the Florida Fish and Wildlife Conservation Commission (FWC) with satellite-derived environmental variables (SST and Chlorophyll-a) using scalable geospatial processing and interpretable modeling.

🧠 Project Summary

  • Goal: Predict the presence of Karenia brevis blooms using environmental variables from remote sensing and in-situ reports.
  • Scope: January 2019 – December 2023 (5 years of daily data).
  • Model Used: XGBoost
  • Outcome: Achieved 96% accuracy with strong precision/recall balance.

📁 Repository Structure

├── hab.ipynb               # Main model building & evaluation notebook
├── sstdatapreprocessing.ipynb      # SST data loading and cleaning
├── chlorophylldatapreprocessing.ipynb  # Chlorophyll-a data loading and cleaning
├── README.md                                     # Project documentation
└── /data                                           #Datasets

📊 Data Sources

Type Source Description
Ground Truth Florida Fish and Wildlife Conservation Commission (FWC) Daily Karenia brevis concentrations (2019–2023)
SST NOAA Coral Reef Watch SST v3.1 Daily sea surface temperature (~5 km resolution)
Chlorophyll NOAA SNPP Chl-a Chlorophyll-a (750m resolution)

🔗 Raw Data Access

The raw Sea Surface Temperature (SST) and Chlorophyll-a datasets used in this project were obtained from NOAA's ERDDAP servers.
You can access and download the data directly from:

For reproducibility, refer to the sstdatapreprocessing.ipynb and chlorophylldatapreprocessing.ipynb notebooks to follow the exact cleaning and formatting steps.

Data Access

You can download the processed SST and Chlorophyll datasets from the following Google Drive link:

Download Processed Data (Google Drive)

🛠️ Tools & Technologies

  • Data Processing: Apache Sedona, PySpark, Pandas, NumPy
  • Modeling: XGBoost, scikit-learn
  • Visualization & Interpretation: SHAP, QGIS
  • Notebook Environment: Jupyter

🔁 Workflow Overview

📍 Preprocessing

  • Performed spatial-temporal joins of FWC, SST, and Chl-a data using Apache Sedona.
  • Aligned time zones and filtered data using bounding boxes.
  • Engineered seasonality features (sin/cos of day-of-year).

📊 Results

Metric No Bloom Bloom
Precision 0.97 0.89
Recall 0.98 0.84
F1-Score 0.97 0.86
  • Accuracy: 95%
  • Macro Avg F1-Score: 0.92
  • Weighted Avg Precision/Recall: 0.95

🧭 Visual Validation with QGIS

To ensure geographic accuracy, QGIS was used to visualize the datasets — including SST, Chlorophyll-a, and HAB occurrence points. This helped validate that the data aligned with the Florida coastal region.

Below are sample plots exported from QGIS:

SST Points Visualization

SST Data Points

Chlorophyll-a Points Visualization

Chlorophyll Data Points

HAB Occurrence Overlay

HAB Points

Getting Started

To run this project locally, follow these steps:

1. Clone the repository

git clone https://github.com/your-username/hab-prediction.git
cd hab-prediction

2. Create a Virtual Environment (Optional but Recommended)

Create and activate a virtual environment:

python -m venv venv
source venv/bin/activate      # For Linux/macOS  
venv\\Scripts\\activate       # For Windows

3. Install Dependencies

Install all required Python packages:

pip install -r requirements.txt

Then, install Apache Sedona (required for spatial joins):

pip install apache-sedona==1.4.1

4. Launch the Notebooks

Start Jupyter Notebook:

jupyter notebook

Open and run the notebooks in the following order:

  • sstdatapreprocessing.ipynb
  • chlorophylldatapreprocessing.ipynb
  • hab2.ipynb

👥 Acknowledgements

  • Florida Fish and Wildlife Conservation Commission (FWC)
  • NOAA ERDDAP
  • Helpful Engineering (project sponsor)

This project was developed with the assistance of OpenAI's ChatGPT-4o model.

About

Tooling for predicting Harmful Algal Blooms

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 100.0%