This repository presents a machine learning pipeline to predict the occurrence of Karenia brevis harmful algal blooms (HABs) along the coast of Florida. It combines in-situ data from the Florida Fish and Wildlife Conservation Commission (FWC) with satellite-derived environmental variables (SST and Chlorophyll-a) using scalable geospatial processing and interpretable modeling.
- Goal: Predict the presence of Karenia brevis blooms using environmental variables from remote sensing and in-situ reports.
- Scope: January 2019 – December 2023 (5 years of daily data).
- Model Used: XGBoost
- Outcome: Achieved 96% accuracy with strong precision/recall balance.
├── hab.ipynb # Main model building & evaluation notebook
├── sstdatapreprocessing.ipynb # SST data loading and cleaning
├── chlorophylldatapreprocessing.ipynb # Chlorophyll-a data loading and cleaning
├── README.md # Project documentation
└── /data #Datasets
Type | Source | Description |
---|---|---|
Ground Truth | Florida Fish and Wildlife Conservation Commission (FWC) | Daily Karenia brevis concentrations (2019–2023) |
SST | NOAA Coral Reef Watch SST v3.1 | Daily sea surface temperature (~5 km resolution) |
Chlorophyll | NOAA SNPP Chl-a | Chlorophyll-a (750m resolution) |
The raw Sea Surface Temperature (SST) and Chlorophyll-a datasets used in this project were obtained from NOAA's ERDDAP servers.
You can access and download the data directly from:
For reproducibility, refer to the sstdatapreprocessing.ipynb
and chlorophylldatapreprocessing.ipynb
notebooks to follow the exact cleaning and formatting steps.
You can download the processed SST and Chlorophyll datasets from the following Google Drive link:
Download Processed Data (Google Drive)
- Data Processing: Apache Sedona, PySpark, Pandas, NumPy
- Modeling: XGBoost, scikit-learn
- Visualization & Interpretation: SHAP, QGIS
- Notebook Environment: Jupyter
- Performed spatial-temporal joins of FWC, SST, and Chl-a data using Apache Sedona.
- Aligned time zones and filtered data using bounding boxes.
- Engineered seasonality features (sin/cos of day-of-year).
Metric | No Bloom | Bloom |
---|---|---|
Precision | 0.97 | 0.89 |
Recall | 0.98 | 0.84 |
F1-Score | 0.97 | 0.86 |
- Accuracy: 95%
- Macro Avg F1-Score: 0.92
- Weighted Avg Precision/Recall: 0.95
To ensure geographic accuracy, QGIS was used to visualize the datasets — including SST, Chlorophyll-a, and HAB occurrence points. This helped validate that the data aligned with the Florida coastal region.
Below are sample plots exported from QGIS:
To run this project locally, follow these steps:
git clone https://github.com/your-username/hab-prediction.git
cd hab-prediction
Create and activate a virtual environment:
python -m venv venv
source venv/bin/activate # For Linux/macOS
venv\\Scripts\\activate # For Windows
Install all required Python packages:
pip install -r requirements.txt
Then, install Apache Sedona (required for spatial joins):
pip install apache-sedona==1.4.1
Start Jupyter Notebook:
jupyter notebook
Open and run the notebooks in the following order:
sstdatapreprocessing.ipynb
chlorophylldatapreprocessing.ipynb
hab2.ipynb
- Florida Fish and Wildlife Conservation Commission (FWC)
- NOAA ERDDAP
- Helpful Engineering (project sponsor)