For more information on this project, please visit the project website.
This is the pipeline for processing the image data, tiling the images, preparing the training, validation and test data and training the model in tensorflow. There are separate processes for DigitalGlobe data and for NOAA data. More details on the data used for this project can be found here.
Scrape the image files from source websites and save them in a folder. For DigitalGLobe sorting the image files into 3 band and 1 band folders is required.
Takes image files. For DigitalGlobe this takes 3 TB and compresses to 60 GB.
Apply appropriate utility script as necessary based on observations of the data.
Clip the big tif images into smaller tiles (2048 x 2048) from left to right and top to bottom including a csv of the lat long ranges for each tif image.
From the csv of lat long ranges per tif image and the geojson file of lat longs of bounding boxes with attached tif id produce a geojson of pixel ranges per bounding box with small tif id.
SSD requires the training data input as pixel coordinates.
Split the images and geojson file into training, validation and test subsets (8:1:1).
Use ipython notebook to plot bounding boxes over the images (tiff files) to check for accuracy, render the bounding boxes over the tiff files to manually inspect, record bad labels, remove those bounding boxes from the geojson file.
Shift, flip and rotate the images as a way to add more training data.
Prepare input for the network.