For more information on this project, please visit the project website.
This is the pipeline for processing the image data, tiling the images, preparing the training, validation and test data and training the model in tensorflow. There are separate processes for DigitalGlobe data and for NOAA data. More details on the data used for this project can be found here.
Scrape the image files from source websites and save them in a folder. For DigitalGLobe sorting the image files into 3 band and 1 band folders is required.
The following instructions are for NOAA only:
Run sudo bash downloadTiffs.sh
on Ubuntu to download the image files after installing the provided file.
All the files combined will be around 60GB; it is recommended to use a hard drive to ensure that you have enough storage. If you have not used Ubuntu before on your device, you will need to run sudo apt-get install wget
in order to be able to run the file.
You may need to delete carriage return characters if the files are not being downloaded properly (e.g. the files are downloaded instantly). To do this, run sed "s/$(printf '\r')\$//" downloadTiffs.sh > downloadTiffs2.sh && mv downloadTiffs2.sh downloadTiffs.sh
on Ubuntu.
Each file takes roughly 5-7 minutes to download; please be patient!
Takes image files. For DigitalGlobe this takes 3 TB and compresses to 60 GB.
Please install both files and ensure that they are in the same directory. You will only need to run compressTiffs.sh
, as it will call compressTiffs.py
.
You will need Miniconda in your Ubuntu terminal in order to run Python files (we chose Miniconda over Anaconda since Miniconda does not come with any Python packages, which saves us file space which would otherwise be taken up by unnecessary packages). Please go here for instructions on how to install Miniconda on Ubuntu. We recommend downloading the latest version.
Once you have installed Miniconda, you must take the following steps to activate it in Ubuntu:
- Run
sudo -s
. - Enter the password associated with your account in Ubuntu.
- First-time only: Run
export PATH=”root/miniconda3/bin:$PATH”
in the terminal. This will ensure that Ubuntu points to your Miniconda directory.
You then must install the GDAL package (preferably in a virtual environment). To set up your virtual environment, run conda create -n [env_name] python=[version]
(we recommend Python 3.9). You can then activate it anytime by running source activate [env_name]
and disable it with source deactivate
.
Install the GDAL package by running conda install -c conda-forge gdal
while your virtual environment is active.
You may get a syntax error when you run downloadTiffs.sh
. To fix this, run vi compressTiffs.sh
-> :set ff=unix
-> wq!
compressTiffs.sh
will automatically go to the folder where the tar files are located, so ensure that the Shell and Python files are located in the directory directly before the noaa_images
folder, which should have been created when you ran downloadTiffs.sh
.
Apply appropriate utility script as necessary based on observations of the data.
Clip the big tif images into smaller tiles (2048 x 2048) from left to right and top to bottom including a csv of the lat long ranges for each tif image.
From the csv of lat long ranges per tif image and the geojson file of lat longs of bounding boxes with attached tif id produce a geojson of pixel ranges per bounding box with small tif id.
SSD requires the training data input as pixel coordinates.
You can verify that both the geospatial and pixel coordinates result in the correct bounding boxes being plotted in BuildingMarker.ipynb
. You can manually select a tile of your choice, or test a random tile from your tiles folder.
Convert the tif files to chips-- smaller images that can be used for train/test sets.
Split the images and geojson file into training, validation and test subsets (8:1:1).
Use ipython notebook to plot bounding boxes over the images (tiff files) to check for accuracy, render the bounding boxes over the tiff files to manually inspect, record bad labels, remove those bounding boxes from the geojson file.
Shift, flip and rotate the images as a way to add more training data.
Prepare input for the network.