Building detection on mosaics using deep learning

Picture being able to select an arbitrary region of the world in your browser and within it rapidly locate all the objects of interest. The advancement of deep learning combined with easy access to high resolution satellite imagery enabled by platforms such as GBDX are making accurate object detection at scale an attainable goal. However, the application of deep learning on satellite imagery is still in its infancy and many questions are open with regards to its efficacy at a global scale.

In a recent experiment, we used the VGG-16 convolutional neural network (CNN) to detect settlements in north-eastern Nigeria. We trained the CNN using a few thousand labeled chips from a small collection of WorldView-2 strips. We then deployed the model on these same strips, as well as a different set of strips that the model ‘had never seen’. We discovered that the model often performed much worse on the latter set, yielding many false positives. Simply put, the model could not generalize based on the data that it was trained on. The experiment raised many questions: How much training data is enough? How does terrain variability affect the performance of a model? How about differences in sensor, sun angle and capture time? Which type of image preprocessing is more conducive to deep learning?


Variability in accuracy when deploying VGG-16 on an image that it was trained on (left) vs an image that it was not trained on (right).


In an attempt to address some of these questions, we decided to conduct a larger experiment in the same part of the world. The region of interest is a square at the border of Nigeria and Cameroon, spanning an area just short of 20000 km2. We picked out 31 recent images captured by WorldView-2, WorldView-3 and GeoEye-1 and ran them through our image preprocessor for orthorectification, atmospherical compensation and pansharpening. We then constructed a mosaic using our proprietary Flexible Large Area Mosaic Engine (FLAME) technology.

FLAME solves a rather complicated problem: it turns a hodgepodge of individual images into a seamless mosaic. It is designed and implemented to take advantage of parallel computation, such that it can process millions of km2 in a matter of hours. FLAME constructs the mosaic in two main steps:

  • it adjusts the pixel intensities in the R,G,B bands to color-match a global base layer, a procedure that is called Base Layer Matching (BLM);
  • it intelligently weaves the images across optimized boundaries called cutlines in order to maximize blending.

The end result is a collection of color-consistent tif image tiles that comprise the mosaic, delivered in an S3 bucket. You can explore the mosaic in the map below or, for a full page view, you can click here. This visualization was created by setting up a Web Map Service (WMS) with MapServer.

We collected a reference data set by creating a giant grid of step size 125m over the mosaic, and having our crowd label 845000 chips in this grid as ‘Buildings’ and ‘No Buildings’. The number of chips with buildings was just short of 13000, i.e., about 1.5% of the reference data set, which illustrates how sparsely populated this region is.


5000 samples from the reference data set with ‘Buildings’ and ‘No Buildings’ in equal parts.

Training and deploying the CNN

Here is the workflow that we implemented for training and deploying the CNN on the mosaic.


The inputs to the workflow are:

  • the mosaic tiles;
  • train.geojson, which includes the geometry and class name of each training chip;
  • a collection of target.geojson files, each including the geometries of the chips covering part of the grid.

The task chip-from-vrt extracts chips from the mosaic for training and deploying. It does this by creating a vrt file which specifies the full path to the S3 location of each tile in the mosaic using the GDAL virtual filesystem /vsis3/. It then extracts the chips defined by the geometries in the input geojson by calling gdal_translate on the vrt, and saves them to a user-defined bucket. This process allows the task to pull chips remotely from the mosaic tiles without actually mounting the tiles onto the GBDX worker executing the task.

chip-from-vrt extracts chips from the collection of image tiles comprising the mosaic, by treating it as a single image.

The training chips are the input to train-cnn-chip-classifier which produces a Keras model. For each part of the grid, the corresponding target chips and the model are passed to a separate deploy-chip-classifier task, which produces classified.json. classified.json contains all the chip names, each appended with the model decision and a confidence score.

You can run the entire workflow with gbdxtools in this Jupyter notebook.


We ran a series of experiments to determine the effect of different factors on the precision/recall performance of the model. The performance was evaluated using the reference dataset that we collected previously.

For each data point in the plots that follow, we trained the CNN 10 different times and computed the mean and standard deviation of the precision in order to derive a confidence interval. Unless mentioned otherwise, the training set was 5000 labeled chips selected randomly from the reference data set. Each chip was downsampled from 260x260 to 150x150 in order to fit the CNN architecture into memory. For each training cycle, we trained the CNN on balanced classes for 50 epochs, and selected the model that resulted in the lowest validation loss.

Training with train-cnn-chip-classifier on 5000 chips took approximately 9 hours on a g2.2xlarge instance. For the deployment phase, we divided the grid into 13 roughly equal parts, each containing about 100000 chips, and ran deploy-chip-classifier on each part in parallel on 13 different g2.2xlarge instances. The deployment phase took about 4 hours, i.e., 0.72 sec/km2.

Dynamic Range Adjustment (DRA)

Dynamic Range Adjustment (DRA) converts pixel values of an orthorectified, atmospherically compensated, pansharpened image from 16 bits to 8 bits, so that the image is viewable on a computer screen. In order to construct a FLAME mosaic, DRA is performed with BLM in order to achieve color consistency across the mosaic.

Our goal was to assess the effect of DRA on the model accuracy. Using the 16-bit imagery, we created two pseudo-mosaics by:

  • Clipping the lowest 0.5% and highest 0.05% pixel intensities for each image strip individually, setting the limits to 0 and 255, respectively, and stitching them using the FLAME cutlines; we refer to this mosaic as CLIP.

  • Not performing any DRA at all, i.e., directly stitching the 16-bit images using the cutlines; we refer to this mosaic as ACOMP to emphasize the fact that the imagery was not DRA’d.

Clipping the pixel intensities to create 8-bit imagery is a naive form of DRA. The CLIP mosaic is compared to the actual mosaic below. Not surprisingly, the colors are different across tiles.

CLIP vs BLM. CLIP is performed on a per-tile basis. The FLAME cutlines outlining the tile boundaries are shown in green. BLM is the result of adjusting the colors to match an underlying global base layer.

16-bit imagery can not be displayed on 8-bit monitors so we can’t really view the ACOMP mosaic unless DRA is applied. Below, we’ve plotted the pixel intensity histograms of a single chip for BLM, CLIP and ACOMP. The histograms follow the same pattern, however, the range of the horizontal axis is much larger in the ACOMP case.

histograms.png Pixel intensity histograms for a BLM, CLIP and ACOMP chip. The colors of the ACOMP chip are not ‘true’; the chip has been DRA’d in QGIS so that it can be displayed on the monitor.

We trained and deployed the CNN on the BLM, CLIP and ACOMP mosaics using 5000 and 7500 training samples selected randomly from the reference data set. The results are shown below.

clip-blm-acomp.png The performance on the BLM and CLIP mosaics is close to identical. Using the ACOMP mosaic incurs a performance loss which decreases with training sample size.

Surprisingly, DRA results in a performance enhancement which is smaller for the larger training set. Our tentative interpretation of this result is that the model requires more time and/or more data to learn on the ACOMP mosaic, given that 16-bit imagery contains more information than its DRA’d counterpart. Moreover, there is no notable performance difference between CLIP and BLM. This is an indication that using training data across the entire mosaic enables the model to understand the differences in color across the CLIP tiles.

Training data spatial distribution

In order to assess the effect of the training data spatial distribution on the model accuracy, we restricted training data selection to a small part of the mosaic.


5000 training samples selected from a small part of the mosaic.

The PR curves for BLM and CLIP are shown in the following figure.


The model can generalize better on the BLM mosaic.

The BLM curve is (mostly) to the right of the CLIP curve. The results confirm our intuition that the model can generalize better on the BLM mosaic since the colors are more consistent compared to the CLIP mosaic.

We also compared the model trained on the restricted training set to the model trained on the mosaic-wide training set on the CLIP mosaic.


Lack of mosaic-wide training data leads to decreased accuracy.

The PR curves demonstrate that the performance penalty of collecting training data only from one location is significant, e.g., at a recall of 90%, the precision drops from 35% to about 15%. The implication is that the cost required to validate the building detections with a crowdsourcing campaign more than doubles.


JPEG compression is typically 10:1, and is therefore an attractive solution for storing big image files. Since it is lossy, we wanted to assess its impact on the performance. We trained and deployed the CNN on the CLIP mosaic using both JPEG-compressed and uncompressed tiles. The results are shown below.


JPEG compression does not incur an accuracy penalty.

JPEG compression does not appear to affect the performance. On the contrary, it leads to tighter confidence intervals than uncompressed imagery. In other words, there is considerably larger variance in accuracy for the models we trained on the uncompressed imagery.


In summary, we found that:

  • DRA of the 16-bit ACOMP’d imagery increases accuracy. This is rather surprising since DRA is an operation which reduces the image information content. Along this line of thought, we suspect that more training time and/or more training data are required to harness the full potential of 16-bit imagery.

  • The spatial distribution of the training data has a crucial impact on accuracy. If training data collection has to be restricted to one area using a mosaic can help a model generalize to other areas.

  • JPEG compression has no impact on the performance. This is good news; compressed imagery takes up much less space than uncompressed imagery and the cost-savings could be very significant at a global scale.

These observations were derived for a particular use case in a specific part of the world, and are not meant to be used as guidelines. More investigation for diverse use cases and geographical locations is required to reach meaningful conclusions.

Perhaps more important than the accuracy viewpoint is the fact that the framework presented here can be used to achieve tremendous search area reduction when looking for sparse settlements over large areas. For a model trained on 5000 samples, the precision at 90% recall is about 30%. That sounds quite bad, yet, in reality, what this means is that the model can single out 27300 locations out of 845000 candidates; a 96.8% search area reduction. It is much faster and economic to run a crowdsourcing campaign to remove the false positives from 27300 detections, than to collect labels for 850000 chips.

We deployed our best model on the entire mosaic and kept the detections with confidence higher than 97.5%. You can explore the results, shown in green, below (full page view here). To make this map, we uploaded the geojson with all the detections to Mapbox and used the Mapbox GL Javascript library to reference the vector tile set.

Future research will include more experiments with 16-bit imagery and all 8 bands of our WorldView-2 and WorldView-3 sensors to assess the impact of these unique capabilities on deep learning algorithms.

For more information on machine learning research at DigitalGlobe and on GBDX in general, get in touch.

Written on March 3, 2017