Detecting population centers in Nigeria
There are large regions of the planet which, although inhabited, remain unmapped to this day. In the past, DigitalGlobe has launched crowdsourcing campaigns to detect remote population centers in Ethiopia, Sudan and Swaziland in support of NGO vaccination and aid distribution initiatives. Beyond DigitalGlobe, there are other initiatives under way to fill in the gaps in the global map, aiding first responders in their effort to provide relief to vulnerable, yet inaccessible, people.
Crowdsourcing the detection of villages is accurate but slow. Human eyes can easily detect buildings but it takes them a while to cover large swaths of land. In the past, we have combined crowdsourcing with deep learning on GBDX to detect and classify objects at scale. This is the approach: collect training samples from the crowd, train a neural network to identify the object of interest, then deploy the trained model on large areas. The cherry on the cake is to use the crowd to weed out the errors of the machine, in order to obtain the best of both worlds.
In the context of a recent large-scale population mapping campaign, we were faced with the usual question. Find buildings with the crowd, or train a machine to do it? This led to another question: can the convolutional neural network (CNN) that we trained to find swimming pools in Adelaide be trained to detect buildings in Nigeria?
Population centers in Nigeria. Intensity of green corresponds to confidence in the presence of buildings.
The area of interest consists of 9 WorldView-2 and 2 GeoEye-1 image strips collected between January 2015 and May 2016 over northeastern Nigeria, close to as well as on the border with Niger and Cameroon. We picked 4 WorldView-2 strips, divided them in square chips of side 125m (about 250 pixels at sensor resolution) and asked our crowd to label them as ‘Buildings’ or ‘No Buildings’. The output of the crowdsourcing campaign is the file train.geojson which contains the labeled chip geometries (a small sample here).
As shown in the following diagram, train.geojson and the image GeoTiff files are given as input to train-cnn-classifier which produces a trained Keras model. The images are orthorectified, atmospherically compensated and pansharpened using our image preprocessor.
Training on a subset of the strips.
With a trained model at hand, we can detect buildings in the remaining 7 images. This involves dividing each image in chips of the same size as those that we trained on to create target.geojson (small sample here) and passing target.geojson and the image to deploy-cnn-classifier.
Deploying on the remainder of the strips.
The output of deploy-cnn-classifier is classified.geojson (small sample here), which contains all the chips in target.geojson classified as ‘Buildings’ or ‘No Buildings’ and a confidence score on each classification.
A few words about training the model. We used a training set of 5000 chips, 2500 from each class, which we randomly sampled from the 4 training strips. We down-sampled each chip from 245x245 to 150x150 in order for the network to fit into memory. After trial and error, we found that the optimum batch size was 32 chips; a smaller size caused validation loss to bounce around during training, while a larger size resulted in memory issues. Finally, we settled on a learning rate of 0.001 because it was the fastest rate that resulted in convergence to the minimum loss.
See here for the complete workflow in Python. Note that the deployment tasks are all launched in parallel in a single for loop.
We used mapbox vector tiles in order to create a heat map of the model confidence in the presence of buildings across each strip that the model was deployed on. You can examine a subset of the results on this map, where we have overlaid the building heat map on one of the strips near Diffa.
It is quite apparent that the CNN is more confident in the presence of buildings when more of them are present in the chip! As the building density decreases, the confidence decreases as well. Here are some screenshots that demonstrate this point.
Sample classifications. CNN confidence increases with building density.
What is the CNN actually learning? Below are examples of hidden layer outputs produced during classification of a chip that contains buildings. Note that as the chip is processed by successive layers, the locations of buildings become more and more illuminated, leading to a high confidence decision that the chip contains buildings.
Various hidden layers for a chip containing buildings.
In contrast, a chip which does not contain buildings becomes progressively darker as it travels through the layers. This is because the CNN is not picking up on any of the learned abstract qualities of buildings.
Various hidden layers for a chip containing no buildings.
It is fascinating that the same CNN architecture can be used successfully on WorldView-3 imagery to detect swimming pools in a suburban environment in Australia, and on WorldView-2 and GeoEye-1 imagery to detect buildings in the Nigerian desert.
It takes about 10 hours to train the model for 75 epochs on 4 strips, on a g2.2xlarge AWS instance. The trained model can classify approximately 200000 chips, i.e., a little over 3000 km2, per hour on the same instance type. For the purpose of this demo, we deployed the trained model on 7 strips. We ran another experiment with 72 strips which is roughly 3 million chips and a total area of around 40000 km2. It took a little less than 2 hours to complete this experiment. Due to the inherent parallelization offered by GBDX, this time is dictated by the size of the largest strip. Going from 7 to 72 strips is simply a matter of adding more catalog ids to the for loop in our Python script.