Automated training data generation for object detection with CNNs

One of the main drivers of satellite imagery analytics is the rapid and accurate detection of objects of interest over broad areas. In recent years, convolutional neural networks (CNNs) have been applied to object detection on multimedia imagery, achieving remarkable success on benchmark data sets. The main idea behind the best performing architectures is to derive region proposals, i.e., bounding boxes for the objects present in the image, and then infer the object class for each bounding box.

Region proposals can be obtained by intelligently grouping pixels using traditional image segmentation methods like Selective Search [1], or computed by the CNN itself [2,3], which is trained both for the region proposal as well as the object classification task. [4] goes beyond object detection to object instance segmentation, where the CNN extracts an outline of the detected object within each bounding box, with remarkable results on COCO. Very cool.

Are we done then?

CNNs are well suited to the problem of object detection on satellite imagery due to (a) the diversity of objects of interest such as cars, boats and buildings which come in all kinds of shapes, colors and sizes, and (b) the variability introduced by the sensors, due to the off-nadir angle, the sun elevation, and atmospheric conditions such as clouds and haze at the time of capture. Properly trained, CNNs can learn the defining characteristics of the objects of interest and recognize them in a variety of settings. In the past, we successfully trained VGG-16 to find swimming pools in Australia and remote settlements in Nigeria.

tank-diversity.png Snapshot of circular tanks in an oil field. Note the differences in size, color, shadows and wear and tear.

However, object detection on satellite imagery with CNNs also involves unique challenges. The first is the procurement of adequate training data, often thousands or tens of thousands of examples. Identifying and outlining small, and oftentimes sparse, objects on satellite images is time consuming, expensive and prone to error. Some reference datasets are available, e.g., here and here, but the abundance of high quality training data is nowhere near that for multimedia imagery. (Incidentally, this points to transfer learning as a key ingredient in training neural networks on satellite imagery.)

The second challenge is computational complexity due to the large search areas typically associated with satellite imagery analytics. The size of multimedia images is usually less than 1 megapixel while the size of a WorldView-3 large area collection is tens of gigapixels. A brute force approach is to divide the search area in small frames that can be processed by the detectors in [1-4], in which case the computational cost is roughly inversely proportional to the frame size. It is unclear at the moment whether this can work at a country- or continental- scale, the scale at which we typically seek answers from satellite imagery on GBDX.

Geometry, geometry, geometry

As argued in this very interesting blog post, semantics are defined by humans and they provide a means to teach artificial intelligence to understand the world. However, they are not grounded on geometry.


A neural network can be trained to classify the left image as ‘Oil tank’ and the right image as ‘Boat’. Semantics do not inherently capture geometrical properties.

In a previous blog post, we experimented with reducing the search area to a set of candidate objects which satisfy the geometrical properties of the object of interest, e.g., when detecting boats, the candidates should only consist of objects within the size and elongation ranges of boats. Once the candidates are established, a CNN can discriminate between boats and other objects of similar geometrical properties like waves, wave breakers or marinas. Beyond reducing the search area, the framework also facilitates the generation of training data by automating part of the process. All the annotator needs to do is flip through the candidates and provide a label (‘boat’ or ‘other’), as opposed to searching for boats on satellite images and drawing bounding boxes around them.

In this blog post, we explore this theme further. The question we ask is: Can the object geometrical properties be used to automate the generation of positive and negative examples, which can then be used to train the CNN to distinguish the object from other, geometrically similar objects, in the candidates?


Provided that the geometry of the object of interest can be described by a set of attribute constraints, referred to as the search space, we define three sets of connected pixel components: (a) in the search space; call these candidates, as they include the object of interest, as well as other objects with similar geometry; (b) well within the search space; call these positives as they most likely only include the object of interest; (3) outside the search space; call these negatives as they most likely don’t include the object of interest.

Candidates, positives and negatives in a two-dimensional attribute space.

The main idea is to train a CNN-based classifier using the positives and negatives, and then deploy the classifier on the candidates. We posit that the CNN can learn defining features from the positive and negative examples, and, with this knowledge, can successfully discriminate the object of interest from other objects in the candidates.

approach.png Schematic of the proposed approach. Automatically generated positive and negative examples from different scenes are used to train the CNN. The model is used to classify the candidates on a new scene.

Use case: circular tank detection

To test the idea, we focus on circular tank detection, a problem we have tackled in the past from a purely geometrical standpoint.

We identified twenty WorldView-2 and WorldView-3 collections which contained circular tanks, in the United States, Central and South America, Europe, the Middle East and Asia. For each collection, we generated the orthorectified, atmospherically compensated, panchromatic and pan-sharpened RGB images in UTM projection.

Generation of positives and negatives

The procedure starts with computing the max-tree and the min-tree for the panchromatic image [5], which are structures that organize the image in connected components ordered with respect to intensity. In the case of the max-tree, bright components are nested within larger, darker components, while in the min-tree, the nesting order is inverse, i.e., dark components are nested within larger, brighter ones. Intuitively, the max-tree and min-tree are suited to selecting connected components which are brighter and darker than their surroundings, respectively.

The attribute space consists of area and compactness. The area range is set to 100-10000 m2 in order to include tanks of all sizes. Compactness is the ratio of the area over the perimeter squared, normalized to unity when the shape is a disk. Intuitively, the closer the compactness of a connected component to unity, the more disk-like it is.

For the positives, we selected components with compactness greater than 0.99. This is a very strict threshold, which results in only close-to-perfect disks being included in the positives. For each component, the axis-aligned bounding box was computed and buffered for additional context, and then used to extract the corresponding image chip from the pan-sharpened image. In the manner described here, we obtained 4322 positives, out of which we randomly selected 1250. A small sample is shown below; note that the bottom row consists of false positives such as roundabouts and domes which also satisfy the minimum compactness threshold.

positives.png A sample of automatically generated positives. The bottom row consists of false positives like roundabouts and domes.

For the negatives, we selected components with compactness between 0.5 and 0.7, and followed the same procedure as for the positives to get the corresponding chips. Why pick this range? We want to show the classifier a sufficient number of examples that look like tanks but are not. Should lower compactness values be used, the negatives would then mostly consist of very elongated features. The classifier would learn that only very elongated objects are not tanks, and would not be able to distinguish tanks from other relatively compact objects in the candidates. A small sample of automatically generated negatives is shown below.

negatives.png A sample of automatically generated negatives. The bottom row consists of false negatives.

Note that the bottom row consists of false negatives. Tanks might end up in the negatives due to impurities or shadows which reduce the compactness, or simply because they are nested in a larger component which falls within the compactness range. Since false negatives constitute a minority in the negatives, a simple way to reduce their number is to train the CNN only on a subset of the negatives. Out of the 586000 negatives that were automatically extracted, we randomly selected 3750, i.e., three times the size of the positive examples, for a training set of total size 5000.

CNN training

In order to assess the impact of the noise present in the automatically generated training set on the classifier accuracy, we manually corrected the false negatives and false positives to obtain a curated training set. We then trained VGG-16 on each set separately to obtain two models, henceforth referred to as model 1 and model 2, respectively.

Some technical details. All the image chips were resized to 150x150. VGG-16 was initialized with ImageNet weights and the soft-max layer was replaced, as ImageNet includes 1000 classes while in our case there are only two, ‘tank’ and ‘other’. We only trained the final convolutional block, the fully connected layers and the soft-max layer, using a learning rate of 0.0001 and an L2 normalization factor of 0.01. The training time for each model was approximately one hour.

Putting everything together

circular-tank-detector is a prototype GBDX task which takes as input a pan-sharpened RGB image in UTM projection and produces a geojson file with the coordinates of the bounding boxes of the detected tanks. The steps executed in the task are candidate extraction from the panchromatic image (which is computed from the input pan-sharpened image) using a specified minimum compactness and area range, chipping of the candidate bounding boxes, and, finally, classification of each chip using the provided input model. If a model is not provided as input, then model 2, which is built into the task, is used by default. The task also provides the options to specify the model decision threshold and to perform prediction-time augmentation. Finally, it uses the GBDX nvidiap2 domain which deploys AWS p2.xlarge instances on the backend.


In order to evaluate the speed of circular-run-detector, we ran it on six different locations using the default settings. On average, circular-tank-detector takes 12 sec/km2. Speed depends on a number of factors including the image size, the number of tanks in the scene, as well as the specified compactness and size ranges. In particular, computing the trees becomes slow as the image size increases, in which case it is preferable to tile the input image and feed each tile to a separate task.

In order to evaluate the precision and recall, we selected two images, one of Fujairah, United Arab Emirates, and one of Gary, Indiana, and created a ground truth data set by manually drawing a tight bounding box around each circular tank within an area range 50-12000 m2. Note that training data was not extracted from these locations. We ran circular-tank-detector with the selected area range, a minimum compactness 0.65 (the default), and prediction-time augmentation, using both models 1 and 2.

What has each model detected? Explore the maps to find out.

Model 1 and 2 detections in Fujairah, United Arab Emirates and Gary, Indiana. Click on each tab to toggle the corresponding layer. Full page views of these maps can be found here and here.

Model 1 detects mostly round and oval objects which stand out with respect to their surroundings. These include tanks but they also include patches of vegetation and dirt, trees and ponds. The false detections are a result of the presence of such objects in the positive examples (as shown in the figure above). In comparison, model 2 has learned much better the defining characteristics of a circular tank and has fewer false detections. Click on the candidates to convince yourself that both models successfully eliminate a large number of irrelevant candidates.

Certain obvious (to humans!) tanks are not detected by either model. Notice that most of the time this occurs, the compactness of the tank is too low such that it does not make it into the candidate set. There are a number of reasons for this including impurities, presence of ladders or pipes, or color gradients which cause the tank to blend into the background. Lowering the compactness threshold can lead to higher recall at the expense of precision and speed, as more candidates are presented to the CNN for classification. Experimenting with this value could result in a better precision/recall tradeoff. (Try it!)

To get accuracy metrics, we calculated the intersection over union (IoU) of each detection box with each box in the reference data set and vice versa. A detection box was counted as a true positive if there was at least one reference box for which the IoU was greater than 0.5, and a reference box was counted as a detection if there was at least one detection box for which the IoU was greater than 0.5. Using this criterion, the precision, recall and F1 score of models 1 and 2, for the threshold values which maximized the F1 score in each case, were calculated. The results are listed in the following table.

method precision recall F1 score
model 1 0.734 0.728 0.731
model 2 0.921 0.722 0.809

You might wonder whether just using the candidates as the detection set can achieve a better F1 score. We tried different compactness thresholds and the best F1 score we obtained was 0.669 (for a compactness threshold 0.99), which is lower than the F1 score of model 1. In other words, a CNN trained on geometrically generated data outperforms a purely geometrical approach.

Where to next

We proposed an original framework for large-scale, unsupervised object detection on satellite imagery which rests on the premise that neural network training and deployment for object detection should be guided by the object geometry. We defined positives, negatives and candidates based on geometrical attributes, and showed that a CNN trained on the positives and negatives can learn to discriminate the object of interest from other objects within the candidates. Our motivation is scalability, achieved by (semi-) automating training data generation, thus cutting training times, and reducing the search area, thus cutting deployment times/costs.

There are some additional nuances in the extraction of the training data which we have glossed over for the sake of simplicity. You can find the full details in our paper in bigdatafromspace2017.

The door is open for future work. How can the framework be applied to other objects? What other types of trees and CNN architectures can be combined? Or should we just ‘brute-force it’ by directly applying cutting-edge object detection architectures such as [2-4] on satellite imagery?

We’ll be discussing our findings in upcoming blogs. This is truly an exciting time to be in the field of satellite imagery analytics.


[1] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 580-587.

[2] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: unified, real-time object detection. 2015.

[3] R. Shaoqing, K. He, R. Girshick and J. Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. 2016.

[4] Kaiming He, Georgia Gkioxari, Piotr Dollár, Ross Girshick. Mask R-CNN. 2017.

[5] P. Salembier, A. Oliveras, and L. Garrido. Antiextensive connected operators for image and sequence processing. IEEE Transactions on Image Processing, Vol. 7, No. 4, April 1998.

Written on October 27, 2017