Boat detection on satellite imagery is important for a number of reasons, including identification of illegal fishing activities, wide-area surveillance of exclusive economic zones and maritime traffic monitoring. Automatic Identification System (AIS) transmissions are mandatory for vessels of given size and type; reliable boat detection can serve as a complement to AIS data and can help authorities detect suspicious or illegal activity by identifying silent boats.
Due to its capacity to image day and night and through clouds, Synthetic Aperture Radar (SAR) imagery is a very attractive option for remote sensing ship detection; a variety of techniques have been developed in the past . However, as discussed in , boat detection on SAR imagery can be challenging due to speckle effects, or when boats are small or made of materials such as wood and fiberglass. Due to these shortcomings, boat detection on optical satellite imagery provides a nice complement to existing SAR techniques.
In the spirit of  and this cool blog post by CosmiQ works, we came up with our own prototype solution for the detection of boats at sea on very high resolution optical satellite imagery. The proposed approach leverages DigitalGlobe multispectral and pan-sharpened imagery, and a combination of traditional image analysis, connected component filtering and deep learning. The algorithm is available as a task to GBDX subscribers.
Boat detections off the coast of Vancouver (yellow) and AIS transmissions (green).
The whole point of satellite imagery analytics is to quickly and accurately discover features and changes of interest over broad areas. At 30cm/pixel resolution, objects such as cars, airplanes and buildings are represented by tens or hundreds of pixels, while the size of an image or a mosaic of images covering an area of a few thousand square kilometers is well into the gigapixel range.
A straightforward approach to object detection is to slide a window across the entire area of interest (AOI), and, for each position, detect the object of interest within the window. The computational cost of this approach depends on the window size, the step size and the detection method.
One option is to make the window size comparable to the object size, and then come up with a classifier that determines whether the window contains the object or not, for each position of the window (a nice graphic can be found here); an accurate classification is an accurate detection, provided that the window is tight enough. In order for this to work, the step size must be a fraction of the window side, otherwise the chance of missing the object increases. This simple fact can make the number of windows very large (roughly inversely proportional to the square of the step size), in which case this solution becomes computationally costly. In addition, if the object class includes objects of variable size, e.g., boats come in many different sizes, multiple-size windows must be used, increasing computational requirements even further. Finally, the window overlap may result in multiple valid detections per object, in which case a method like non-maximum suppression  is required to obtain a single detection. This has its problems in cluttered environments, as valid detections of distinct nearby objects might be suppressed (see a nice illustration of this problem here).
An alternative option is to make the window larger, e.g., 1 megapixel in size, which is larger than the objects we are looking for but still much smaller than the total AOI. In this case, multiple objects within the window will likely need to be detected. This can be done with convolutional neural network (CNN) based solutions targeted to object detection such as R-CNN  (and its variants Fast and Faster R-CNN) and YOLO  (and its variants such as DetectNet), or with fully convolutional neural networks (FCN)  which produce a semantic segmentation, i.e., a pixel-wise classification, within the window.
The main idea behind R-CNN is to use selective search  to come up with region proposals, and then use a CNN to assign a label to each region. Selective search works by grouping pixels similar in texture, color and intensity into components using graph-based segmentation , and then employing hierarchical grouping in order to come up with the region proposals. The approach requires as input a scale parameter k  which determines the observation scale and is usually comparable to the image dimensions. Depending on the value of k, selective search can produce hundreds of candidates; see the example below for a megapixel-sized, pan-sharpened WorldView-2 RGB chip. If the AOI comprises thousands of windows, the computational complexity increases substantially, just as in the small sliding window approach.
Region proposals with selective search on a pan-sharpened WorldView-2 RGB chip with dimensions 979x982, i.e., roughly 500m x 500m at 50cm/pixel resolution. There are 143 candidates with size greater than 1000 pixels, shown in red, generated by setting k=500. Using this implementation, and the default selective search parameter settings therein, it took 44 sec on a MacBook Air to generate these results. This means that it would take about 12 compute hours to generate candidates for an AOI consisting of 1000 chips.
At the time of writing, the applicability of architectures such as YOLO and FCN for object detection on satellite imagery is under investigation. One concern are the memory requirements as the window size increases. Another is obtaining the training data necessary to train these models. FCN-like architectures require accurately segmented images as input in order to be trained; this type of training data is laborious and expensive to obtain, and naturally prone to error.
The very real issues of computational complexity and high quality training data generation associated with using deep learning for object detection on satellite imagery motivate the proposed approach.
Due to the magnitude of the search area, object detection on satellite imagery should be based on search area reduction. If the objective is to detect boats at sea, the first step is to focus on the sea and ignore all land mass. Moreover, within the sea body, it makes sense to look for objects which observe the geometrical constraints characteristic of boats such as size and length-to-width ratio. Once the reduced search space is specified in the form of a set of candidate objects, a classifier such as a CNN can be trained to select boats from other objects of similar geometry.
Read on for more detail.
A straightforward way to generate a water mask with multi-spectral imagery is to compute the normalized difference water index between the first band and the farthest NIR band for every pixel, set the intensity of any pixel with index greater (smaller) than zero to white (black), and then filter out white bodies smaller than a given minimum size. The result of these operations is shown below for a WorldView-2 image of the port of Vancouver.
A simple water mask can be derived by thresholding the normalized difference water index, and then removing white bodies smaller than 0.25 km2 (Vancouver, BC).
Deriving accurate water masks in a variety of conditions can be surprisingly problematic. Choppy water, building shadows, water contamination are all potential sources of inaccuracy. For this version of our boat detector, we turn to the global OSM coastline dataset in order to demarcate the sea.
Sea mask derived using the OSM coastline dataset (Algeciras, Spain).
Pixel dissimilarity map
Boats are made of materials which have different spectral properties than the water that surrounds them. We can take advantage of this simple fact in order to obtain the candidate outlines. One way to do this is to compute the maximum intensity difference across all bands between every pixel and its immediate neighbors, and encode this value as an intensity. This produces a grayscale image which we call the dissimilarity map, where the darker a pixel is, the more different the corresponding material is with respect to its neighboring materials. Note that this calculation is relatively fast on the multi-spectral image, where the resolution is lower (about 2m/pixel) than the corresponding pan-sharpened image.
Pan-sharpened WorldView-2 image on the left, and pixel dissimilarity map, evaluated on the multi-spectral image, on the right.
Since we are looking for boats at sea, we mask the dissimilarity map using the sea mask. Before applying the mask, we erode it using a disk element in order to ensure that the water close to the coast line is not part of the search area. You can think of this as a buffer which can be adjusted via the disk radius.
Sea mask on the left, eroded sea mask (with disk radius of 100m) in the middle, and masked pixel dissimilarity map on the right.
How do we extract the candidates from the dissimilarity map? We first build its max-tree representation . The max-tree is a hierarchical representation of a grayscale image; nodes at a given level of the tree correspond to connected components of pixels with intensity greater than a given value. Once the tree is constructed, we can extract components that satisfy certain geometrical properties.
In the past, we filtered the max-tree based on compactness in order to extract oil tanks. In the case of boats, we are interested in elongated components within a given size range. Our measure of elongation is the ratio of the major axis length over the minor axis length, computed with principal component analysis. Here is the result of filtering for components with size range 500-6000 m2 and elongation between 2 and 8.
Max-tree filtering of the dissimilarity map based on size and elongation.
Max-tree filtering is extremely fast. The operation is close to instantaneous for the 2753x2627 image shown above. The detection bounding boxes are shown below, superimposed on the pan-sharpened image.
A quick inspection shows that almost every bounding box corresponds to a boat. However, this is not always the case. Here is the result of applying exactly the same procedure on an image off the coast of Mumbai, where the water is very choppy. We pick up all the boats but also a lot of the waves which happen to conform to the size and elongation constraints.
Pan-sharpened WorldView-2 image chip of Mumbai, dissimilarity map, filtered components, and, finally, boat candidates.
A classifier is needed to weed out the noise from the candidate set. This is where deep learning enters the picture.
The final step is to train a neural network based classifier to differentiate between boats and other objects in the candidate set. Since the generation of the candidate set is automatic, the generation of the training set is greatly facilitated. All that is required is to go through the candidate set and manually label each candidate as ‘Boat’ and ‘Other’. These labels can then be used to train the classifier.
We extracted several thousand candidate axis-aligned bounding boxes from 13 different ports around the world: Shanghai, Singapore, Hong Kong, Rotterdam, Kaoh Siung, Hamburg, Jeddah, Algeciras, Mumbai, Santos, Piraeus, Istanbul and Yokohama. We made an image chip for each candidate from the corresponding pan-sharpened image, and then we labeled each chip as ‘Boat’ and ‘Other’.
Examples of ‘Boat’ (top row) and ‘Other’ (bottom row). Square 224x224 chips are created by warping and zero padding the original rectangular chips.
Out of the total set of candidates, we created a training set of 6913 chips; 1636 chips labeled ‘Boat’ and 5277 chips labeled ‘Other’. We then trained a VGG-16 neural network with ImageNet initialized weights. Note that while the candidate selection takes place on 4- or 8-band multi-spectral image (depending on sensor), the neural network is trained on the pan-sharpened image. This allows us to take advantage of the sub-meter resolution in the final classification of each candidate.
Bringing it all together
boat-detector is a GBDX task which takes as input a multi-spectral image and its pan-sharpened counterpart, both in UTM projection, and produces bounding boxes for boats at sea. The steps executed in the task are water masking, candidate extraction using max-tree filtering on the dissimilarity map, generation of the candidate image chips, and, finally, deployment of the model contained in the task to classify each chip as ‘Boat’ and ‘Other’.
The output is a geojson file with the axis-aligned bounding boxes of the detections. The task also has a number of optional tunable parameters including the minimum and maximum acceptable component sizes and elongations and the minimum required confidence threshold. Finally, the task requires a single GPU; it uses the GBDX nvidiap2 domain which deploys AWS p2.xlarge instances.
We conducted various experiments in order to test boat-detector. Note that none of the images in these experiments were used in the generation of the training set for boat-detector. For comparison purposes, we also tested the SpaceKnow ship segmentation (sss) algorithm, available as a task on GBDX. The input to sss is a pan-sharpened image in EPSG:4326 projection and the output is a geojson with the axis-aligned bounding boxes of the detections.
We first ran boat-detector and sss on five images collected over Sevastopol, Halifax, Genoa, Osaka and Dubai. We used the default settings of boat-detector, except for the minimum size which we set to 100 m2 in order to detect small vessels. The results for Osaka are shown below; click on each menu bar to toggle the corresponding layer.
boat-detector (yellow) and sss (blue) detections in Osaka (catalog id 105001000A1FFF00, GeoEye-1). boat-detector only detects boats at sea whereas sss also detects boats at the dock. The boat-detector candidates are shown in red; note the significant search area reduction. A full page view of the map can be found here.
Both algorithms appear to be quite accurate at sea. The boat-detector bounding boxes for boats with a pronounced wake also include the wake since the wake forms part of the corresponding candidate. In this case, the sss bounding boxes provide a significantly more accurate picture of the actual boat size. The boat-detector candidates illustrate nicely the search area reduction achieved with water-masking and max-tree filtering on the dissimilarity map; only linear features at sea are included in the search space. The classifier singles out the boats (yellow) from the entire candidate set (red), which, among other things, includes docks, waves and wave breakers. With regards to docked boats, sss appears to detect most of them but suffers from a number of false positives on land and often lumps boats which are close to each other in a single detection.
As part of the same experiment, we also evaluated the speed of each task. On average, boat-detector takes about 1 sec/km2 while sss takes about 47 sec/km2. We are not aware of the implementation details of sss. We speculate that this difference in speed is due to the drastic search area reduction performed within boat-detector prior to the final classification step. An important factor to keep in mind in this comparison is that sss also targets boats at the dock, i.e., it has a larger search space than boat-detector to begin with. Note that the speed of boat-detector depends on the value of the minimum size parameter, in this case 100 m2. Increasing the minimum size reduces the number of candidates and hence increases the speed.
We conducted a different experiment using a large number of images, collected between 2010 and 2014, over three specific areas: Vancouver (60 images), San Francisco (30 images) and New York (20 images). These images were captured under different weather conditions, including calm, windy and cloudy. We ran boat-detector (using the same settings as in the previous experiment) and sss. We also collected AIS data within +/- 5 minutes of each image acquisition time and took the average in order to obtain a rough estimate of the boat location at the time of acquisition. Finally, we selected 5 images from each area and created a reference data set by drawing axis-aligned bounding boxes over all boats at sea within a given polygon, referred to as the reference polygon. You can download the reference data set for each port here, here and here.
The time lapse for Vancouver is shown below; we made similar time lapses for San Francisco and New York. A quick viewing reveals that certain AIS points do not overlap with any boat. This is to be expected as each AIS point is computed as an average of all transmission locations within +/- 5 minutes of the acquisition time. Nevertheless, it is quite straightforward to single out the boats that are not transmitting.
Vancouver time lapse. boat-detector detections are shown in yellow and sss detections in blue. AIS-estimated boat locations at acquisition time are marked with green points; note that some of these are off, depending on the speed of the boat at the acquisition time. Reference bounding boxes for select images, 5 in total, are shown in white, along with the reference polygon. Click on each menu bar to toggle the corresponding layer. A full page view of the map can be found here.
We evaluated the precision and recall of boat-detector and sss on our reference data set (which only includes boats at sea). Precision is defined as the ratio TP/(TP+FP) and recall as the ratio TP/(TP+FN), where:
- TP is the number of detection bounding boxes that intersect the reference polygon for which the maximum intersection-over-union, calculated over each of the reference bounding boxes, is >= 0.5;
- FP is the number of detection bounding boxes that intersect the reference polygon for which the maximum intersection-over-union, calculated over each of the reference bounding boxes, is < 0.5;
- FN is the number of reference bounding boxes (boat bounding boxes which intersect the reference polygon) for which the maximum intersection-over-union, calculated over each of the detection bounding boxes, is < 0.5.
Note that these definitions allow for multiple targets to be covered by the same detection bounding box (provided that the intersection-over-union is large enough). We also introduce area-weighted precision and recall. The modification introduced compared to the previous definitions is that each true positive, false positive and false negative are weighted by the corresponding bounding box area. This is a means to assign importance to a correct/faulty detection or a miss based on size. The results of our calculations are summarized in the following table.
|task||precision||recall||area-weighted precision||area-weighted recall|
The main reason for the relatively low recall of both algorithms is that boats which are attached together in a group, especially barges, are usually lumped into a single detection. The intersection-over-union of this detection box with the individual bounding boxes is usually below the minimum value of 0.5, so none of the boats in the group are detected. Using a lower threshold, results in better metrics for both algorithms. For boat-detector in particular, the bounding boxes of boats with significant wake include the wake. As a result, these detections fail the intersection-over-union criterion and are not counted. In addition, the minimum size threshold used in this test results in some of the very small vessels not being included in the candidate set. A smaller size threshold could be used at the expense of speed.
With regards to precision, both algorithms suffer from boat-like false positives including waves, docks and bridge segments. Collecting more training data that includes these features and re-training the neural network should help boost the precision of boat-detector. Finally, when the area weight is introduced, both algorithms do significantly better both in terms of recall and precision, which points to the fact that most errors occur over features of small size. Overall, it appears that sss slightly outperforms boat-detector albeit at a significantly lower speed.
Our prototype boat-detector is an example of bringing together different tools to efficiently solve an object detection problem on satellite imagery. Several improvements and extensions can be made to the boat-detector in order to find docked boats, to improve the quality of the bounding boxes and to boost the accuracy. The main point is that traditional image analysis and connective morphological filtering can be used to pave the way for deep learning, by facilitating the generation of training data, hence speeding up training, and by reducing the search space, hence reducing deployment times. These are crucial factors when seeking object detection solutions that work at a global scale. In the future, we will be further exploring this theme for other important object detection cases.
 D. J. Crisp. The state-of-the-art in ship detection in synthetic aperture radar imagery. Australian Government, Department of Defense. 2004.
 C. Corbane, L. Najman, E. Pecoul, L. Demagistri, and M. Petit. A complete processing chain for ship detection using optical satellite imagery. International Journal of Remote Sensing, Taylor & Francis, 2010, 31 (22), pp. 5837–5854.
 P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2010, 32 (9), pp. 1627-1645.
 R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 580-587.
 J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: unified, real-time object detection. https://arxiv.org/abs/1506.02640, 2015.
 J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
 J. Uijlings, K. van de Sande, T. Gevers, and A. Smeulders. Selective search for object recognition. IJCV, 2013.
 P. Felzenszwalb and D. Huttenlocher. Efficient graph-based image segmentation. International Journal of Computer Vision, Vol. 59, No. 2, September 2004.
 P. Salembier, A. Oliveras, and L. Garrido. Antiextensive connected operators for image and sequence processing. IEEE Transactions on Image Processing, Vol. 7, No. 4, April 1998.