Assessing OSM with automatic built-up detection

OpenStreetMap (OSM) is a free global map built from scratch by volunteers around the world. It is a living map of the world: online communities of mappers use satellite imagery and/or knowledge of the area they live in to continuously update the map with buildings, roads, infrastructure, etc. The OSM dataset has diverse uses, from locating remote population centers in order to administer aid, to training deep neural networks to identify roads on satellite imagery.

The completeness of OSM is notoriously hard to assess. Statistics can be obtained in areas where reference data is available but this is a rare occasion. (The existence of ubiquitous reference data would defeat the purpose of mapping in the first place!) The other side of the same coin is intelligent mapping tasking: how can mappers be pointed to unmapped or sparsely mapped locations? In the past, we trained a neural network to identify settlements in the Nigerian desert with good results. However, deep learning involves substantial overhead due to its reliance on training data and, as of today, it is still computationally expensive, especially when deployed at a large scale.

An attractive alternative to deep learning is unsupervised detection of materials commonly found in built-up areas like cement, asphalt and stone, using spectral information. Our Protogen Land Use Land Cover (LULC) classification algorithm makes use of 8-band, atmospherically compensated Worldview-2 and Worldview-3 imagery to classify each pixel as vegetation, water, soil, cloud, shadow and unclassified. The last category can serve as an approximation for built-up as shown in the following example.


LULC classification output for an area of interest east of Houston is shown on the left; blue is water, green is vegetation, brown is bare soil and grey is unclassified. The mask on the right is derived by painting the unclassified pixels in white and the rest of the image in black. Note that the mask outlines urban and suburban areas quite nicely. Also note that the big white patches are not built-up; according to Google maps, the left patch is a large area of dirt by the name of Lost Lake. The patch on the right is within the Highlands Reservoir; the reservoir possibly contains materials other than water.

The LULC classification algorithm is fast: it takes about 0.34 sec/km2 (or 0.006 cents/km2) on the GBDX default domain, which consists of r4.2xlarge nodes. The speed and flexibility of the algorithm allows easy experimentation at scale. We ran the LULC algorithm on recent imagery of a number of cities around the world in order to derive built-up masks, and overlaid OSM building footprints that we obtained via the OSM Overpass API. You can explore the results in the map below.

Automatically derived built-up shown in white over a black background. OSM building footprints obtained from the OSM Overpass API are shown in green. Chrome and Firefox are the recommended browsers for viewing the map.

The map provides an indication of how well buildings in different cities around the world are mapped. OSM coverage in New York City and Madrid is high; less so in Houston and Amman. Ulaan Baatar’s large shantytown in the north is mostly absent from OSM but the rest of the Mongolian capital is surprisingly well mapped. In Asunción, buildings have mostly been drawn along the main roads, while in Perth only downtown and the suburb of Dianella have good coverage. It is compelling to discover the underlying reasons of these trends.

You can find the GBDX workflow that we implemented in this Jupyter notebook, which you can run locally by installing the gbdx-notebook server. The workflow consists of three tasks:

  • protogen-runner runs the Protogen LULC algorithm that derives a built-up mask from the input multispectral image;
  • download-osm-buildings downloads building footprints from the OSM Overpass API and converts the osm file format to geojson;
  • upload-to-mapbox uploads the raster output of protogen-runner and the vector output of download-osm-buildings to our Mapbox account where the corresponding tilesets are created. When the input is a geojson file, the task converts it to the mbtiles format using tippecanoe prior to upload.

The map (full page view here) was created with the Mapbox GL Javascript library. Each city corresponds to a different layer/tileset in the map; ‘flying’ to a city activates that city’s raster and vector tilesets and deactivates the tilesets of the other cities (a trick that we found made for the smoothest flying experience.)

You might have noticed that the built-up masks are noisy in parts, e.g., in Los Angeles and Tehran, they include mountainous terrain which is likely not inhabited. There are ways to improve accuracy, e.g., a corner detection algorithm can probably filter out non man-made features. Regardless, the overall conclusion from this experiment is that the masks do provide a valuable bird’s eye view of the areas that need to be mapped and can potentially be used for intelligent mapping tasking. We are also working on more refined feature extraction methods like road extraction, and will be comparing their results with the corresponding features in OSM in the future.

If you are interested in using GBDX to derive geospatial intelligence from satellite imagery, you can sign up for a free account here!

Written on February 24, 2017