Creating a GBDX task

Tasks are the bread and butter of the GBDX Platform. This walkthrough will take you through the steps of creating a task that can run your Python code on GBDX. We will start with a very simple task and then cover the more advanced case of creating a machine learning task. We will also demonstrate how you can setup your machine learning task to run on a GPU.

header.png

Contents

  1. Background
  2. Hello GBDX
  3. Dockerizing
  4. Registering a Task on GBDX
  5. Machine Learning on GBDX
  6. Finally

Background

A GBDX task is a process that performs a specific action on its inputs and generates a set of outputs. In the vast majority of cases, inputs and outputs consist of satellite image files (usually in tif format), vector files (shapefile, geojson), text files and various metadata files (XML, IMD and other).

Tasks can be chained together in a workflow where one task’s outputs can be the inputs to one or more different tasks. In this manner, more complicated processes can be executed than what is possible within a single task. For example, you can imagine a ship detection workflow consisting of the following tasks: (a) pansharpen the raw satellite image (b) create a sea mask (c) look for boats in the sea.

When a workflow is executed, tasks are scheduled appropriately by a scheduler and the system generates status indicators that are available via the GBDX API. Using the task definition in the task registry, each task is executed by a worker node within a Docker container that contains the task code and its dependencies in an encapsulated environment. You can find additional information on GBDX in the GBDX documentation.

Hello GBDX

In this section, we will write a Python script for our Hello GBDX task, hello-gbdx. The script hello-gbdx.py does the following: it obtains a list of the task input files and prints this list in the file out.txt, along with a user defined message. This script is executed by the Python interpreter within the task’s Docker container (more on this a bit later).

import os
from gbdx_task_interface import GbdxTaskInterface

class HelloGbdxTask(GbdxTaskInterface):

    def invoke(self):

        # Get inputs
        input_dir = self.get_input_data_port('data_in')
        message = self.get_input_string_port('message', default='No message!')

        # Get output
        output_dir = self.get_output_data_port('data_out')
        os.makedirs(output_dir)

        # Write message to file
        with open(os.path.join(output_dir, 'out.txt'), 'w') as f:
            input_contents = ','.join(os.listdir(input_dir))
            f.write(input_contents + '\n')
            f.write(message)


if __name__ == "__main__":
    with HelloGbdxTask() as task:
        task.invoke()

What exactly is going on in this script?

The HelloGbdxTask class that inherits from the GbdxTaskInterface class is defined:

import os
from gbdx_task_interface import GbdxTaskInterface

class HelloGbdxTask(GbdxTaskInterface):

    def invoke():

You can think of GbdxTaskInterface as a GBDX task Python template. It contains prebuilt functions to read the names and values of the task ports and clean-down code to record the result of the task execution (‘success’ or ‘fail’). The invoke() function implements the task functionality, which is described in the following.

The name of the input directory is obtained using the get_input_data_port function (inherited from GbdxTaskInterface).

# Get inputs
input_dir = self.get_input_data_port('data_in')

data_in is the task directory input port. What get_input_data_port('data_in') does behind the scenes is return the string ‘mnt/work/input/data_in’.

The value of the input string port message is obtained using the get_input_string_port function (also inherited from GbdxTaskInterface).

message = self.get_input_string_port('message', default='No message!')

message is one of possibly many task string input ports. What get_input_string_port('message', default='No message!') does behind the scenes is read the value of message from the file ports.json which is found under mnt/work/input/. (Keep in mind that you don’t have to worry about these inner workings if you don’t want to!) If the value is not specified, it returns a default value.

The name of the output directory is obtained using the get_output_data_port function (inherited from GbdxTaskInterface) and the output directory is created.

# Get output
output_dir = self.get_output_data_port('data_out')
os.makedirs(output_dir)

data_out is the task directory output port. What get_output_data_port('data_out') does behind the scenes is return the string ‘mnt/output/data_out’. Note that it is the responsibility of the script to create and move files into the output directory.

out.txt is created and saved in the output directory:

# Write message to file
with open(os.path.join(output_dir, 'out.txt'), 'w') as f:
    input_contents = ','.join(os.listdir(input_dir))
    f.write(input_contents + '\n')
    f.write(message)

Finally, the script executes the invoke() function when it is run.

if __name__ == "__main__":
    with HelloGbdxTask() as task:
        task.invoke()

How is data input to and output from hello-gbdx when it is executed on GBDX? If hello-gbdx is the first in a series of tasks comprising a workflow, data_in is assigned a string value by the user, which is the S3 location that contains the task input files (it will soon become apparent how to do this). These files are automatically copied to mnt/work/input/data_in of the Docker container, which runs hello-gbdx.py (we will cover Docker in the next section). When the execution of hello-gbdx.py is concluded, out.txt is found in mnt/work/input/data_out of the Docker container (our script explicitly saved it there). GBDX permits saving the contents of mnt/work/input/data_out at a user-specified S3 location, as well as feed those to the directory input port of another task in order to chain the two tasks together. If neither of those actions are performed, the contents of mnt/work/input/data_out are lost when the task executed. In this walkthrough, we will only consider single-task workflows; you can explore Platform Stories for examples of more complicated workflows involving multiple tasks.

In the next section we go through the steps of creating a Docker image for hello-gbdx.

Dockerizing

Dockerizing is a crucial step to get your code to run as a task on GBDX. In this section, we will provide a high level overview of Docker and review its terminology. Next, we’ll show you how to build your own hello-gbdx Docker image and run a container locally. At the end of this section, you will have an image that can be used to execute hello-gbdx, and have all of the materials necessary to register hello-gbdx on GBDX.

About Docker

Docker is a software containerization platform that allows developers to package up an application with its dependencies, and deliver it to a user in a single, self-sufficient package (referred to as a container).

docker_workflow.png Figure 1: Docker allows you to deliver various libraries and scripts in a lightweight package. Docker is required to create a task on GBDX.

Docker and GBDX

When a task is run on GBDX, a Docker container containing the task code, all its required dependencies, and the operating system is run by a worker node. Docker provides an efficient method for delivering the task code and its dependencies to the worker in an encapsulated environment.

Docker Lingo

Because Docker can be confusing if you are not used to it, we will define some terms you will encounter in this tutorial. You can consult the Docker glossary for more information.

  • Image: Docker images are the basis of containers. An image is a static, immutable file that describes the container environment. Running an image produces a container.

  • Container: A container is a runtime instance of a Docker image (similar to an object being an instantiation of a class in Python).

  • DockerFile: A Dockerfile is a text document that contains all the commands you would normally execute manually in order to build a Docker image. Docker can build images automatically by reading the instructions from a Dockerfile. It is only necessary to create a DockerFile if you elect to create your image from scratch.

  • DockerHub: A repository of images. You can pull and edit these images for your own use, analogous to cloning or forking a GitHub repo.

Creating a Docker Image

Before creating your image sign up for a DockerHub account and follow these instructions to install Docker CE on your machine. This will allow you to create, edit, and push your Docker images locally from the command line. In this section, we review two methods for creating a Docker image:

Pulling an Image from DockerHub

Before choosing an image to work with, you should choose an operating system and have a list of the libraries your task requires. hello-gbdx requires Python to run, so we will look for a simple base image with Python packages installed. Note that unnecessary libraries and packages will only slow down our container and the task will consequently take longer to run. There are many publicly available images that you can search through on DockerHub, so chances are you’ll be able to find one that suits your needs. Note that you may install additional dependencies on the image once you’ve pulled it from DockerHub, so it is not necessary to find one with all of the the required libraries.

Below is a list of images that may be useful to you:

  • ubuntu: A basic image with an Ubuntu OS and a good starting point for very simple tasks. This can be configured further once you have your own version tagged, which we will review below.

  • debian: Another basic image with a Debian OS.

  • geographica/gdal2: An image with Ubuntu 16.04 and GDAL v.2.x. (See the ‘tags’ tab in the repo for specific version options.)

  • platformstories/python_vim: An image developed by NVIDIA for accessing the GPU on the device running it. Note that this requires nvidia-docker. More on this later.

  • nvidia/cuda: An image developed by NVIDIA for accessing the GPU on the device running it. Note that this requires nvidia-docker. More on this later.

Since hello-gbdx only requires Python we will use platformstories/python_vim as our base image. In the following steps you will pull this image from its DockerHub repository, tag your own version for editing, and push the new image up to your own DockerHub account:

Login to Docker with your username and password.

docker login -u <your_username> -p <your_password>

Download the python_vim image to your machine using the following command. Pulling images can take a few minutes depending on their size.

docker pull platformstories/python_vim

Once complete, you can confirm that the image has been downloaded by executing the following command:

docker images

You should see something that looks like this:

REPOSITORY                             TAG                 IMAGE ID            CREATED             SIZE
platformstories/python_vim             latest              ddd4c238e314        About an hour ago   461.1 MB

Finally, you can tag the image under your username. This enables you to edit the image and push it to your personal DockerHub repository. Name the image hello-gbdx-docker-image, as we will be moving our code into it shortly.

docker tag platformstories/python_vim <your_username>/hello-gbdx-docker-image
docker push <your_username>/hello-gbdx-docker-image

This step is optional. We are going to run the image interactively (using the -it flag) to produce a container. The --rm flag keeps your machine clean by deleting the container after it is stopped. Recall that a container is a running instance of the image, but the image itself remains static. This means that any changes you make to the container will not affect the image and will be lost once the container is stopped. Familiarize yourself with the structure of the container using cd and ls:

# Run the image
docker run --rm -it <your_username>/hello-gbdx-docker-image

# Now we are in the container. Explore some as you would in the command line
root@ff567ca72fa0:/ ls
>>> boot  etc   lib    media  opt   root  sbin  sys  usr  bin  dev   home  lib64  mnt    proc  run   srv   tmp  var

Congratulations! You now have your own Docker image to which you can add your code.

Adding Your Code to the Image

docker_commit_wf.png Figure 2: Adding code to a Docker image.

The following steps (shown in Figure 2) walk you through adding hello-gbdx.py and gbdx_task_interface.py to hello-gbdx-docker-image. Before getting started ensure that both scripts are saved to your current working directory.

First we run the image using the docker run command in detached mode (using the -d flag). This tells the container to run in the background so we can access the files on our local machine.

docker run --rm -itd <your_username>/hello-gbdx-docker-image
>>> ff567ca72fa0ed6cdfbe0a5c02ea3e04f88ec49239344f217ce1049651d01344

The value returned is the container id. Make note of this because we will need it.

We now use the docker cp command to copy our scripts into the container. The format of this command is as follows: docker cp <filename> <container_id>:<container_dest_path>.

# Copy hello-gbdx.py to the root directory of the container
docker cp hello-gbdx.py <container_id>:/
docker cp gbdx_task_interface.py <container_id>:/

Our scripts are now in the container. You may confirm this by attaching to the container (bringing it back to the foreground) as follows:

docker attach <container_id>

# Notice that the scripts now live in the root directory of the container
root@ff567ca72fa0:/ ls
gbdx_task_interface.py   hello-gbdx.py   boot  etc   lib    media  opt   root  sbin  sys  usr  bin  dev   home  lib64  mnt    proc  run   srv   tmp  var

You may detach from the container (sending it back to background) without stopping it using the following escape sequence: Ctrl-p + Ctrl-q

If we were to stop the container now, all of our changes would be lost and hello-gbdx-docker-image image would remain unchanged. To permanently update hello-gbdx-docker-image, we must commit our changes to it.

# Commit the changes from the container to the image
docker commit -m 'add scripts to root' <container_id> <your_username>/hello-gbdx-docker-image

Now when you run hello-gbdx-docker-image, hello-gbdx.py and gbdx_task_interface.py will be in the root directory.

You can also push your new image to DockerHub in case you need to pull it in the future.

# Push the changes up to DockerHub
docker push <your_username>/hello-gbdx-docker-image

Keep in mind that, although hello-gbdx does not require any additional libraries to run, often times you will need to install a package that is not provided in the image that you pulled. Let’s say our task requires numpy to run; the process of adding it to the image is similar:

# Start up a container in attached mode, do not use the --rm flag
docker run -it <your_username>/hello-gbdx-docker-image

# Import numpy (can take several minutes)
root@ff567ca72fa0:/ pip install numpy

# Send container to background
root@ff567ca72fa0:/ ctrl-p + ctrl-q

# Commit your changes to the image, push to DockerHub
docker commit -m 'Add numpy' <container_id> <your_username>/hello-gbdx-docker-image
docker push <your_username>/hello-gbdx-docker-image

Our image is ready! If you would like to learn how to build an image from scratch continue on the next section. Otherwise, move on to testing the image locally.

Building an Image with a DockerFile

Feeling a little ambitious? You can build your own image from scratch with a DockerFile. For more information on DockerFiles see here and here.

dockerfile.png Figure 3: Building a Docker image with a DockerFile.

In this example, we will build hello-gbdx-docker-image.

We begin by making the directory hello-gbdx, which will contain our DockerFile, and the subdirectory bin in which we copy hello-gbdx.py and gbdx_task_interface.py.

# Make build and bin directories
mkdir hello-gbdx-build/
cd hello-gbdx-build/
mkdir bin/

# Copy both scripts for hello-gbdx into the bin/ directory
cp path/to/hello-gbdx.py bin/
cp path/to/gbdx_task_interface.py bin/

Now we can create our DockerFile. From within the hello-gbdx-build directory, type vim Dockerfile (or whatever code editor you prefer) to create a blank document. The first line of the DockerFile is the OS that your image uses. We will be working with Ubuntu. Type the following in the first line of the file:

FROM ubuntu:xenial

Next we add the commands that install the required dependencies preceded by the keyword RUN.

# install Python packages
RUN apt-get update && apt-get -y install\
    python \
    vim\
    build-essential\
    python-software-properties\
    software-properties-common\
    python-pip\
    python-dev

We instruct Docker to place the contents of bin into the image root directory. Add the following line to the end of Dockerfile:

# Add all scripts in bin to the image root directory
COPY ./bin /

Our DockerFile is now complete. Exit vim with the :wq command.

We are now ready to build the image. Make sure that you are still in the hello-gbdx-build directory and execute the following command:

docker build -t <your_username>/hello-gbdx-docker-image .

The build will take several minutes the first time through. If you change your script or the DockerFile you will need to rebuild the image. Subsequent builds will be faster since Docker builds images in layers.

Finally, push the image to DockerHub.

docker push <your username>/hello-gbdx-docker-image

Our Docker image is now ready to be tested with sample inputs.

Using a Private Docker Repository

Note that you can make your Docker image private on DockerHub and still use it as a task. To make an image private simply navigate to the Settings tab in the image repository and select the Make Private button under Visibility. Before registering the associated task, however, you must add ‘tdgpdeploy’ as a collaborator. Do may so in the Collaborators tab for the task repository, simply enter ‘tdgpdeploy’ in the username prompt and select Add User.

Testing a Docker Image

At this point you should have hello-gbdx-docker-image which includes hello-gbdx.py. In this section, we will run this image with actual input data. Successfully doing this locally ensures that hello-gbdx will run on GBDX. hello-gbdx/sample-input in this repo contains the two inputs required by hello-gbdx: (a) the directory data_in, the contents of which will be written to out.txt (in this example, this is simply the file data_file.txt) (b) the file ports.json which contains the message to be written to out.txt. Keep in mind that ports.json is automatically created by GBDX based on the task definition and the values of the string input ports provided by the user when the task is executed.

Run hello-gbdx-docker-image and mount inputs to the container under mnt/work/input; this is where GBDX will place the inputs when the task is executed.

docker run --rm -v ~/path/to/hello-gbdx/sample-input:/mnt/work/input -it <your_username>/hello-gbdx-docker-image

Note the important distinction between mounting data to the container and adding data to the image using the COPY command in the Dockerfile: when you exit the container, this data ‘disappears’ (i.e., it is not saved onto the image).

Confirm that the inputs are mounted by exploring the container.

# Look at the contents of the input directory. data_in should be mounted.
root@3ad24b35e32e:/ ls /mnt/work/input/
>>> data_in  ports.json

To test hello-gbdx, simply run the hello-gbdx.py script.

root@3ad24b35e32e:/ python hello-gbdx.py

If the script completes successfully you shouldn’t see anything written to STDOUT and the file out.txt should be found under mnt/work/output/. Here is how you can confirm this:

# Navigate to the output directory, ensure that 'data_out' lives there
root@3ad24b35e32e:/ ls mnt/work/output/data_out/
>>> out.txt

You can also make sure out.txt contains the expected content by typing cat out.txt. You should see the following output:

data_file.txt
This is my message!

Congratulations, your task is working as expected! The next step is to create a task definition, which will be used to register hello-gbdx on the platform.

Registering a Task on GBDX

Now that we have hello-gbdx-docker-image working locally, we can finally define hello-gbdx and then register it to the GBDX task registry.

Defining the Task

The task definition is a json file that contains a description of the task functionality, a list of its inputs and outputs, and the Docker image that needs to be run when the task is executed.

{
    "name": "hello-gbdx",
    "version":"0.0.1",
    "description": "Writes list of the input file names and a user defined message to output file out.txt.",
    "properties": {
        "isPublic": false,
        "timeout": 7200
    },
    "inputPortDescriptors": [
        {
            "name": "message",
            "type": "string",
            "description": "User defined message.",
            "required": true
        },
        {
            "name": "data_in",
            "type": "directory",
            "description": "Input data directory.",
            "required": true
        }
    ],
    "outputPortDescriptors": [
        {
            "name": "data_out",
            "type": "directory",
            "description": "Output data directory."
        }
    ],
    "containerDescriptors": [
        {
            "type": "DOCKER",
            "properties": {
                "image": "platformstories/hello-gbdx-docker-image"
            },
            "command": "python /hello-gbdx.py",
            "isPublic": true
        }
    ]
}

We review the five parts of this definition below.

Task properties:

{
    "name": "hello-gbdx",
    "version":"0.0.1",
    "description": "Writes list of the input file names and a user defined message to output file out.txt.",
    "properties": {
        "isPublic": false,
        "timeout": 7200
  • name: The task name.
  • version: The task version number. Note that this must be incremented every time the task’s Docker image is updated for changes to take effect.
  • description: A brief, high-level description of the task.
  • isPublic: A boolean. Note that only GBDX members with admin privileges can submit a new public task. If you have questions about public tasks, contact gbdx-support. Once a task has been made public you may switch this flag to true so that future versions will also be public.
  • timeout: Amount of time (in seconds) for the task to run before it is terminated by the platform. This value defaults to 7200 and must be between 0 and 172800 (i.e., 48 hours).

Input Port Descriptors: This is where the task input ports are defined.

"inputPortDescriptors": [
    {
        "name": "message",
        "type": "string",
        "description": "User defined message.",
        "required": true
    },
    {
        "name": "data_in",
        "type": "directory",
        "description": "Input data directory.",
        "required": true
    }
  • name: The input port name.
  • type: The input port type. Currently the only options are ‘directory’ and ‘string’. A directory input port is used to define an S3 location where input files are stored or to hook to the output directory port of a previous task. A string input port is used to pass a string parameter to the task. Note that integers, floats and booleans must all be provided to a task in string format!
  • description: Description of the input port.
  • required: A boolean. ‘true’/’false’ indicate required/optional input, respectively.

Output Port Descriptors: This is where the task output ports are defined.

"outputPortDescriptors": [
    {
        "name": "data_out",
        "type": "directory",
        "description": "Output data directory."
    }
  • name: The output port name.
  • type: The output port type. Currently, the only options are ‘directory’ and ‘string’.
  • description: Description of the output port.

Container Descriptors:

"containerDescriptors": [
    {
        "type": "DOCKER",
        "domain": "default"
        "properties": {
            "image": "platformstories/hello-gbdx-docker-image"
        },
        "command": "python /hello-gbdx.py",
        "isPublic": true
  • type: The type of container. Currently only DOCKER is supported.
  • domain: The domain on which to run this task. Default is an r4.2xlarge machine. For tasks using the GPU change this to an appropriate GPU domain (more on this later).
  • image: The name of the Docker image that is pulled from DockerHub.
  • command: The command to run within the container to initiate the task.

GBDX Task Registry

We now have all the required material to register hello-gbdx: a Docker image on DockerHub and the task definition.

Open an iPython terminal, import gbdxtools and start up a GBDX Interface.

from gbdxtools import Interface

gbdx = Interface()

Call the register() method of the TaskRegistry class with the name of the definition JSON. (Make sure hello-gbdx-definition.json is in your working directory).

gbdx.task_registry.register(json_filename = 'hello-gbdx-definition.json')
>>> u'hello-gbdx:0.0.1 has been submitted for registration.'

There’s a good chance that hello-gbdx:0.0.1 already exists in the registry. You can try using a different name after appropriately modifying the definition. The task takes 10-15 minutes to register. You may check on the status of the registration as follows:

# If output is True the task has successfully registered. If False, try again in a couple minutes.
'hello-gbdx:0.0.1' in gbdx.task_registry.list()
>>> True

Congratulations, you have just registered hello-gbdx! You can run it with sample data as follows. Open an IPython terminal and copy in the following:

from gbdxtools import Interface
from os.path import join
import uuid
gbdx = Interface()

# specify S3 location of input files, this must be on S3
input_location = 's3://gbd-customer-data/32cbab7a-4307-40c8-bb31-e2de32f940c2/platform-stories/hello-gbdx/'

# create task object
hello_task = gbdx.Task('hello-gbdx')

# set the value of data_in
hello_task.inputs.data_in = join(input_location, 'data_in')

# set the value fo the input string port
hello_task.inputs.message = 'This is my message!'

# define a single-task workflow
workflow = gbdx.Workflow([hello_task])

# save contents of data_out in platform-stories/trial-runs/random_str within your bucket/prefix
random_str = str(uuid.uuid4())
output_location = join('platform-stories/trial-runs', random_str)
workflow.savedata(hello_task.outputs.data_out, output_location)

Execute the workflow and monitor its status as follows:

workflow.execute()
workflow.status

You may also use the following commands to understand the STDOUT and STDERR of the workflow:

workflow.stdout
workflow.stderr

When the workflow is complete, you can download out.txt locally as follows

gbdx.s3.download(output_location)

To delete hello-gbdx from the registry:

gbdx.task_registry.delete('hello-gbdx:0.0.1')
>>> u'hello-gbdx successfully deleted.'

You have created a basic yet fully functional GBDX task using Docker and gbdxtools. The next section covers the process of creating more complicated tasks that can run machine learning algorithms.

Machine Learning on GBDX

In the last section, we created a simple task that generates a text file with a list of the contents of the input directory and a user defined message. Chances are you’re looking to do a bit more with your task. In this section, you will learn how to create a task that runs a standard machine learning (ML) algorithm such as a random forest classifier, and how to setup a ML task that utilizes a Convolutional Neural Network (CNN) so that it can be executed on a worker with a GPU.

Random Forest Classifier

In this example, we will create the task rf-pool-classifier that trains a random forest classifier to classify polygons of arbitrary geometry into those that contain swimming pools and those that don’t. For more information on this algorithm see here and here.

rf_img.png Figure 4: Inputs and output of rf-pool-classifier.

rf-pool-classifier has two directory input ports: geojson and image. Within geojson and image, the task expects to find a file train.geojson, which contains labeled polygons from both classes, and a tif image file from which the task will extract the pixels corresponding to each polygon, respectively (Figure 4). The task also has the input string port n_estimators that determines the number of trees in the random forest; specifying a value is optional and the default is ‘100’. The task produces a trained model in pickle format, which is saved in the S3 location specified by the output port trained_classifier.

The Code

The code of rf-pool-classifier.py is shown below; the structure is the same as hello-gbdx.py.

import numpy as np
import os
import pickle
import warnings
warnings.filterwarnings('ignore')   # suppress annoying warnings

from shutil import move
from mltools import features
from mltools import geojson_tools as gt
from mltools import data_extractors as de
from gbdx_task_interface import GbdxTaskInterface
from sklearn.ensemble import RandomForestClassifier

class RfPoolClassifier(GbdxTaskInterface):

    def invoke(self):

        # Get inputs
        n_estimators = int(self.get_input_string_port('n_estimators', default = '100'))
        img_dir = self.get_input_data_port('image')
        img = os.path.join(img_dir, os.listdir(img_dir)[0])

        geojson_dir = self.get_input_data_port('geojson')
        geojson = os.path.join(geojson_dir, os.listdir(geojson_dir)[0])


        # Move geojson to same dir as img
        move(geojson, img_dir)

        # Navigate to directory with input data
        os.chdir(img_dir)

        # Create output directory
        output_dir = self.get_output_data_port('trained_classifier')
        os.makedirs(output_dir)

        # Get training data from the geojson input
        train_rasters, train_labels = de.get_data('train.geojson', return_labels=True, mask=True)

        # Compute features from each training polygon
        compute_features = features.pool_basic
        X = []
        for raster in train_rasters:
            X.append(compute_features(raster))

        # Create classifier object.
        c = RandomForestClassifier(n_estimators = n_estimators)

        # Train the classifier
        X, train_labels = np.nan_to_num(np.array(X)), np.array(train_labels)
        c.fit(X, train_labels)

        # Pickle classifier and save to output dir
        with open(os.path.join(output_dir, 'classifier.pkl'), 'wb') as f:
            pickle.dump(c, f)


if __name__ == "__main__":
    with RfPoolClassifier() as task:
        task.invoke()

Here is what’s going on in the script:

We define the RfPoolClassifier class that inherits from GbdxTaskInterface, and read the input ports.

class RfPoolClassifier(GbdxTaskInterface):

    def invoke(self):

        # Get inputs
        n_estimators = int(self.get_input_string_port('n_estimators', default = '100'))
        img_dir = self.get_input_data_port('image')
        img = os.path.join(img_dir, os.listdir(img_dir)[0])

        geojson_dir = self.get_input_data_port('geojson')
        geojson = os.path.join(geojson_dir, os.listdir(geojson_dir)[0])

We move all the input files in the same directory (this particular implementation wants them in one place), and create the output directory.

# Move geojson to same dir as img
move(geojson, img_dir)

# Navigate to directory with input data
os.chdir(img_dir)

# Create output directory
output_dir = self.get_output_data_port('trained_classifier')
os.makedirs(output_dir)

Using the mltools.data_extractors modules, the pixels corresponding to each polygon in train.geojson are extracted and stored in a masked numpy array. For each array, a 4-dim feature vector is computed by the function features.pool_basic and stored in the list X.

# Get training data from the geojson input
train_rasters, train_labels = de.get_data('train.geojson', return_labels=True, mask=True)

# Compute features from each training polygon
compute_features = features.pool_basic
X = []
for raster in train_rasters:
    X.append(compute_features(raster))

We create an instance of the sklearn Random Forest Classifier class and train it using X and corresponding labels.

# Create classifier object.
c = RandomForestClassifier(n_estimators = n_estimators)

# Train the classifier
X, train_labels = np.nan_to_num(np.array(X)), np.array(train_labels)
c.fit(X, train_labels)

We save the trained classifier to the output directory port.

# Pickle classifier and save to output dir
with open(os.path.join(output_dir, 'classifier.pkl'), 'wb') as f:
    pickle.dump(c, f)

Finally, we call the invoke() function when the script is run.

if __name__ == "__main__":
    with RfPoolClassifier() as task:
        task.invoke()

The Docker Image

rf-pool-classifier requires more libraries than hello-gbdx, such as numpy and mltools. We build the Docker image rf-pool-classifier-docker-image by pulling geographica/gdal2:2.1.2 and installing the required libraries.

# Pull and tag the Docker image
docker pull ubuntu:xenial
docker tag ubuntu:xenial <your_username>/rf-pool-classifier

# Run the container
docker run -it <your_username>/rf-pool-classifier

# Install packages
root@5d4ae93d26dd:/ apt-get update && apt-get install -y git vim python ipython \
                      build-essential python-software-properties \
                      software-properties-common python-pip python-scipy \
                      gdal-bin python-gdal libgdal-dev
root@5d4ae93d26dd:/ pip install gdal numpy ephem psycopg2
root@5d4ae93d26dd:/ pip install git+https://github.com/DigitalGlobe/mltools

# Exit the container and commit changes to a new image name
root@5d4ae93d26dd:/ exit
docker commit -m 'install rf classifier packages' <container_id> <your_username>/rf-pool-classifier-docker-image

We are now ready to copy rf-pool-classifier.py and gbdx_task_interface.py to rf-pool-classifier-docker-image. Make sure to have the script saved to your working directory and execute the following:

# Run Docker in detached mode
docker run --rm -itd <your_username>/rf-pool-classifier-docker-image

# Copy the script to the container and commit the changes
docker cp rf-pool-classifier.py <container_id>:/
docker cp gbdx_task_interface.py <container_id>:/
docker commit -m 'copy rf_pool_classifier script' <container_id> <your_username>/rf-pool-classifier-docker-image

You can also build rf-pool-classifier-docker-image from scratch using the DockerFile below. The DockerFile should be in your working directory and your scripts should reside in ./bin for this to work.

FROM ubuntu:xenial

# install python and gdal packages
RUN apt-get update && apt-get install -y\
   git\
   vim\
   python \
   ipython\
   build-essential\
   python-software-properties\
   software-properties-common\
   python-pip\
   python-scipy\
   python-dev\
   gdal-bin\
   python-gdal\
   libgdal-dev

# install ml dependencies
RUN pip install gdal numpy ephem psycopg2
RUN pip install git+https://github.com/DigitalGlobe/mltools

# put code into image
ADD ./bin /

To build the image:

docker build -t <your_username>/rf-pool-classifier-docker-image .

The DockerFile and scripts can be found here.

Testing the Docker Image

We can now test rf-pool-classifier-docker-image on our local machine before defining rf-pool-classifier and registering it on the platform. Just as in the case of hello-gbdx, we will mimic the platform by mounting the sample input to a container and then executing rf-pool-classifier.py.

Create the directory rf_pool_classifier_test and the subdirectories geojson and image.

# create input port directories
mkdir rf_pool_classifier_test
cd rf_pool_classifier_test
mkdir geojson
mkdir image

The training data for the classifier needs to be saved under geojson. Download the file here and save it in this directory.

You will need the multispectral image with catalog id 1040010014800C00. Order this image, run it through the advanced image preprocessor and save it in your bucket as follows:

catid = '1040010014800C00'

order = gbdx.Task('Auto_Ordering', cat_id=catid)
order.impersonation_allowed = True

aop = gbdx.Task('AOP_Strip_Processor',
                data=order.outputs.s3_location.value,
                bands='MS',
                enable_dra=False,
                enable_pansharpen=False)      

wf = gbdx.Workflow([order, aop])
output_location = 'platform-stories/rf-pool-classifier/image'
wf.savedata(aop.outputs.data, output_location)
wf.execute()

Note that this workflow can take up to a couple hours to run. When the workflow is complete, download the image file locally, save it in the image directory and rename it to 1040010014800C00.tif. Note that the tif file is large (~ 13GB) so you will need adequate disk space.

Run rf-pool-classifier-docker-image with rf_pool_classifier_test mounted to the input port.

docker run --rm -v ~/<full/path/to/rf_pool_classifier_test>:/mnt/work/input -it <your_username>/rf-pool-classifier-docker-image

Within the container run rf-pool-classifier.py.

python /rf-pool-classifier.py

The script should run without errors. To confirm this, check the output port directory for classifier.pkl.

root@91d9d5cd9570:/ ls mnt/work/output/trained_classifier
>>> classifier.pkl

You may now exit the container and define and register rf-pool-classifier!

Task Definition

The definition for rf-pool-classifier is provided below:

{
    "name": "rf-pool-classifier",
    "version": "0.0.1",
    "description": "Train a random forest classifier to classify polygons in those that contain pools and those that do not.",
    "properties": {
        "isPublic": false,
        "timeout": 7200
    },
    "inputPortDescriptors": [
        {
            "name": "image",
            "type": "directory",
            "description": "Contains the image strip where the polygons are found.",
            "required": true
        },
        {
            "name": "geojson",
            "type": "directory",
            "description": "Contains a geojson with labeled polygons. Each polygon has the properties feature_id, image_id, and class_name (either 'No swimming pool' or 'Swimming pool')",
            "required": true
        },
        {
            "name": "n_estimators",
            "type": "string",
            "description": "Number of trees to use in the random forest classifier. Defaults to 100.",
            "required": false
        }
    ],
    "outputPortDescriptors": [
        {
            "name": "trained_classifier",
            "type": "directory",
            "description": "Contains the file 'classifier.pkl' which is the trained random forest classifier."
        }
    ],
    "containerDescriptors": [
        {
            "type": "DOCKER",
            "properties": {
                "image": "platformstories/rf-pool-classifier-docker-image"
            },
            "command": "python /rf-pool-classifier.py",
            "isPublic": true
        }
    ]
}

Put rf-pool-classifier-definition.json in your working directory and register rf-pool-classifier as follows:

from gbdxtools import Interface
gbdx = Interface()

# register the task using rf-pool-classifier-definition.json
gbdx.task_registry.register(json_filename = 'rf-pool-classifier-definition.json')

Executing the Task

We will now run through a sample execution of rf-pool-classifier using gbdxtools.

Open an iPython terminal, create a GBDX interface and specify the task input location.

from gbdxtools import Interface
from os.path import join
import uuid

gbdx = Interface()

# specify location
input_location = 's3://gbd-customer-data/32cbab7a-4307-40c8-bb31-e2de32f940c2/platform-stories/rf-pool-classifier'

Create an rf_task object and specify the inputs.

rf_task = gbdx.Task('rf-pool-classifier')
rf_task.inputs.image = join(input_location, 'image')
rf_task.inputs.geojson = join(input_location, 'geojson')
rf_task.inputs.n_estimators = "1000"

Create a single-task workflow object and define where the output data should be saved.

workflow = gbdx.Workflow([rf_task])
random_str = str(uuid.uuid4())
output_location = join('platform-stories/trial-runs', random_str)

workflow.savedata(rf_task.outputs.trained_classifier, output_location)

Execute the workflow and monitor its status as follows:

workflow.execute()
workflow.status

Once the workflow is completed, you can download classifier.pkl locally as follows:

gbdx.s3.download(output_location)

Done! At this point we have created and executed a simple ML task on the platform. In the next section, we will cover how to make use of the GPU for compute-intensive algorithms that rely on it.

Using the GPU

Until now, we have been running our tasks on a CPU device. For certain ML applications such as deep learning that are very compute-intensive, the GPU offers order-of-magnitude performance improvement compared to the CPU. GBDX provides the capability of running a task on a GPU worker. This requires configuring the Docker image and defining the task appropriately.

This section will walk you through setting up a local GPU instance for building and testing your Docker image, and then building a GPU-compatible Docker image, i.e., a Docker image that can access the GPU on the node on which it is run.

Setting up a GPU instance

All GBDX GPU workers use nvidia-docker to allow running containers to leverage their GPU devices. Here are the steps for setting up an AWS instance with nvidia-docker, which you will need to test your GPU-compatible Docker image.

On AWS launch an Ubuntu:16.04 instance. Choose a GPU device of type EC2 g2.2xlarge. At least 20GB of storage is recommended. Then ssh into your instance.

ssh -i <path/to/key_pair> ubuntu@<instance_id>

Update APT.

sudo apt-get update && sudo apt-get -y upgrade

# Clean up
sudo apt-get clean

Install NVIDIA drivers through Ubuntu.

sudo apt install ubuntu-drivers-common
sudo ubuntu-drivers autoinstall

Install CUDA.

wget http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1604/x86_64/cuda-repo-ubuntu1604_8.0.44-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu1604_8.0.44-1_amd64.deb
sudo apt-get update
sudo apt-get install cuda

Verify CUDA installation.

nvidia-smi

# You should see the following output:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.66                 Driver Version: 375.66                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GRID K520           Off  | 0000:00:03.0     Off |                  N/A |
| N/A   36C    P0    45W / 125W |      0MiB /  4036MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Install Docker on your image.

sudo apt-get install \
    apt-transport-https \
    ca-certificates \
    curl \
    software-properties-common

# Add the Docker GPG key
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -

Set up the stable Docker repository:

sudo add-apt-repository \
   "deb [arch=amd64] https://download.docker.com/linux/ubuntu \
   $(lsb_release -cs) \
   stable"

Update APT and install Docker.

sudo apt-get update
sudo apt-get install docker-ce

Instruct docker to run without sudo privileges.

sudo groupadd docker
sudo usermod -aG docker $USER

# Reboot the instance
sudo reboot

Log back in to the instance and ensure Docker was installed properly by running the ‘hello-world’ container. You should see a message indicating that the installation was successful.

docker run hello-world

Finally, install nvidia-docker on the instance.

# Install nvidia-docker and nvidia-docker-plugin
wget -P /tmp https://github.com/NVIDIA/nvidia-docker/releases/download/v1.0.1/nvidia-docker_1.0.1-1_amd64.deb
sudo dpkg -i /tmp/nvidia-docker*.deb && rm /tmp/nvidia-docker*.deb

# Test nvidia-smi
nvidia-docker run --rm nvidia/cuda nvidia-smi

# You should see the following output:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.66                 Driver Version: 375.66                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GRID K520           Off  | 0000:00:03.0     Off |                  N/A |
| N/A   35C    P8    18W / 125W |      0MiB /  4036MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Building a GPU-Compatible Image

Now that we have an instance on which to test our GPU tasks, we can create a Docker image that can access the GPU. To do this we simply use the Docker image nvidia/cuda as a base and install any necessary dependencies. nvidia/cuda is set up such that it can run seamlessly on any device using the nvidia-docker plugin, saving us the headache of matching drivers between the Docker image, GBDX worker nodes, and our GPU instance.

Here we will create a Docker image ‘gbdx-gpu’ with the Tensorflow and Keras Python libraries installed, as these are required by the example task presented in the next section. While still logged into the AWS instance, execute the following steps:

Login to Docker on the running instance.

docker login -u <your_username> -p <your_password>

Pull the nvidia/cuda image created for CUDA 8.0 from DockerHub and tag it under a new name.

docker pull nvidia/cuda:8.0-cudnn6-runtime-ubuntu16.04
docker tag nvidia/cuda:8.0-cudnn6-runtime-ubuntu16.04 <your_username>/gbdx-gpu

Run a container from the image.

docker run -it <your_username>/gbdx-gpu

Install Python libraries and other machine learning dependencies.

root@a28d273819ea:/ apt-get update && apt-get -y install python build-essential python-software-properties software-properties-common ipython python-pip python-scipy python-numpy python-dev vim wget
root@a28d273819ea:/ pip install keras tensorflow-gpu h5py

Create or update the Keras config file. This will instruct Keras to use the Tensorflow backend.

root@a28d273819ea:/ mkdir /root/.keras/
root@a28d273819ea:/ vim /root/.keras/keras.json

Remove any contents of the file and paste the following:

{
"image_dim_ordering": "tf",
"epsilon": 1e-07,
"floatx": "float32",
"backend": "tensorflow"
}

Exit the container and commit your changes to gbdx-gpu.

root@a28d273819ea:# exit

docker commit -m 'install dependencies, add keras.json' <container id> <your_username>/gbdx-gpu

We now have the Docker image gbdx-gpu that can run Tensorflow and Keras on a GPU. See here for how to build this image with a DockerFile. In the following section, we will create a GBDX task that trains a CNN classifier using the GPU.

Convolutional Neural Network

We are going to use the tools we created above to create the task ‘train-cnn’ that trains a CNN classifier using the GPU. The task uses input images and labels to create a trained model (Figure 5).

train_cnn_task.png Figure 5: Inputs and output of train-cnn.

train-cnn has a single directory input port train_data. The task expects to find the following two files within train_data:

  • X.npz: Training images as a numpy array saved in npz format. The array should have the following dimensional ordering: (num_images, num_bands, img_rows, img_cols).
  • y.npz: Class labels corresponding to training images as a numpy array saved in npz format.

train-cnn also has the optional string input ports bit_depth and nb_epoch. The former specifies the bit depth of the imagery and defaults to ‘8’ and the latter defines the number of training epochs with a default value of ‘10’. The task produces a trained model in the form of a model architecture file model_architecture.json and a trained weights file model_weights.h5. These two outputs will be stored in the S3 location specified by the output port trained_model.

The Code

The code of train-cnn.py is shown below.

import os
import json
import numpy as np

from gbdx_task_interface import GbdxTaskInterface
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Flatten
from keras.layers import Convolution2D, MaxPooling2D
from keras.utils import np_utils


class TrainCnn(GbdxTaskInterface):

    def invoke(self):

        # Get string inputs
        nb_epoch = int(self.get_input_string_port('nb_epoch', default = '10'))
        bit_depth = int(self.get_input_string_port('bit_depth', default = '8'))

        # Get training from input data dir
        train = self.get_input_data_port('train_data')
        X_train = np.load(os.path.join(train, 'X.npz'))['arr_0']
        y_train = np.load(os.path.join(train, 'y.npz'))['arr_0']
        nb_classes = len(np.unique(y_train))

        # Reshape for input to net, normalize based on bit_depth
        if len(X_train.shape) == 3:
            X_train = X_train.reshape(X_train.shape[0], 1, X_train.shape[1], X_train.shape[2])

        X_train = X_train.astype('float32')
        X_train /= float((2 ** bit_depth) - 1)
        X_train = np.swapaxes(X_train, 1, -1)

        # convert class vectors to binary class matrices
        Y_train = np_utils.to_categorical(y_train, nb_classes)

        # Create basic Keras model
        model = Sequential()

        model.add(Convolution2D(32, 3, 3, border_mode='valid',
                                input_shape=(X_train.shape[1:])))
        model.add(Activation('relu'))
        model.add(Convolution2D(32, 3, 3))
        model.add(Activation('relu'))
        model.add(MaxPooling2D(pool_size=(2,2)))
        model.add(Dropout(0.25))

        model.add(Flatten())
        model.add(Dense(128))
        model.add(Activation('relu'))
        model.add(Dropout(0.5))
        model.add(Dense(nb_classes))
        model.add(Activation('softmax'))

        # Compile model
        model.compile(loss='categorical_crossentropy',
                      optimizer='adadelta',
                      metrics=['accuracy'])

        # Fit model on input data
        model.fit(X_train, Y_train, batch_size=128, epochs=nb_epoch,
              verbose=1)

        # Create the output directory
        output_dir = self.get_output_data_port('trained_model')
        os.makedirs(output_dir)

        # Save the model architecture and weights to output dir
        model.save(os.path.join(output_dir, 'model.h5'))


if __name__ == '__main__':
    with TrainCnn() as task:
        task.invoke()

Here is what is happening in train-cnn.py:

Define the TrainCnn class that inherits from GbdxTaskInterface, read the input ports, and load the images and labels to X_train and y_train, respectively.

class TrainCnn(GbdxTaskInterface):

    def invoke(self):

        # Get string inputs
        nb_epoch = int(self.get_input_string_port('nb_epoch', default = '10'))
        bit_depth = int(self.get_input_string_port('bit_depth', default = '8'))

        # Get training from input data dir
        train = self.get_input_data_port('train_data')
        X_train = np.load(os.path.join(train, 'X.npz'))['arr_0']
        y_train = np.load(os.path.join(train, 'y.npz'))['arr_0']
        nb_classes = len(np.unique(y_train))

Put X_train and y_train into a format that the CNN will accept during training.

# Reshape for input to net, normalize based on bit_depth
if len(X_train.shape) == 3:
    X_train = X_train.reshape(X_train.shape[0], 1, X_train.shape[1], X_train.shape[2])

X_train = X_train.astype('float32')
X_train /= float((2 ** bit_depth) - 1)
X_train = np.swapaxes(X_train, 1, -1)

# convert class vectors to binary class matrices
Y_train = np_utils.to_categorical(y_train, nb_classes)

Create a Keras CNN model. Layers are added to the model to define its architecture, then training parameters are set using model.compile.

# Create basic Keras model
model = Sequential()

model.add(Convolution2D(32, 3, 3, border_mode='valid',
                        input_shape=(X_train.shape[1:])))
model.add(Activation('relu'))
model.add(Convolution2D(32, 3, 3))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Dropout(0.25))

model.add(Flatten())
model.add(Dense(128))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(nb_classes))
model.add(Activation('softmax'))

# Compile model
model.compile(loss='categorical_crossentropy',
              optimizer='adadelta',
              metrics=['accuracy'])

Train the model.

# Fit model on input data
model.fit(X_train, Y_train, batch_size=128, epochs=nb_epoch,
      verbose=1)

Create model_architecture.json and model_weights.h5 and save them to the output directory port.

# Create the output directory
output_dir = self.get_output_data_port('trained_model')
os.makedirs(output_dir)

# Save the model to output dir
model.save(os.path.join(output_dir, 'model.h5'))

Call the invoke() function when the script is run.

if __name__ == '__main__':
    with TrainCnn() as task:
        task.invoke()

The Docker Image

train-cnn requires a Docker image that can access the GPU. We build the Docker image train-cnn-docker-image by pulling the Tensorflow/Keras gbdx-gpu image that we created above and copying in train-cnn.py and gbdx_task_interface.py.

Pull platformstories/gbdx-gpu from DockerHub if you do not already have it. Tag the image under your username and rename it to train-cnn-docker-image.

docker pull platformstories/gbdx-gpu
docker tag platformstories/gbdx-gpu <your_username>/train-cnn-docker-image

Run train-cnn-docker-image in detached mode and copy train-cnn.py and gbdx_task_interface.py.

# Run train-cnn-docker-image in detached mode
docker run -itd <your_username>/train-cnn-docker-image
>>> <container_id>

# Copy code file to the image
docker cp train-cnn.py <container_id>:/
docker cp gbdx_task_interface.py <container_id>:/

Commit changes to train-cnn-docker-image and push it to DockerHub.

docker commit -m 'add train-cnn scripts' <container_id> <your_username>/train-cnn-docker-image
docker push <your_username>/train-cnn-docker-image

This image now has all of the libraries and scripts required by train-cnn. See here to see a sample build of train-cnn-docker-image using a DockerFile. Continue on to the next section to test the image using the GPU instance created above.

Testing the Docker Image

We will now test train-cnn-docker-image with sample input to ensure that train-cnn.py runs successfully AND that the GPU is utilized.

ssh into the AWS GPU instance and clone this repo so that the sample input is on the instance.

# ssh into the instance
ssh -i </path/to/key_pair> ubuntu@<your_instance_name>

# clone create-task repo
ubuntu@ip-00-000-00-000:~$ git clone https://github.com/PlatformStories/train-cnn

Pull train-cnn-docker-image from your DockerHub account onto the instance.

docker pull <your_username>/train-cnn-docker-image

Run a container from train-cnn-docker-image. This is where testing on a GPU differs from our previous tests: you must specify which GPU devices the container should use, defined by curl http://localhost:3476/v1.0/docker/cli.

docker run --rm `curl http://localhost:3476/v1.0/docker/cli` -v \
    ~/train-cnn/sample-input:/mnt/work/input/ -it \
    <your_username>/train-cnn-docker-image /bin/bash

Run train-cnn.py. In this step we confirm that the container is using the GPU and that the script runs without errors.

root@984b2508233b:/build# python /train-cnn.py

# You should see the following in your output
Using TensorFlow backend.
TensorFlow device (/gpu:0) -> (device: 0, name: GRID K520, pci bus id: 0000:00:03.0)
Epoch 1/2
60000/60000 [==============================] - 7s - loss: 0.3776 - acc: 0.8827
Epoch 2/2
60000/60000 [==============================] - 7s - loss: 0.1431 - acc: 0.9570

The second line of STDOUT indicates that the container is indeed using the GPU! The lines that follow indicate that the model is being trained.

To confirm that the script was run successfully, check the output directory for the file model.h5.

# Look for the trained model in the output directory
root@984b2508233b:/build# ls /mnt/work/output/trained_model
>>> model_architecture.json    model_weights.h5

Task Definition

Defining a task that runs on a GPU is very similar to defining regular tasks. The one difference is in the containerDescriptors section: you must set the ‘domain’ property to ‘nvidiag2’. There are multiple options for GPU domains on the platform, nvidia g2 and nvidiap2 support docker images running CUDA 8, while nvidiagpu supports CUDA 7.5. Here is the definition for train-cnn:

{
    "name": "train-cnn",
    "version": "0.0.1",
    "description": "Train a convolutional neural network classifier on the MNIST data set.",
    "properties": {
        "isPublic": false,
        "timeout": 36000
    },
    "inputPortDescriptors": [
        {
            "name": "train_data",
            "type": "directory",
            "description": "Contains training images X.npz and corresponding labels y.npz.",
            "required": true
        },
        {
            "name": "nb_epoch",
            "type": "string",
            "description": "Number of training epochs to perform during training. Defaults to 10.",
            "required": false
        },
        {
            "name": "bit_depth",
            "type": "string",
            "description": "Bit depth of the input images. This parameter is necessary for proper normalization. Defaults to 8."
        }
    ],
    "outputPortDescriptors": [
        {
            "name": "trained_model",
            "type": "directory",
            "description": "Contains the fully trained model with the architecture stored as model_arch.json and the weights as model_weights.h5."
        }
    ],
    "containerDescriptors": [
        {
            "type": "DOCKER",
            "properties": {
                "image": "platformstories/train-cnn-docker-image",
                "domain": "nvidiag2"
            },
            "command": "python /train-cnn.py",
            "isPublic": true
        }
    ]
}

Now all we have to do is register train-cnn; follow the same steps as for hello-gbdx and rf-pool-classifier.

Executing the Task

It’s time to try out train-cnn. We’ll use the publicly available MNIST dataset, which contains 60,000 images of handwritten digits to train a model to recognize handwritten digits. Figure 6 shows some example images.

mnist.png Figure 6: Sample digit images from the MNIST dataset.

Open an iPython terminal, create a GBDX interface and get the input location.

from gbdxtools import Interface
from os.path import join
import uuid

gbdx = Interface()

input_location = 's3://gbd-customer-data/32cbab7a-4307-40c8-bb31-e2de32f940c2/platform-stories/train-cnn'

Create a cnn_task object and specify the inputs.

cnn_task = gbdx.Task('train-cnn')
cnn_task.inputs.train_data = join(input_location, 'train_data')
cnn_task.inputs.bit_depth = '8'
cnn_task.inputs.nb_epoch = '15'

Create a single-task workflow object and specify where the output data should be saved.

workflow = gbdx.Workflow([cnn_task])
random_str = str(uuid.uuid4())
output_location = join('platform-stories/trial-runs', random_str)

workflow.savedata(cnn_task.outputs.trained_model, output_location)

Execute the workflow and monitor the status as follows:

workflow.execute()

# monitor workflow status
workflow.status

Once the workflow is completed, download the trained model:

gbdx.s3.download(output_location)

Congratulations, you have now created, registered, and executed a GBDX task on a GPU!

Finally

For the latest updates on GBDX check out our release notes. If you find any problems in this walkthrough, please submit an issue here.

For general GBDX support, contact gbdx-support.

Written on September 30, 2016