In my previous blog posting I introduced our service management platform Dex which is at the core of many of our advanced analytics solutions. Dex seamlessly manages the coordination of the data acquisition, ingestion, transformation and modeling for our system via processing steps that are often deployed as Docker containers for execution on either dedicated servers, EC2 instances, or as AWS Fargate tasks. One of the core applications we leverage for many of these steps is the open source KNIME analytics platform (www.knime.org).
KNIME’s easy to use GUI allows us to quickly develop and test ETL and modeling workflows. Once these workflows are done, we then deploy them inside of Dex to be run inside Docker containers using KNIME’s batch execution mode. The purpose of this article is to provide a guide to building a Docker container to execute KNIME workflows in a headless environment. This article details a very basic set up for the Docker container and does not cover things like user accounts and security credentials which may be required for a more robust, production ready deployment.
By running a KNIME workflow in a Docker container it becomes possible to automate the scheduling and execution of workflows within a variety of cloud environments such as Amazon Fargate and Google Kubernetes Engine. This is an excellent way for automating background data processing and machine learning workflows. For shared and collaborative workflows, the KNIME Server enterprise application should be considered.
The first step is to prepare a Docker container with a Java Runtime Environment that will be used to execute KNIME. We will also install the curl utility so that we can download the KNIME distribution directly into the container and install it from there. For this blog, we will use an Ubuntu container as our base but a lighter weight more Java centric container could be used instead. Our initial Dockerfile looks like this:
FROM ubuntu:18.04 RUN apt-get update RUN apt-get install -y default-jre curl
With the basic container prepared, the next step is to define some Docker ENV variables that we can use in the Dockerfile both while building the final container and when executing it. These variables configure the installation directory of KNIME as well as the location of the workspace KNIME will execute its workflows from. These variables as set as follows:
ENV DOWNLOAD_URL https://download.knime.org/analytics-platform/linux/knime_3.7.2.linux.gtk.x86_64.tar.gz ENV INSTALLATION_DIR /usr/local ENV KNIME_DIR $INSTALLATION_DIR/knime_3.7.2 ENV WORKSPACE=/root/knime-workspace
Installing KNIME is straightforward: it simply requires downloading a KNIME distribution and unzipping it into the desired installation location inside the container:
RUN curl -L "$DOWNLOAD_URL" | tar vxz -C $INSTALLATION_DIR
At this point we now have a container with the base KNIME installation and its default set of processing nodes. However, KNIME workflows often use some of the large number of optional nodes which are distributed as plugins for the platform. We can install any number of these plugins via the KNIME command line, but we first need to identify which ones we require as each plugin has a unique plugin identifier.
To find this identifier for a given plugin, use an instance of the KNIME application GUI which has the desired plugins installed. To see the list of plugins, bring up the “About KNIME” dialog and click on the “Installation Details” button at its bottom left. This will display a secondary dialog where you will select the “Plug-ins” tab which will display a list of all plugins installed in the application.
For this example, we are going to assume our workflows require the KNIME Amazon Cloud Connectors because they work with data in S3. We will use the filter text box at the top of the dialog to narrow down the list to AWS related plugins:
From this dialog we can see that the plugin identifier for the KNIME Amazon Cloud Connectors is “org.knime.cloud.aws” so this is the identifier we want to reference in our Dockerfile command to install a plugin via the KNIME command line:
RUN $KNIME_DIR/knime -application org.eclipse.equinox.p2.director \ -i org.knime.cloud.aws -r http://update.knime.org/analytics-platform/3.7
This command tells KNIME to install the plugin with the given identifier (-i option) by downloading it from the specified plugin repository (-r option). You may install any number of plugins using this technique by repeating this command with a different plugin identifier for each desired plugin. Plugin installation can take a bit of time and you will often see warnings about KNIME being unable to open the display. These warnings are merely a side effect of doing the plugin installation via the command line and may safely be ignored.
In order to run a workflow in the container, the workflow needs to be copied into the container. KNIME’s batch mode execution supports running workflows either in their normal form which is a workflow directory or as a self-contained ZIP file which is created by exporting a workflow from the KNIME GUI. In this example, we will create a workspace in the container (referencing the previously defined ENV variable) and copy in a ZIP file containing a sample workflow:
RUN mkdir $WORKSPACE ADD ./HelloWorld.zip $WORKSPACE
Finally, we will register a default command with the container which will run our sample workflow in KNIME’s batch mode when the container is executed without an alternative command being specified at the time of execution. This command includes several options to disable some of the features of the KNIME GUI as well as ensure that the workflow being run is reset before being executed and its state is not saved after execution.
CMD $KNIME_DIR/knime --launcher.suppressErrors -nosave -reset -nosplash \ -application org.knime.product.KNIME_BATCH_APPLICATION \ -workflowFile=$WORKSPACE/HelloWorld.zip
Because we are running the workflow inside a Zip file, we are using the “-workflowFile” option. If the workflow was copied into the container as a directory, we would use the “-workflowDir” option instead. KNIME supports several other options that let you pass both credentials and workflow variables via the command line. These options are documented in the KNIME FAQ at https://www.knime.com/faq#q12.
At this point, we now have a complete Dockerfile for the execution of a single KNIME workflow inside a Docker container in a headless environment:
FROM ubuntu:18.04 RUN apt-get update RUN apt-get install -y default-jre curl ENV DOWNLOAD_URL https://download.knime.org/analytics-platform/linux/knime_3.7.2.linux.gtk.x86_64.tar.gz ENV INSTALLATION_DIR /usr/local ENV KNIME_DIR $INSTALLATION_DIR/knime_3.7.2 ENV HOME_DIR /home/knime ENV WORKSPACE=/root/knime-workspace RUN curl -L "$DOWNLOAD_URL" | tar vxz -C $INSTALLATION_DIR RUN $KNIME_DIR/knime -application org.eclipse.equinox.p2.director \ -i org.knime.cloud.aws -r http://update.knime.org/analytics-platform/3.7 RUN mkdir $WORKSPACE ADD ./HelloWorld.zip $WORKSPACE CMD $KNIME_DIR/knime --launcher.suppressErrors -nosave -reset -nosplash \ -application org.knime.product.KNIME_BATCH_APPLICATION \ -workflowFile=$WORKSPACE/HelloWorld.zip
This is a simplistic example to show the general steps; in a production system you will want to build a more sophisticated container. For example, our Dex service management platform includes a standard general KNIME Docker container with a sophisticated bootstrap application for downloading the workflow to execute from a Dex instance. This eliminates the need to rebuild and redeploy the container any time the workflow is revised and enables one Docker container to support any number of KNIME workflows. Furthermore, the container’s KNIME installation includes a large number of custom KNIME nodes we built which our workflows leverage to interact with Dex to manage configuration, state, and security credentials as they execute.
By running a KNIME workflow via a Docker container, we’ve created a consistent environment to allow the execution of the workflow without the need for user interaction. Furthermore, we can now leverage any number of Docker centric cloud technologies to schedule and run the container which can vastly reduce cost as there is no need to maintain dedicated processing resources for the workflow.