Batch Execution of KNIME Workflows, Part 2

Bulletproof Your KNIME Workflows with System Property Variables and External Config Files
August 5, 2019
URL Tips and Tricks
August 7, 2019
Show all

In an earlier blog posting I walked through the steps required to install KNIME in a Docker container so that it could be used to run workflows in a batch mode.  This is a very useful technique for automating workflow execution and leveraging cloud resources to do so.  However, most workflows are not self-contained: they need access to configuration, external data, and local storage.   I did not cover those aspects in my first posting so this blog entry will introduce ways to pass in configuration and support the batch execution of KNIME workflows in a container.

Setting Flow Variables

The most common way to supply external configuration to a KNIME workflow is to set flow variables from the command line when KNIME is run in batch mode.    This is done using the -workflow.variable option.  Each instance of this option sets the value for a single variable so it may be included any number of times in the command line.   The format for the option value is as follows:

-workflow.variable=[Variable Name]:[Value]:[Type]

The [Variable Name] is simply the KNIME flow variable as it is used in the workflow and the [Value] is the value to assign to the flow variable.     Since all KNIME flow variables are strongly typed, you are also required to specify its type in the final [Type] portion of the setting.   The valid types for flow variables set from the command line are String, double, and int (i.e. the names of the standard Java types they are stored as internally).  

As an example, say we have a workflow which will be running in AWS via a Fargate task and it needs to connect to an RDS instance.  If we’ve built the workflow to use the flow variables rds_host_name and rds_port_number in the database connector, we can specify the database host and port on the command line as follows:

$KNIME_DIR/knime --launcher.suppressErrors -nosave -reset -nosplash \
     -application org.knime.product.KNIME_BATCH_APPLICATION \
     -workflowFile=$WORKSPACE/ProcessData.zip \ 
     -workflow.variable=rds_host_name:some_db_id.us-east-1.rds.amazonaws.com:String \
     -workflow.variable=rds_port_number:5432:int

Note that this example assumes you have set the environment variables KNIME_DIR and WORKSPACE in your Docker environment so that they can be referenced in the default container command (my previous article on KNIME in Docker covers how to do so).  One tip for facilitating development of the workflow in the KNIME UI is to set default values for the flow variables in the workflow properties:

The default values will be used when the workflow is executed in the UI but when the workflow is run in batch mode, any values specified via the command line will override the default values.

Supplying Credentials

As with flow variables, security credentials can also be passed in via the command line:

-credential=[Credential Name]:[User]:[Password]

In order to reference credentials in a workflow node, you will be required to configure the credential set via the workflow credentials settings for the workflow:

During development in the UI you will have to use the actual credentials but before exporting the workflow to be run in batch mode, you should consider replacing the actual credentials with dummy values so that your security secrets are not distributed in your container.   As with flow variables, credentials specified via the command line will override the ones configured in the workflow settings.

Continuing the previous example of using RDS for a workflow database, we can configure the database node to use a credential set named rds_user and specify the user and password for this credential set on the command line as follows:

$KNIME_DIR/knime --launcher.suppressErrors -nosave -reset -nosplash \
     -application org.knime.product.KNIME_BATCH_APPLICATION \
     -workflowFile=$WORKSPACE/ProcessData.zip \ 
     -workflow.variable=rds_host_name:some_db_id.us-east-1.rds.amazonaws.com:String \
     -workflow.variable=rds_port_number:5432:int \
     -credential=rds_user:some_user:password123

Accessing Environment Variables

Container environment variables can also be used to supply values to a workflow.  Environment variables are accessed inside of Java snippet-based nodes:

One of the advantages of using environment variables is that their values can be set on the development machine that the UI is being run on and they will be used during workflow development without including them in the distributed workflow.    For execution inside of Docker, the environment values can then be configured into the container or they can be supplied when the container is executed.  For example, in AWS Fargate you can configure them when you define the task in the Elastic Container Service:

Local Storage

When running in a container, local storage capacity is dependent on both the container definition and the environment it is being run in.   For example, in AWS Fargate a container is provided with 15 GB of storage, but this total also includes whatever is required by the container image so the space available to your workflow will be less.

KNIME automatically manages the persisting of data between nodes when necessary.   The determination of when data needs to be persisted to disk between nodes is based on several factors: the number of records, the memory policy of the nodes, and the available memory inside the of Java Virtual Machine that KNIME is running in.    Memory is often more abundant than local storage in a container execution environment (e.g an AWS Fargate task can have up to 32 GB of memory as opposed to 15 GB of storage.)  So we want to keep all data in memory if possible.    By changing the JVM’s memory settings, KNIME’s record limits for writing to disk, the memory policy of our workflow nodes, and using the streaming executor if available, we can often keep all table data in memory.   This also has the added bonus of greatly enhancing the workflow’s performance since costly streaming of data to and from storage is avoided.  

Rather than laying out these memory related tuning options to you here, I will refer you to KNIME’s excellent blog posting on them .

In the cases where we cannot avoid the need to persist intermediate data to storage, an external data sink like a temporary database or S3 storage should be considered.   By using an external database or something like Athena on top of S3 storage, we sometimes can also then push some of the processing into the storage layer query during retrieval of the data by later steps (such as queries for statistics by arbitrary groupings, etc.)

Custom Logging

You can configure many container execution environments to record the execution logs for later review.  For example, AWS Fargate tasks can write their container logs to CloudWatch topics.  This is extremely useful, but we’ve found that while KNIME’s default logging configuration is great for its UI console, its output can often be confusing when its logs are persisted in an environment such as CloudWatch where the environment provides its own timestamps.

Fortunately, it is easy to override KNIME’s logging configuration when you run KNIME in batch execution mode.   This is done by overriding the Log4J configuration file using a JVM argument when executing KNIME:

$KNIME_DIR/knime --launcher.suppressErrors -nosave -reset -nosplash \
     -application org.knime.product.KNIME_BATCH_APPLICATION \
     -workflowFile=$WORKSPACE/ProcessData.zip \ 
     -workflow.variable=rds_host_name:some_db_id.us-east-1.rds.amazonaws.com:String \
     -workflow.variable=rds_port_number:5432:int \
     -credential=rds_user:some_user:password123 \
     -vmargs -Dlog4j.configuration=$WORKSPACE/log4j3.xml

This example assumes that our custom configuration file is the root directory of the KNIME workspace that we created in the container.   To create a customized configuration file the best thing to do is copy the original configuration file (named log4j3.xml) out of the KNIME application and customize it as desired.  When KNIME is running in batch execution mode, the console output format can be configured by changing the settings for the stdout and stderr appenders.

When examining the logs from a workflow run in KNIME’s batch mode, it is important to remember that KNIME only executes nodes sequentially when they are on the same control flow path in the workflow.  If the workflow has multiple processing paths, they will often be run in parallel and you will see their log messages interleaved in the logging output.    Fortunately, KNIME logs the unique node number for each node and this number can be used to pinpoint which node was responsible for the output in the log. 

Conclusion

This blog posting covers the basic building blocks for creating a custom environment for the efficient, secure batch execution of KNIME workflows in a container environment.  This is by no means a comprehensive summary of techniques.  For our company’s Dex automation platform, we have developed a very sophisticated KNIME Docker container that uses custom KNIME nodes to connect to our various platform services to do things like download the workflow to execute, receive temporary credentials dynamically, and provision and access cloud storage transparently.  By starting simple and using these basic techniques you too can evolve a container that works for your needs.

Paul Wisneskey
Paul Wisneskey
Software architect with over 25 years of experience designing and implementing large scale, reliable systems for big data search and analytics. Architect and principal developer of the Dex platform that powers NuWave Solution's machine learning and advanced analytics solutions.

Leave a Reply

Your email address will not be published. Required fields are marked *

Contact