KNIME includes a wide variety of data manipulation and data science nodes, but there are a few cases where you may need to embed specific python routines inside a workflow. Or more often, I test algorithms in Python but do not want to put the effort into orchestrating the data access, iterations, and parallel processing in python. KNIME has several python nodes, but first you will need to set up your python environments.
KNIME requires you to have Python 2 and Python 3 environments running locally. If you have Anaconda 3.x, you will need to create a python 2 environment with numpy and pandas installed.
From the Anaconda prompt, run the following:
conda install nb_conda_kernels
conda create -n py27 python=2.7 ipykernel
pip install numpy
pip install pandas
You can not point to the python executables directly if you are running on Windows and will need to create batch files to run the environments.
The .bat files for Python3 (base) and Python2 (py27) environments:
@REM Adapt the folder in the PATH to your system
@SET PATH=<Location of Anaconda3 install>\Scripts;%PATH%
@CALL activate <environment> || ECHO Activating python environment failed
Open the Preferences dialogue within KNIME (Figure 1) and point the python preferences to the .bat files. If your environment is correctly configured, KNIME will recognize and specify the version of python. You are now ready to build workflows containing python nodes.
KNIME contains a variety of Python nodes, which are preconfigured for common uses. For example, the Python Learner node (Figure 2) has templates of several popular sklearn models including Decision Tree, Logistic Regression, and Random Forest.
The Granger Causality Test is useful in determining the usefulness of one time series in predicting another. In this use case, we have a large (greater than 1000) time series we want to forecast but need to understand which features are potentially driving the behavior of other features.
The Granger Causality test is implemented in the Python StatsModels module (Figure 3) and provides standard statistical responses. With the large number of variables, performing this iteration would have required days in a single python executable, so I want to take advantage of KNIME’s parallel processing. The python script takes in the feature being treated as the response and a table of the other features as the inputs and outputs the f and p matrix. The python code implements error catching when the time series do not contain enough data to perform the calculation.
By embedding this python script (in the blue box), within a KNIME workflow, I can take advantage of the data access, iteration, and parallel chunk modules as well as explore the results quickly with additional analysis. The ‘Read in data’ section can easily be replaced by the file access approach. When I finish the experimentation, I can embed this calculation into an automated production workflow by running in headless mode.
The results of the test are the statistical significance of 24 months of lag for the combinatorial space between more than 1000 variables. This is incomprehensible for human review, but can easily be aggregated, filtered, and pivoted to understand the space. For example, to answer the question of “How many features are statistically important in predicting a feature?”, I aggregated the monthly impact, filtered based on a statistical significance of alpha = 0.05 and counted the resulting rows for each result.
The ability to embed python code directly into KNIME gives me the flexibility to leverage the wide variety of python modules with the ease of use of KNIME.