Menu Close

Using COPY command to Load CSV files from S3 to RDS Postgress Database via KNIME Workflow

The Postgres COPY command is the most efficient way to load CSV data into a Postgres database. RDS provides a small challenge to the use of this functionality since you cannot use a local filesystem in RDS. In this blog post I will walk you through the steps to connect your Postgres RDS database to an S3 filesystem and load CSV files to this database via a KNIME workflow.

First, we need to add the aws_s3 extension to the Postgres database by executing the following command from the PG Admin tool: CREATE EXTENSION aws_s3 CASCADE;

Next, we need to Create an IAM Policy to Access Amazon S3 Resources and Create an IAM Role to Allow Access AWS Services.

Now that we have a role associated with the IAM access policy we need to grant it to the database. To do this logon to the AWS console and go to the RDS page and select the database you wish to have this functionality. On that page scroll down to Manage IAM Roles (see Figure 1) and add the IAM role created in the previous step to the database. Once this is complete, restart the DB.

Figure 1

Now you have allowed the Postgres RDS database to access the S3 bucket you need to write the SQL statements to use the Postgres copy command.

SELECT aws_s3.table_import_from_s3
( ‘<schema.table_name>’
,” — Table column list if needed
,'(FORMAT CSV, HEADER true)’ – Format of the file and header declaration
, ‘<Bucket Name>’
,'<CSV File Name>’
,'<Database Zone>

Our final step is to integrate this into a KNIME workflow. To do this we only need two nodes a Postgres connector node and a DB SQL Executor Node (see Figure 2).

Figure 2

In the example below (Figure 3), I am refreshing three tables on a daily basis. To do this I truncate the table and then execute the aws_s3.table_import_from_s3 defined above and within a couple of seconds all of my data is refreshed from the .csv files a separate process placed in my S3 bucket.

Figure 3

Posted in Blog

Leave a Reply

Your email address will not be published. Required fields are marked *