Yesterday I had to import an XML file into a series of relational database tables. I started by thinking of the best way to import this data. I could write a Python parser by hand and load the data that way, but I am a lazy programmer. I’m looking for something that is quick to deploy and easy to maintain. That is when I thought of my trusty, opensource ETL tool, KNIME.
For this example, I will be using an XML version of the Shakespeare play Hamlet.
Everything started off fine. I grabbed the XML Reader node and placed it on my canvas. I opened the node configuration tool and entered in the location of the xml file. So far, so good.
My task was to flatten the XML document into a denormalized table. I started out by finding all of the relevant tags I would need to make this output meet the requirements. OK, the play title tag occurs once, so I grab an XPATH node and configure it to grab the title of the play.
Because the act title will repeat inside the play and the scene title will repeat inside the act information, I create two more XPATH nodes to capture this hierarchical information.
Currently my workflow looks like this:
Now, all I need to do is add in the dialogue and I am finished. That is when I noticed a little challenge. I need to get the speaker and their lines to match up. I see that a single speaker can have multiple lines and they are not contained in a grouping tag.
<LINE>For this relief much thanks: ’tis bitter cold,</LINE>
<LINE>And I am sick at heart.</LINE>
If I follow what I have been doing so far, the next speaker (Bernardo) will be credited for speaking the line “And I am sick at heart.” So, how do you handle this using XPATH in KNIME? To resolve this issue I needed to create a node cell of speeches that I would pass into new XPATHS to retrieve the speaker and their lines.
Finally, I need to get rid of the node cells so that I have only the columns I will be loading into the denormalized table as defined in my requirements.
There it is, a quick to deploy and easy to maintain XML to table solution using XPATH and KNIME.