XPATH & KNIME

Plotly.js: Getting Accurate Coordinates from Click Event
September 4, 2019
Late Arriving Dimensions
September 25, 2019
Show all

Yesterday I had to import an XML file into a series of relational database tables. I started by thinking of the best way to import this data. I could write a Python parser by hand and load the data that way, but I am a lazy programmer. I’m looking for something that is quick to deploy and easy to maintain. That is when I thought of my trusty, opensource ETL tool, KNIME.

For this example, I will be using an XML version of the Shakespeare play Hamlet.

Everything started off fine. I grabbed the XML Reader node and placed it on my canvas. I opened the node configuration tool and entered in the location of the xml file. So far, so good.

My task was to flatten the XML document into a denormalized table. I started out by finding all of the relevant tags I would need to make this output meet the requirements. OK, the play title tag occurs once, so I grab an XPATH node and configure it to grab the title of the play.

Because the act title will repeat inside the play and the scene title will repeat inside the act information, I create two more XPATH nodes to capture this hierarchical information.

Currently my workflow looks like this:

Now, all I need to do is add in the dialogue and I am finished. That is when I noticed a little challenge. I need to get the speaker and their lines to match up. I see that a single speaker can have multiple lines and they are not contained in a grouping tag.

<SPEECH>
     <SPEAKER>FRANCISCO</SPEAKER>
     <LINE>For this relief much thanks: ’tis bitter cold,</LINE>
     <LINE>And I am sick at heart.</LINE>
</SPEECH>

If I follow what I have been doing so far, the next speaker (Bernardo) will be credited for speaking the line “And I am sick at heart.” So, how do you handle this using XPATH in KNIME? To resolve this issue I needed to create a node cell of speeches that I would pass into new XPATHS to retrieve the speaker and their lines.

Finally, I need to get rid of the node cells so that I have only the columns I will be loading into the denormalized table as defined in my requirements.

There it is, a quick to deploy and easy to maintain XML to table solution using XPATH and KNIME.

Hamlet XML

Jim McHugh
Jim McHugh
An accomplished Sr. IT leader with over 27 years of professional experience applying technology solutions to business challenges. Mr. McHugh has expertise in data modeling, data governance, business intelligence, predictive analytics and data science. His responsibilities include establishing and executing a strategy that ensures the application of data management & analytics to enable an organization to strategically leverage and fully realize the value of their data.

Leave a Reply

Your email address will not be published. Required fields are marked *

Contact