Semi-structured data has been increasingly taking over the role of a universal container for data exchange. JSON, XML and other formats are dominating the area that once used to belong to fixed-format text files and CSV files. However, while CSV file import and export is a no-brainer and a standard part of any data integration tool, JSON/XML could pose a huge problem. As the structure of a JSON/XML document is dynamic and sometimes quite complex, traditional approaches for data transformation used by ETL tools for structured data often fail here.

XDTL takes a different approach. As always, instead of trying to provide means to handle any conceivable semi-structured data of arbitrary complexity with extremely sophisticated toolset XDTL focuses on a handful of most important and most frequent scenarios with simple and lightweight tools. Three different scenarios are considered:

1. The documents are small but with complex structure. The structure is known beforehand and is more or less static. This scenario mostly applies to a feed from a JSON/XML API that are popular in today's web environment and is applicable for creation of a connector (adapter) to a specific web application.

2. XML data is medium/large but with simple hierarchical structure, usually no more than 2-3 levels deep. The structure could be known beforehand or not. This scenario fits for medium size XML files where XML is used in place of more traditional text (CSV) format.

3. Very big document, probably with complex structure that never changes. This scenario would most probably require an XDTL extension (a library task written in XDTL). XDTL provides means for creation of such an extension that could take advantage of a variety of different facilities, eg. XSLT transformations, database platform-specific bulk load utilities and additional scripting, encapsulating and hiding all the complexity from a user. 

To handle the first scenario, xdtl:decode command is introduced. xdtl:decode takes an URI (referencing a data feed) or a document name as the source and decodes it into internal Java structures (Array, HashMap) that then could be easily shredded with for loops and .get('<name>') methods and either converted into INSERT statements or a series of flat files following the structure of data.

Here's a simple code fragment of decoding JSON: 

<xdtl:decode source="" target="root" type="JSON"/>
<xdtl:log msg="${root.get('feed').get('title').get('$t')}"/>
<xdtl:log msg="${root.get('feed').get('updated').get('$t')}"/>
<xdtl:for item="entry" rowset="${root.get('feed').get('entry')}">
    <xdtl:log msg="${entry.get('id').get('$t')}"/>
    <xdtl:log msg="${entry.get('title').get('$t')}"/>

That'll do it. One xdtl:decode command and a couple of nested loops will scrape the relevant data fields. The rest is up to the designer. There's no need to create XPath patterns or XSLT transformations. The same example could be elaborated to decode XML data:

<xdtl:decode source="" target="feed" type="XML"/>
<xdtl:log msg="${feed.get('title').get('textvalue')}"/>
<xdtl:log msg="${feed.get('updated')}"/>
<xdtl:for item="entry" rowset="${feed.get('entry')}">
    <xdtl:log msg="${entry.get('id')}"/>
    <xdtl:log msg="${entry.get('title').get('textvalue')}"/>

Internally, XML is represented as Java Array and HashMap objects. XML element values are converted to "textvalue" properties.

Finally, here's a more complete example shredding the same YouTube feed into INSERT statements. Four tables (FEED, LINK, ENTRY and CATEGORY) are being filled with data, complete with parent-child relationships foreign keys. 

<xdtl:decode source="" target="feed" type="XML"/>

<xdtl:query source="INSERT INTO FEED VALUES ('${feed.get('id')}', '${feed.get('updated')}', '${feed.get('title').get('textvalue')}', '${feed.get('author').get('name')}');"/>

<xdtl:for item="link" rowset="${feed.get('link')}">

    <xdtl:query source="INSERT INTO LINK VALUES ('${feed.get('id')}', '${link.get('type')}', '${link.get('href')}');"/> 


<xdtl:for item="entry" rowset="${feed.get('entry')}"> 

    <xdtl:query source="INSERT INTO ENTRY VALUES ('${feed.get('id')}', '${entry.get('id')}', '${entry.get('published')}', '${entry.get('title').get('textvalue')}');"/> 

    <xdtl:for item="category" rowset="${entry.get('category')}"> 

        <xdtl:query source="INSERT INTO CATEGORY ('${entry.get('id')}', '${category.get('scheme')}', '${category.get('term')}');"/> 



Actually there's nothing to stop the script from shredding every single data field out of the XML feed with this method, and it only takes a couple of more xdtl:for loops. And last but definitely not least, the contents of the data could serve as sources for successive calls to other APIs with xdtl:decode within a loop, enabling recursive traversal of several related APIs.