mmx metadata framework
...the DNA of your data
MMX metadata framework is a lightweight implementation of OMG Metadata Object Facility built on relational database technology. MMX framework
is based on three general concepts:
Metamodel | MMX Metamodel provides a storage mechanism for various knowledge models. The data model underlying the metadata framework is more abstract in nature than metadata models in general. The model consists of only a few abstract entities... see more.
Access layer | Object oriented methods can be exploited using inheritance to derive the whole data access layer from a small set of primitives created in SQL. MMX Metadata Framework provides several diverse methods of data access to fulfill different requirements... see more.
Generic transformation | A large part of relationships between different objects in metadata model are too complex to be described through simple static relations. Instead, universal data transformation concept is put to use enabling definition of transformations, mappings and transitions of any complexity... see more.

XML Schema to MMX Mapping: Essentials

August 16, 2010 17:54 by marx

XML Schema is an important and one of the most widely used modelling tools. Therefore it would make sense to have the the ability to use metamodels created as XML Schemas in MMX Metadata Framework in addition to metamodels originating from UML. As MMX architecture closely follows the ideas and architecture of MOF it would be natural to take advantage of the multitude of UML profiles for XML Schema that are already available. Most of these, however, focus on UML to XSD conversion, so there are only a few targeting XML Schema mapping to UML. As always, our approach is a pragmatic one, concentrating on elements we find more important, more widely used and easier to implement. Hence the word 'Essentials' in the title.  

Data types. Most important XML Schema built-in data types have direct equivalents in MMX Core Metamodel (string, integer, decimal, boolean, date etc.). A property (md_property_type) realizing an XSD attribute with built-in type therefore references one of these Core types in its datatype_cd. In case an XSD simple type is inherited from a built-in type with <restriction> construct the derived datatype is created in MMX M2 layer as a new class with datatype referred in <base=...> as its parent. The root data type classes have all required properties to support XML Schema restrictions and facets (length, minLength, maxLength, minInclusive, maxInclusive, minExclusive, maxExclusive, fractionDigits, totalDigits, pattern) that are appropriately inherited by their respective descendants as optional properties. In case a restriction specifes an enumeration a new Enumeration class is created.

Global vs. local. Global simple types would be realized as descendants of a built-in data type (parent_object_type_cd). Global complex types would be preferably realized as independent classes (md_object_type) with elements referring to them with <type=...> or <ref=...> implemented as their descendants. In case a complex type is defined directly inside an element (local complex type) it could be realized either globally and referenced with parent_object_type_cd, or directly inside the element, with parent_object_type_cd referencing the schema class. We assume that whenever an element references another element we can subsitute the reference with the referenced element. So, when there is a <ref=...> or a <type=...> attribute pointing to another (global) element we can replace the reference with the target element itself. The same applies to <attributeGroup>.

Naming. If possible, name attribute of an element or an attribute is used as name of a class (object_nm) or a property type (property_nm). It is very important that every element and attribute had either a name (preferred) or an ID (second best) to uniquely identify the corresponding class in MMX metamodel. In case both name and ID are missing a technical name gets generated that is less intuitive and makes the metamodel more difficult to understand. Note. origin_ds column of md_object_type is constructed based on either name or ID attribute.

Model Groups. XML Schema group (ordering) indicators <all>, <sequence> and <choice> do not have direct counterparts in UML. One way to implement this in MMX is via <group> (named model group) element. A <group> element would be realized as a class (md_object_type) having a 1:M relationship (aggregation) with the group members (classes). Ordering of the group members is indicated as a dedicated property of the group class.

XSD construct Mapping to MMX model
<schema>

<schema> element corresponds to an MMX metamodel. <schema> attributes attributeFormDefault, elementFormDefault, blockDefault and finalDefault are irrelevant in MMX metamodel context and are assumed to have their respective default values. Content of targetNamespace attribute is stored with md_object_type (class) while version and xmlns attributes are stored as class properties in md_property_type. Each xmlns attribute gets its own property type as this is a multiple property type. 

<element>

In case an <element> contains a complexType, specifies a complexType as its type or references another element that happens to be a complexType it gets realized as a class (md_object_type). In case an element is a simpleType it is realized as a property (md_property_type) of the schema class. 

<attribute>

An <attibute> generally corresponds to an MMX M2 level property (md_property_type). default and fixed attributes values are stored in default_value_ds and changeable_ind columns of md_property_type. Required attribute (use="required") is stored in mandatory_ind column.

<complexType> <complexType> element is naturally realized as a class (md_object_type) in MMX. In case a complex type contains another complex type element, the nested element gets its own class and a relationship between the two classes. Complex type attributes are realized as properties (md_property_type) owned by the complex type class. 
<simpleType>

Stand-alone (global) <simpleType> elements are realized as properties (md_property_type) of the root (schema) class.

<enumeration>

MMX provides the facilities to realize an <enumeration> of unlimited depth in form of a special built-in data type. An enumeration class is realized in M2 (class) layer of MMX metametamodel. Enumeration instances (corresponding to enumeration literals of UML) are created on M1 (instance) layer, in md_object table. An attribute taking enumeration as its type is modelled as a property type with enumeration as its data type referencing the specific enumeration class as its domain_cd. 

<group>

A <group> is realized as a named class (md_object_type) having relationships with the group members. Order indicator (<all>, <sequence> or <choice>) is stored in a dedicated property of the <group> class.

<all>, <sequence>, <choice> 

XML Schema order indicators (<all>, <sequence>, <choice>) denote 1:M relationships (aggregations) that might require a specific order and are realized via a named <group> element.

<minOccurs>, <maxOccurs>

XML Schema occurrence indicators (<minOccurs>, <maxOccurs>) are expressed as a combination of mandatory_ind and multiplicity_ind columns. Note that mandatory_ind and multiplicity_ind do not support numeric occurrence indicator values therefore only values "0", "1" and "unbounded" are allowed here. 

<attributeGroup>

An <attributeGroup> is essentially implemented as a complex type (MMX class), with attribute group name as the class name. A reference to an attribute group is realized either as inheritance between complex types or by substituting it with referenced attributes. In former case, the attribute group class would be an abstract class, with descendant class inheriting all of its attributes. 

<annotation>, <documentation>, <appInfo>

Class <annotation> is stored in object_ds column of a class (md_object_type). Attribute <annotation> is stored in property_ds column of a property (md_property_type). 

Notes. The following elements of XML Schema are not covered (yet) for various reasons:

- Identity constraints (<field>, <selector>, <key>, <keyref>, <unique>). 
- Undefined elements, attributes and content (<any>, <anyAttribute>, <notation>). 
- Relationships with external schemas (<redefine>, <import>, <include>). 
- Constructs we don't see too often or don't like (<union>, <list>). 

At least a part of this list will be covered in a subsequent release of this mapping document.

References:

[1] Grady Booch, Magnus Christerson, Matthew Fuchs, Jari Koistinen: UML for XML Schema Mapping Specification 

[2] David Carlson: UML Profile for XML Schema, http://www.xmlmodeling.com/documentation/specs/

[3] Martin Bernauer, Gerti Kappel, Gerhard Kramler: Representing XML Schema in UML - An UML Profile for XML Schema

[4] Nicholas Routledge, Linda Bird and Andrew Goodchild: UML and XML Schema

[5] M. Laura Caliusco, César Maidana, Martín Patiño, M. Rosa Galli and Omar Chiotti: A UML profile for XML Schema

[6] Martin Bernauer, Gerti Kappel, Gerhard Kramler: Representing XML Schema in UML – A Comparison of Approaches

 



XDTL Runtime: Alive and Kicking

March 22, 2010 11:58 by marx

First, a quick recap. XDTL (http://xdtl.org) is an XML based descriptional language designed for specifying data transformations from one database/storage to another. XDTL syntax is defined in an XML Schema document. XDTL Runtime is a lightweight ETL runtime environment that efficiently and with zero overhead handles most of the typical ETL needs. XDTL Runtime can generate SQL or SAS scripts (or any other executable instructions for that matter) based on templates processed by a template engine.

Now, the 'news' part. XDTL Runtime Version 1.0 (XDTL RT 1.0) is finished and running live! The runtime is written in Java (Sun JRE 1.6 required) and uses Velocity (http://velocity.apache.org/) for template processing. So here's a short primer.

There are two individually addressable units of execution in XDTL: a package is a container containing tasks, both of them can be invoked by name. A task consists of steps denoting individual commands that are executed sequentially and cannot be addressed individually. Besides tasks, a package contains three collections: parameters, variables and connections. As in XSLT, $ denotes dereferencing: during execution, everything starting with $ is dereferenced and substituted with a value. In addition, everything between { and } is treated as a JavaScript expression and evaluated.   

There are file commands and database commands in XDTL. File commands usually launch operating system processes and are designed to handle files (move, compress, transfer etc.). File commands in XDTL RT 1.0 are:

get: downloads a file with a file transfer utility (ftp, scp, sftp)
put: uploads a file with a file transfer utility (ftp, scp, sftp)
unpack: unpacks a file with an archival utility (zip, arj, tar etc)
pack: pack a file with an archival utility (zip, arj, tar etc)
strip: cleanse a text file, eg. with a stream/regex utility (sed etc)
move: move files to another directory, usually for archival with timestamp
clear: remove everything related to a file/task from a working directory
log: adds a line to standard log output
exec: executes an arbitrary operating system command line (shell) command

Database commands control database operations:

read: transfers a text file into a database table, usually in staging area
write: transfer a database table into a text file
load: configured to load a file into a database table with a bulk load utility
dump: configured to dump a database table into a file with a bulk dump utility
transaction: wrapper for transaction processing
query: executes a database command (an SQL statement, a SPARQL query etc.)

Then come some control flow commands:

call: invokes another package or another task passing required parameters
if: adds basic conditional control

Finally, while the file and database commands constitute the backbone of XDTL, it's heart and soul are mappings and render commands:

mappings: define or retrieve metadata used to instantiate a procedure (SQL, SAS etc.)
render: merges a code template with corresponding mappings to produce executable SQL statement(s), SAS procedure(s) or any other form of executable code

Mappings and templates are fully decoupled and 'know nothing about each other': it's only during the rendering step that they meet each other to merge into an executable script or set of instructions. This enables a lot of novel opportunities: a specific set of mappings can be used with different templates for different purposes, a specific template can be reused many times with different mappings to handle different data sources, a mapping representing some business logic can be ported to another platform by rendering with another template etc.  Splendid!



XDTL: Template-Based SQL Generation

November 6, 2009 22:21 by marx

SQL is and probably remains the main workforce behind any ETL (and especially ELT flavour of ETL) tool. Automating SQL generation has arguably always been the biggest obstacle in building an ideal ETL tool - ie. completely metadata-driven, with small foot-print, multiple platform support on single code base... and, naturally, capable of generating complex SQL in an easy and flexible manner, with no rocket scientists required nearby. While SQL stands for Structured Query Language, ironically the language itself is not too well 'structured', and the abundance of vendor dialects and extensions does not help either. 

Attempts to build an SQL generator supporting full feature list of SQL language have generally fallen into one of the two camps: one of them trying to create a graphical click-and-pick interface that would encompass the syntax of every single SQL construct, another one designing an even more high-level language or model to describe SQL itself, a kind of meta-SQL. The first approach would usually be limited to simple SQL statements, be appropriate mostly for SELECT statements only and struggle with UPDATEs and INSERTs, and be limited to a single vendor dialect.  

The second approach would drown in the complexity of SQL itself. In theory, one could decompose all of the SQL statements into series of binary expressions, store them away, and (re)assemble into SQL statements again as needed, driven by the syntax of a particular SQL dialect. However, usually this approach fails to produce something usable, mostly because SQL is too loosely defined (considering all those vendors and dialects) and trying to cover everything just results in a system too cumbersome for anyone to use. The result would probably be an order of magnitude more complex to use than just hand-coding SQL statements, even with several parallel code bases. And that's exactly what the developers would do: invent a method to bypass the abstract layer and hand-code SQL directly.

Enter Template-Based SQL Generation. Based on our experience (and tons of ETL code written) we have extracted a set of SQL 'patterns' common to ETL (ELT) tasks. The patterns are converted into templates for processing by a template engine (eg. Apache Velocity), each one realizing a separate SQL fragment, a full SQL statement or a complete sequence of commands implementing a complex process. Template engine merges patterns and mappings into executable SQL statements so instead of going as deep as full decomposition we only separate and extract mappings (structure) and template (process) parts of SQL. This limits us to only a set of predefined templates, but anyone can add new or customize the existing ones.

The important thing about this is: templates are generic and can be used with multiple different mappings/data structures. The mappings are generic as well and can be used in multiple different patterns/templates. Template engine 'instantiates' mappings and templates to create executable SQL statement 'instances' which brings us closer to OO mindset. The number of tables joined, the number of columns selected, the number of WHERE conditions etc. is arbitrary and is affected by and driven by the contents of the mappings only, ie. well-designed templates are transparent to the level of complexity of the mappings. The same template would produce quite different SQL statements driven by minor changes in mappings.

As an example, a 'basic' template-driven INSERT..SELECT statement might look like this:

INSERT INTO network (
caller
,receiver
,calls_no
)
SELECT
c.cust_a AS caller
,c.cust_b AS receiver
,c.calls AS calls_no
FROM
call c
LEFT OUTER JOIN
network n ON n.network_id = c.network_id
WHERE
...

Indicating that three consequtive mappings are actually to be treated as one complex statement with subqueries would change the generated SQL to:

INSERT INTO network (
caller
,receiver
,calls_no
)
SELECT
c.cust_a AS caller
,c.cust_b AS receiver
,c.calls AS calls_no
FROM
(SELECT
DISTINCT a.cust_id AS cust_a
,b.cust_id AS cust_b
,c.call_type::integer AS type
,c.call_length::integer AS length
,c.call_date::date AS date
FROM
(SELECT
DISTINCT r.call_type::integer AS call_type
,r.call_length::integer AS call_length
,r.call_date::date AS call_date
FROM
raw_cdr r
...

On the other hand, we might prefer cascading INSERTs through temporary tables for performance reasons, that would morph the SQL into:

SELECT
DISTINCT r.call_type::integer AS call_type
,r.call_length::integer AS call_length
,r.call_date::date AS call_date
INTO TEMP TABLE cdr
FROM
raw_cdr r
WHERE
... 

Selecting Oracle as the target platform would switch the same template over to Oracle syntax producing:

CREATE TABLE cdr AS
SELECT
DISTINCT r.call_type AS call_type
,r.call_length AS call_length
,r.call_date AS call_date
FROM
raw_cdr r
WHERE
... 

To accomplish all (or at least a lot) of this we have (so far) assembled two template 'libraries'. MMX XDTL Basic SQL Library covers a wide range of 'building blocks' for implementing complex data processing command chains: basic INSERT..SELECT, complex INSERT..SELECT with subqueries, cascaded (staged) INSERT..SELECT, UPDATE..FROM etc. MMX XDTL Basic ELT Library includes more complex multi-step patterns used in typical ELT scenarios focusing on single-table synchronisation: Full Replace, Incremental Load, Upsert, History Load etc. These pattern libraries serve the purpose of reference templates and are easily customizable to fit the unique characteristics of a specific use case.