Jesper Tverskov, September 15, 2010

Transform XML to CSV with XSLT pipeline

This tutorial is a showcase for the templating power of XSLT. The XML to CSV solution is a pipeline of 7 xslt stylesheets providing user-defined pre-processing and automatic flattening and equalizing. Comma, quote, linebreak, leading and trailing whitespace are supported in data.

1. xml2csv in 7 steps

I have made an XSLT pipeline of 7 transformations. The first 5 steps are for manual and automatic pre-processing, the 6th step tests if the result of the pre-processing is tabular, and the 7th step is the transformation of the pre-processed XML to CSV.

The solution has a parameter for each included stylesheet with values 0="skip", 1="include" and 2="include and serialize to file". For a CSV friendly XML document, already flattened and equalized, 6 of the 7 stylesheets can be set to "skip". Switch the steps on/off as necessary. If speed is not an issue just use the "2" parameter value for all the parameters.

The master stylesheet, pipeline_xml2csv.xsl, includes 7 "step" stylesheets. All stylesheets can be downloaded in one zip-file: xml2csv.zip.The seven included stylesheets also work on their own, outside the pipeline, without any changes, to make debugging easier.

No Stylesheet Description
1 s1_user-defined.xsl Identity transformation. The user can add additional templates of exceptions typically to delete or rename attributes and elements.
2 s2_testing.xsl Tests and warns user if XML input contains mixed content. Also warns user if "_" and "." is used in element and attribute names (not necessarily a problem).
3 s3_flatten.xsl All children deeper than three levels and all attributes deeper then two levels are converted to children of second level.
4 s4_equalizing.xsl Adds missing optional elements and attributes (now in the form of elements).
5 s5_user-defined.xsl Identity transformation. The user can add templates of exceptions typically to rename (fine-tune) the automatically generated element names of step-3.
6 s6_testing.xsl Tests if the XML document is CSV-ready. All sets of children of second level must have the same count, the same names in the same order.
7 s7_xml2csv.xsl Transform the XML document to CSV.

Remember to look at the source code of the above stylesheets. In some browsers you must look in the source code to see anything at all. Even if browsers show a nicely indented XML tree, the source code has been transformed with JavaScript and styled with CSS.

1.1 Naming conventions

Each included stylesheet has a name starting with the step number, e.g.: "s3_flatten.xsl". In the main stylesheet the parameter for that stylesheet is param_s3_flatten, the variable holding the transformation of the step is "s3_flatten", the mode for the templates is "s3_flatten", and the generated output file is "temp_s3_flatten.xml".

2. Manual fine-tuning

The pipeline_xml2scv.xsl works automatically for any XML document of the form "items/item", including attributes, no matter the element and attribute names, and no matter how many levels the hierarchy. Flattening and equalizing (adding missing optional elements and attributes) are done automatically.

User intervention is only needed, if you want some attributes and elements to be ignored (deleted) or renamed, or if you want to improve the automatically generated element names. Step 1 and Step 5 are user-defined consisting only of an identity transformation. The user can add templates of exceptions typically to delete or rename.

3. pipeline_xml2csv.xsl

The main stylesheet, pipeline_xml2csv.xsl is a good model for how to make a pure XSLT pipeline, for how to make it easy to include and exclude transformations, for how to serialize each step to file for easy debugging.

3.1 Step-1: User-defined

The included stylesheet only contains one template, the famous identity template, recreating the input XML document node for node. The user can make templates of exceptions to match some attributes or elements with higher priority than the identity template, typically to delete or rename attributes or elements. The following template will delete the second item element:

  1. <xsl:template mode="s1_userdefined" match="/*/*[2]"/>.

The template above matches the second child of the top-element and does nothing. That is: the element is ignored, it will not make it to the regenerated output.

The next example shows a template that will rename the second child of item from "name" to product-name":

  1. <xsl:template mode="s1_userdefined" match="/*/*/name">
  2.  <product-name>
  3.   <xsl:apply-templates mode="s1_userdefined" match="@*|node()">
  4.  </product-name>
  5. </xsl:template>

3.2 Step-2: Testing

If the input XML document has mixed content, the transformation from XML to CSV will fail. For that reason the user is warned if mixed content exists. The user can use step-1 to make the necessary changes manually by inserting additional templates to fix the problem.

The user is also warned if attributes or element names contain "." and "_". This is most often not a problem. But the flattening step uses "." and "_" in the automatically created element names. The end result, the first line of field names in the CSV file, might look confusing with many "." and "_" in the names. One of the user-defined steps can be used for renaming.

If the use of "." and "_" in the original attribute and element names is not a problem, the param_s2_testing parameter can be set to "0" in the main stylesheet to skip testing.

3.3 Step-3: Flatten

All attributes at second level or deeper and all elements being children of children of second level, or children's children, are converted to children of second level. Let us give some hints to how to read the stylesheet. There are templates matching first level and second level elements. The following template will match all other elements:

  1. <xsl:template match="*" mode="s3_flatten">
  2.  <xsl:choose>
  3.   <xsl:when test="*">
  4.    <xsl:apply-templates select="@*|node()" mode="s3_flatten"/>
  5.   </xsl:when>
  6.   <xsl:otherwise>
  7.    <xsl:variable name="name" select="if(count(ancestor::*) > 2) then for $a in ancestor::*[count(ancestor::*) > 1] return for $b in $a return local-name($b) else ''"/>
  8.    <xsl:element name="{concat(string-join($name, '_'), if(count(ancestor::*) > 2)then '_' else '', local-name(.))}">
  9.     <xsl:value-of select="."/>
  10.    </xsl:element>
  11.    <xsl:apply-templates select="@*" mode="s3_flatten"/>
  12.   </xsl:otherwise>
  13.  </xsl:choose>
  14. </xsl:template>

The "when" tests if the element has children. If that is the case the children are processed instead, getting to them with apply-templates, finding the same template. If the element hasn't children, it is a child of second level or should be recreated as a child of second level. For element name, all the skiped ancestor element names are used.

In the "name" variable all ancestor element names, except the first two levels, are stored in a sequence. The element is then recreated using the names in the sequence, adding a "_" between the names. In another template, not shown, the attributes are treated more or less the same except that "." is used as the first delimiter to indicate "attribute".

3.4 Step-4: Equalizing

The stylesheet is not easy to understand but basically it adds all missing optional attributes and elements. Since attributes are already converted to elements, we only need to add missing optional elements. The stylesheet works well if one of the item element contains all the children elements. The stylesheet's output is less likely to be exactly as desired, if the item element with most children is far from complete.

The stylesheet always works but if all item elements are missing some of the children elements, we can not expect an automatic proces to quess the ideal order of the combined children in all situations. Look at the following example:

  1. <items>
  2.  <item>
  3.   <b/>
  4.   <c/>
  5.  </item>
  6.  <item>
  7.   <a/>
  8.   <z/>
  9.  </item>
  10. </items>

If we equalize the above children, should the order of the complete set of children be: b-c-a-z or a-b-c-d or what? If the output is less than desired, one can simply add the missing elements to the most complete item element either directly in the input XML file, or by adding templates that can do the trick to the first user-defined step.

The "equalizing" stylesheet works as follows:

  1. The input XML document is transformed to a temporary tree in order to add a count-children-of-item attribute to all item elements with the count of children elements.
  2. The input document is sorted by the children count to get the "item" element with most children first.
  3. A list is made of distinct children names.
  4. The input file is transformed to a temporary tree, using the list of distinctive element names to create the children. If a child exists more than once, it is recreated as often as required with distinctive names like "member_1", "member_2", etc.
  5. A new temporary tree is created and sorted to get the "item" element with most children first.
  6. A new list of distinctive element names is created, and a new temporary tree is generated based on the list, and the "_1", "_2", etc, are now deleted from the element names.

3.5 Step-5: User-defined

The included stylesheet contains only one template, the famous identity template, recreating the input XML document node for node. The user can make templates of exceptions to match some attributes or elements with higher priority than the identity template, typically to rename (fine-tune) some of the automatically generated element names of step-3: Flatten.

3.6 Step-6: Testing

This step tests if the data of the input or modified XML document is tabular, that is: ready to be transformed to CSV. All "item" elements, the second element level, must have the same set of children elements: same count of children, same names and in same order. If that is not the case the final transformation to CSV is aborted and the user is warned.

  1. <xsl:variable name="name-item-children-sequence" select="for $b in 1 to count(./*/*) return count(distinct-values(for $a in ./*/*/*[$b] return local-name($a)))"/>
  2. <!-- etc -->
  3. <xsl:when test="count(distinct-values($name-item-children-sequence)) > 1">
  4.  <xsl:text>ALERT-2&#xA;&#xA;</xsl:text>
  5.  <xsl:text>Children of item elements are not ordered exactly the same.</xsl:text>
  6.  <xsl:text>&#xA;&#xA;END</xsl:text>
  7. </xsl:when>

The first test (not shown above) simply tests if all "item" elements have the same count of children elements. The second test, shown above, is a little more complex. It tests if the count of distinct values of all "first" child is one, of all second child is one, etc. If all the counts are one, the set of children must be exactly the same (if the first test is also true).

3.7 Step-7: xml2csv

The stylesheet has parameters for delimiter, ",|;|:", and for how to handle "no" data. The most correct scheme is probably to use quots like: "some data,"",more data", but "some data,,more data" is also common. Note how easy it would be to add support for more delimiters simply by adding additional templates.

  1. <!-- matching data with comma -->
  2. <xsl:template match="text()[contains(., ',')]" mode="s7_xml2csv">
  3.  <xsl:choose>
  4.   <xsl:when test="$param_delimiter = ','">
  5.    <xsl:value-of select="concat('&quot;', ., '&quot;')"/>
  6.   </xsl:when>
  7.   <xsl:otherwise>
  8.    <xsl:value-of select="."/>
  9.   </xsl:otherwise>
  10.  </xsl:choose>
  11. </xsl:template>

Templates can be based on a conditon with the mode attribute or the use-when attribute, or with a good old switch. Above we combine "mode" and "switch". The template is only active during step 7, and the switch says that data containing comma should only be quoted if comma is the delimiter.

If you take a closer look at the complete stylesheet, note that it could have been simplified. We don't need that many templates. But I prefer to spell things out with a template for each match to make a more robust solution that is easier to understand and maintain.

4. Testing the pipeline

I have made 9 files for testing. They are included in the zip-file above. By looking at the comments in the files, it should be a little easier to make sense of the different pipeline steps.

1 xml2csv-input-01.xml The file is or is almost CSV ready.
2 xml2csv-input-02.xml Fine-tuning of output field-names.
3 xml2csv-input-03.xml A missing element and attribute.
4 xml2csv-input-04.xml Mixed content.
5 xml2csv-input-05.xml More than two levels deep.
6 xml2csv-input-06.xml Different order of attributes.
7 xml2csv-input-07.xml One item has three children with same name.
8 xml2csv-input-08.xml Namespace in input is not a problem.
9 xml2csv-input-09.xml Very nasty.

5. The power of XSLT

One of the benefits of a well-made XSLT solution is that you can make drastic changes to the code without ever having to modify the original code. You simply add new templates matching the input with higher priority. It is relatively easy to document and roll back what has been added in a template of its own.

Updated: 2010-09-18