Thursday 23 October 2014

CSV to XML with a Quick and Dirty XSLT

Issue

A csv file has to be converted into XML

Resolution

The following XSLT uses a simple method of tokenization to generate the xml from plain seperated text, the separator being defined by the parameter 'seperator'. The example below uses a tab character.

Other parameters allow the definition of whether a header row is included (header-row), plus the customised naming of the various elements that generate the table, row and cell structure.

The transformation is XSLT2 and can be invoked by use of saxon using the following command line, where thisXSLT.xsl is the code below:

java -jar saxon.jar -it:main -xsl:thisXSLT.xsl -o:result.xml "csvFile=myfile.csv"

XSLT

<xsl:stylesheet 
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform" 
  xmlns:fn="http://www.w3.org/2005/xpath-functions" 
  xmlns:local="http://www.griffmonster.org" 
  xmlns:xs="http://www.w3.org/2001/XMLSchema" 
 
  version="2.0"
  exclude-result-prefixes="xsl xs fn local">
 <xsl:output indent="yes" encoding="UTF-8" method="xml"/>
 
 <!--
 
 a more complex routine is available at http://rosettacode.org/wiki/Csv-to-xml.xslt
 
 -->

 <xsl:param name="csvFile" as="xs:string" />
 <xsl:param name="header-row" as="xs:string" select="'true'" />
 <xsl:param name="seperator" as="xs:string"  select="'&#9;'"/>
 <xsl:param name="tableName" as="xs:string"  select="'legislation'"/>
 <xsl:param name="rowName" as="xs:string"  select="'item'"/>
 <xsl:param name="cellName" as="xs:string"  select="'data'"/>
 
 <xsl:template match="/" name="main">
  <xsl:copy-of select="local:csv-to-xml($csvFile)" />
 </xsl:template>

 <!-- if this function is available from xslt 3 then use it otherwise use the makeshift expression  -->
 <xsl:function name="local:unparsed-text-lines" as="xs:string+">
  <xsl:param name="href" as="xs:string" />
  <xsl:sequence use-when="function-available('unparsed-text-lines')" 
    select="fn:unparsed-text-lines($href)" />
  <xsl:sequence use-when="not(function-available('unparsed-text-lines'))" 
    select="tokenize(unparsed-text($href), '\r\n|\r|\n')[not(position()=last() and .='')]" />
 </xsl:function>

 <xsl:function name="local:csv-to-xml" as="node()+">
  <xsl:param name="href" as="xs:string" />
  <xsl:variable name="header-row" as="xs:string*" 
    select="if ($header-row != '') then 
       tokenize(local:unparsed-text-lines($href)[1], $seperator) 
      else ()"/>
  <xsl:element name="{$tableName}">
   <xsl:for-each select="local:unparsed-text-lines($href)">
    <xsl:choose>
     <xsl:when test="position() = 1 and exists($header-row)">
     </xsl:when>
     <xsl:otherwise>
      <xsl:element name="{$rowName}">
       <xsl:variable name="tokens"  as="xs:string+" select="tokenize(., $seperator)"/>
       <xsl:for-each select="$tokens">
        <xsl:variable name="position" as="xs:integer" 
          select="position()"/>
        <xsl:variable name="celltitle" as="xs:string?" 
          select="if (exists($header-row)) then 
             $header-row[$position]
            else ()"/>
        <xsl:element name="{$cellName}">
         <xsl:if test="exists($header-row)">
          <xsl:attribute name="title" select="$celltitle"/>
         </xsl:if>
         <xsl:value-of select="."/>
        </xsl:element>
       </xsl:for-each>
      </xsl:element>
     </xsl:otherwise>
    </xsl:choose>
    
   </xsl:for-each>
  </xsl:element>
 </xsl:function>
</xsl:stylesheet>

Friday 3 October 2014

Make structured xml from flat source with XSLT 1

Issue

A requirement for structured XML to be generated from a flat XML source but only XSLT 1 can be used for the transformation. This required that all paragraph elements needed to be nested within a subsection element and all subparagraphs needed to be nested within the paragraph element. An additional requirement was for textual content to be contained within a <text/> element

Resolution

Source:

<sectiontitle>sample content text</sectiontitle>
<subsection>sample content text</subsection>
<paragraph>sample content text</paragraph>
<paragraph>sample content text</paragraph>
<subparagraph>sample content text</subparagraph>
<subparagraph>sample content text</subparagraph>
<subsection>sample content text</subsection>

Required Output:

<sectiontitle><text>sample content text</text></sectiontitle>
<subsection>
    <text>sample content text</text>
    <paragraph><text>sample content text</text></paragraph>
    <paragraph>
        <text>sample content text.</text>
        <subparagraph><text>sample content  text<text></subparagraph>
        <subparagraph><text>sample content  text<text></subparagraph>
    </paragraph>
</subsection>
<subsection><text>sample content text.</text></subsection>


XSLT

<xsl:template match="node()|@*">
    <xsl:copy>
        <xsl:apply-templates select="node()|@*" />
    </xsl:copy>
</xsl:template>

<xsl:template match="subsection">
    <subsection>
        <text>
            <xsl:value-of select="." />
        </text>
        <xsl:apply-templates
        select="following-sibling::paragraph
        [generate-id(preceding-sibling::subsection[1])
        = generate-id(current())]"  mode="nest" />
    </subsection>
</xsl:template>

 <xsl:template match="paragraph" mode="nest">
    <paragraph>
        <text>
            <xsl:value-of select="." />
        </text>
        <xsl:apply-templates 
            select="following-sibling::subparagraph
            [generate-id(preceding-sibling::paragraph[1])
            = generate-id(current())]"  mode="nest" />
    </paragraph>
</xsl:template>

<xsl:template match="subparagraph" mode="nest">
    <xsl:copy>
        <text>
            <xsl:apply-templates />
        </text>
    </xsl:copy>
</xsl:template>
 
<xsl:template match="paragraph"/>
 
<xsl:template match="subparagraph"/>

Points to note:

  • The xsl:value-of could be xsl:apply-templates if we have anything other than a text node within the content
  • There is a requirement for consistency withn the XML source