Monday, September 21, 2009

Properly Handling CDATA In .NET XSLT

Prologue:

I have a T-SQL table of data, one field of which contains escaped X/HTML - that is, it has XHTML where the important characters, like < and >, are represented as &lt; and &gt; etc. Some of this XHTML is broken because the writer was probably using a rich-text editor which sent fragments of XHTML to the database. Of course, using SQLXML to pull this out I find that the body of some of the elements contains this escaped HTML. So now I need to unescape it back into normal HTML, so it can be rendered on a page without hassle, but without making it part of the XML document.

Sumary:
Ok, so, in summary, we have valid XML with escaped XML (or XHTML) which needs to be wrapped sensibly so it's not mixed up and we need to do this in .NET.

First problem:

Unescaping the code, in XSLT, is a little tricky, but if you assume every element which needs unescaping is the same, then we can do this:

<xsl:template match="Description">
<xsl:copy>
<xsl:value-of select="node()" disable-output-escaping="yes" />
</xsl:copy>
</xsl:template>

What this does is match every element named 'Description' in the original XML and unescape it's content - so code like &lt; and &gt; become < and >. Simple. If you want to do that for specific elements only, just move the 'value-of' XSL element to where you want it.

Second problem:

We now have XML containing some more XML, which should really be considered plain text, until it's needed. So, we want to keep this 'inner XML' as it is (ie: pretty XML and not escaped) but indicate that it's not part of the current XML document.

Second solution:

Wrap the body content of the elements we want to remain as body content in a CDATA tag, like this:

<xsl:template match="Description">
<xsl:copy>
<![CDATA[<xsl:value-of select="node()" disable-output-escaping="yes" />]]>
</xsl:copy>
</xsl:template>



What the CDATA block does is tell any parsers that the content is to be, essentially, ignored. Simple.

Unfortunately, the CDATA block tells the parser not to parse the content within it, so what we actually end up with is:

<Description>
<![CDATA[
<xsl:value-of select="node()" disable-output-escaping="yes" />]]>
</Description>

Third problem:

If you just add the CDATA block to your XSLT a standard parser will take the XSL element within it as to be ignored, and not process it.

Third solution:

The answer to this is not to include the CDATA block ourselves, but to tell the XSLT document which elements need to contain the CDATA block, instead. Thus, the XSLT parser will wrap them itself. We do this by adding the 'cdata-section-elements' attribute to the output element of the XSLT document:

<xsl:output method="xml" indent="yes" cdata-section-elements="Description" omit-xml-declaration="yes" />

Now, every <Description> element, in the output XML, will contain a CDATA block which wraps the body content of the element, like this:

<Description>
<![CDATA[
<strong>Some content that was parsed from the value-of XSL element.</strong>]]>
</Description>

However, if you have been trying this as we go along and seeing something different - ie, it's not working for you - that's because, in .NET, you have to handle XSLT in a certain way.

The short answer is, here's the C# to do it:

/// <summary>
/// Takes a string containing XML to be parsed and a string containing the the XSLT to do the parsing.
/// </summary>
public string NewTransform(string xml, string xsl)
{
// create the readers for the xml and xsl
XmlReader reader = XmlReader.Create(new StringReader(xsl));
XmlReader input = XmlReader.Create(new StringReader(xml));

// create the xsl transformer
XslCompiledTransform t = new XslCompiledTransform(true);
t.Load(reader);

// create the writer which will output the transformed xml
StringBuilder sb = new StringBuilder();
XmlWriter results = XmlWriter.Create(new StringWriter(sb), t.OutputSettings);

// write the transformed xml out to a stringbuilder
t.Transform(input, null, results);

// return the transformed xml
return sb.ToString();
}

/// <summary>
/// Takes an absolute path to a file which is then loaded into a string and returned.
/// </summary>
public string LoadFileAsString(string fullpathtofile)
{
using (var sr = new StreamReader(fullpathtofile))
return sr.ReadToEnd();
}

The key to ensuring that the cdata-section-elements attribute, and therefore the CDATA block, is handled correctly is the t.OutputSettings - passing the settings to the writer informs it of the intentions of the XSL transformer.

Epilogue:

I battled with all of the above until I stumbled upon this post, so thank you to everyone there:

No comments:

Post a Comment