Friday, September 30, 2011

XSLT, entities, Java, Xalan...

Java
The Apache XML Project / Xalan-J
As already mentioned, I have been using XSLT and Xalan-J for quite a long while. It was recently that I was dealing with an HTML-related transform that used a lot of special characters.

Previously I blindly used to assume that named character references are a part of HTML, not defined for generic XML, so not available in XSLT files too. No big problem if numeric character references are available. But that time I had a good incentive to give it a better thought . After all, the very name of XML states extensibility. So what prevents us from getting named character references being defined for XSLT files too?

In fact, nothing prevents. An absolutely legal way to get them defined is to extend the document type declaration for XSLT, like this:
<?xml version="1.0" standalone="yes" ?>
<!DOCTYPE transform [
  <!ENTITY % w3centities-f PUBLIC "-//W3C//ENTITIES Combined Set//EN//XML"
      "http://www.w3.org/2003/entities/2007/w3centities-f.ent">
  %w3centities-f;
]>
<xsl:transform version="1.0"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
...

With this small modification the miracle happened, and named character references stopped complaining . Everything was fine, but, as one of the Murphy's laws states, "If everything seems to be going well, you have obviously overlooked something" .

This time the pitfall was in the reference to the entities definition,
"http://www.w3.org/2003/entities/2007/w3centities-f.ent". If we tell the XSLT engine to use some external resource, it naturally has to go for it . In case we do not take necessary care, the only place to go is the Internet. Everything is going to work, but maybe slower than we might expect, and crashing if Internet connection is not available.

The standard approach to deal with this is using a catalog-based entity/URL resolver. This time unexpectedly it did not help. Nothing was wrong with the resolver and the catalog, and still the XSLT engine persistently went fetching the entities from the Internet.

The cause of the issue was found in the Xalan-J sources. Perhaps nobody before considered seriously using external entities in an XSLT file, thus no traces of using an entity resolver for an XSLT file in the code. It may be worth mentioning that those of us who might be not familiar with Xalan-J as such and just plainly use JAXP, are likely to tread on the same issue, as Xalan-J is a part of the reference JAXP implementation that is endorsed into standard JDK/JRE implementations like Oracle J2SE and OpenJDK.

The Fix


The last Xalan-J release 2.7.1 was taken as the basis for amendments. For those who need just things working, a patched Xalan-J binary build is available as xalan-2.7.1-usn-20110928.jar. Those interested in what is under the hood are welcome to have a look at the modifications.

The fix was submitted to the Xalan-J project as an attachment to issue XALANJ-2544.

Usage – HOW-TO


In case you feel like using some named character references in your XSLT file:
  1. Add appropriate entities to your XSLT transform/stylesheet, either one by one, like this:
    <?xml version="1.0" standalone="yes" ?>
    <!DOCTYPE transform [
      <!ENTITY nbsp  "&#x000A0;">
      <!ENTITY ndash "&#x02013;">
    ]>
    <xsl:transform version="1.0"
        xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    ...
    
    or all at once:
    ...
    <!DOCTYPE transform [
      <!ENTITY % w3centities-f PUBLIC "-//W3C//ENTITIES Combined Set//EN//XML"
          "http://www.w3.org/2003/entities/2007/w3centities-f.ent">
      %w3centities-f;
    ]>
    ...
    
  2. If you could limit yourself to a definite set of entities, or your XSLT engine is not Xalan-based, have fun
  3. If not, go ahead and download the patched Xalan-J binary as xalan-2.7.1-usn-20110928.jar.
  4. Arrange your JVM to use the patched .jar, keeping in mind that Xalan-J is an endorsed technology. One of possible approaches is to use -Djava.endorsed.dirs=path_to_the_folder_containing_the_new_jar JVM launcher option. See more on the Oracle Java Endorsed Standards Override Mechanism page.
  5. Make sure you use some XML Entity Resolver. You may wish to have a look at Apache XML Commons Resolver as a good candidate.
  6. Get yourself a catalog for your resolver, if not yet. If starting from scratch, that may be a plain text file in OASIS TR9401 format:
    PUBLIC "-//W3C//ENTITIES Combined Set//EN//XML"
           "file:///path_to_your_local_copy/w3centities-f.ent"
    SYSTEM "http://www.w3.org/2003/entities/2007/w3centities-f.ent"
           "file:///path_to_your_local_copy/w3centities-f.ent"
    
  7. Get your catalog known to the resolver, say, via the xml.catalog.files system property supplied to the JVM launcher: -Dxml.catalog.files=your_catalog_file_name_with_path.
  8. Enjoy .

1 comment:

  1. Thank you for this. I've been struggling to juggle XSL->HTML with some extra entities defined for some months now!

    ReplyDelete