Sunday, September 4, 2011

Java, Xalan, Unicode command line arguments...

Java
I have always believed on bare word that Java is Unicode-friendly by design. And I have been using Xalan-J happily for quite a long while already.

It was just recently that I ran into Unicode-related problem with Java, and that was occasionally related to Xalan-J. Strictly speaking, Xalan-J was not to blame. But that was Xalan-J Command-Line Utility who refused to do its job when handling file names with characters missing from my local system code page.

It did not take a long search to discover that the problem is known and platform-specific, at least specific for Windows. The bug is known at least since the early days when the yet-to-come Java 6 was called Mustang ("Support for Unicode in Mustang ?" , 2005). It was discussed later on JavaRanch - Unicode: cmd parameters (main args); exec parameters; filenames and on OpenJDK 'core-libs-dev' mailing list - RFE 4519026: (process) Process should support Unicode on Win NT, request for review and Unicode support in Java JRE on Windows.

The origin of the bug looks like dating back to the days when it was necessary to have Java running on both Unicode and non-Unicode (e.g. Windows 9x) platforms, so using non-Unicode system calls in Java launcher code was somehow justified. A lot of water has flown under the bridge since then, and Windows 9x is no more supported by new JVM versions... And still my quick experiment revealed that the new Java 7 comes with the same bug in its place...

The right approach would be to get the bug fixed in the Java launcher. But unfortunately it is already long since I wrote something in C/C++... So the quick-and-dirty solution happened to be a Java wrapper.

Wrapper Implementation


The first idea was to get a wrapper just for Xalan-J. But then I remembered of the other command-line utilities written in Java and came to an idea of a general-purpose command line wrapper.

It was necessary to provide possibility of handling Unicode parameters at command shell level (e.g. file names, etc). And then supply all the command line arguments to Java, bypassing the command line . The solution was found as passing the command line arguments via a temporary file, one argument per line, to be written by command shell in some flavor of Unicode, to be further read in by the Java wrapper. And there was no rich variety of convenient Unicode flavors under Windows, as the cmd tool only allows easy output in UTF-16LE using the /u switch.

The resulting wrapper is available for download and free use under Apache 2.0 license, as both source and a compiled .jar.

Other command-line tools coming to my mind besides Xalan-J, that might benefit from using with this wrapper, include: Batik, FOP, CSSToXSLFO, ...

Usage – HOW-TO


  1. Download the .jar file and save it to a location of your choice.
  2. Prepare / modify the script you use to launch your command-line tool. Say, for Xalan-J your script might contain something like:
    @cmd /u /c echo -IN >%TMP%\xalan-args.tmp
    @cmd /u /c echo %~f1 >>%TMP%\xalan-args.tmp
    @cmd /u /c echo -XSL >>%TMP%\xalan-args.tmp
    @cmd /u /c echo %~f2 >>%TMP%\xalan-args.tmp
    @cmd /u /c echo -OUT >>%TMP%\xalan-args.tmp
    @cmd /u /c echo %~f3 >>%TMP%\xalan-args.tmp
    @if not %4. == .  cmd /u /c echo %~4 >>%TMP%\xalan-args.tmp
    @if not %5. == .  cmd /u /c echo %~5 >>%TMP%\xalan-args.tmp
    ...
    java [...] usn.unicode.UnicodeLauncher org.apache.xalan.xslt.Process %TMP%\xalan-args.tmp
    
    Ensure that usn-unicode-20110903.jar is on your class path. Note that all right angle redirection characters are doubled, except the first line... Refer to the "Using batch parameters" Microsoft page for syntax like %~f1. Note also %~4 instead of %4 to remove surrounding quotes if any.
  3. Have fun

No comments:

Post a Comment