Sablotron 0.60

Tom Kaiser (Ginger Alliance)

June 17, 2001

Abstract

This is a description of the current version of the XSLT processor called Sablotron, including an overview of its limitations as compared to the XSLT specification.

Contents

  1  This text
  2  Changes from the last release
  3  Introduction   4  The sources   5  Implementation. Supported instructions and functions   6  Other implementation-related notes   7  The C interface   8  The command line interface
  9  References

1  This text

The HTML form of this description was compiled by Sablotron from the XML source Sablot-0-60.xml.

The material in the following sections includes:

2  Changes from the last release

Please see the RELEASE file.

3  Introduction

3.1  XSLT

XSLT is a language allowing to transform given XML data (the input) according to a stylesheet. XSLT stylesheets are themselves XML documents; that is, all instructions of the language are expressed in the form of XML elements. The output, i.e. the result of the processing, is typically a XML document as well, although the syntactic requirements can be relaxed to allow the creation of a HTML document (one that contains unclosed tags and the like), or even plain text.

XSLT was designed by the World Wide Web Consortium (W3C) as a part of the XSL stylesheet language, where it is complemented by a powerful set of formatting instructions. The most precise information about XSLT can be found in the W3C Recommendation [XSLT]. In particular, Appendix B of the Recommendation contains a handy syntax table. A good tutorial is [XMLBible14].

Other W3C Recommendations one often needs to consult are [XML] (for the definition of the XML language) and [XPath] (for details on XPath, the language used to form expressions in XSLT and elsewhere).

An excellent source of information about XSLT (indeed, about anything related to XML and SGML) is [Cover]; see also [XSLINFO] and [XMLorg].

3.2  On Sablotron

Sablotron is a XSLT processor (though not quite conforming yet..., see below) written in C++. Since the machines where it is meant to run include various small mobile clients, the main objectives of its design are the following:

  • portability,
  • compact code,
  • as much independence on other resources (Java etc.) as possible.

Sablotron is a single shared library (sablot.dll or libsablot.so.0.60). It can also be used from the command line via the simple interface called sabcmd. See here for more information.

The only software Sablotron relies on is expat, the XML parser by James Clark. See below for information on how to get expat.

For information on the available interfaces, e.g. for Python, Perl and PHP, see www.gingerall.com.

4  The sources

Sablotron is written in C++. The source files compile under Win32 (using MS Visual C++ 6.0) and on Solaris and Linux (using g++ 2.95.2) without change.

4.1  Getting the sources

The source or binary distributions of Sablotron can be downloaded from www.gingerall.com. For instructions on how to build the sources (if any), refer to the accompanying INSTALL file.

If you have access to the Ginger Alliance CVS server, you can get the working version of Sablotron in the CVS module ga. The access rights can be obtained on request from the CVS admin.

Since version 0.50, Sablotron uses expat 1.95.1, available from SourceForge.

4.2  Joining the development

Sablotron is an open source project and all volunteers are most welcome! The documentation of the sources is still somewhat sparse but we will try to improve it. If you find the invitation to work on Sablotron with us interesting, please contact us. There is also a mailing list available, see www.gingerall.com.

5  Implementation. Supported instructions and functions

The instruction set supported by this version of Sablotron is already sufficient for many transformation tasks (e.g. the task of formatting this document). On the other hand, a comparison of it to the XSLT specification [XSLT] shows that much is still to be done. The purpose of the following sections is to describe the varying degree of support for the elements of the XSLT language.

It may be helpful to refer to the syntax table in Appendix B of [XSLT]. The instructions/attributes that are not listed as unsupported should be implemented. The authors will appreciate being told about any omissions found in the following description.

For readability, I sometimes omit the xsl: prefix from the instruction names.

5.1  Templates

template, apply-templates, call-template

Fully implemented. xsl:sort is supported since release 0.50.

5.2  Conditional processing

if, choose, when, otherwise

Fully implemented.

5.3  Loops

for-each

Fully implemented.

5.4  Variables and parameters

variable, param, with-param

Fully implemented. Top-level variables and parameters are read in the document order, so no forward references are resolved. This is a minor deviation from the spec.

5.5  Element creation

element, attribute, text, comment, processing-instruction, attribute-set

xsl:attribute-set is not implemented. For the rest, name is the only recognized attribute (where applicable). Literal result elements work.

5.6  Global definitions

stylesheet, transform, output

For stylesheet and transform, the only recognized attribute is version. xsl:output should work (see below for notes on the encoding attribute). HTML indentation has been added in 0.60.

5.7  Values and copying

value-of, copy, copy-of

copy-of and value-of are fully implemented. copy is implemented except for the use-attribute-sets attribute.

5.8  Namespace processing

namespace-alias

Namespaces should be processed correctly. The namespace-alias instruction is now supported (patch by Major).

5.9  Sorting

sort

xsl:sort is implemented since 0.50. There are minor limitations:

  • currently, the lang attribute may only contain the values "en" or "cz".
  • case-order cannot be specified.

5.10  Whitespace stripping

strip-space, preserve-space

Only the default whitespace stripping is done. That is, all whitespace-only text nodes in any stylesheet, not appearing inside a xsl:text, are removed. The two instructions for whitespace stripping and preservation are unsupported.

5.11  Includes

include, import, apply-imports

Only xsl:include is implemented. Processing involving multiple documents works, but has to get more testing, eg. with respect to generate-id().

5.12  Other unimplemented instructions

  • xsl:key,
  • xsl:number,
  • xsl:fallback.

5.13  Output conformance

The output mechanism is much closer to the spec than in the versions prior to 0.4. The following issues remain for the html method:

  • Output the boolean attributes correctly.
  • Disable the escaping inside <SCRIPT> and <STYLE>
  • .

5.14  XPath expressions

Almost all features of XPath are fully implemented. This means there should be no problems with expressions of any kind.

One exception relates to axes. The following and preceding axes haven't been implemented yet.

Another possible exception may be numbers; we did not yet do a thorough test of rounding, NaNs, infinity, etc.

5.15  Built-in functions

Only a few functions from the standard function library remain unimplemented:

  • id(),
  • lang() (accepted but always returns true),
  • key(),
  • format-number(),
  • unparsed-entity-uri().

As for the functions that are implemented, the following is a list of differences from the spec:

  • document() only accepts one argument, always getting the base URI from the stylesheet URI.
  • string-length() returns the byte length of the UTF-8 representation of the string. This will typically differ from the actual length.
  • generate-id() might fail to generate unique identifiers when several input documents are present (giving the same id to nodes from different documents).

6  Other implementation-related notes

6.1  Handlers

It is possible for the user to supply the following handlers to Sablotron:

  • message handler (to bypass the default way of displaying error and warning messages and logging),
  • scheme handler (to retrieve documents whose URI use an unsupported scheme),
  • streaming handler (an expat-like interface to the XML document which is the result of the processing),
  • 'miscellaneous' handler (which will probably server as a collections of odd callbacks).

The handlers are set using SablotRegHandler() For details concerning the interface of these handlers, consult the header files sablot.h and shandler.h.

6.2  Encodings

In version 0.52, the encoding conversion capabilities of Sablotron have been much extended. The most important fact is the following: if you have the iconv library installed on your system, you can use any encoding it supports (that is, almost any encoding whatsoever) for both the input and the output documents. Iconv is available on most systems (it is a standard part of glibc2, for instance). There are implementations for Win32 as well.

If iconv is not available, the encoding may still be supported internally by Sablotron. At present, the list is of such encodings is rather short: besides UTF-8, these are UTF-16, ASCII, iso-8859-1, iso-8859-2 and windows-1250 on input, none on output. However, we plan to implement a half independent light-weight conversion library for use on systems without iconv, extending the set of internally supported encodings considerably.

Lastly, the user has the option to implement a custom encoding conversion handler, which will be asked to perform any unsupported conversion. See the shandler.h header file for details.

The default input and output encoding is in all cases UTF-8.

6.3  Output methods

In addition to the standard output methods (xml, html and text), it is possible to output xhtml. Documents output using this method obey the XHTML 1.0 rules (in particular, all empty elements are closed). To choose the method, use <xsl:output method='xhtml'>. Please note that the name of this method will possibly be changed since the XSLT spec requires any processor-specific methods to have qualified names, say sab:xhtml. On the other hand, the name xhtml is considered in the XSLT 2.0 working draft.

6.4  URIs

Sablotron can handle two URI schemes natively: 'file' and 'arg' (see below). Moreover, it is possible to use the function SablotRegSchemeHandler to register an external scheme handler which will receive requests in all other schemes. See the documentation in sablot.h and shandler.h.

Relative URI references are resolved in conformance to RFC 2396. The base URI is well defined when the relative reference appears inside a XML document; when invoking sabcmd, the base URI is taken to correspond to the current working directory.

When specifying filenames, the following rules are in effect:

  • specify the "file:" scheme for any standard files, i.e. refer to stdin as file://stdin etc.
  • slashes and backslashes work equally fine, in Windows as well as Linux.
  • to include a drive letter under Windows (e.g. C:\doc.xml), it is necessary to say file://c:/doc.xml.

6.5  Named buffers

Sablotron introduces an URI scheme 'arg:' which enables one to use strings in named memory buffers. The buffer names can have a tree-like structure so that a relative reference from a document in a buffer can be resolved as pointing to another buffer.

For instance, if we invoke Sablotron specifying that a buffer named /mybuf/1 contains the string "&lt;a>contents&lt;/a>", then the expression

document('arg:/mybuf/1')/a

has string-value "contents". If the document in arg:/mybuf/1 contained a relative URI reference "../theirbuf/2" then this would be resolved as pointing to "arg:/theirbuf/2".

6.6  Error and log messages

By default, Sablotron writes error and warning messages to stderr, and does no logging. By a call to SablotSetLog(), you can specify the name of the log file to be used.

Besides, you can use SablotRegHandler() to override the default message handling. The handler you register will receive all messages in a structured form that's easy to process and filter. For details, see the documentation in sablot.h and shandler.h.

7  The C interface

This section describes the functions exported from the Sablotron library. All of them have a return type of 'int' and return an error flag (nonzero signals an error). Errors are reported to the user by Sablotron itself.

7.1  Shortcuts

We'll first describe the 'shortcuts' that do the whole processing in one call.

int SablotProcess(char *sheetURI, char *inputURI, char *resultURI, char **params, char **arguments, char **resultArg);

This is the basic function. The first three of its arguments are the URIs of the XSLT stylesheet, the XML source and the resulting document, respectively. For some notes on specifying file names, see above.

params is an array of pointers to the names and contents of the top-level stylesheet parameters. Thus, params[0] is a pointer to the null-terminated name of the first parameter, params[1] points to the (null-terminated) contents of the first parameter. The following two array items do the same for the second parameter, etc. The whole array is terminated by a NULL pointer in place of the name. If no parameters are to be passed, you can specify NULL for params itself.

arguments is a similar array of named buffers to be passed to the stylesheet. (They can be referred to via the 'arg:' scheme, see above.) Again, the array is a sequence of (name, value) pairs terminated by NULL in place of a name. If no named buffers are to be passed, you can specify NULL for arguments itself.

resultArg enables one to access the resulting document in case the output went to a named buffer. In that situation, *resultArg points to the resulting null-terminated string, allocated by Sablotron. You can pass NULL for resultArg if the output is sure to go to a file.

Note:When you are done processing the string pointed to by *resultArg, free it using SablotFree() - never use free(). The latter is guaranteed to produce a segmentation fault under Linux.

int SablotProcessFiles(char *styleSheetName, char *inputName, char *resultName);

A wrapper for SablotProcess() working on files. The parameters are the null-terminated file names of the XSLT stylesheet, the XML input and the result, respectively. Sablotron opens these files itself and closes them after the processing is complete. Values like "file://stdin" are allowed.

int SablotProcessStrings(char *styleSheetStr, char *inputStr, char **resultStr);

Another wrapper for SablotProcess(), this time for accessing named buffers (i.e. user-allocated memory blocks)only. Thus, the first parameter is a null-terminated string containing the whole stylesheet; the second parameter is a null-terminated string containing the XML input. Sablotron allocates the buffer for the resulting string and returns a pointer to it in resultStr. Hence, invoking puts(*resultStr) after having called SablotProcessStrings sends the result to stdout. The buffer allocated must be freed by calling the function SablotFree described next.

7.2  Basic functions

The above shortcuts just call the basic, lower-level functions described below. Note that if you need to set options for logging etc., you may need to use the low-level functions.

A typical processing session may look like this:

          SablotHandle p;
          char *my_buf;
          SablotCreateProcessor(&p);
          SablotSetLog(p, ...);
          /* ...set other instance-specific options here... */
          SablotRunProcessor(p, ...);
          SablotGetResultArg(p, "arg:/somename", &my_buf)
          /* ...do something with my_buf... */
          /* can run the processor again if necessary */
          SablotRunProcessor(p, ...);
          SablotDestroyProcessor(p);
      

int SablotCreateProcessor(SablotHandle *processorPtr);

Creates an instance of Sablotron and returns a pointer to it in *processorPtr. This pointer is passed on all subsequent calls to this instance.

int SablotDestroyProcessor(SablotHandle processor_);

Destroys an instance of the processor, deallocating all the memory used up by it.

int SablotRunProcessor(SablotHandle processor_, char *sheetURI, char *inputURI, char *resultURI, char **params, char **arguments);

Processes documents using the given processor instance and given params and args definitions. See SablotProcess().

int SablotGetResultArg(SablotHandle processor_, char *argURI, char **argValue);

Copies the result 'arg' buffer with the given URI, returning a pointer to the newly-allocated block in *argValue. If no such buffer exists, returns NULL in *argValue.

This function is necessary, because if the result document is output to memory, it would be lost when SablotDestroyProcessor() is called. When deallocating the copy obtained from SablotGetResultArg(), use SablotFree (never free()).

int SablotFreeResultArgs(SablotHandle processor_);

Removes the Sablotron-internal copies of the 'arg' buffers from the last Sablotron run. Normally, there should be no reason to call this function as it is called automatically on both SablotRunProcessor() and SablotDestroyProcessor().

int SablotFree(char *resultBuf);

This function frees the buffer allocated on previous call to SablotProcessStrings. Calling it with an invalid pointer will cause a crash.

int SablotRegHandler( SablotHandle processor_, HandlerType type, void *handler, void *userData);

Registers an external handler. type can be HLR_MESSAGE, HLR_SCHEME, HLR_SAX, HLR_MISC or HLR_ENC. handler points to the callback vector of the appropriate type. userData is a data item to passed to all callbacks of this particular handler. For details, check the sablot.h and shandler.h header files.

int SablotUnregHandler( SablotHandle processor_, HandlerType type, void *handler, void *userData);

Unregisters the given external handler. For details, check the sablot.h and shandler.h header files.

int SablotSetLog( SablotHandle processor_, const char *logFilename, int logLevel);

Sets the log filename. The logLevel parameter is currently not used. Pass NULL for logFilename to turn logging off (default).

The other functions published by sablot.h have been included for experimental reasons or for compatibility, and it is better not to use them.

int SablotClearError(SablotHandle processor_);

Clears the 'pending error' flag for this instance of Sablotron.

7.3  Generalized interface functions

The implementation of the DOM interface brought the need to extend some of the functions described in the previous section. This extension enables the user to:

  • process documents created by the DOM functions, and
  • process frequently used documents in pre-parsed form.

An object called situation is used to provide a persistent context for all calls to the DOM-related functions. Functions used to manipulate the situation are described in the following section.

Note: If not specified otherwise, all these functions return an error code. A positive value indicates an error.

int SablotCreateDocument(SablotSituation S, SDOM_Document *D);

Creates an empty document. Typically followed by calls to DOM functions to populate the document.

int SablotDestroyDocument(SablotSituation S, SDOM_Document D);

Destroys a document, freeing all the nodes it has created.

int SablotParse(SablotSituation S, const char *uri, SDOM_Document *D);

Reads in a document from the given URI.

int SablotParseBuffer(SablotSituation S, const char *buffer, SDOM_Document *D);

Reads in a document from the given in-memory buffer.

These functions have variants to be used if the document is to be interpreted as an XSLT stylesheet, namely SablotParseStylesheet and SablotParseStylesheetBuffer.

The following functions generalize SablotRunProcessor in that they make it possible to utilize an extra kind of a source document: a DOM tree.

int SablotRunProcessorGen(SablotSituation S, void *processor_, char *sheetURI, char *inputURI, char *resultURI);

A key ingredient of the extended interface. Only the URIs of the sources and of the result document are given to it. The rest of the information passed to SablotRunProcessor is conveyed through SablotAddArgBuffer, SablotAddArgTree and SablotAddParam. The scheme part of the stylesheet URI or the input URI may be "arg:", in which case they refer to a buffer or tree passed by these functions.

int SablotAddArgBuffer(SablotSituation S, void *processor_, const char *argName, const char *bufferValue);

Creates a named buffer for the next processor run. The buffer's name and contents are passed as arguments. The name is interpreted relative to the 'arg:/' scheme.

int SablotAddArgTree(SablotSituation S, void *processor_, const char *argName, SDOM_Document tree);

Associates the given document with a name for the next processor run. The document is not destroyed after the run is finished. The name is interpreted relative to the 'arg:/' scheme.

int SablotAddParam(SablotSituation S, void *processor_, const char *paramName, const char *paramValue);

Adds a global stylesheet parameter for the next processor run.

7.4  The situation object

At present, the situation object primarily holds information on any pending errors. A situation is created using

int SablotCreateSituation(SablotSituation *SP);

and destroyed by

int SablotDestroySituation(SablotSituation S);

To clear the pending error flag in a situation, use

int SablotClearSituation(SablotSituation S);

The following self-explanatory functions extract parts of the error information from the situation:

const char *SablotGetErrorURI(SablotSituation S);
int SablotGetErrorLine(SablotSituation S);
const char *SablotGetErrorMsg(SablotSituation S);

7.5  Document Object Model (DOM) functions

Starting with version 0.60, Sablotron implements a major subset of the DOM Level 1 Core Specification [DOM]. A brief description of the implemented interface follows; for more details, please refer to the header file named sdom.h.

All of the names related to the DOM interface start with SDOM_ (for Sablot DOM).

Major new types are SDOM_Document (a DOM tree) and SDOM_Node (a node of the tree). A document can also be used in place of a node. This reflects the fact in the DOM spec, Document is a subclass of Node. When used in this way, the document represents its own root node (which is not the same as the `root element').

Other types include:

  • SDOM_char: a DOM character type. Currently, this is just char. Note that the DOM spec requires that the DOM implementations work with UTF-16. Sablotron deviates from this by using UTF-8 instead. A separate set of functions taking UTF-16 strings will be provided.
  • SDOM_NodeType: a node type enum. Some of the values are SDOM_ELEMENT_NODE, SDOM_ATTRIBUTE_NODE and SDOM_TEXT_NODE. See sdom.h for the rest.
  • SDOM_NodeList: a node list returned by some of the functions.
  • SDOM_Exception: DOM exception codes enum, with values such as SDOM_NOT_FOUND_ERR or SDOM_INVALID_NODE_TYPE. See sdom.h for details.

The functions listed below are implemented more or less as defined in the DOM Level 1 Specification, with two exceptions: their names are prefixed with SDOM_ and the first argument is always a SablotSituation. All the functions return a SDOM_Exception.

  • createElement, createAttribute, createTextNode, createCDATASection, createComment, createProcessingInstruction
  • getNodeType, getNodeName, setNodeName, getNodeValue, setNodeValue
  • getParentNode, getFirstChild, getLastChild, getPreviousSibling, getNextSibling, getOwnerDocument
  • insertBefore, appendChild, removeChild, replaceChild
  • cloneNode
  • getAttribute, setAttribute, removeAttribute, getAttributeList

Several functions have been added:

  • disposeNode frees all memory used by the given node
  • cloneForeignNode clones a node from a different document
  • docToString serializes the document, returning the resulting string
  • xql performs an XPath query on the DOM tree, returning a list of the nodes satisfying it.

In addition, there are some functions used to manipulate the node lists returned by xql and getAttributeList. These include getNodeListLength, getNodeListItem and disposeNodeList.

Finally, there are functions to extract DOM exception-related information from the situation object, namely getExceptionCode, getExceptionMessage and getExceptionDetails.

8  The command line interface

Sablotron comes with a command-line interface to the shared library, which is a program named sabcmd. At present, sabcmd is invoked as follows:

sabcmd [options] stylesheet [input [result]] [assignments]

The arguments are the URIs of the XSLT stylesheet, the XML input document, and the resulting document, respectively. The default for input is file://stdin (meaning plain old stdin); result defaults to file://stdout. Filenames have to include the extension (if any).

You can display the list of available options by typing sabcmd --help. Among the more useful ones are --log-file (for setting the log file) and --measure (measures and outputs the total processing time).

The rules for filenames are the same as with SablotProcess().

assignments is a series of definitions of the form:

name1=value1 name2=value2 ...

assigning values to top-level stylesheet parameters and to named buffers. These two cases are distinguished by a leading '$' in the name of a stylesheet parameter. The names of the buffers do not start with "arg:". They may start with a slash; if they don't, the slash is prepended.

Note: In most cases, it will be necessary to quote the individual assignments. Whether to use single or double quotes may depend on the shell used (or may it?) Single quotes work for bash, double quotes work in Windows.

If the result URI refers to a named buffer, the output would normally remain buried in memory. Sabcmd dumps the buffer to standard output instead.

To sum up and give an example, the following would be a valid invocation of sabcmd:

sabcmd sheet.xsl arg:/the_input "the_input=&lt;a/>" "$use_defaults=1"

This processes the document passed in the buffer named the_input, using a stylesheet found in file "sheet.xsl" in the working directory. We assign 1 to the top-level parameter called "use_defaults". The output goes to stdout by default.

9  References

[XSLT]
XSL Transformations (XSLT) Version 1.0
[XPath]
XML Path Language (XPath) Version 1.0
[XML]
Extensible Markup Language (XML) 1.0
[DOM]
Document Object Model Level 1 Specification, Version 1.0
[Cover]
The XML Cover Pages
[XMLorg]
XML.org
[XSLINFO]
XSLINFO.com
[XMLBible14]
Harold, E. R.: XML Bible, Chapter 14 (online presentation)

(c) 2000 Ginger Alliance s.r.o.