Tom Kaiser (Ginger Alliance)
June 17, 2001
The HTML form of this description was compiled by Sablotron from the XML source Sablot-0-60.xml.
The material in the following sections includes:
XSLT is a language allowing to transform given XML data (the input) according to a stylesheet. XSLT stylesheets are themselves XML documents; that is, all instructions of the language are expressed in the form of XML elements. The output, i.e. the result of the processing, is typically a XML document as well, although the syntactic requirements can be relaxed to allow the creation of a HTML document (one that contains unclosed tags and the like), or even plain text.
XSLT was designed by the World Wide Web Consortium (W3C) as a part of the XSL stylesheet language, where it is complemented by a powerful set of formatting instructions. The most precise information about XSLT can be found in the W3C Recommendation [XSLT]. In particular, Appendix B of the Recommendation contains a handy syntax table. A good tutorial is [XMLBible14].
Other W3C Recommendations one often needs to consult are [XML] (for the definition of the XML language) and [XPath] (for details on XPath, the language used to form expressions in XSLT and elsewhere).
An excellent source of information about XSLT (indeed, about anything related to XML and SGML) is [Cover]; see also [XSLINFO] and [XMLorg].
Sablotron is a XSLT processor (though not quite conforming yet..., see below) written in C++. Since the machines where it is meant to run include various small mobile clients, the main objectives of its design are the following:
Sablotron is a single shared library
(sablot.dll
or libsablot.so.0.60
). It can
also be used from the command line via the simple interface
called sabcmd
. See here for
more information.
The only software Sablotron relies on is expat, the XML parser by James Clark. See below for information on how to get expat.
For information on the available interfaces, e.g. for Python, Perl and PHP, see www.gingerall.com.
Sablotron is written in C++. The source files compile under Win32 (using MS Visual C++ 6.0) and on Solaris and Linux (using g++ 2.95.2) without change.
The source or binary distributions of Sablotron can be downloaded from www.gingerall.com. For instructions on how to build the sources (if any), refer to the accompanying INSTALL file.
If you have access to the Ginger Alliance CVS server, you
can get the working version of Sablotron in the CVS module
ga
. The access rights can be obtained on
request from the CVS admin.
Since version 0.50, Sablotron uses expat 1.95.1, available from SourceForge.
Sablotron is an open source project and all volunteers are most welcome! The documentation of the sources is still somewhat sparse but we will try to improve it. If you find the invitation to work on Sablotron with us interesting, please contact us. There is also a mailing list available, see www.gingerall.com.
The instruction set supported by this version of Sablotron is already sufficient for many transformation tasks (e.g. the task of formatting this document). On the other hand, a comparison of it to the XSLT specification [XSLT] shows that much is still to be done. The purpose of the following sections is to describe the varying degree of support for the elements of the XSLT language.
It may be helpful to refer to the syntax table in Appendix B of [XSLT]. The instructions/attributes that are not listed as unsupported should be implemented. The authors will appreciate being told about any omissions found in the following description.
For readability, I sometimes omit the xsl:
prefix
from the instruction names.
template, apply-templates, call-template
Fully implemented. xsl:sort
is supported since release 0.50.
variable, param, with-param
Fully implemented. Top-level variables and parameters are read in the document order, so no forward references are resolved. This is a minor deviation from the spec.
element, attribute, text,
comment, processing-instruction, attribute-set
xsl:attribute-set
is not implemented. For the
rest, name
is the only recognized attribute (where
applicable). Literal result elements work.
stylesheet, transform, output
For stylesheet
and transform
,
the only recognized attribute is
version
. xsl:output
should work
(see below for notes on the encoding
attribute). HTML indentation has been added in 0.60.
value-of, copy, copy-of
copy-of
and value-of
are fully
implemented. copy
is implemented except for the
use-attribute-sets
attribute.
namespace-alias
Namespaces should be processed correctly. The
namespace-alias
instruction is now supported
(patch by Major).
sort
xsl:sort
is implemented since 0.50. There are
minor limitations:
lang
attribute may only
contain the values "en"
or "cz"
.case-order
cannot be specified.
strip-space, preserve-space
Only the default whitespace stripping is done. That is,
all whitespace-only text nodes in any stylesheet, not appearing
inside a xsl:text
, are removed. The two
instructions for whitespace stripping and preservation are
unsupported.
include, import, apply-imports
Only xsl:include
is implemented. Processing
involving multiple documents works, but has to get more testing,
eg. with respect to generate-id()
.
The output mechanism is much closer to the spec than in the versions prior to 0.4. The following issues remain for the html method:
<SCRIPT>
and
<STYLE>
Almost all features of XPath are fully implemented. This means there should be no problems with expressions of any kind.
One exception relates to axes. The following
and
preceding
axes haven't been implemented yet.
Another possible exception may be numbers; we did not yet do a thorough test of rounding, NaNs, infinity, etc.
Only a few functions from the standard function library remain unimplemented:
id()
,lang()
(accepted but always returns true),key()
,format-number()
,unparsed-entity-uri()
.As for the fuctions that are implemented, the following is a list of differences from the spec:
document()
only accepts one argument, always
getting the base URI from the stylesheet URI.
string-length()
returns the byte length of
the UTF-8 representation of the string. This will typically
differ from the actual length.
generate-id()
might fail to generate unique identifiers
when several input documents are present (giving the same id to
nodes from different documents).
It is possible for the user to supply the following handlers to Sablotron:
The handlers are set using SablotRegHandler()
For details concerning the interface of these handlers,
consult the header files sablot.h
and
shandler.h
.
In version 0.52, the encoding conversion capabilities of Sablotron have been much extended. The most important fact is the following: if you have the iconv library installed on your system, you can use any encoding it supports (that is, almost any encoding whatsoever) for both the input and the output documents. Iconv is available on most systems (it is a standard part of glibc2, for instance). There are implementations for Win32 as well.
If iconv is not available, the encoding may still be supported internally by Sablotron. At present, the list is of such encodings is rather short: besides UTF-8, these are UTF-16, ASCII, iso-8859-1, iso-8859-2 and windows-1250 on input, none on output. However, we plan to implement a half independent light-weight conversion library for use on systems without iconv, extending the set of internally supported encodings considerably.
Lastly, the user has the option to implement a custom
encoding conversion handler, which will be asked to perform any unsupported
conversion. See the shandler.h
header file for
details.
The default input and output encoding is in all cases UTF-8.
In addition to the standard output methods (xml, html and
text), it is possible to output xhtml. Documents output using
this method obey the XHTML 1.0 rules (in particular, all empty
elements are closed). To choose the method, use
<xsl:output method='xhtml'>
. Please note
that the name of this method will possibly be changed since the XSLT
spec requires any processor-specific methods to have qualified
names, say sab:xhtml
. On the other hand, the name
xhtml
is considered in the XSLT 2.0 working draft.
Sablotron can handle
two URI schemes natively: 'file' and 'arg' (see
below). Moreover, it is possible to use the function
SablotRegSchemeHandler
to register an external scheme
handler which will receive requests in all other schemes. See
the documentation in sablot.h
and
shandler.h
.
Relative URI references are resolved in conformance to RFC 2396. The base URI is well defined when the relative reference appears inside a XML document; when invoking sabcmd, the base URI is taken to correspond to the current working directory.
When specifying filenames, the following rules are in effect:
stdin
as file://stdin
etc.C:\doc.xml
), it is necessary to say
file://c:/doc.xml
.
Sablotron introduces an URI scheme 'arg:' which enables one to use strings in named memory buffers. The buffer names can have a tree-like structure so that a relative reference from a document in a buffer can be resolved as pointing to another buffer.
For instance, if we invoke Sablotron specifying that a
buffer named /mybuf/1
contains the string
"<a>contents</a>", then the expression
document('arg:/mybuf/1')/a
has string-value "contents". If the document in arg:/mybuf/1 contained a relative URI reference "../theirbuf/2" then this would be resolved as pointing to "arg:/theirbuf/2".
By default, Sablotron writes error and warning messages to
stderr, and does no logging. By a call to
SablotSetLog()
, you can specify the name of the log
file to be used.
Besides, you can use SablotRegHandler()
to override the default message handling. The handler you
register will receive all messages in a structured form that's
easy to process and filter. For details, see
the documentation in sablot.h
and
shandler.h
.
This section describes the functions exported from the Sablotron library. All of them have a return type of 'int' and return an error flag (nonzero signals an error). Errors are reported to the user by Sablotron itself.
We'll first describe the 'shortcuts' that do the whole processing in one call.
int SablotProcess(char *sheetURI, char *inputURI, char *resultURI,
char **params, char **arguments, char **resultArg);
This is the basic function. The first three of its arguments are the URIs of the XSLT stylesheet, the XML source and the resulting document, respectively. For some notes on specifying file names, see above.
params
is an array of pointers to the names
and contents of the top-level stylesheet parameters. Thus,
params[0]
is a pointer to the null-terminated name
of the first parameter, params[1]
points to the
(null-terminated) contents of the first parameter. The following
two array items do the same for the second parameter, etc. The
whole array is terminated by a NULL pointer in place of the
name. If no parameters are to be passed, you can specify NULL
for params
itself.
arguments
is a similar array of named buffers
to be passed to the stylesheet. (They can be referred to via the
'arg:' scheme, see above.) Again, the
array is a sequence of (name, value) pairs terminated by NULL in
place of a name. If no named buffers are to be passed, you can
specify NULL for arguments
itself.
resultArg
enables one to access the
resulting document in case the output went to a named buffer. In
that situation, *resultArg
points to the resulting
null-terminated string, allocated by Sablotron. You can pass NULL
for resultArg
if the output is sure to go to a
file.
Note:When you are done processing the string
pointed to by *resultArg
, free it using
SablotFree()
- never use
free()
. The latter is guaranteed to produce a
segmentation fault under Linux.
int SablotProcessFiles(char *styleSheetName,
char *inputName,
char *resultName);
A wrapper for SablotProcess()
working on
files. The parameters are the null-terminated file names of the
XSLT stylesheet, the XML input and the result,
respectively. Sablotron opens these files itself and closes them
after the processing is complete. Values like "file://stdin" are
allowed.
int SablotProcessStrings(char *styleSheetStr, char *inputStr, char
**resultStr);
Another wrapper for SablotProcess()
, this
time for accessing named buffers (i.e. user-allocated memory
blocks)only. Thus, the first parameter is a null-terminated
string containing the whole stylesheet; the second parameter
is a null-terminated string containing the XML
input. Sablotron allocates the buffer for the resulting string
and returns a pointer to it in resultStr. Hence, invoking
puts(*resultStr)
after having called
SablotProcessStrings
sends the result to
stdout. The buffer allocated must be freed by calling the
function SablotFree
described next.
The above shortcuts just call the basic, lower-level functions described below. Note that if you need to set options for logging etc., you may need to use the low-level functions.
A typical processing session may look like this:
SablotHandle p; char *my_buf; SablotCreateProcessor(&p); SablotSetLog(p, ...); /* ...set other instance-specific options here... */ SablotRunProcessor(p, ...); SablotGetResultArg(p, "arg:/somename", &my_buf) /* ...do something with my_buf... */ /* can run the processor again if necessary */ SablotRunProcessor(p, ...); SablotDestroyProcessor(p);
int SablotCreateProcessor(SablotHandle *processorPtr);
Creates an instance of Sablotron and returns a pointer to it in *processorPtr. This pointer is passed on all subsequent calls to this instance.
int SablotDestroyProcessor(SablotHandle processor_);
Destroys an instance of the processor, deallocating all the memory used up by it.
int SablotRunProcessor(SablotHandle processor_,
char *sheetURI,
char *inputURI,
char *resultURI,
char **params,
char **arguments);
Processes documents using the given processor instance and
given params and args definitions. See
SablotProcess()
.
int SablotGetResultArg(SablotHandle processor_,
char *argURI,
char **argValue);
Copies the result 'arg' buffer with the given URI, returning a pointer to the newly-allocated block in *argValue. If no such buffer exists, returns NULL in *argValue.
This function is necessary, because if the result document
is output to memory, it would be lost when
SablotDestroyProcessor()
is called. When
deallocating the copy obtained from
SablotGetResultArg()
, use SablotFree
(never free()
).
int SablotFreeResultArgs(SablotHandle processor_);
Removes the Sablotron-internal copies of the 'arg' buffers
from the last Sablotron run. Normally, there should be no reason
to call this function as it is called automatically on both
SablotRunProcessor()
and
SablotDestroyProcessor()
.
int SablotFree(char *resultBuf);
This function frees the buffer allocated on previous call
to SablotProcessStrings
. Calling it with an
invalid pointer will cause a crash.
int SablotRegHandler(
SablotHandle processor_,
HandlerType type,
void *handler,
void *userData);
Registers an external handler. type
can be
HLR_MESSAGE
, HLR_SCHEME
,
HLR_SAX
, HLR_MISC
or
HLR_ENC
.
handler
points to the
callback vector of the appropriate type. userData
is a data item to passed to all callbacks of this particular
handler. For details, check the sablot.h
and
shandler.h
header files.
int SablotUnregHandler(
SablotHandle processor_,
HandlerType type,
void *handler,
void *userData);
Unregisters the given external handler. For details, check the
sablot.h
and shandler.h
header
files.
int SablotSetLog(
SablotHandle processor_,
const char *logFilename,
int logLevel);
Sets the log filename. The logLevel
parameter
is currently not used. Pass NULL for logFilename
to
turn logging off (default).
The other functions published by sablot.h have been included for experimental reasons or for compatibility, and it is better not to use them.
int SablotClearError(SablotHandle processor_);
Clears the 'pending error' flag for this instance of Sablotron.
The implementation of the DOM interface brought the need to extend some of the functions described in the previous section. This extension enables the user to:
An object called situation is used to provide a persistent context for all calls to the DOM-related functions. Functions used to manipulate the situation are described in the following section.
Note: If not specified otherwise, all these functions return an error code. A positive value indicates an error.
int SablotCreateDocument(SablotSituation S,
SDOM_Document *D);
Creates an empty document. Typically followed by calls to DOM functions to populate the document.
int SablotDestroyDocument(SablotSituation S,
SDOM_Document D);
Destroys a document, freeing all the nodes it has created.
int SablotParse(SablotSituation S,
const char *uri, SDOM_Document *D);
Reads in a document from the given URI.
int SablotParseBuffer(SablotSituation S,
const char *buffer, SDOM_Document *D);
Reads in a document from the given in-memory buffer.
These functions have variants to be used if the document
is to be interpreted as an XSLT stylesheet, namely
SablotParseStylesheet
and
SablotParseStylesheetBuffer
.
The following functions generalize
SablotRunProcessor
in that they make it possible to
utilize an extra kind of a source document: a DOM tree.
int SablotRunProcessorGen(SablotSituation S,
void *processor_,
char *sheetURI,
char *inputURI,
char *resultURI);
A key ingredient of the extended interface. Only the URIs
of the sources and of the result document are given to it. The
rest of the information passed to
SablotRunProcessor
is conveyed through
SablotAddArgBuffer,
SablotAddArgTree
and SablotAddParam.
The scheme part of the
stylesheet URI or the input URI may be "arg:", in which
case they refer to a buffer or tree passed by these
functions.
int SablotAddArgBuffer(SablotSituation S,
void *processor_,
const char *argName,
const char *bufferValue);
Creates a named buffer for the next processor run. The buffer's name and contents are passed as arguments. The name is interpreted relative to the 'arg:/' scheme.
int SablotAddArgTree(SablotSituation S,
void *processor_,
const char *argName,
SDOM_Document tree);
Associates the given document with a name for the next processor run. The document is not destroyed after the run is finished. The name is interpreted relative to the 'arg:/' scheme.
int SablotAddParam(SablotSituation S,
void *processor_,
const char *paramName,
const char *paramValue);
Adds a global stylesheet parameter for the next processor run.
At present, the situation object primarily holds information on any pending errors. A situation is created using
int SablotCreateSituation(SablotSituation
*SP);
and destroyed by
int SablotDestroySituation(SablotSituation
S);
To clear the pending error flag in a situation, use
int SablotClearSituation(SablotSituation
S);
The following self-explanatory functions extract parts of the error information from the situation:
const char *SablotGetErrorURI(SablotSituation S);
int SablotGetErrorLine(SablotSituation S);
const char *SablotGetErrorMsg(SablotSituation S);
Starting with version 0.60, Sablotron implements
a major subset of the DOM Level 1 Core Specification [DOM]. A brief
description of the implemented interface follows; for more
details, please refer to the header file named
sdom.h.
All of the names related to the DOM interface start with SDOM_ (for Sablot DOM).
Major new types are SDOM_Document
(a DOM tree) and
SDOM_Node
(a node of the tree). A document can also be used in
place of a node. This reflects the fact in the DOM spec,
Document is a subclass of Node. When used in this way, the
document represents its own root node (which is not the same as
the `root element').
Other types include:
SDOM_char:
a DOM character type. Currently, this is just
char. Note that the DOM spec requires that the DOM
implementations work with UTF-16. Sablotron deviates from this
by using UTF-8 instead. A separate set of functions taking
UTF-16 strings will be provided.SDOM_NodeType:
a node type enum. Some of the values are
SDOM_ELEMENT_NODE,
SDOM_ATTRIBUTE_NODE
and SDOM_TEXT_NODE.
See
sdom.h
for the rest.SDOM_NodeList:
a node list returned by some of the
functions.SDOM_Exception:
DOM exception codes enum, with values such
as SDOM_NOT_FOUND_ERR
or SDOM_INVALID_NODE_TYPE
. See sdom.h
for details.The functions listed below are implemented more or less as defined in
the DOM Level 1 Specification, with two exceptions:
their names are prefixed with SDOM_
and the first argument is
always a SablotSituation.
All the functions return
a SDOM_Exception.
createElement, createAttribute, createTextNode,
createCDATASection, createComment, createProcessingInstruction
getNodeType, getNodeName, setNodeName, getNodeValue, setNodeValue
getParentNode, getFirstChild, getLastChild, getPreviousSibling,
getNextSibling, getOwnerDocument
insertBefore, appendChild, removeChild, replaceChild
cloneNode
getAttribute, setAttribute, removeAttribute, getAttributeList
Several functions have been added:
disposeNode
frees all memory used by the given nodecloneForeignNode
clones a node from a different
documentdocToString
serializes the document, returning the
resulting stringxql
performs an XPath query on the DOM tree,
returning a list of the nodes satisfying it.In addition, there are some functions used to manipulate
the node lists returned by xql
and
getAttributeList
. These include
getNodeListLength
, getNodeListItem
and
disposeNodeList
.
Finally, there are functions to extract DOM
exception-related information from the situation object, namely
getExceptionCode
, getExceptionMessage
and getExceptionDetails
.
Sablotron comes with a command-line interface to the
shared library, which is a program named
sabcmd
. At present, sabcmd
is invoked
as follows:
sabcmd [options] stylesheet [input [result]] [assignments]
The arguments are the URIs of the XSLT stylesheet, the
XML input document, and the resulting document, respectively. The
default for
input
is
file://stdin
(meaning plain old stdin);
result
defaults to
file://stdout
. Filenames have to include the extension (if
any).
You can display the list of available options by typing
sabcmd --help
. Among the more useful ones are
--log-file
(for setting the log file) and
--measure
(measures and outputs the total
processing time).
The rules for filenames are the same as
with SablotProcess()
.
assignments
is a series of definitions of the
form:
name1=value1 name2=value2 ...
assigning values to top-level stylesheet parameters and to named buffers. These two cases are distinguished by a leading '$' in the name of a stylesheet parameter. The names of the buffers do not start with "arg:". They may start with a slash; if they don't, the slash is prepended.
Note: In most cases, it will be necessary to quote the individual assignments. Whether to use single or double quotes may depend on the shell used (or may it?) Single quotes work for bash, double quotes work in Windows.
If the result URI refers to a named buffer, the output would normally remain buried in memory. Sabcmd dumps the buffer to standard output instead.
To sum up and give an example, the following would be a valid invocation of sabcmd:
sabcmd sheet.xsl arg:/the_input "the_input=<a/>"
"$use_defaults=1"
This processes the document passed in the buffer named the_input, using a stylesheet found in file "sheet.xsl" in the working directory. We assign 1 to the top-level parameter called "use_defaults". The output goes to stdout by default.
(c) 2000 Ginger Alliance s.r.o.