2000-04-06 14:21:58 +00:00
|
|
|
\section{\module{codecs} ---
|
2000-04-06 16:09:59 +00:00
|
|
|
Codec registry and base classes}
|
2000-04-06 14:21:58 +00:00
|
|
|
|
2000-04-06 16:09:59 +00:00
|
|
|
\declaremodule{standard}{codecs}
|
2000-04-06 14:21:58 +00:00
|
|
|
\modulesynopsis{Encode and decode data and streams.}
|
|
|
|
\moduleauthor{Marc-Andre Lemburg}{mal@lemburg.com}
|
|
|
|
\sectionauthor{Marc-Andre Lemburg}{mal@lemburg.com}
|
|
|
|
|
|
|
|
|
|
|
|
\index{Unicode}
|
|
|
|
\index{Codecs}
|
|
|
|
\indexii{Codecs}{encode}
|
|
|
|
\indexii{Codecs}{decode}
|
|
|
|
\index{streams}
|
|
|
|
\indexii{stackable}{streams}
|
|
|
|
|
|
|
|
|
|
|
|
This module defines base classes for standard Python codecs (encoders
|
|
|
|
and decoders) and provides access to the internal Python codec
|
|
|
|
registry which manages the codec lookup process.
|
|
|
|
|
|
|
|
It defines the following functions:
|
|
|
|
|
|
|
|
\begin{funcdesc}{register}{search_function}
|
|
|
|
Register a codec search function. Search functions are expected to
|
|
|
|
take one argument, the encoding name in all lower case letters, and
|
|
|
|
return a tuple of functions \code{(\var{encoder}, \var{decoder}, \var{stream_reader},
|
|
|
|
\var{stream_writer})} taking the following arguments:
|
|
|
|
|
|
|
|
\var{encoder} and \var{decoder}: These must be functions or methods
|
|
|
|
which have the same interface as the .encode/.decode methods of
|
|
|
|
Codec instances (see Codec Interface). The functions/methods are
|
|
|
|
expected to work in a stateless mode.
|
|
|
|
|
|
|
|
\var{stream_reader} and \var{stream_writer}: These have to be
|
|
|
|
factory functions providing the following interface:
|
|
|
|
|
2000-04-06 16:09:59 +00:00
|
|
|
\code{factory(\var{stream}, \var{errors}='strict')}
|
2000-04-06 14:21:58 +00:00
|
|
|
|
|
|
|
The factory functions must return objects providing the interfaces
|
2000-04-06 16:09:59 +00:00
|
|
|
defined by the base classes \class{StreamWriter} and
|
|
|
|
\class{StreamReader}, respectively. Stream codecs can maintain
|
|
|
|
state.
|
2000-04-06 14:21:58 +00:00
|
|
|
|
2000-04-06 16:09:59 +00:00
|
|
|
Possible values for errors are \code{'strict'} (raise an exception
|
|
|
|
in case of an encoding error), \code{'replace'} (replace malformed
|
|
|
|
data with a suitable replacement marker, such as \character{?}) and
|
|
|
|
\code{'ignore'} (ignore malformed data and continue without further
|
|
|
|
notice).
|
2000-04-06 14:21:58 +00:00
|
|
|
|
|
|
|
In case a search function cannot find a given encoding, it should
|
2000-04-06 16:09:59 +00:00
|
|
|
return \code{None}.
|
2000-04-06 14:21:58 +00:00
|
|
|
\end{funcdesc}
|
|
|
|
|
|
|
|
\begin{funcdesc}{lookup}{encoding}
|
|
|
|
Looks up a codec tuple in the Python codec registry and returns the
|
|
|
|
function tuple as defined above.
|
|
|
|
|
|
|
|
Encodings are first looked up in the registry's cache. If not found,
|
|
|
|
the list of registered search functions is scanned. If no codecs tuple
|
2000-04-06 16:09:59 +00:00
|
|
|
is found, a \exception{LookupError} is raised. Otherwise, the codecs
|
|
|
|
tuple is stored in the cache and returned to the caller.
|
2000-04-06 14:21:58 +00:00
|
|
|
\end{funcdesc}
|
|
|
|
|
|
|
|
To simplify working with encoded files or stream, the module
|
|
|
|
also defines these utility functions:
|
|
|
|
|
2000-07-24 19:35:52 +00:00
|
|
|
\begin{funcdesc}{open}{filename, mode\optional{, encoding\optional{,
|
|
|
|
errors\optional{, buffering}}}}
|
2000-04-06 14:21:58 +00:00
|
|
|
Open an encoded file using the given \var{mode} and return
|
|
|
|
a wrapped version providing transparent encoding/decoding.
|
|
|
|
|
2000-04-06 16:09:59 +00:00
|
|
|
\strong{Note:} The wrapped version will only accept the object format
|
2000-07-24 19:35:52 +00:00
|
|
|
defined by the codecs, i.e.\ Unicode objects for most built-in
|
|
|
|
codecs. Output is also codec-dependent and will usually be Unicode as
|
2000-04-06 16:09:59 +00:00
|
|
|
well.
|
2000-04-06 14:21:58 +00:00
|
|
|
|
|
|
|
\var{encoding} specifies the encoding which is to be used for the
|
|
|
|
the file.
|
|
|
|
|
|
|
|
\var{errors} may be given to define the error handling. It defaults
|
2000-07-24 19:35:52 +00:00
|
|
|
to \code{'strict'} which causes a \exception{ValueError} to be raised
|
|
|
|
in case an encoding error occurs.
|
2000-04-06 14:21:58 +00:00
|
|
|
|
2000-04-06 16:09:59 +00:00
|
|
|
\var{buffering} has the same meaning as for the built-in
|
|
|
|
\function{open()} function. It defaults to line buffered.
|
2000-04-06 14:21:58 +00:00
|
|
|
\end{funcdesc}
|
|
|
|
|
2000-07-24 19:35:52 +00:00
|
|
|
\begin{funcdesc}{EncodedFile}{file, input\optional{,
|
|
|
|
output\optional{, errors}}}
|
2000-04-06 14:21:58 +00:00
|
|
|
Return a wrapped version of file which provides transparent
|
|
|
|
encoding translation.
|
|
|
|
|
|
|
|
Strings written to the wrapped file are interpreted according to the
|
|
|
|
given \var{input} encoding and then written to the original file as
|
2000-07-24 19:35:52 +00:00
|
|
|
strings using the \var{output} encoding. The intermediate encoding will
|
2000-04-06 14:21:58 +00:00
|
|
|
usually be Unicode but depends on the specified codecs.
|
|
|
|
|
2000-07-24 19:35:52 +00:00
|
|
|
If \var{output} is not given, it defaults to \var{input}.
|
2000-04-06 14:21:58 +00:00
|
|
|
|
|
|
|
\var{errors} may be given to define the error handling. It defaults to
|
2000-07-24 19:35:52 +00:00
|
|
|
\code{'strict'}, which causes \exception{ValueError} to be raised in case
|
2000-04-06 14:21:58 +00:00
|
|
|
an encoding error occurs.
|
|
|
|
\end{funcdesc}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
...XXX document codec base classes...
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The module also provides the following constants which are useful
|
|
|
|
for reading and writing to platform dependent files:
|
|
|
|
|
|
|
|
\begin{datadesc}{BOM}
|
|
|
|
\dataline{BOM_BE}
|
|
|
|
\dataline{BOM_LE}
|
|
|
|
\dataline{BOM32_BE}
|
|
|
|
\dataline{BOM32_LE}
|
|
|
|
\dataline{BOM64_BE}
|
|
|
|
\dataline{BOM64_LE}
|
|
|
|
These constants define the byte order marks (BOM) used in data
|
|
|
|
streams to indicate the byte order used in the stream or file.
|
|
|
|
\constant{BOM} is either \constant{BOM_BE} or \constant{BOM_LE}
|
|
|
|
depending on the platform's native byte order, while the others
|
|
|
|
represent big endian (\samp{_BE} suffix) and little endian
|
|
|
|
(\samp{_LE} suffix) byte order using 32-bit and 64-bit encodings.
|
|
|
|
\end{datadesc}
|
|
|
|
|