mirror of https://github.com/python/cpython.git
Marc-Andre Lemburg: Python Unicode integration proposal, version 1.2.
This commit is contained in:
parent
e141fd84e9
commit
9ed0d1ef18
|
@ -0,0 +1,885 @@
|
|||
=============================================================================
|
||||
Python Unicode Integration Proposal Version: 1.2
|
||||
-----------------------------------------------------------------------------
|
||||
|
||||
|
||||
Introduction:
|
||||
-------------
|
||||
|
||||
The idea of this proposal is to add native Unicode 3.0 support to
|
||||
Python in a way that makes use of Unicode strings as simple as
|
||||
possible without introducing too many pitfalls along the way.
|
||||
|
||||
Since this goal is not easy to achieve -- strings being one of the
|
||||
most fundamental objects in Python --, we expect this proposal to
|
||||
undergo some significant refinements.
|
||||
|
||||
Note that the current version of this proposal is still a bit unsorted
|
||||
due to the many different aspects of the Unicode-Python integration.
|
||||
|
||||
The latest version of this document is always available at:
|
||||
|
||||
http://starship.skyport.net/~lemburg/unicode-proposal.txt
|
||||
|
||||
Older versions are available as:
|
||||
|
||||
http://starship.skyport.net/~lemburg/unicode-proposal-X.X.txt
|
||||
|
||||
|
||||
Conventions:
|
||||
------------
|
||||
|
||||
· In examples we use u = Unicode object and s = Python string
|
||||
|
||||
· 'XXX' markings indicate points of discussion (PODs)
|
||||
|
||||
|
||||
General Remarks:
|
||||
----------------
|
||||
|
||||
· Unicode encoding names should be lower case on output and
|
||||
case-insensitive on input (they will be converted to lower case
|
||||
by all APIs taking an encoding name as input).
|
||||
|
||||
Encoding names should follow the name conventions as used by the
|
||||
Unicode Consortium: spaces are converted to hyphens, e.g. 'utf 16' is
|
||||
written as 'utf-16'.
|
||||
|
||||
Codec modules should use the same names, but with hyphens converted
|
||||
to underscores, e.g. utf_8, utf_16, iso_8859_1.
|
||||
|
||||
· The <default encoding> should be the widely used 'utf-8' format. This
|
||||
is very close to the standard 7-bit ASCII format and thus resembles the
|
||||
standard used programming nowadays in most aspects.
|
||||
|
||||
|
||||
Unicode Constructors:
|
||||
---------------------
|
||||
|
||||
Python should provide a built-in constructor for Unicode strings which
|
||||
is available through __builtins__:
|
||||
|
||||
u = unicode(encoded_string[,encoding=<default encoding>][,errors="strict"])
|
||||
|
||||
u = u'<unicode-escape encoded Python string>'
|
||||
|
||||
u = ur'<raw-unicode-escape encoded Python string>'
|
||||
|
||||
With the 'unicode-escape' encoding being defined as:
|
||||
|
||||
· all non-escape characters represent themselves as Unicode ordinal
|
||||
(e.g. 'a' -> U+0061).
|
||||
|
||||
· all existing defined Python escape sequences are interpreted as
|
||||
Unicode ordinals; note that \xXXXX can represent all Unicode
|
||||
ordinals, and \OOO (octal) can represent Unicode ordinals up to U+01FF.
|
||||
|
||||
· a new escape sequence, \uXXXX, represents U+XXXX; it is a syntax
|
||||
error to have fewer than 4 digits after \u.
|
||||
|
||||
For an explanation of possible values for errors see the Codec section
|
||||
below.
|
||||
|
||||
Examples:
|
||||
|
||||
u'abc' -> U+0061 U+0062 U+0063
|
||||
u'\u1234' -> U+1234
|
||||
u'abc\u1234\n' -> U+0061 U+0062 U+0063 U+1234 U+005c
|
||||
|
||||
The 'raw-unicode-escape' encoding is defined as follows:
|
||||
|
||||
· \uXXXX sequence represent the U+XXXX Unicode character if and
|
||||
only if the number of leading backslashes is odd
|
||||
|
||||
· all other characters represent themselves as Unicode ordinal
|
||||
(e.g. 'b' -> U+0062)
|
||||
|
||||
|
||||
Note that you should provide some hint to the encoding you used to
|
||||
write your programs as pragma line in one the first few comment lines
|
||||
of the source file (e.g. '# source file encoding: latin-1'). If you
|
||||
only use 7-bit ASCII then everything is fine and no such notice is
|
||||
needed, but if you include Latin-1 characters not defined in ASCII, it
|
||||
may well be worthwhile including a hint since people in other
|
||||
countries will want to be able to read you source strings too.
|
||||
|
||||
|
||||
Unicode Type Object:
|
||||
--------------------
|
||||
|
||||
Unicode objects should have the type UnicodeType with type name
|
||||
'unicode', made available through the standard types module.
|
||||
|
||||
|
||||
Unicode Output:
|
||||
---------------
|
||||
|
||||
Unicode objects have a method .encode([encoding=<default encoding>])
|
||||
which returns a Python string encoding the Unicode string using the
|
||||
given scheme (see Codecs).
|
||||
|
||||
print u := print u.encode() # using the <default encoding>
|
||||
|
||||
str(u) := u.encode() # using the <default encoding>
|
||||
|
||||
repr(u) := "u%s" % repr(u.encode('unicode-escape'))
|
||||
|
||||
Also see Internal Argument Parsing and Buffer Interface for details on
|
||||
how other APIs written in C will treat Unicode objects.
|
||||
|
||||
|
||||
Unicode Ordinals:
|
||||
-----------------
|
||||
|
||||
Since Unicode 3.0 has a 32-bit ordinal character set, the implementation
|
||||
should provide 32-bit aware ordinal conversion APIs:
|
||||
|
||||
ord(u[:1]) (this is the standard ord() extended to work with Unicode
|
||||
objects)
|
||||
--> Unicode ordinal number (32-bit)
|
||||
|
||||
unichr(i)
|
||||
--> Unicode object for character i (provided it is 32-bit);
|
||||
ValueError otherwise
|
||||
|
||||
Both APIs should go into __builtins__ just like their string
|
||||
counterparts ord() and chr().
|
||||
|
||||
Note that Unicode provides space for private encodings. Usage of these
|
||||
can cause different output representations on different machines. This
|
||||
problem is not a Python or Unicode problem, but a machine setup and
|
||||
maintenance one.
|
||||
|
||||
|
||||
Comparison & Hash Value:
|
||||
------------------------
|
||||
|
||||
Unicode objects should compare equal to other objects after these
|
||||
other objects have been coerced to Unicode. For strings this means
|
||||
that they are interpreted as Unicode string using the <default
|
||||
encoding>.
|
||||
|
||||
For the same reason, Unicode objects should return the same hash value
|
||||
as their UTF-8 equivalent strings.
|
||||
|
||||
Coercion:
|
||||
---------
|
||||
|
||||
Using Python strings and Unicode objects to form new objects should
|
||||
always coerce to the more precise format, i.e. Unicode objects.
|
||||
|
||||
u + s := u + unicode(s)
|
||||
|
||||
s + u := unicode(s) + u
|
||||
|
||||
All string methods should delegate the call to an equivalent Unicode
|
||||
object method call by converting all envolved strings to Unicode and
|
||||
then applying the arguments to the Unicode method of the same name,
|
||||
e.g.
|
||||
|
||||
string.join((s,u),sep) := (s + sep) + u
|
||||
|
||||
sep.join((s,u)) := (s + sep) + u
|
||||
|
||||
For a discussion of %-formatting w/r to Unicode objects, see
|
||||
Formatting Markers.
|
||||
|
||||
|
||||
Exceptions:
|
||||
-----------
|
||||
|
||||
UnicodeError is defined in the exceptions module as subclass of
|
||||
ValueError. It is available at the C level via PyExc_UnicodeError.
|
||||
All exceptions related to Unicode encoding/decoding should be
|
||||
subclasses of UnicodeError.
|
||||
|
||||
|
||||
Codecs (Coder/Decoders) Lookup:
|
||||
-------------------------------
|
||||
|
||||
A Codec (see Codec Interface Definition) search registry should be
|
||||
implemented by a module "codecs":
|
||||
|
||||
codecs.register(search_function)
|
||||
|
||||
Search functions are expected to take one argument, the encoding name
|
||||
in all lower case letters, and return a tuple of functions (encoder,
|
||||
decoder, stream_reader, stream_writer) taking the following arguments:
|
||||
|
||||
encoder and decoder:
|
||||
These must be functions or methods which have the same
|
||||
interface as the .encode/.decode methods of Codec instances
|
||||
(see Codec Interface). The functions/methods are expected to
|
||||
work in a stateless mode.
|
||||
|
||||
stream_reader and stream_writer:
|
||||
These need to be factory functions with the following
|
||||
interface:
|
||||
|
||||
factory(stream,errors='strict')
|
||||
|
||||
The factory functions must return objects providing
|
||||
the interfaces defined by StreamWriter/StreamReader resp.
|
||||
(see Codec Interface). Stream codecs can maintain state.
|
||||
|
||||
Possible values for errors are defined in the Codec
|
||||
section below.
|
||||
|
||||
In case a search function cannot find a given encoding, it should
|
||||
return None.
|
||||
|
||||
Aliasing support for encodings is left to the search functions
|
||||
to implement.
|
||||
|
||||
The codecs module will maintain an encoding cache for performance
|
||||
reasons. Encodings are first looked up in the cache. If not found, the
|
||||
list of registered search functions is scanned. If no codecs tuple is
|
||||
found, a LookupError is raised. Otherwise, the codecs tuple is stored
|
||||
in the cache and returned to the caller.
|
||||
|
||||
To query the Codec instance the following API should be used:
|
||||
|
||||
codecs.lookup(encoding)
|
||||
|
||||
This will either return the found codecs tuple or raise a LookupError.
|
||||
|
||||
|
||||
Standard Codecs:
|
||||
----------------
|
||||
|
||||
Standard codecs should live inside an encodings/ package directory in the
|
||||
Standard Python Code Library. The __init__.py file of that directory should
|
||||
include a Codec Lookup compatible search function implementing a lazy module
|
||||
based codec lookup.
|
||||
|
||||
Python should provide a few standard codecs for the most relevant
|
||||
encodings, e.g.
|
||||
|
||||
'utf-8': 8-bit variable length encoding
|
||||
'utf-16': 16-bit variable length encoding (litte/big endian)
|
||||
'utf-16-le': utf-16 but explicitly little endian
|
||||
'utf-16-be': utf-16 but explicitly big endian
|
||||
'ascii': 7-bit ASCII codepage
|
||||
'iso-8859-1': ISO 8859-1 (Latin 1) codepage
|
||||
'unicode-escape': See Unicode Constructors for a definition
|
||||
'raw-unicode-escape': See Unicode Constructors for a definition
|
||||
'native': Dump of the Internal Format used by Python
|
||||
|
||||
Common aliases should also be provided per default, e.g. 'latin-1'
|
||||
for 'iso-8859-1'.
|
||||
|
||||
Note: 'utf-16' should be implemented by using and requiring byte order
|
||||
marks (BOM) for file input/output.
|
||||
|
||||
All other encodings such as the CJK ones to support Asian scripts
|
||||
should be implemented in seperate packages which do not get included
|
||||
in the core Python distribution and are not a part of this proposal.
|
||||
|
||||
|
||||
Codecs Interface Definition:
|
||||
----------------------------
|
||||
|
||||
The following base class should be defined in the module
|
||||
"codecs". They provide not only templates for use by encoding module
|
||||
implementors, but also define the interface which is expected by the
|
||||
Unicode implementation.
|
||||
|
||||
Note that the Codec Interface defined here is well suitable for a
|
||||
larger range of applications. The Unicode implementation expects
|
||||
Unicode objects on input for .encode() and .write() and character
|
||||
buffer compatible objects on input for .decode(). Output of .encode()
|
||||
and .read() should be a Python string and .decode() must return an
|
||||
Unicode object.
|
||||
|
||||
First, we have the stateless encoders/decoders. These do not work in
|
||||
chunks as the stream codecs (see below) do, because all components are
|
||||
expected to be available in memory.
|
||||
|
||||
class Codec:
|
||||
|
||||
""" Defines the interface for stateless encoders/decoders.
|
||||
|
||||
The .encode()/.decode() methods may implement different error
|
||||
handling schemes by providing the errors argument. These
|
||||
string values are defined:
|
||||
|
||||
'strict' - raise an error (or a subclass)
|
||||
'ignore' - ignore the character and continue with the next
|
||||
'replace' - replace with a suitable replacement character;
|
||||
Python will use the official U+FFFD REPLACEMENT
|
||||
CHARACTER for the builtin Unicode codecs.
|
||||
|
||||
"""
|
||||
def encode(self,input,errors='strict'):
|
||||
|
||||
""" Encodes the object intput and returns a tuple (output
|
||||
object, length consumed).
|
||||
|
||||
errors defines the error handling to apply. It defaults to
|
||||
'strict' handling.
|
||||
|
||||
The method may not store state in the Codec instance. Use
|
||||
SteamCodec for codecs which have to keep state in order to
|
||||
make encoding/decoding efficient.
|
||||
|
||||
"""
|
||||
...
|
||||
|
||||
def decode(self,input,errors='strict'):
|
||||
|
||||
""" Decodes the object input and returns a tuple (output
|
||||
object, length consumed).
|
||||
|
||||
input must be an object which provides the bf_getreadbuf
|
||||
buffer slot. Python strings, buffer objects and memory
|
||||
mapped files are examples of objects providing this slot.
|
||||
|
||||
errors defines the error handling to apply. It defaults to
|
||||
'strict' handling.
|
||||
|
||||
The method may not store state in the Codec instance. Use
|
||||
SteamCodec for codecs which have to keep state in order to
|
||||
make encoding/decoding efficient.
|
||||
|
||||
"""
|
||||
...
|
||||
|
||||
StreamWriter and StreamReader define the interface for stateful
|
||||
encoders/decoders which work on streams. These allow processing of the
|
||||
data in chunks to efficiently use memory. If you have large strings in
|
||||
memory, you may want to wrap them with cStringIO objects and then use
|
||||
these codecs on them to be able to do chunk processing as well,
|
||||
e.g. to provide progress information to the user.
|
||||
|
||||
class StreamWriter(Codec):
|
||||
|
||||
def __init__(self,stream,errors='strict'):
|
||||
|
||||
""" Creates a StreamWriter instance.
|
||||
|
||||
stream must be a file-like object open for writing
|
||||
(binary) data.
|
||||
|
||||
The StreamWriter may implement different error handling
|
||||
schemes by providing the errors keyword argument. These
|
||||
parameters are defined:
|
||||
|
||||
'strict' - raise a ValueError (or a subclass)
|
||||
'ignore' - ignore the character and continue with the next
|
||||
'replace'- replace with a suitable replacement character
|
||||
|
||||
"""
|
||||
self.stream = stream
|
||||
self.errors = errors
|
||||
|
||||
def write(self,object):
|
||||
|
||||
""" Writes the object's contents encoded to self.stream.
|
||||
"""
|
||||
data, consumed = self.encode(object,self.errors)
|
||||
self.stream.write(data)
|
||||
|
||||
def reset(self):
|
||||
|
||||
""" Flushes and resets the codec buffers used for keeping state.
|
||||
|
||||
Calling this method should ensure that the data on the
|
||||
output is put into a clean state, that allows appending
|
||||
of new fresh data without having to rescan the whole
|
||||
stream to recover state.
|
||||
|
||||
"""
|
||||
pass
|
||||
|
||||
def __getattr__(self,name,
|
||||
|
||||
getattr=getattr):
|
||||
|
||||
""" Inherit all other methods from the underlying stream.
|
||||
"""
|
||||
return getattr(self.stream,name)
|
||||
|
||||
class StreamReader(Codec):
|
||||
|
||||
def __init__(self,stream,errors='strict'):
|
||||
|
||||
""" Creates a StreamReader instance.
|
||||
|
||||
stream must be a file-like object open for reading
|
||||
(binary) data.
|
||||
|
||||
The StreamReader may implement different error handling
|
||||
schemes by providing the errors keyword argument. These
|
||||
parameters are defined:
|
||||
|
||||
'strict' - raise a ValueError (or a subclass)
|
||||
'ignore' - ignore the character and continue with the next
|
||||
'replace'- replace with a suitable replacement character;
|
||||
|
||||
"""
|
||||
self.stream = stream
|
||||
self.errors = errors
|
||||
|
||||
def read(self,size=-1):
|
||||
|
||||
""" Decodes data from the stream self.stream and returns the
|
||||
resulting object.
|
||||
|
||||
size indicates the approximate maximum number of bytes to
|
||||
read from the stream for decoding purposes. The decoder
|
||||
can modify this setting as appropriate. The default value
|
||||
-1 indicates to read and decode as much as possible. size
|
||||
is intended to prevent having to decode huge files in one
|
||||
step.
|
||||
|
||||
The method should use a greedy read strategy meaning that
|
||||
it should read as much data as is allowed within the
|
||||
definition of the encoding and the given size, e.g. if
|
||||
optional encoding endings or state markers are available
|
||||
on the stream, these should be read too.
|
||||
|
||||
"""
|
||||
# Unsliced reading:
|
||||
if size < 0:
|
||||
return self.decode(self.stream.read())[0]
|
||||
|
||||
# Sliced reading:
|
||||
read = self.stream.read
|
||||
decode = self.decode
|
||||
data = read(size)
|
||||
i = 0
|
||||
while 1:
|
||||
try:
|
||||
object, decodedbytes = decode(data)
|
||||
except ValueError,why:
|
||||
# This method is slow but should work under pretty much
|
||||
# all conditions; at most 10 tries are made
|
||||
i = i + 1
|
||||
newdata = read(1)
|
||||
if not newdata or i > 10:
|
||||
raise
|
||||
data = data + newdata
|
||||
else:
|
||||
return object
|
||||
|
||||
def reset(self):
|
||||
|
||||
""" Resets the codec buffers used for keeping state.
|
||||
|
||||
Note that no stream repositioning should take place.
|
||||
This method is primarely intended to be able to recover
|
||||
from decoding errors.
|
||||
|
||||
"""
|
||||
pass
|
||||
|
||||
def __getattr__(self,name,
|
||||
|
||||
getattr=getattr):
|
||||
|
||||
""" Inherit all other methods from the underlying stream.
|
||||
"""
|
||||
return getattr(self.stream,name)
|
||||
|
||||
XXX What about .readline(), .readlines() ? These could be implemented
|
||||
using .read() as generic functions instead of requiring their
|
||||
implementation by all codecs. Also see Line Breaks.
|
||||
|
||||
Stream codec implementors are free to combine the StreamWriter and
|
||||
StreamReader interfaces into one class. Even combining all these with
|
||||
the Codec class should be possible.
|
||||
|
||||
Implementors are free to add additional methods to enhance the codec
|
||||
functionality or provide extra state information needed for them to
|
||||
work. The internal codec implementation will only use the above
|
||||
interfaces, though.
|
||||
|
||||
It is not required by the Unicode implementation to use these base
|
||||
classes, only the interfaces must match; this allows writing Codecs as
|
||||
extensions types.
|
||||
|
||||
As guideline, large mapping tables should be implemented using static
|
||||
C data in separate (shared) extension modules. That way multiple
|
||||
processes can share the same data.
|
||||
|
||||
A tool to auto-convert Unicode mapping files to mapping modules should be
|
||||
provided to simplify support for additional mappings (see References).
|
||||
|
||||
|
||||
Whitespace:
|
||||
-----------
|
||||
|
||||
The .split() method will have to know about what is considered
|
||||
whitespace in Unicode.
|
||||
|
||||
|
||||
Case Conversion:
|
||||
----------------
|
||||
|
||||
Case conversion is rather complicated with Unicode data, since there
|
||||
are many different conditions to respect. See
|
||||
|
||||
http://www.unicode.org/unicode/reports/tr13/
|
||||
|
||||
for some guidelines on implementing case conversion.
|
||||
|
||||
For Python, we should only implement the 1-1 conversions included in
|
||||
Unicode. Locale dependent and other special case conversions (see the
|
||||
Unicode standard file SpecialCasing.txt) should be left to user land
|
||||
routines and not go into the core interpreter.
|
||||
|
||||
The methods .capitalize() and .iscapitalized() should follow the case
|
||||
mapping algorithm defined in the above technical report as closely as
|
||||
possible.
|
||||
|
||||
|
||||
Line Breaks:
|
||||
------------
|
||||
|
||||
Line breaking should be done for all Unicode characters having the B
|
||||
property as well as the combinations CRLF, CR, LF (interpreted in that
|
||||
order) and other special line separators defined by the standard.
|
||||
|
||||
The Unicode type should provide a .splitlines() method which returns a
|
||||
list of lines according to the above specification. See Unicode
|
||||
Methods.
|
||||
|
||||
|
||||
Unicode Character Properties:
|
||||
-----------------------------
|
||||
|
||||
A separate module "unicodedata" should provide a compact interface to
|
||||
all Unicode character properties defined in the standard's
|
||||
UnicodeData.txt file.
|
||||
|
||||
Among other things, these properties provide ways to recognize
|
||||
numbers, digits, spaces, whitespace, etc.
|
||||
|
||||
Since this module will have to provide access to all Unicode
|
||||
characters, it will eventually have to contain the data from
|
||||
UnicodeData.txt which takes up around 600kB. For this reason, the data
|
||||
should be stored in static C data. This enables compilation as shared
|
||||
module which the underlying OS can shared between processes (unlike
|
||||
normal Python code modules).
|
||||
|
||||
There should be a standard Python interface for accessing this information
|
||||
so that other implementors can plug in their own possibly enhanced versions,
|
||||
e.g. ones that do decompressing of the data on-the-fly.
|
||||
|
||||
|
||||
Private Code Point Areas:
|
||||
-------------------------
|
||||
|
||||
Support for these is left to user land Codecs and not explicitly
|
||||
intergrated into the core. Note that due to the Internal Format being
|
||||
implemented, only the area between \uE000 and \uF8FF is useable for
|
||||
private encodings.
|
||||
|
||||
|
||||
Internal Format:
|
||||
----------------
|
||||
|
||||
The internal format for Unicode objects should use a Python specific
|
||||
fixed format <PythonUnicode> implemented as 'unsigned short' (or
|
||||
another unsigned numeric type having 16 bits). Byte order is platform
|
||||
dependent.
|
||||
|
||||
This format will hold UTF-16 encodings of the corresponding Unicode
|
||||
ordinals. The Python Unicode implementation will address these values
|
||||
as if they were UCS-2 values. UCS-2 and UTF-16 are the same for all
|
||||
currently defined Unicode character points. UTF-16 without surrogates
|
||||
provides access to about 64k characters and covers all characters in
|
||||
the Basic Multilingual Plane (BMP) of Unicode.
|
||||
|
||||
It is the Codec's responsibility to ensure that the data they pass to
|
||||
the Unicode object constructor repects this assumption. The
|
||||
constructor does not check the data for Unicode compliance or use of
|
||||
surrogates.
|
||||
|
||||
Future implementations can extend the 32 bit restriction to the full
|
||||
set of all UTF-16 addressable characters (around 1M characters).
|
||||
|
||||
The Unicode API should provide inteface routines from <PythonUnicode>
|
||||
to the compiler's wchar_t which can be 16 or 32 bit depending on the
|
||||
compiler/libc/platform being used.
|
||||
|
||||
Unicode objects should have a pointer to a cached Python string object
|
||||
<defencstr> holding the object's value using the current <default
|
||||
encoding>. This is needed for performance and internal parsing (see
|
||||
Internal Argument Parsing) reasons. The buffer is filled when the
|
||||
first conversion request to the <default encoding> is issued on the
|
||||
object.
|
||||
|
||||
Interning is not needed (for now), since Python identifiers are
|
||||
defined as being ASCII only.
|
||||
|
||||
codecs.BOM should return the byte order mark (BOM) for the format
|
||||
used internally. The codecs module should provide the following
|
||||
additional constants for convenience and reference (codecs.BOM will
|
||||
either be BOM_BE or BOM_LE depending on the platform):
|
||||
|
||||
BOM_BE: '\376\377'
|
||||
(corresponds to Unicode U+0000FEFF in UTF-16 on big endian
|
||||
platforms == ZERO WIDTH NO-BREAK SPACE)
|
||||
|
||||
BOM_LE: '\377\376'
|
||||
(corresponds to Unicode U+0000FFFE in UTF-16 on little endian
|
||||
platforms == defined as being an illegal Unicode character)
|
||||
|
||||
BOM4_BE: '\000\000\376\377'
|
||||
(corresponds to Unicode U+0000FEFF in UCS-4)
|
||||
|
||||
BOM4_LE: '\377\376\000\000'
|
||||
(corresponds to Unicode U+0000FFFE in UCS-4)
|
||||
|
||||
Note that Unicode sees big endian byte order as being "correct". The
|
||||
swapped order is taken to be an indicator for a "wrong" format, hence
|
||||
the illegal character definition.
|
||||
|
||||
The configure script should provide aid in deciding whether Python can
|
||||
use the native wchar_t type or not (it has to be a 16-bit unsigned
|
||||
type).
|
||||
|
||||
|
||||
Buffer Interface:
|
||||
-----------------
|
||||
|
||||
Implement the buffer interface using the <defencstr> Python string
|
||||
object as basis for bf_getcharbuf (corresponds to the "t#" argument
|
||||
parsing marker) and the internal buffer for bf_getreadbuf (corresponds
|
||||
to the "s#" argument parsing marker). If bf_getcharbuf is requested
|
||||
and the <defencstr> object does not yet exist, it is created first.
|
||||
|
||||
This has the advantage of being able to write to output streams (which
|
||||
typically use this interface) without additional specification of the
|
||||
encoding to use.
|
||||
|
||||
The internal format can also be accessed using the 'unicode-internal'
|
||||
codec, e.g. via u.encode('unicode-internal').
|
||||
|
||||
|
||||
Pickle/Marshalling:
|
||||
-------------------
|
||||
|
||||
Should have native Unicode object support. The objects should be
|
||||
encoded using platform independent encodings.
|
||||
|
||||
Marshal should use UTF-8 and Pickle should either choose
|
||||
Raw-Unicode-Escape (in text mode) or UTF-8 (in binary mode) as
|
||||
encoding. Using UTF-8 instead of UTF-16 has the advantage of
|
||||
eliminating the need to store a BOM mark.
|
||||
|
||||
|
||||
Regular Expressions:
|
||||
--------------------
|
||||
|
||||
Secret Labs AB is working on a Unicode-aware regular expression
|
||||
machinery. It works on plain 8-bit, UCS-2, and (optionally) UCS-4
|
||||
internal character buffers.
|
||||
|
||||
Also see
|
||||
|
||||
http://www.unicode.org/unicode/reports/tr18/
|
||||
|
||||
for some remarks on how to treat Unicode REs.
|
||||
|
||||
|
||||
Formatting Markers:
|
||||
-------------------
|
||||
|
||||
Format markers are used in Python format strings. If Python strings
|
||||
are used as format strings, the following interpretations should be in
|
||||
effect:
|
||||
|
||||
'%s': '%s' does str(u) for Unicode objects embedded
|
||||
in Python strings, so the output will be
|
||||
u.encode(<default encoding>)
|
||||
|
||||
In case the format string is an Unicode object, all parameters are coerced
|
||||
to Unicode first and then put together and formatted according to the format
|
||||
string. Numbers are first converted to strings and then to Unicode.
|
||||
|
||||
'%s': Python strings are interpreted as Unicode
|
||||
string using the <default encoding>. Unicode
|
||||
objects are taken as is.
|
||||
|
||||
All other string formatters should work accordingly.
|
||||
|
||||
Example:
|
||||
|
||||
u"%s %s" % (u"abc", "abc") == u"abc abc"
|
||||
|
||||
|
||||
Internal Argument Parsing:
|
||||
--------------------------
|
||||
|
||||
These markers are used by the PyArg_ParseTuple() APIs:
|
||||
|
||||
'U': Check for Unicode object and return a pointer to it
|
||||
|
||||
's': For Unicode objects: auto convert them to the <default encoding>
|
||||
and return a pointer to the object's <defencstr> buffer.
|
||||
|
||||
's#': Access to the Unicode object via the bf_getreadbuf buffer interface
|
||||
(see Buffer Interface); note that the length relates to the buffer
|
||||
length, not the Unicode string length (this may be different
|
||||
depending on the Internal Format).
|
||||
|
||||
't#': Access to the Unicode object via the bf_getcharbuf buffer interface
|
||||
(see Buffer Interface); note that the length relates to the buffer
|
||||
length, not necessarily to the Unicode string length (this may
|
||||
be different depending on the <default encoding>).
|
||||
|
||||
|
||||
File/Stream Output:
|
||||
-------------------
|
||||
|
||||
Since file.write(object) and most other stream writers use the "s#"
|
||||
argument parsing marker for binary files and "t#" for text files, the
|
||||
buffer interface implementation determines the encoding to use (see
|
||||
Buffer Interface).
|
||||
|
||||
For explicit handling of files using Unicode, the standard
|
||||
stream codecs as available through the codecs module should
|
||||
be used.
|
||||
|
||||
XXX There should be a short-cut open(filename,mode,encoding) available which
|
||||
also assures that mode contains the 'b' character when needed.
|
||||
|
||||
|
||||
File/Stream Input:
|
||||
------------------
|
||||
|
||||
Only the user knows what encoding the input data uses, so no special
|
||||
magic is applied. The user will have to explicitly convert the string
|
||||
data to Unicode objects as needed or use the file wrappers defined in
|
||||
the codecs module (see File/Stream Output).
|
||||
|
||||
|
||||
Unicode Methods & Attributes:
|
||||
-----------------------------
|
||||
|
||||
All Python string methods, plus:
|
||||
|
||||
.encode([encoding=<default encoding>][,errors="strict"])
|
||||
--> see Unicode Output
|
||||
|
||||
.splitlines([include_breaks=0])
|
||||
--> breaks the Unicode string into a list of (Unicode) lines;
|
||||
returns the lines with line breaks included, if include_breaks
|
||||
is true. See Line Breaks for a specification of how line breaking
|
||||
is done.
|
||||
|
||||
|
||||
Code Base:
|
||||
----------
|
||||
|
||||
We should use Fredrik Lundh's Unicode object implementation as basis.
|
||||
It already implements most of the string methods needed and provides a
|
||||
well written code base which we can build upon.
|
||||
|
||||
The object sharing implemented in Fredrik's implementation should
|
||||
be dropped.
|
||||
|
||||
|
||||
Test Cases:
|
||||
-----------
|
||||
|
||||
Test cases should follow those in Lib/test/test_string.py and include
|
||||
additional checks for the Codec Registry and the Standard Codecs.
|
||||
|
||||
|
||||
References:
|
||||
-----------
|
||||
|
||||
Unicode Consortium:
|
||||
http://www.unicode.org/
|
||||
|
||||
Unicode FAQ:
|
||||
http://www.unicode.org/unicode/faq/
|
||||
|
||||
Unicode 3.0:
|
||||
http://www.unicode.org/unicode/standard/versions/Unicode3.0.html
|
||||
|
||||
Unicode-TechReports:
|
||||
http://www.unicode.org/unicode/reports/techreports.html
|
||||
|
||||
Unicode-Mappings:
|
||||
ftp://ftp.unicode.org/Public/MAPPINGS/
|
||||
|
||||
Introduction to Unicode (a little outdated by still nice to read):
|
||||
http://www.nada.kth.se/i18n/ucs/unicode-iso10646-oview.html
|
||||
|
||||
Encodings:
|
||||
|
||||
Overview:
|
||||
http://czyborra.com/utf/
|
||||
|
||||
UTC-2:
|
||||
http://www.uazone.com/multiling/unicode/ucs2.html
|
||||
|
||||
UTF-7:
|
||||
Defined in RFC2152, e.g.
|
||||
http://www.uazone.com/multiling/ml-docs/rfc2152.txt
|
||||
|
||||
UTF-8:
|
||||
Defined in RFC2279, e.g.
|
||||
http://info.internet.isi.edu/in-notes/rfc/files/rfc2279.txt
|
||||
|
||||
UTF-16:
|
||||
http://www.uazone.com/multiling/unicode/wg2n1035.html
|
||||
|
||||
|
||||
History of this Proposal:
|
||||
-------------------------
|
||||
1.2:
|
||||
1.1: Added note about comparisons and hash values. Added note about
|
||||
case mapping algorithms. Changed stream codecs .read() and
|
||||
.write() method to match the standard file-like object methods
|
||||
(bytes consumed information is no longer returned by the methods)
|
||||
1.0: changed encode Codec method to be symmetric to the decode method
|
||||
(they both return (object, data consumed) now and thus become
|
||||
interchangeable); removed __init__ method of Codec class (the
|
||||
methods are stateless) and moved the errors argument down to the
|
||||
methods; made the Codec design more generic w/r to type of input
|
||||
and output objects; changed StreamWriter.flush to StreamWriter.reset
|
||||
in order to avoid overriding the stream's .flush() method;
|
||||
renamed .breaklines() to .splitlines(); renamed the module unicodec
|
||||
to codecs; modified the File I/O section to refer to the stream codecs.
|
||||
0.9: changed errors keyword argument definition; added 'replace' error
|
||||
handling; changed the codec APIs to accept buffer like objects on
|
||||
input; some minor typo fixes; added Whitespace section and
|
||||
included references for Unicode characters that have the whitespace
|
||||
and the line break characteristic; added note that search functions
|
||||
can expect lower-case encoding names; dropped slicing and offsets
|
||||
in the codec APIs
|
||||
0.8: added encodings package and raw unicode escape encoding; untabified
|
||||
the proposal; added notes on Unicode format strings; added
|
||||
.breaklines() method
|
||||
0.7: added a whole new set of codec APIs; added a different encoder
|
||||
lookup scheme; fixed some names
|
||||
0.6: changed "s#" to "t#"; changed <defencbuf> to <defencstr> holding
|
||||
a real Python string object; changed Buffer Interface to delegate
|
||||
requests to <defencstr>'s buffer interface; removed the explicit
|
||||
reference to the unicodec.codecs dictionary (the module can implement
|
||||
this in way fit for the purpose); removed the settable default
|
||||
encoding; move UnicodeError from unicodec to exceptions; "s#"
|
||||
not returns the internal data; passed the UCS-2/UTF-16 checking
|
||||
from the Unicode constructor to the Codecs
|
||||
0.5: moved sys.bom to unicodec.BOM; added sections on case mapping,
|
||||
private use encodings and Unicode character properties
|
||||
0.4: added Codec interface, notes on %-formatting, changed some encoding
|
||||
details, added comments on stream wrappers, fixed some discussion
|
||||
points (most important: Internal Format), clarified the
|
||||
'unicode-escape' encoding, added encoding references
|
||||
0.3: added references, comments on codec modules, the internal format,
|
||||
bf_getcharbuffer and the RE engine; added 'unicode-escape' encoding
|
||||
proposed by Tim Peters and fixed repr(u) accordingly
|
||||
0.2: integrated Guido's suggestions, added stream codecs and file
|
||||
wrapping
|
||||
0.1: first version
|
||||
|
||||
|
||||
-----------------------------------------------------------------------------
|
||||
Written by Marc-Andre Lemburg, 1999-2000, mal@lemburg.com
|
||||
-----------------------------------------------------------------------------
|
Loading…
Reference in New Issue