2
Strings and Unicode
bdarnell edited this page 2011-05-30 11:04:04 -07:00
Table of Contents
Summary
Tornado supports both python2 and python3, which requires care in dealing with strings (see also PEP 3333). Even though each version of python only has two string types, there are logically three types that must be considered when supporting both versions:
bytes
: Represented by thestr
type in python2 and thebytes
type in python3. To unambiguously refer to this type in tornado code, usetornado.util.bytes_type
, andtornado.util.b("")
to create byte literals (byte literal support wasn't added to python until version 2.6, so until we drop support for 2.5 we must use our own aliases).unicode
: Represented by theunicode
type in python2 and thestr
type in python3. Tornado code refers to this type with the python2 names:unicode
for the type andu""
for literals.str
: The native string type, calledstr
in both versions but equivalent tobytes
in python2 andunicode
in python3.
Tornado uses UTF-8 as its default encoding, and the tornado.escape
module provides utf8
, to_unicode
, and native_str
functions to convert arguments to the three string types. In general, tornado methods should accept any string type as arguments. Return values should be native strings when possible. Data from external sources should only be converted to unicode if a definite encoding is known, otherwise it should be left as bytes.
Detailed rules
- Low-level code such as
IOStream
generally deals solely in bytes - Output methods such as
RequestHandler.write
accept either bytes or unicode. Unicode strings will be encoded as utf8, but byte strings will never be decoded so applications can output non-utf8 data. - HTTP headers are generally ascii (officially they're latin1, but use of non-ascii is rare), so we mostly represent them (and data derived from them) with native strings (note that in python2 if a header contains non-ascii data tornado will decode the latin1 and re-encode as utf8!)
- Query parameters are sent percent-encoded, but the underlying character set is unspecified. In
HTTPRequest.arguments
the percent-encoding has been undone, resulting in byte strings for the argument values. InRequestHandler.get_argument
these bytes are decoded according toRequestHandler.decode_argument
, allowing the application to choose the encoding to be used (default utf8). Note that because keys are nearly always ascii and having byte strings as keys is awkward, the keys are converted to native strings (using latin1 on python3).