Created Strings and Unicode (markdown)

2011-05-30 11:03:47 -07:00 · 2011-05-30 11:03:47 -07:00 · b402e33f77
parent fef9f0e38a
commit b402e33f77
1 changed files with 16 additions and 0 deletions
--- a/Strings-and-Unicode.md
+++ b/Strings-and-Unicode.md
@ -0,0 +1,16 @@
+# Summary
+
+Tornado supports both python2 and python3, which requires care in dealing with strings (see also [PEP 3333](http://www.python.org/dev/peps/pep-3333/#a-note-on-string-types)).  Even though each version of python only has two string types, there are logically *three* types that must be considered when supporting both versions:
+
+* `bytes`:  Represented by the `str` type in python2 and the `bytes` type in python3.  To unambiguously refer to this type in tornado code, use `tornado.util.bytes_type`, and `tornado.util.b("")` to create byte literals (byte literal support wasn't added to python until version 2.6, so until we drop support for 2.5 we must use our own aliases).
+* `unicode`: Represented by the `unicode` type in python2 and the `str` type in python3.  Tornado code refers to this type with the python2 names: `unicode` for the type and `u""` for literals.
+* `str`: The native string type, called `str` in both versions but equivalent to `bytes` in python2 and `unicode` in python3.
+
+Tornado uses UTF-8 as its default encoding, and the `tornado.escape` module provides `utf8`, `to_unicode`, and `native_str` functions to convert arguments to the three string types.  In general, tornado methods should accept any string type as arguments.  Return values should be native strings when possible.  Data from external sources should only be converted to unicode if a definite encoding is known, otherwise it should be left as bytes.
+
+## Detailed rules
+
+* Low-level code such as `IOStream` generally deals solely in bytes
+* Output methods such as `RequestHandler.write` accept either bytes or unicode.  Unicode strings will be encoded as utf8, but byte strings will never be decoded so applications can output non-utf8 data.
+* HTTP headers are generally ascii (officially they're latin1, but use of non-ascii is rare), so we mostly represent them (and data derived from them) with native strings (note that in python2 if a header contains non-ascii data tornado will decode the latin1 and re-encode as utf8!)
+* Query parameters are sent percent-encoded, but the underlying character set is unspecified.  In `HTTPRequest.arguments` the percent-encoding has been undone, resulting in byte strings for the argument values.  In `RequestHandler.get_argument` these bytes are decoded according to `RequestHandler.decode_argument`, allowing the application to choose the encoding to be used (default utf8).  Note that because keys are nearly always ascii and having byte strings as keys is awkward, the keys are converted to native strings (using latin1 on python3).