* Add features table

This commit is contained in:
Matthew Honnibal 2015-01-16 19:04:03 +11:00
parent 1590788dd4
commit 2e14f09d2f
1 changed files with 45 additions and 0 deletions

View File

@ -106,4 +106,49 @@ Create a bag-of-words representation:
.. py:attribute:: head: Token
Features
--------
+--------------------------------------------------------------------------+
| Boolean Features |
+----------+---------------------------------------------------------------+
| IS_ALPHA | :py:meth:`str.isalpha` |
+----------+---------------------------------------------------------------+
| IS_DIGIT | :py:meth:`str.isdigit` |
+----------+---------------------------------------------------------------+
| IS_LOWER | :py:meth:`str.islower` |
+----------+---------------------------------------------------------------+
| IS_SPACE | :py:meth:`str.isspace` |
+----------+---------------------------------------------------------------+
| IS_TITLE | :py:meth:`str.istitle` |
+----------+---------------------------------------------------------------+
| IS_UPPER | :py:meth:`str.isupper` |
+----------+---------------------------------------------------------------+
| IS_ASCII | all(ord(c) < 128 for c in string) |
+----------+---------------------------------------------------------------+
| IS_PUNCT | all(unicodedata.category(c).startswith('P') for c in string) |
+----------+---------------------------------------------------------------+
| LIKE_URL | Using various heuristics, does the string resemble a URL? |
+----------+---------------------------------------------------------------+
| LIKE_NUM | "Two", "10", "1,000", "10.54", "1/2" etc all match |
+----------+---------------------------------------------------------------+
| ID of string features |
+----------+---------------------------------------------------------------+
| SIC | The original string, unmodified. |
+----------+---------------------------------------------------------------+
| NORM1 | The string after level 1 normalization: case, spelling |
+----------+---------------------------------------------------------------+
| NORM2 | The string after level 2 normalization |
+----------+---------------------------------------------------------------+
| SHAPE | Word shape, e.g. 10 --> dd, Garden --> Xxxx, Hi!5 --> Xx!d |
+----------+---------------------------------------------------------------+
| PREFIX | A short slice from the start of the string. |
+----------+---------------------------------------------------------------+
| SUFFIX | A short slice from the end of the string. |
+----------+---------------------------------------------------------------+
| CLUSTER | Brown cluster ID of the word |
+----------+---------------------------------------------------------------+
| LEMMA | The word's lemma, i.e. morphological suffixes removed |
+----------+---------------------------------------------------------------+
| TAG | The word's part-of-speech tag |
+----------+---------------------------------------------------------------+