From 2e14f09d2f96d47f230d9c0845f47d4b20c19552 Mon Sep 17 00:00:00 2001 From: Matthew Honnibal Date: Fri, 16 Jan 2015 19:04:03 +1100 Subject: [PATCH] * Add features table --- docs/source/quickstart.rst | 45 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 45 insertions(+) diff --git a/docs/source/quickstart.rst b/docs/source/quickstart.rst index 9ac049e29..96d2a52e6 100644 --- a/docs/source/quickstart.rst +++ b/docs/source/quickstart.rst @@ -106,4 +106,49 @@ Create a bag-of-words representation: .. py:attribute:: head: Token +Features +-------- ++--------------------------------------------------------------------------+ +| Boolean Features | ++----------+---------------------------------------------------------------+ +| IS_ALPHA | :py:meth:`str.isalpha` | ++----------+---------------------------------------------------------------+ +| IS_DIGIT | :py:meth:`str.isdigit` | ++----------+---------------------------------------------------------------+ +| IS_LOWER | :py:meth:`str.islower` | ++----------+---------------------------------------------------------------+ +| IS_SPACE | :py:meth:`str.isspace` | ++----------+---------------------------------------------------------------+ +| IS_TITLE | :py:meth:`str.istitle` | ++----------+---------------------------------------------------------------+ +| IS_UPPER | :py:meth:`str.isupper` | ++----------+---------------------------------------------------------------+ +| IS_ASCII | all(ord(c) < 128 for c in string) | ++----------+---------------------------------------------------------------+ +| IS_PUNCT | all(unicodedata.category(c).startswith('P') for c in string) | ++----------+---------------------------------------------------------------+ +| LIKE_URL | Using various heuristics, does the string resemble a URL? | ++----------+---------------------------------------------------------------+ +| LIKE_NUM | "Two", "10", "1,000", "10.54", "1/2" etc all match | ++----------+---------------------------------------------------------------+ +| ID of string features | ++----------+---------------------------------------------------------------+ +| SIC | The original string, unmodified. | ++----------+---------------------------------------------------------------+ +| NORM1 | The string after level 1 normalization: case, spelling | ++----------+---------------------------------------------------------------+ +| NORM2 | The string after level 2 normalization | ++----------+---------------------------------------------------------------+ +| SHAPE | Word shape, e.g. 10 --> dd, Garden --> Xxxx, Hi!5 --> Xx!d | ++----------+---------------------------------------------------------------+ +| PREFIX | A short slice from the start of the string. | ++----------+---------------------------------------------------------------+ +| SUFFIX | A short slice from the end of the string. | ++----------+---------------------------------------------------------------+ +| CLUSTER | Brown cluster ID of the word | ++----------+---------------------------------------------------------------+ +| LEMMA | The word's lemma, i.e. morphological suffixes removed | ++----------+---------------------------------------------------------------+ +| TAG | The word's part-of-speech tag | ++----------+---------------------------------------------------------------+