spaCy/docs/source/features.rst

Lexeme Features
===============

A lexeme is an entry in the lexicon --- the vocabulary --- for a word, punctuation
symbol, whitespace unit, etc.  Lexemes come with lots of pre-computed information,
that help you write good feature functions.  Features are integer-valued where
possible --- instead of strings, spaCy refers to strings by consecutive ID numbers,
which you can use to look up the string values if necessary.

String features
---------------

+---------+-------------------------------------------------------------------+
| SIC     | The word as it appeared in the sentence, unaltered.               |
+---------+-------------------------------------------------------------------+
| NORM    | For frequent words, case normalization is applied.                |
|         | Otherwise, back-off to SHAPE.                                     |
+---------+-------------------------------------------------------------------+
| SHAPE   | Remap the characters of the word as follows:                      |
|         |                                                                   |
|         | a-z --> x, A-Z --> X, 0-9 --> d, ,.;:"'?!$- --> self, other --> \*|
|         |                                                                   |
|         | Trim sequences of length 3+ to 3, e.g                             |
|         |                                                                   |
|         | apples --> xxx, Apples --> Xxxx, app9LES@ --> xxx9XXX*            |
+---------+-------------------------------------------------------------------+
| ASCIIED | Use unidecode.unidecode(sic) to approximate the word using the    |
|         | ascii characters.                                                 |
+---------+-------------------------------------------------------------------+
| PREFIX  | sic_unicode_string[:1]                                            |
+---------+-------------------------------------------------------------------+
| SUFFIX  | sic_unicode_string[-3:]                                           |
+---------+-------------------------------------------------------------------+


Integer features
----------------

+--------------+--------------------------------------------------------------+
| LENGTH       |  Length of the string, in unicode                            |
+--------------+--------------------------------------------------------------+
| CLUSTER      | Brown cluster                                                |
+--------------+--------------------------------------------------------------+
| POS_TYPE     | K-means cluster of word's tag affinities                     |
+--------------+--------------------------------------------------------------+
| SENSE_TYPE   | K-means cluster of word's sense affinities                   |
+--------------+--------------------------------------------------------------+

Boolean features
----------------

+-------------+--------------------------------------------------------------+
| IS_ALPHA    | The result of sic.isalpha()                                  |
+-------------+--------------------------------------------------------------+
| IS_ASCII    | Check whether all the word's characters are ascii characters |
+-------------+--------------------------------------------------------------+
| IS_DIGIT    | The result of sic.isdigit()                                  |
+-------------+--------------------------------------------------------------+
| IS_LOWER    | The result of sic.islower()                                  |
+-------------+--------------------------------------------------------------+
| IS_PUNCT    | Check whether all characters are in the class TODO           |
+-------------+--------------------------------------------------------------+
| IS_SPACE    | The result of sic.isspace()                                  |
+-------------+--------------------------------------------------------------+
| IS_TITLE    | The result of sic.istitle()                                  |
+-------------+--------------------------------------------------------------+
| IS_UPPER    | The result of sic.isupper()                                  |
+-------------+--------------------------------------------------------------+
| LIKE_URL    | Check whether the string looks like it could be a URL. Aims  |
|             | for low false negative rate.                                 |
+-------------+--------------------------------------------------------------+
| LIKE_NUMBER | Check whether the string looks like it could be a numeric    |
|             | entity, e.g. 10,000 10th .10 . Skews for low false negative  |
|             | rate.                                                        |
+-------------+--------------------------------------------------------------+
| IN_LIST     | Facility for loading arbitrary run-time word lists?          |
+-------------+--------------------------------------------------------------+
* Add draft work on features 2015-01-30 05:46:52 +00:00			`Lexeme Features`
			`===============`

			`A lexeme is an entry in the lexicon --- the vocabulary --- for a word, punctuation`
			`symbol, whitespace unit, etc. Lexemes come with lots of pre-computed information,`
			`that help you write good feature functions. Features are integer-valued where`
			`possible --- instead of strings, spaCy refers to strings by consecutive ID numbers,`
			`which you can use to look up the string values if necessary.`

			`String features`
			`---------------`

			`+---------+-------------------------------------------------------------------+`
			`\| SIC \| The word as it appeared in the sentence, unaltered. \|`
			`+---------+-------------------------------------------------------------------+`
			`\| NORM \| For frequent words, case normalization is applied. \|`
			`\| \| Otherwise, back-off to SHAPE. \|`
			`+---------+-------------------------------------------------------------------+`
			`\| SHAPE \| Remap the characters of the word as follows: \|`
			`\| \| \|`
			`\| \| a-z --> x, A-Z --> X, 0-9 --> d, ,.;:"'?!$- --> self, other --> \*\|`
			`\| \| \|`
			`\| \| Trim sequences of length 3+ to 3, e.g \|`
			`\| \| \|`
			`\| \| apples --> xxx, Apples --> Xxxx, app9LES@ --> xxx9XXX* \|`
			`+---------+-------------------------------------------------------------------+`
			`\| ASCIIED \| Use unidecode.unidecode(sic) to approximate the word using the \|`
			`\| \| ascii characters. \|`
			`+---------+-------------------------------------------------------------------+`
			`\| PREFIX \| sic_unicode_string[:1] \|`
			`+---------+-------------------------------------------------------------------+`
			`\| SUFFIX \| sic_unicode_string[-3:] \|`
			`+---------+-------------------------------------------------------------------+`


			`Integer features`
			`----------------`

			`+--------------+--------------------------------------------------------------+`
			`\| LENGTH \| Length of the string, in unicode \|`
			`+--------------+--------------------------------------------------------------+`
			`\| CLUSTER \| Brown cluster \|`
			`+--------------+--------------------------------------------------------------+`
			`\| POS_TYPE \| K-means cluster of word's tag affinities \|`
			`+--------------+--------------------------------------------------------------+`
			`\| SENSE_TYPE \| K-means cluster of word's sense affinities \|`
			`+--------------+--------------------------------------------------------------+`

			`Boolean features`
			`----------------`

			`+-------------+--------------------------------------------------------------+`
			`\| IS_ALPHA \| The result of sic.isalpha() \|`
			`+-------------+--------------------------------------------------------------+`
			`\| IS_ASCII \| Check whether all the word's characters are ascii characters \|`
			`+-------------+--------------------------------------------------------------+`
			`\| IS_DIGIT \| The result of sic.isdigit() \|`
			`+-------------+--------------------------------------------------------------+`
			`\| IS_LOWER \| The result of sic.islower() \|`
			`+-------------+--------------------------------------------------------------+`
			`\| IS_PUNCT \| Check whether all characters are in the class TODO \|`
			`+-------------+--------------------------------------------------------------+`
			`\| IS_SPACE \| The result of sic.isspace() \|`
			`+-------------+--------------------------------------------------------------+`
			`\| IS_TITLE \| The result of sic.istitle() \|`
			`+-------------+--------------------------------------------------------------+`
			`\| IS_UPPER \| The result of sic.isupper() \|`
			`+-------------+--------------------------------------------------------------+`
			`\| LIKE_URL \| Check whether the string looks like it could be a URL. Aims \|`
			`\| \| for low false negative rate. \|`
			`+-------------+--------------------------------------------------------------+`
			`\| LIKE_NUMBER \| Check whether the string looks like it could be a numeric \|`
			`\| \| entity, e.g. 10,000 10th .10 . Skews for low false negative \|`
			`\| \| rate. \|`
			`+-------------+--------------------------------------------------------------+`
			`\| IN_LIST \| Facility for loading arbitrary run-time word lists? \|`
			`+-------------+--------------------------------------------------------------+`