diff --git a/docs/source/features.rst b/docs/source/features.rst new file mode 100644 index 000000000..3e586dd68 --- /dev/null +++ b/docs/source/features.rst @@ -0,0 +1,78 @@ +Lexeme Features +=============== + +A lexeme is an entry in the lexicon --- the vocabulary --- for a word, punctuation +symbol, whitespace unit, etc. Lexemes come with lots of pre-computed information, +that help you write good feature functions. Features are integer-valued where +possible --- instead of strings, spaCy refers to strings by consecutive ID numbers, +which you can use to look up the string values if necessary. + +String features +--------------- + ++---------+-------------------------------------------------------------------+ +| SIC | The word as it appeared in the sentence, unaltered. | ++---------+-------------------------------------------------------------------+ +| NORM | For frequent words, case normalization is applied. | +| | Otherwise, back-off to SHAPE. | ++---------+-------------------------------------------------------------------+ +| SHAPE | Remap the characters of the word as follows: | +| | | +| | a-z --> x, A-Z --> X, 0-9 --> d, ,.;:"'?!$- --> self, other --> \*| +| | | +| | Trim sequences of length 3+ to 3, e.g | +| | | +| | apples --> xxx, Apples --> Xxxx, app9LES@ --> xxx9XXX* | ++---------+-------------------------------------------------------------------+ +| ASCIIED | Use unidecode.unidecode(sic) to approximate the word using the | +| | ascii characters. | ++---------+-------------------------------------------------------------------+ +| PREFIX | sic_unicode_string[:1] | ++---------+-------------------------------------------------------------------+ +| SUFFIX | sic_unicode_string[-3:] | ++---------+-------------------------------------------------------------------+ + + +Integer features +---------------- + ++--------------+--------------------------------------------------------------+ +| LENGTH | Length of the string, in unicode | ++--------------+--------------------------------------------------------------+ +| CLUSTER | Brown cluster | ++--------------+--------------------------------------------------------------+ +| POS_TYPE | K-means cluster of word's tag affinities | ++--------------+--------------------------------------------------------------+ +| SENSE_TYPE | K-means cluster of word's sense affinities | ++--------------+--------------------------------------------------------------+ + +Boolean features +---------------- + ++-------------+--------------------------------------------------------------+ +| IS_ALPHA | The result of sic.isalpha() | ++-------------+--------------------------------------------------------------+ +| IS_ASCII | Check whether all the word's characters are ascii characters | ++-------------+--------------------------------------------------------------+ +| IS_DIGIT | The result of sic.isdigit() | ++-------------+--------------------------------------------------------------+ +| IS_LOWER | The result of sic.islower() | ++-------------+--------------------------------------------------------------+ +| IS_PUNCT | Check whether all characters are in the class TODO | ++-------------+--------------------------------------------------------------+ +| IS_SPACE | The result of sic.isspace() | ++-------------+--------------------------------------------------------------+ +| IS_TITLE | The result of sic.istitle() | ++-------------+--------------------------------------------------------------+ +| IS_UPPER | The result of sic.isupper() | ++-------------+--------------------------------------------------------------+ +| LIKE_URL | Check whether the string looks like it could be a URL. Aims | +| | for low false negative rate. | ++-------------+--------------------------------------------------------------+ +| LIKE_NUMBER | Check whether the string looks like it could be a numeric | +| | entity, e.g. 10,000 10th .10 . Skews for low false negative | +| | rate. | ++-------------+--------------------------------------------------------------+ +| IN_LIST | Facility for loading arbitrary run-time word lists? | ++-------------+--------------------------------------------------------------+ +