diff --git a/docs/source/index.rst b/docs/source/index.rst index e1a0b0112..af87ad18f 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -3,9 +3,9 @@ You can adapt this file completely to your liking, but it should at least contain the root `toctree` directive. -================================ -spaCy: Industrial-strength NLP -================================ +=================================== +spaCy: Text-processing for products +=================================== spaCy is a library for industrial-strength text processing in Python and Cython. Its core values are efficiency, accuracy and minimalism: you get a fast pipeline of @@ -15,22 +15,23 @@ spaCy is particularly good for feature extraction, because it pre-loads lexical resources, maps strings to integer IDs, and supports output of numpy arrays: >>> from spacy.en import English - >>> from spacy.en import attrs >>> nlp = English() - >>> tokens = nlp(u'An example sentence', pos_tag=True, parse=True) - >>> tokens.to_array((attrs.LEMMA, attrs.POS, attrs.SHAPE, attrs.CLUSTER)) + >>> tokens = nlp(u'An example sentence', tag=True, parse=True) + >>> from spacy.en import attrs + >>> feats = tokens.to_array((attrs.LEMMA, attrs.POS, attrs.SHAPE, attrs.CLUSTER)) + >>> for lemma, pos, shape, cluster in feats: + ... print nlp.strings[lemma], nlp.tagger.tags[pos], nlp.strings[shape], cluster spaCy also makes it easy to add in-line mark up. Let's say you want to mark all adverbs in red: >>> from spacy.defs import ADVERB >>> color = lambda t: u'\033[91m' % t if t.pos == ADVERB else u'%s' - >>> print u''.join(color(t) + unicode(t) for t in tokens) + >>> print u''.join(color(token) + unicode(token) for t in tokens) -Tokens.__iter__ produces a sequence of Token objects. The Token.__unicode__ -method --- invoked by unicode(t) --- pads each token with any whitespace that -followed it. So, u''.join(unicode(t) for t in tokens) is guaranteed to restore -the original string. +Easy. The trick here is that the Token objects know to pad themselves with +whitespace when you ask for their unicode representation, so you can always get +back the original string. spaCy is also very efficient --- much more efficient than any other language processing tools available. The table below compares the time to tokenize, POS @@ -61,6 +62,12 @@ and what you're competing to do is write papers --- so it's very hard to write software useful to non-academics. Seeing this gap, I resigned from my post-doc, and wrote spaCy. +spaCy is dual-licensed: you can either use it under the GPL, or pay a one-time +fee of $5000 for a commercial license. I think this is excellent value: +you'll find NLTK etc much more expensive, because what you save on license +cost, you'll lose many times over in lost productivity. $5000 does not buy you +much developer time. + .. toctree:: :hidden: :maxdepth: 3