* Rework intro text

2015-01-25 00:58:52 +11:00 · 2015-01-25 00:58:52 +11:00 · 70d4a9dcc5
parent 83a7e91f3c
commit 70d4a9dcc5
1 changed files with 32 additions and 11 deletions
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@ -8,21 +8,42 @@ spaCy: Industrial-strength NLP
 ==============================

 spaCy is a new library for text processing in Python and Cython.
+I wrote it because I think small companies are terrible at NLP.  Or rather:
+small companies are using terrible NLP technology.

-Most commercial NLP development is based on obsolete
-technology.  Over the last 3-5 years, the field has advanced dramatically, but
-only the tech giants have really been able to capitalize.  The research is all
-public, but it's been too hard for small companies to read and apply it.  
-Many end up relying on `NLTK`_, which is intended primarily as an educational
-resource.
+To do great NLP, you have to know a little about linguistics, a lot
+about machine learning, and almost everything about the latest research.
+The people who fit this description seldom join small companies, and almost
+never start them.  Most are broke --- they've just finished grad school.
+If they don't want to stay in academia, they join Google, IBM, etc.

-.. _NLTK: https://www.nltk.org/
+The net result is that outside of the tech giants, commercial NLP has changed
+little in the last ten years.  In academia, it's changed entirely.  Amazing
+improvements in quality. Orders of magnitude faster.  But the
+academic code is always GPL, undocumented, unuseable, or all three.  You could
+implement the ideas yourself, but the papers are hard to read, and training
+data is exorbitantly expensive.  So what are you left with? NLTK?
+
+I used to think that the NLP community just needed to do more to communicate
+its findings to software engineers.  So I wrote two blog posts, explaining
+`how to write a part-of-speech tagger`_ and `parser`_.  Both were very well received,
+and there's been a bit of interest in `my research software`_ --- even though
+it's entirely undocumented, and mostly unuseable to anyone but me.
+
+.. _`my research software`: https://github.com/syllog1sm/redshift/tree/develop
+
+.. _`how to write a part-of-speech tagger`: https://honnibal.wordpress.com/2013/09/11/a-good-part-of-speechpos-tagger-in-about-200-lines-of-python/
+
+.. _`parser`: https://honnibal.wordpress.com/2013/12/18/a-simple-fast-algorithm-for-natural-language-dependency-parsing/
+
+So six months ago I quit my post-doc, and I've been working day and night on
+spaCy since.  I'm now pleased to announce an alpha release.

 If you're a small company doing NLP, I think spaCy will seem like a minor miracle.
-It's by far the fastest NLP software available.  The full processing pipeline
-completes in 7ms per document, including accurate tagging and parsing.  All strings
-are mapped to integer IDs, tokens are linked to embedded word representations,
-and a range of useful features are pre-calculated and cached.
+The full processing pipeline completes in 7ms per document, including accurate
+tagging and parsing.  All strings are mapped to integer IDs, tokens are linked
+to embedded word representations, and a range of useful features are pre-calculated
+and cached.

 If none of that made any sense to you, here's the gist of it.  Computers don't
 understand text. This is unfortunate, because that's what the web almost entirely