* Add post introducing spaCy

2015-08-13 15:49:33 +02:00 · 2015-08-13 15:49:33 +02:00 · 005074c31e
parent 2f50288813
commit 005074c31e
1 changed files with 93 additions and 0 deletions
--- a/docs/redesign/blog_intro.jade
+++ b/docs/redesign/blog_intro.jade
@ -0,0 +1,93 @@
+-
+  var urls = {
+    'pos_post': 'https://honnibal.wordpress.com/2013/09/11/a-good-part-of-speechpos-tagger-in-about-200-lines-of-python/',
+    'google_ngrams': "http://googleresearch.blogspot.com.au/2013/05/syntactic-ngrams-over-time.html",
+    'implementation': 'https://gist.github.com/syllog1sm/10343947',
+    'redshift': 'http://github.com/syllog1sm/redshift',
+    'tasker': 'https://play.google.com/store/apps/details?id=net.dinglisch.android.taskerm',
+    'acl_anthology': 'http://aclweb.org/anthology/',
+    'share_twitter': 'http://twitter.com/share?text=[ARTICLE HEADLINE]&url=[ARTICLE LINK]&via=honnibal'
+    }
+
+
+- var my_research_software = '<a href="https://github.com/syllog1sm/redshift/tree/develop">my research software</a>'
+
+- var how_to_write_a_POS_tagger = '<a href="https://honnibal.wordpress.com/2013/09/11/a-good-part-of-speechpos-tagger-in-about-200-lines-of-python/">how to write a part-of-speech tagger</a>'
+
+- var parser_lnk = '<a href="https://honnibal.wordpress.com/2013/12/18/a-simple-fast-algorithm-for-natural-language-dependency-parsing/">parser</a>'
+
+- var buy_a_commercial_license = '<a href="license.html">buy a commercial license</a>'
+
+doctype html
+html(lang='en')
+  head
+    meta(charset='utf-8')
+    title spaCy Blog
+    meta(name='description', content='')
+    meta(name='author', content='Matthew Honnibal')
+    link(rel='stylesheet', href='css/style.css')
+    //if lt IE 9
+      script(src='http://html5shiv.googlecode.com/svn/trunk/html5.js')
+  body#blog
+    header(role='banner')
+      h1.logo spaCy Blog
+      .slogan Blog
+    main#content(role='main')
+      article.post
+        p.
+          <strong>spaCy</strong> is a new library for text processing in Python
+          and Cython. I wrote it because I think small companies are terrible at
+          natural language processing (NLP).  Or rather: small companies are using
+          terrible NLP technology.
+
+        p.
+          To do great NLP, you have to know a little about linguistics, a lot
+          about machine learning, and almost everything about the latest research.
+          The people who fit this description seldom join small companies.
+          Most are broke &ndash; they've just finished grad school.
+          If they don't want to stay in academia, they join Google, IBM, etc.
+
+        p.
+          The net result is that outside of the tech giants, commercial NLP has
+          changed little in the last ten years.  In academia, it's changed entirely.
+          Amazing improvements in quality.  Orders of magnitude faster.  But the
+          academic code is always GPL, undocumented, unuseable, or all three. 
+          You could implement the ideas yourself, but the papers are hard to read,
+          and training data is exorbitantly expensive.  So what are you left with?
+          A common answer is NLTK, which was written primarily as an educational resource.
+          Nothing past the tokenizer is suitable for production use.
+
+        p.
+          I used to think that the NLP community just needed to do more to communicate
+          its findings to software engineers.  So I wrote two blog posts, explaining
+          !{how_to_write_a_POS_tagger} and !{parser_lnk}.  Both were well
+          received, and there's been a bit of interest in !{my_research_software}
+          &ndash; even though it's entirely undocumented, and mostly unuseable to
+          anyone but me.
+        p.
+          So six months ago I quit my post-doc, and I've been working day and night
+          on spaCy since.  I'm now pleased to announce an alpha release.
+      
+        p.
+          If you're a small company doing NLP, I think spaCy will seem like a minor
+          miracle.  It's by far the fastest NLP software ever released.  The
+          full processing pipeline completes in 20ms per document, including accurate
+          tagging and parsing.  All strings are mapped to integer IDs, tokens are
+          linked to embedded word representations, and a range of useful features
+          are pre-calculated and cached.
+
+        p.
+          If none of that made any sense to you, here's the gist of it.  Computers
+          don't understand text.  This is unfortunate, because that's what the
+          web almost entirely consists of.  We want to recommend people text based
+          on other text they liked.  We want to shorten text to display it on a
+          mobile screen.  We want to aggregate it, link it, filter it, categorise
+          it, generate it and correct it.
+
+        p. 
+          spaCy provides a library of utility functions that help programmers
+          build such products.  It's commercial open source software: you can
+          either use it under the AGPL, or you can !{buy_a_commercial_license}
+          under generous terms.
+
+  footer(role='contentinfo')