mirror of https://github.com/explosion/spaCy.git
82 lines
4.1 KiB
Plaintext
82 lines
4.1 KiB
Plaintext
extends ./template_post.jade
|
|
|
|
-
|
|
var urls = {
|
|
'pos_post': 'https://honnibal.wordpress.com/2013/09/11/a-good-part-of-speechpos-tagger-in-about-200-lines-of-python/',
|
|
'google_ngrams': "http://googleresearch.blogspot.com.au/2013/05/syntactic-ngrams-over-time.html",
|
|
'implementation': 'https://gist.github.com/syllog1sm/10343947',
|
|
'redshift': 'http://github.com/syllog1sm/redshift',
|
|
'tasker': 'https://play.google.com/store/apps/details?id=net.dinglisch.android.taskerm',
|
|
'acl_anthology': 'http://aclweb.org/anthology/',
|
|
'share_twitter': 'http://twitter.com/share?text=[ARTICLE HEADLINE]&url=[ARTICLE LINK]&via=honnibal'
|
|
}
|
|
|
|
- var my_research_software = '<a href="https://github.com/syllog1sm/redshift/tree/develop">my research software</a>'
|
|
|
|
- var how_to_write_a_POS_tagger = '<a href="https://honnibal.wordpress.com/2013/09/11/a-good-part-of-speechpos-tagger-in-about-200-lines-of-python/">how to write a part-of-speech tagger</a>'
|
|
|
|
- var parser_lnk = '<a href="https://honnibal.wordpress.com/2013/12/18/a-simple-fast-algorithm-for-natural-language-dependency-parsing/">parser</a>'
|
|
|
|
- var buy_a_commercial_license = '<a href="license.html">buy a commercial license</a>'
|
|
|
|
|
|
block body_block
|
|
article.post
|
|
p.
|
|
<strong>spaCy</strong> is a new library for text processing in Python
|
|
and Cython. I wrote it because I think small companies are terrible at
|
|
natural language processing (NLP). Or rather: small companies are using
|
|
terrible NLP technology.
|
|
|
|
p.
|
|
To do great NLP, you have to know a little about linguistics, a lot
|
|
about machine learning, and almost everything about the latest research.
|
|
The people who fit this description seldom join small companies.
|
|
Most are broke – they've just finished grad school.
|
|
If they don't want to stay in academia, they join Google, IBM, etc.
|
|
|
|
p.
|
|
The net result is that outside of the tech giants, commercial NLP has
|
|
changed little in the last ten years. In academia, it's changed entirely.
|
|
Amazing improvements in quality. Orders of magnitude faster. But the
|
|
academic code is always GPL, undocumented, unuseable, or all three.
|
|
You could implement the ideas yourself, but the papers are hard to read,
|
|
and training data is exorbitantly expensive. So what are you left with?
|
|
A common answer is NLTK, which was written primarily as an educational resource.
|
|
Nothing past the tokenizer is suitable for production use.
|
|
|
|
p.
|
|
I used to think that the NLP community just needed to do more to communicate
|
|
its findings to software engineers. So I wrote two blog posts, explaining
|
|
!{how_to_write_a_POS_tagger} and !{parser_lnk}. Both were well
|
|
received, and there's been a bit of interest in !{my_research_software}
|
|
– even though it's entirely undocumented, and mostly unuseable to
|
|
anyone but me.
|
|
p.
|
|
So six months ago I quit my post-doc, and I've been working day and night
|
|
on spaCy since. I'm now pleased to announce an alpha release.
|
|
|
|
p.
|
|
If you're a small company doing NLP, I think spaCy will seem like a minor
|
|
miracle. It's by far the fastest NLP software ever released. The
|
|
full processing pipeline completes in 20ms per document, including accurate
|
|
tagging and parsing. All strings are mapped to integer IDs, tokens are
|
|
linked to embedded word representations, and a range of useful features
|
|
are pre-calculated and cached.
|
|
|
|
p.
|
|
If none of that made any sense to you, here's the gist of it. Computers
|
|
don't understand text. This is unfortunate, because that's what the
|
|
web almost entirely consists of. We want to recommend people text based
|
|
on other text they liked. We want to shorten text to display it on a
|
|
mobile screen. We want to aggregate it, link it, filter it, categorise
|
|
it, generate it and correct it.
|
|
|
|
p.
|
|
spaCy provides a library of utility functions that help programmers
|
|
build such products. It's commercial open source software: you can
|
|
either use it under the AGPL, or you can !{buy_a_commercial_license}
|
|
under generous terms.
|
|
|
|
footer(role='contentinfo')
|