diff --git a/.travis.yml b/.travis.yml index 83c7da85f..6571f55bd 100644 --- a/.travis.yml +++ b/.travis.yml @@ -8,12 +8,12 @@ python: - "2.7" - "3.4" -# command to install dependencies +# install dependencies install: - "pip install --upgrade setuptools" - "pip install -r requirements.txt" - "export PYTHONPATH=`pwd`" - "python setup.py build_ext --inplace" -# command to run tests +# run tests script: - py.test tests/ diff --git a/README.md b/README.md index 95afcb8ae..a72ccf2c6 100644 --- a/README.md +++ b/README.md @@ -3,20 +3,18 @@ spaCy http://honnibal.github.io/spaCy -Fast, state-of-the-art natural language processing pipeline. Commercial licenses available, or use under AGPL. +A pipeline for fast, state-of-the-art natural language processing. Commercial licenses available, otherwise under AGPL. Version 0.80 released --------------------- 2015-04-13 -* Preliminary named entity recognition support. Accuracy is currently - substantially behind the current state-of-the-art. I'm working on - improvements. +* Preliminary support for named-entity recognition. Its accuracy is substantially behind the state-of-the-art. I'm working on improvements. * Better sentence boundary detection, drawn from the syntactic structure. -* Lots of bug fixes +* Lots of bug fixes. Supports: diff --git a/docs/source/guide/overview.rst b/docs/source/guide/overview.rst index dbcfebfd7..6faaaa67f 100644 --- a/docs/source/guide/overview.rst +++ b/docs/source/guide/overview.rst @@ -28,14 +28,14 @@ can access an excellent set of pre-computed orthographic and distributional feat >>> are.check_flag(en.CAN_NOUN) False -spaCy makes it easy to write very efficient NLP applications, because your feature +spaCy makes it easy to write efficient NLP applications, because your feature functions have to do almost no work: almost every lexical property you'll want is pre-computed for you. See the tutorial for an example POS tagger. Benchmark --------- -The tokenizer itself is also very efficient: +The tokenizer itself is also efficient: +--------+-------+--------------+--------------+ | System | Time | Words/second | Speed Factor | @@ -56,7 +56,7 @@ Pros: - All tokens come with indices into the original string - Full unicode support -- Extensible to other languages +- Extendable to other languages - Batch operations computed efficiently in Cython - Cython API - numpy interoperability diff --git a/docs/source/howworks.rst b/docs/source/howworks.rst index 6f88db744..00d61d66d 100644 --- a/docs/source/howworks.rst +++ b/docs/source/howworks.rst @@ -135,7 +135,7 @@ lexical types. In a sample of text, vocabulary size grows exponentially slower than word count. So any computations we can perform over the vocabulary and apply to the -word count are very efficient. +word count are efficient. Part-of-speech Tagger diff --git a/docs/source/index.rst b/docs/source/index.rst index 60a66b2ae..08fbb8046 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -37,7 +37,7 @@ tokenizer is suitable for production use. I used to think that the NLP community just needed to do more to communicate its findings to software engineers. So I wrote two blog posts, explaining -`how to write a part-of-speech tagger`_ and `parser`_. Both were very well received, +`how to write a part-of-speech tagger`_ and `parser`_. Both were well received, and there's been a bit of interest in `my research software`_ --- even though it's entirely undocumented, and mostly unuseable to anyone but me. @@ -202,7 +202,7 @@ this: We wanted to refine the logic so that only adverbs modifying evocative verbs of communication, like "pleaded", were highlighted. We've now built a vector that -represents that type of word, so now we can highlight adverbs based on very +represents that type of word, so now we can highlight adverbs based on subtle logic, honing in on adverbs that seem the most stylistically problematic, given our starting assumptions: diff --git a/docs/source/license.rst b/docs/source/license.rst index 833b1aae7..5edf22095 100644 --- a/docs/source/license.rst +++ b/docs/source/license.rst @@ -35,7 +35,7 @@ And if you're ever in acquisition or IPO talks, the story is simple. spaCy can also be used as free open-source software, under the Aferro GPL license. If you use it this way, you must comply with the AGPL license terms. When you distribute your project, or offer it as a network service, you must -distribute the source-code, and grant users an AGPL license to it. +distribute the source-code and grant users an AGPL license to it. .. I left academia in June 2014, just when I should have been submitting my first diff --git a/docs/source/updates.rst b/docs/source/updates.rst index a526ee757..c796f31a5 100644 --- a/docs/source/updates.rst +++ b/docs/source/updates.rst @@ -7,8 +7,8 @@ Updates Five days ago I presented the alpha release of spaCy, a natural language processing library that brings state-of-the-art technology to small companies. -spaCy has been very well received, and there are now a lot of eyes on the project. -Naturally, lots of issues have surfaced. I'm very grateful to those who've reported +spaCy has been well received, and there are now a lot of eyes on the project. +Naturally, lots of issues have surfaced. I'm grateful to those who've reported them. I've worked hard to address them as quickly as I could. Bug Fixes @@ -26,7 +26,7 @@ Bug Fixes just store an index into that list, instead of a hash. * Parse tree navigation API was rough, and buggy. - The parse-tree navigation API was the last thing I added before v0.3. I've + The parse-tree navigation API was the last thing I added before v0.3. I've now replaced it with something better. The previous API design was flawed, and the implementation was buggy --- Token.child() and Token.head were sometimes inconsistent. @@ -108,9 +108,9 @@ input to be segmented into sentences, but with no sentence segmenter. This caused a drop in parse accuracy of 4%! Over the last five days, I've worked hard to correct this. I implemented the -modifications to the parsing algorithm I had planned, from Dongdong Zhang et al +modifications to the parsing algorithm I had planned, from Dongdong Zhang et al. (2013), and trained and evaluated the parser on raw text, using the version of -the WSJ distributed by Read et al (2012), and used in Dridan and Oepen's +the WSJ distributed by Read et al. (2012), and used in Dridan and Oepen's experiments. I'm pleased to say that on the WSJ at least, spaCy 0.4 performs almost exactly