mirror of https://github.com/explosion/spaCy.git
Minor copyediting
This commit is contained in:
parent
7bddd15e27
commit
1b79d947b9
|
@ -8,12 +8,12 @@ python:
|
|||
- "2.7"
|
||||
- "3.4"
|
||||
|
||||
# command to install dependencies
|
||||
# install dependencies
|
||||
install:
|
||||
- "pip install --upgrade setuptools"
|
||||
- "pip install -r requirements.txt"
|
||||
- "export PYTHONPATH=`pwd`"
|
||||
- "python setup.py build_ext --inplace"
|
||||
# command to run tests
|
||||
# run tests
|
||||
script:
|
||||
- py.test tests/
|
||||
|
|
|
@ -3,20 +3,18 @@ spaCy
|
|||
|
||||
http://honnibal.github.io/spaCy
|
||||
|
||||
Fast, state-of-the-art natural language processing pipeline. Commercial licenses available, or use under AGPL.
|
||||
A pipeline for fast, state-of-the-art natural language processing. Commercial licenses available, otherwise under AGPL.
|
||||
|
||||
Version 0.80 released
|
||||
---------------------
|
||||
|
||||
2015-04-13
|
||||
|
||||
* Preliminary named entity recognition support. Accuracy is currently
|
||||
substantially behind the current state-of-the-art. I'm working on
|
||||
improvements.
|
||||
* Preliminary support for named-entity recognition. Its accuracy is substantially behind the state-of-the-art. I'm working on improvements.
|
||||
|
||||
* Better sentence boundary detection, drawn from the syntactic structure.
|
||||
|
||||
* Lots of bug fixes
|
||||
* Lots of bug fixes.
|
||||
|
||||
|
||||
Supports:
|
||||
|
|
|
@ -28,14 +28,14 @@ can access an excellent set of pre-computed orthographic and distributional feat
|
|||
>>> are.check_flag(en.CAN_NOUN)
|
||||
False
|
||||
|
||||
spaCy makes it easy to write very efficient NLP applications, because your feature
|
||||
spaCy makes it easy to write efficient NLP applications, because your feature
|
||||
functions have to do almost no work: almost every lexical property you'll want
|
||||
is pre-computed for you. See the tutorial for an example POS tagger.
|
||||
|
||||
Benchmark
|
||||
---------
|
||||
|
||||
The tokenizer itself is also very efficient:
|
||||
The tokenizer itself is also efficient:
|
||||
|
||||
+--------+-------+--------------+--------------+
|
||||
| System | Time | Words/second | Speed Factor |
|
||||
|
@ -56,7 +56,7 @@ Pros:
|
|||
|
||||
- All tokens come with indices into the original string
|
||||
- Full unicode support
|
||||
- Extensible to other languages
|
||||
- Extendable to other languages
|
||||
- Batch operations computed efficiently in Cython
|
||||
- Cython API
|
||||
- numpy interoperability
|
||||
|
|
|
@ -135,7 +135,7 @@ lexical types.
|
|||
|
||||
In a sample of text, vocabulary size grows exponentially slower than word
|
||||
count. So any computations we can perform over the vocabulary and apply to the
|
||||
word count are very efficient.
|
||||
word count are efficient.
|
||||
|
||||
|
||||
Part-of-speech Tagger
|
||||
|
|
|
@ -37,7 +37,7 @@ tokenizer is suitable for production use.
|
|||
|
||||
I used to think that the NLP community just needed to do more to communicate
|
||||
its findings to software engineers. So I wrote two blog posts, explaining
|
||||
`how to write a part-of-speech tagger`_ and `parser`_. Both were very well received,
|
||||
`how to write a part-of-speech tagger`_ and `parser`_. Both were well received,
|
||||
and there's been a bit of interest in `my research software`_ --- even though
|
||||
it's entirely undocumented, and mostly unuseable to anyone but me.
|
||||
|
||||
|
@ -202,7 +202,7 @@ this:
|
|||
|
||||
We wanted to refine the logic so that only adverbs modifying evocative verbs
|
||||
of communication, like "pleaded", were highlighted. We've now built a vector that
|
||||
represents that type of word, so now we can highlight adverbs based on very
|
||||
represents that type of word, so now we can highlight adverbs based on
|
||||
subtle logic, honing in on adverbs that seem the most stylistically
|
||||
problematic, given our starting assumptions:
|
||||
|
||||
|
|
|
@ -35,7 +35,7 @@ And if you're ever in acquisition or IPO talks, the story is simple.
|
|||
spaCy can also be used as free open-source software, under the Aferro GPL
|
||||
license. If you use it this way, you must comply with the AGPL license terms.
|
||||
When you distribute your project, or offer it as a network service, you must
|
||||
distribute the source-code, and grant users an AGPL license to it.
|
||||
distribute the source-code and grant users an AGPL license to it.
|
||||
|
||||
|
||||
.. I left academia in June 2014, just when I should have been submitting my first
|
||||
|
|
|
@ -7,8 +7,8 @@ Updates
|
|||
Five days ago I presented the alpha release of spaCy, a natural language
|
||||
processing library that brings state-of-the-art technology to small companies.
|
||||
|
||||
spaCy has been very well received, and there are now a lot of eyes on the project.
|
||||
Naturally, lots of issues have surfaced. I'm very grateful to those who've reported
|
||||
spaCy has been well received, and there are now a lot of eyes on the project.
|
||||
Naturally, lots of issues have surfaced. I'm grateful to those who've reported
|
||||
them. I've worked hard to address them as quickly as I could.
|
||||
|
||||
Bug Fixes
|
||||
|
@ -26,7 +26,7 @@ Bug Fixes
|
|||
just store an index into that list, instead of a hash.
|
||||
|
||||
* Parse tree navigation API was rough, and buggy.
|
||||
The parse-tree navigation API was the last thing I added before v0.3. I've
|
||||
The parse-tree navigation API was the last thing I added before v0.3. I've
|
||||
now replaced it with something better. The previous API design was flawed,
|
||||
and the implementation was buggy --- Token.child() and Token.head were
|
||||
sometimes inconsistent.
|
||||
|
@ -108,9 +108,9 @@ input to be segmented into sentences, but with no sentence segmenter. This
|
|||
caused a drop in parse accuracy of 4%!
|
||||
|
||||
Over the last five days, I've worked hard to correct this. I implemented the
|
||||
modifications to the parsing algorithm I had planned, from Dongdong Zhang et al
|
||||
modifications to the parsing algorithm I had planned, from Dongdong Zhang et al.
|
||||
(2013), and trained and evaluated the parser on raw text, using the version of
|
||||
the WSJ distributed by Read et al (2012), and used in Dridan and Oepen's
|
||||
the WSJ distributed by Read et al. (2012), and used in Dridan and Oepen's
|
||||
experiments.
|
||||
|
||||
I'm pleased to say that on the WSJ at least, spaCy 0.4 performs almost exactly
|
||||
|
|
Loading…
Reference in New Issue