💫 Industrial-strength Natural Language Processing (NLP) in Python
Go to file
Wolfgang Seeker 5e2e8e951a add baseclass DocIterator for iterators over documents
add classes for English and German noun chunks

the respective iterators are set for the document when created by the parser
as they depend on the annotation scheme of the parsing model
2016-03-16 15:53:35 +01:00
bin introduce lang field for LexemeC to hold language id 2016-03-10 13:01:34 +01:00
contributors Add contributor. 2015-10-07 17:55:46 -07:00
corpora/en * Add wordnet 2015-09-21 19:06:48 +10:00
examples move displacy to its own subdomain 2016-02-19 14:03:52 +01:00
include * Add header files to repo, to prevent cross-compilation problems 2016-02-06 22:57:11 +01:00
lang_data add tokenizer files for German, add/change code to train German pos tagger 2016-02-18 13:24:20 +01:00
spacy add baseclass DocIterator for iterators over documents 2016-03-16 15:53:35 +01:00
website move displacy to its own subdomain 2016-02-19 14:03:52 +01:00
.gitignore Added Windows file to .gitignore 2015-10-13 10:58:30 +03:00
.travis.yml Update .travis.yml 2016-02-09 19:34:24 +01:00
LICENSE.txt * Change from AGPL to MIT 2015-09-28 07:37:12 +10:00
MANIFEST.in fix windows readme 2015-12-21 21:58:53 +01:00
README-MSVC.txt fix windows readme 2015-12-21 21:58:53 +01:00
README.md Update README.md 2016-02-19 19:36:47 +01:00
bootstrap_python_env.sh * Add bootstrap script 2015-03-16 14:01:36 -04:00
buildbot.json add run section to buildbot.json 2016-02-26 23:04:33 +01:00
fabfile.py Merge branch 'master' of https://github.com/honnibal/spaCy 2015-12-28 18:03:06 +01:00
package.json Update package.json 2016-02-14 20:19:26 +01:00
requirements.txt upgrade to latest sputnik 2016-03-08 15:30:17 +01:00
setup.py add baseclass DocIterator for iterators over documents 2016-03-16 15:53:35 +01:00
tox.ini refactor setup.py 2015-12-13 23:32:23 +01:00
wordnet_license.txt * Add WordNet license file 2015-02-01 16:11:53 +11:00

README.md

Travis CI status

spaCy: Industrial-strength NLP

spaCy is a library for advanced natural language processing in Python and Cython.

Documentation and details: https://spacy.io/

spaCy is built on the very latest research, but it isn't researchware. It was designed from day 1 to be used in real products. It's commercial open-source software, released under the MIT license.

Features

  • Labelled dependency parsing (91.8% accuracy on OntoNotes 5)
  • Named entity recognition (82.6% accuracy on OntoNotes 5)
  • Part-of-speech tagging (97.1% accuracy on OntoNotes 5)
  • Easy to use word vectors
  • All strings mapped to integer IDs
  • Export to numpy data arrays
  • Alignment maintained to original string, ensuring easy mark up calculation
  • Range of easy-to-use orthographic features.
  • No pre-processing required. spaCy takes raw text as input, warts and newlines and all.

Top Peformance

  • Fastest in the world: <50ms per document. No faster system has ever been announced.
  • Accuracy within 1% of the current state of the art on all tasks performed (parsing, named entity recognition, part-of-speech tagging). The only more accurate systems are an order of magnitude slower or more.

Supports

  • CPython 2.6, 2.7, 3.3, 3.4, 3.5 (only 64 bit)
  • OSX
  • Linux
  • Windows (Cygwin, MinGW, Visual Studio)

Difficult to support:

  • PyPy