diff --git a/README.rst b/README.rst index b6bee922b..3ff65ced7 100644 --- a/README.rst +++ b/README.rst @@ -286,372 +286,57 @@ and ``--model`` are optional and enable additional tests: python -m pytest --vectors --model --slow -Changelog -========= - -2017-01-16 `v1.6.0 `_: *Improvements to tokenizer and tests* ----------------------------------------------------------------------------------------------------------- - -**✨ Major features and improvements** - -* Updated token exception handling mechanism to allow the usage of arbitrary functions as token exception matchers. -* Improve how tokenizer exceptions for English contractions and punctuations are generated. -* Update language data for Hungarian and Swedish tokenization. -* Update to use `Thinc v6 `_ to prepare for `spaCy v2.0 `_. - -**πŸ”΄ Bug fixes** - -* Fix issue `#326 `_: Tokenizer is now more consistent and handles abbreviations correctly. -* Fix issue `#344 `_: Tokenizer now handles URLs correctly. -* Fix issue `#483 `_: Period after two or more uppercase letters is split off in tokenizer exceptions. -* Fix issue `#631 `_: Add ``richcmp`` method to ``Token``. -* Fix issue `#718 `_: Contractions with ``She`` are now handled correctly. -* Fix issue `#736 `_: Times are now tokenized with correct string values. -* Fix issue `#743 `_: ``Token`` is now hashable. -* Fix issue `#744 `_: ``were`` and ``Were`` are now excluded correctly from contractions. - -**πŸ“‹ Tests** - -* Modernise and reorganise all tests and remove model dependencies where possible. -* Improve test speed to ~20s for basic tests (from previously >80s) and ~100s including models (from previously >200s). -* Add fixtures for spaCy components and test utilities, e.g. to create ``Doc`` object manually. -* Add `documentation for tests `_ to explain conventions and organisation. - -**πŸ‘₯ Contributors** - -Thanks to `@oroszgy `_, `@magnusburton `_, `@guyrosin `_ and `@danielhers `_ for the pull requests! - -2016-12-27 `v1.5.0 `_: *Alpha support for Swedish and Hungarian* ------------------------------------------------------------------------------------------------------------------------- - -**✨ Major features and improvements** - -* **NEW:** Alpha support for Swedish tokenization. -* **NEW:** Alpha support for Hungarian tokenization. -* Update language data for Spanish tokenization. -* Speed up tokenization when no data is preloaded by caching the first 10,000 vocabulary items seen. - -**πŸ”΄ Bug fixes** - -* List the ``language_data`` package in the ``setup.py``. -* Fix missing ``vec_path`` declaration that was failing if ``add_vectors`` was set. -* Allow ``Vocab`` to load without ``serializer_freqs``. - -**πŸ“– Documentation and examples** - -* **NEW:** `spaCy Jupyter notebooks `_ repo: ongoing collection of easy-to-run spaCy examples and tutorials. -* Fix issue `#657 `_: Generalise dependency parsing `annotation specs `_ beyond English. -* Fix various typos and inconsistencies. - -**πŸ‘₯ Contributors** - -Thanks to `@oroszgy `_, `@magnusburton `_, `@jmizgajski `_, `@aikramer2 `_, `@fnorf `_ and `@bhargavvader `_ for the pull requests! - -2016-12-18 `v1.4.0 `_: *Improved language data and alpha Dutch support* -------------------------------------------------------------------------------------------------------------------------------- - -**✨ Major features and improvements** - -* **NEW:** Alpha support for Dutch tokenization. -* Reorganise and improve format for language data. -* Add shared tag map, entity rules, emoticons and punctuation to language data. -* Convert entity rules, morphological rules and lemmatization rules from JSON to Python. -* Update language data for English, German, Spanish, French, Italian and Portuguese. - -**πŸ”΄ Bug fixes** - -* Fix issue `#649 `_: Update and reorganise stop lists. -* Fix issue `#672 `_: Make ``token.ent_iob_`` return unicode. -* Fix issue `#674 `_: Add missing lemmas for contracted forms of "be" to ``TOKENIZER_EXCEPTIONS``. -* Fix issue `#683 `_ ``Morphology`` class now supplies tag map value for the special space tag if it's missing. -* Fix issue `#684 `_: Ensure ``spacy.en.English()`` loads the Glove vector data if available. Previously was inconsistent with behaviour of ``spacy.load('en')``. -* Fix issue `#685 `_: Expand ``TOKENIZER_EXCEPTIONS`` with unicode apostrophe (``’``). -* Fix issue `#689 `_: Correct typo in ``STOP_WORDS``. -* Fix issue `#691 `_: Add tokenizer exceptions for "gonna" and "Gonna". - -**⚠️ Backwards incompatibilities** - -No changes to the public, documented API, but the previously undocumented language data and model initialisation processes have been refactored and reorganised. If you were relying on the ``bin/init_model.py`` script, see the new `spaCy Developer Resources `_ repo. Code that references internals of the ``spacy.en`` or ``spacy.de`` packages should also be reviewed before updating to this version. - -**πŸ“– Documentation and examples** - -* **NEW:** `"Adding languages" `_ workflow. -* **NEW:** `"Part-of-speech tagging" `_ workflow. -* **NEW:** `spaCy Developer Resources `_ repo – scripts, tools and resources for developing spaCy. -* Fix various typos and inconsistencies. - -**πŸ‘₯ Contributors** - -Thanks to `@dafnevk `_, `@jvdzwaan `_, `@RvanNieuwpoort `_, `@wrvhage `_, `@jaspb `_, `@savvopoulos `_ and `@davedwards `_ for the pull requests! - -2016-12-03 `v1.3.0 `_: *Improve API consistency* --------------------------------------------------------------------------------------------------------- - -**✨ API improvements** - -* Add ``Span.sentiment`` attribute. -* `#658 `_: Add ``Span.noun_chunks`` iterator (thanks `@pokey `_). -* `#642 `_: Let ``--data-path`` be specified when running download.py scripts (thanks `@ExplodingCabbage `_). -* `#638 `_: Add German stopwords (thanks `@souravsingh `_). -* `#614 `_: Fix ``PhraseMatcher`` to work with new ``Matcher`` (thanks `@sadovnychyi `_). - -**πŸ”΄ Bug fixes** - -* Fix issue `#605 `_: ``accept`` argument to ``Matcher`` now rejects matches as expected. -* Fix issue `#617 `_: ``Vocab.load()`` now works with string paths, as well as ``Path`` objects. -* Fix issue `#639 `_: Stop words in ``Language`` class now used as expected. -* Fix issues `#656 `_, `#624 `_: ``Tokenizer`` special-case rules now support arbitrary token attributes. - - -**πŸ“– Documentation and examples** - -* Add `"Customizing the tokenizer" `_ workflow. -* Add `"Training the tagger, parser and entity recognizer" `_ workflow. -* Add `"Entity recognition" `_ workflow. -* Fix various typos and inconsistencies. - -**πŸ‘₯ Contributors** - -Thanks to `@pokey `_, `@ExplodingCabbage `_, `@souravsingh `_, `@sadovnychyi `_, `@manojsakhwar `_, `@TiagoMRodrigues `_, `@savkov `_, `@pspiegelhalter `_, `@chenb67 `_, `@kylepjohnson `_, `@YanhaoYang `_, `@tjrileywisc `_, `@dechov `_, `@wjt `_, `@jsmootiv `_ and `@blarghmatey `_ for the pull requests! - -2016-11-04 `v1.2.0 `_: *Alpha tokenizers for Chinese, French, Spanish, Italian and Portuguese* ------------------------------------------------------------------------------------------------------------------------------------------------------- - -**✨ Major features and improvements** - -* **NEW:** Support Chinese tokenization, via `Jieba `_. -* **NEW:** Alpha support for French, Spanish, Italian and Portuguese tokenization. - -**πŸ”΄ Bug fixes** - -* Fix issue `#376 `_: POS tags for "and/or" are now correct. -* Fix issue `#578 `_: ``--force`` argument on download command now operates correctly. -* Fix issue `#595 `_: Lemmatization corrected for some base forms. -* Fix issue `#588 `_: `Matcher` now rejects empty patterns. -* Fix issue `#592 `_: Added exception rule for tokenization of "Ph.D." -* Fix issue `#599 `_: Empty documents now considered tagged and parsed. -* Fix issue `#600 `_: Add missing ``token.tag`` and ``token.tag_`` setters. -* Fix issue `#596 `_: Added missing unicode import when compiling regexes that led to incorrect tokenization. -* Fix issue `#587 `_: Resolved bug that caused ``Matcher`` to sometimes segfault. -* Fix issue `#429 `_: Ensure missing entity types are added to the entity recognizer. - -2016-10-23 `v1.1.0 `_: *Bug fixes and adjustments* ----------------------------------------------------------------------------------------------------------- - -* Rename new ``pipeline`` keyword argument of ``spacy.load()`` to ``create_pipeline``. -* Rename new ``vectors`` keyword argument of ``spacy.load()`` to ``add_vectors``. - -**πŸ”΄ Bug fixes** - -* Fix issue `#544 `_: Add ``vocab.resize_vectors()`` method, to support changing to vectors of different dimensionality. -* Fix issue `#536 `_: Default probability was incorrect for OOV words. -* Fix issue `#539 `_: Unspecified encoding when opening some JSON files. -* Fix issue `#541 `_: GloVe vectors were being loaded incorrectly. -* Fix issue `#522 `_: Similarities and vector norms were calculated incorrectly. -* Fix issue `#461 `_: ``ent_iob`` attribute was incorrect after setting entities via ``doc.ents`` -* Fix issue `#459 `_: Deserialiser failed on empty doc -* Fix issue `#514 `_: Serialization failed after adding a new entity label. - -2016-10-18 `v1.0.0 `_: *Support for deep learning workflows and entity-aware rule matcher* --------------------------------------------------------------------------------------------------------------------------------------------------- - -**✨ Major features and improvements** - -* **NEW:** `custom processing pipelines `_, to support deep learning workflows -* **NEW:** `Rule matcher `_ now supports entity IDs and attributes -* **NEW:** Official/documented `training APIs `_ and `GoldParse` class -* Download and use GloVe vectors by default -* Make it easier to load and unload word vectors -* Improved rule matching functionality -* Move basic data into the code, rather than the json files. This makes it simpler to use the tokenizer without the models installed, and makes adding new languages much easier. -* Replace file-system strings with ``Path`` objects. You can now load resources over your network, or do similar trickery, by passing any object that supports the ``Path`` protocol. - -**⚠️ Backwards incompatibilities** - -* The data_dir keyword argument of ``Language.__init__`` (and its subclasses ``English.__init__`` and ``German.__init__``) has been renamed to ``path``. -* Details of how the Language base-class and its sub-classes are loaded, and how defaults are accessed, have been heavily changed. If you have your own subclasses, you should review the changes. -* The deprecated ``token.repvec`` name has been removed. -* The ``.train()`` method of Tagger and Parser has been renamed to ``.update()`` -* The previously undocumented ``GoldParse`` class has a new ``__init__()`` method. The old method has been preserved in ``GoldParse.from_annot_tuples()``. -* Previously undocumented details of the ``Parser`` class have changed. -* The previously undocumented ``get_package`` and ``get_package_by_name`` helper functions have been moved into a new module, ``spacy.deprecated``, in case you still need them while you update. - -**πŸ”΄ Bug fixes** - -* Fix ``get_lang_class`` bug when GloVe vectors are used. -* Fix Issue `#411 `_: ``doc.sents`` raised IndexError on empty string. -* Fix Issue `#455 `_: Correct lemmatization logic -* Fix Issue `#371 `_: Make ``Lexeme`` objects hashable -* Fix Issue `#469 `_: Make ``noun_chunks`` detect root NPs - -**πŸ‘₯ Contributors** - -Thanks to `@daylen `_, `@RahulKulhari `_, `@stared `_, `@adamhadani `_, `@izeye `_ and `@crawfordcomeaux `_ for the pull requests! - -2016-05-10 `v0.101.0 `_: *Fixed German model* ------------------------------------------------------------------------------------------------------- - -* Fixed bug that prevented German parses from being deprojectivised. -* Bug fixes to sentence boundary detection. -* Add rich comparison methods to the Lexeme class. -* Add missing ``Doc.has_vector`` and ``Span.has_vector`` properties. -* Add missing ``Span.sent`` property. - -2016-05-05 `v0.100.7 `_: *German!* -------------------------------------------------------------------------------------------- - -spaCy finally supports another language, in addition to English. We're lucky -to have Wolfgang Seeker on the team, and the new German model is just the -beginning. Now that there are multiple languages, you should consider loading -spaCy via the ``load()`` function. This function also makes it easier to load extra -word vector data for English: - -.. code:: python - - import spacy - en_nlp = spacy.load('en', vectors='en_glove_cc_300_1m_vectors') - de_nlp = spacy.load('de') - -To support use of the load function, there are also two new helper functions: -``spacy.get_lang_class`` and ``spacy.set_lang_class``. Once the German model is -loaded, you can use it just like the English model: - -.. code:: python - - doc = nlp(u'''Wikipedia ist ein Projekt zum Aufbau einer EnzyklopΓ€die aus freien Inhalten, zu dem du mit deinem Wissen beitragen kannst. Seit Mai 2001 sind 1.936.257 Artikel in deutscher Sprache entstanden.''') - - for sent in doc.sents: - print(sent.root.text, sent.root.n_lefts, sent.root.n_rights) - - # (u'ist', 1, 2) - # (u'sind', 1, 3) - -The German model provides tokenization, POS tagging, sentence boundary detection, -syntactic dependency parsing, recognition of organisation, location and person -entities, and word vector representations trained on a mix of open subtitles and -Wikipedia data. It doesn't yet provide lemmatisation or morphological analysis, -and it doesn't yet recognise numeric entities such as numbers and dates. - -**Bugfixes** - -* spaCy < 0.100.7 had a bug in the semantics of the ``Token.__str__`` and ``Token.__unicode__`` built-ins: they included a trailing space. -* Improve handling of "infixed" hyphens. Previously the tokenizer struggled with multiple hyphens, such as "well-to-do". -* Improve handling of periods after mixed-case tokens -* Improve lemmatization for English special-case tokens -* Fix bug that allowed spaces to be treated as heads in the syntactic parse -* Fix bug that led to inconsistent sentence boundaries before and after serialisation. -* Fix bug from deserialising untagged documents. - -2016-03-08 `v0.100.6 `_: *Add support for GloVe vectors* ------------------------------------------------------------------------------------------------------------------ - -This release offers improved support for replacing the word vectors used by spaCy. -To install Stanford's GloVe vectors, trained on the Common Crawl, just run: - -.. code:: bash - - sputnik --name spacy install en_glove_cc_300_1m_vectors - -To reduce memory usage and loading time, we've trimmed the vocabulary down to 1m entries. - -This release also integrates all the code necessary for German parsing. A German model -will be released shortly. To assist in multi-lingual processing, we've added a ``load()`` -function. To load the English model with the GloVe vectors: - -.. code:: python - - spacy.load('en', vectors='en_glove_cc_300_1m_vectors') - -2016-02-07 `v0.100.5 `_ --------------------------------------------------------------------------------- - -Fix incorrect use of header file, caused from problem with thinc - -2016-02-07 `v0.100.4 `_: *Fix OSX problem introduced in 0.100.3* -------------------------------------------------------------------------------------------------------------------------- - -Small correction to right_edge calculation - -2016-02-06 `v0.100.3 `_ --------------------------------------------------------------------------------- - -Support multi-threading, via the ``.pipe`` method. spaCy now releases the GIL around the -parser and entity recognizer, so systems that support OpenMP should be able to do -shared memory parallelism at close to full efficiency. - -We've also greatly reduced loading time, and fixed a number of bugs. - -2016-01-21 `v0.100.2 `_ --------------------------------------------------------------------------------- - -Fix data version lock that affected v0.100.1 - -2016-01-21 `v0.100.1 `_: *Fix install for OSX* -------------------------------------------------------------------------------------------------------- - -v0.100 included header files built on Linux that caused installation to fail on OSX. -This should now be corrected. We also update the default data distribution, to -include a small fix to the tokenizer. - -2016-01-19 `v0.100 `_: *Revise setup.py, better model downloads, bug fixes* ----------------------------------------------------------------------------------------------------------------------------------- - -* Redo setup.py, and remove ugly headers_workaround hack. Should result in fewer install problems. -* Update data downloading and installation functionality, by migrating to the Sputnik data-package manager. This will allow us to offer finer grained control of data installation in future. -* Fix bug when using custom entity types in ``Matcher``. This should work by default when using the - ``English.__call__`` method of running the pipeline. If invoking ``Parser.__call__`` directly to do NER, - you should call the ``Parser.add_label()`` method to register your entity type. -* Fix head-finding rules in ``Span``. -* Fix problem that caused ``doc.merge()`` to sometimes hang -* Fix problems in handling of whitespace - -2015-11-08 `v0.99 `_: *Improve span merging, internal refactoring* ------------------------------------------------------------------------------------------------------------------------- - -* Merging multi-word tokens into one, via the ``doc.merge()`` and ``span.merge()`` methods, no longer invalidates existing ``Span`` objects. This makes it much easier to merge multiple spans, e.g. to merge all named entities, or all base noun phrases. Thanks to @andreasgrv for help on this patch. -* Lots of internal refactoring, especially around the machine learning module, thinc. The thinc API has now been improved, and the spacy._ml wrapper module is no longer necessary. -* The lemmatizer now lower-cases non-noun, noun-verb and non-adjective words. -* A new attribute, ``.rank``, is added to Token and Lexeme objects, giving the frequency rank of the word. - -2015-11-03 `v0.98 `_: *Smaller package, bug fixes* ---------------------------------------------------------------------------------------------------------- - -* Remove binary data from PyPi package. -* Delete archive after downloading data -* Use updated cymem, preshed and thinc packages -* Fix information loss in deserialize -* Fix ``__str__`` methods for Python2 - -2015-10-23 `v0.97 `_: *Load the StringStore from a json list, instead of a text file* -------------------------------------------------------------------------------------------------------------------------------------------- - -* Fix bugs in download.py -* Require ``--force`` to over-write the data directory in download.py -* Fix bugs in ``Matcher`` and ``doc.merge()`` - -2015-10-19 `v0.96 `_: *Hotfix to .merge method* ------------------------------------------------------------------------------------------------------ - -* Fix bug that caused text to be lost after ``.merge`` -* Fix bug in Matcher when matched entities overlapped - -2015-10-18 `v0.95 `_: *Bugfixes* --------------------------------------------------------------------------------------- - -* Reform encoding of symbols -* Fix bugs in ``Matcher`` -* Fix bugs in ``Span`` -* Add tokenizer rule to fix numeric range tokenization -* Add specific string-length cap in Tokenizer -* Fix ``token.conjuncts`` - -2015-10-09 `v0.94 `_ --------------------------------------------------------------------------- - -* Fix memory error that caused crashes on 32bit platforms -* Fix parse errors caused by smart quotes and em-dashes - -2015-09-22 `v0.93 `_ --------------------------------------------------------------------------- - -Bug fixes to word vectors +πŸ›  Changelog +=========== + +=========== ============== =========== +Version Date Description +=========== ============== =========== +`v1.6.0`_ ``2017-01-16`` Improvements to tokenizer and tests +`v1.5.0`_ ``2016-12-27`` Alpha support for Swedish and Hungarian +`v1.4.0`_ ``2016-12-18`` Improved language data and alpha Dutch support +`v1.3.0`_ ``2016-12-03`` Improve API consistency +`v1.2.0`_ ``2016-11-04`` Alpha tokenizers for Chinese, French, Spanish, Italian and Portuguese +`v1.1.0`_ ``2016-10-23`` Bug fixes and adjustments +`v1.0.0`_ ``2016-10-18`` Support for deep learning workflows and entity-aware rule matcher +`v0.101.0`_ ``2016-05-10`` Fixed German model +`v0.100.7`_ ``2016-05-05`` German support +`v0.100.6`_ ``2016-03-08`` Add support for GloVe vectors +`v0.100.5`_ ``2016-02-07`` Fix incorrect use of header file +`v0.100.4`_ ``2016-02-07`` Fix OSX problem introduced in 0.100.3 +`v0.100.3`_ ``2016-02-06`` Multi-threading, faster loading and bugfixes +`v0.100.2`_ ``2016-01-21`` Fix data version lock +`v0.100.1`_ ``2016-01-21`` Fix install for OSX +`v0.100`_ ``2016-01-19`` Revise setup.py, better model downloads, bug fixes +`v0.99`_ ``2015-11-08`` Improve span merging, internal refactoring +`v0.98`_ ``2015-11-03`` Smaller package, bug fixes +`v0.97`_ ``2015-10-23`` Load the StringStore from a json list, instead of a text file +`v0.96`_ ``2015-10-19`` Hotfix to .merge method +`v0.95`_ ``2015-10-18`` Bug fixes +`v0.94`_ ``2015-10-09`` Fix memory and parse errors +`v0.93`_ ``2015-09-22`` Bug fixes to word vectors +=========== ============== =========== + +.. _v1.6.0: https://github.com/explosion/spaCy/releases/tag/v1.6.0 +.. _v1.5.0: https://github.com/explosion/spaCy/releases/tag/v1.5.0 +.. _v1.4.0: https://github.com/explosion/spaCy/releases/tag/v1.4.0 +.. _v1.3.0: https://github.com/explosion/spaCy/releases/tag/v1.3.0 +.. _v1.2.0: https://github.com/explosion/spaCy/releases/tag/v1.2.0 +.. _v1.1.0: https://github.com/explosion/spaCy/releases/tag/v1.1.0 +.. _v1.0.0: https://github.com/explosion/spaCy/releases/tag/v1.0.0 +.. _v0.101.0: https://github.com/explosion/spaCy/releases/tag/0.101.0 +.. _v0.100.7: https://github.com/explosion/spaCy/releases/tag/0.100.7 +.. _v0.100.6: https://github.com/explosion/spaCy/releases/tag/0.100.6 +.. _v0.100.5: https://github.com/explosion/spaCy/releases/tag/0.100.5 +.. _v0.100.4: https://github.com/explosion/spaCy/releases/tag/0.100.4 +.. _v0.100.3: https://github.com/explosion/spaCy/releases/tag/0.100.3 +.. _v0.100.2: https://github.com/explosion/spaCy/releases/tag/0.100.2 +.. _v0.100.1: https://github.com/explosion/spaCy/releases/tag/0.100.1 +.. _v0.100: https://github.com/explosion/spaCy/releases/tag/0.100 +.. _v0.99: https://github.com/explosion/spaCy/releases/tag/0.99 +.. _v0.98: https://github.com/explosion/spaCy/releases/tag/0.98 +.. _v0.97: https://github.com/explosion/spaCy/releases/tag/0.97 +.. _v0.96: https://github.com/explosion/spaCy/releases/tag/0.96 +.. _v0.95: https://github.com/explosion/spaCy/releases/tag/0.95 +.. _v0.94: https://github.com/explosion/spaCy/releases/tag/0.94 +.. _v0.93: https://github.com/explosion/spaCy/releases/tag/0.93