Commit Graph

647 Commits

Author SHA1 Message Date
Mark Abraham a0ffa346c0 Fix broken link in docs 2020-03-13 14:07:26 +01:00
Renaud Richardet eccf6b1686
small typo in code sample 2020-03-09 14:49:11 +01:00
Adriane Boyd 0c31f03ec5 Update docs [ci skip] 2020-03-09 13:41:17 +01:00
Adriane Boyd 1139247532 Revert changes to token_match priority from #4374
* Revert changes to priority of `token_match` so that it has priority
over all other tokenizer patterns

* Add lookahead and potentially slow lookbehind back to the default URL
pattern

* Expand character classes in URL pattern to improve matching around
lookaheads and lookbehinds related to #4882

* Revert changes to Hungarian tokenizer

* Revert (xfail) several URL tests to their status before #4374

* Update `tokenizer.explain()` and docs accordingly
2020-03-09 12:09:41 +01:00
Ines Montani de11ea753a Merge branch 'master' into develop 2020-02-18 14:47:23 +01:00
Kabir Khan f6ed07b85c
Use nlp.pipe in EntityRuler for phrase patterns in add_patterns (#4931)
* Fix ent_ids and labels properties when id attribute used in patterns

* use set for labels

* sort end_ids for comparison in entity_ruler tests

* fixing entity_ruler ent_ids test

* add to set

* Run make_doc optimistically if using phrase matcher patterns.

* remove unused coveragerc I was testing with

* format

* Refactor EntityRuler.add_patterns to use nlp.pipe for phrase patterns. Improves speed substantially.

* Removing old add_patterns function

* Fixing spacing

* Make sure token_patterns loaded as well, before generator was being emptied in from_disk
2020-02-16 18:17:47 +01:00
Julin S 479e81bafc
fix link (#4977) 2020-02-10 20:31:26 -05:00
Ines Montani 9c08d9baa3 Remove old sections [ci skip] (closes #4961) 2020-02-03 13:10:46 +01:00
Preston Badeer b216ff43c9 Update vectors-similarity.md (#4889)
These links are broken on the website, due to quotes around the URLs.
2020-01-08 16:49:40 +01:00
Geoffrey Gordon Ashbrook 53929138d7 remove extra word typo (#4875)
"let you find you"
2020-01-06 12:37:42 +01:00
Ines Montani 400257a802 Update index.md [ci skip] 2020-01-04 01:52:18 +01:00
Ines Montani db55577c45
Drop Python 2.7 and 3.5 (#4828)
* Remove unicode declarations

* Remove Python 3.5 and 2.7 from CI

* Don't require pathlib

* Replace compat helpers

* Remove OrderedDict

* Use f-strings

* Set Cython compiler language level

* Fix typo

* Re-add OrderedDict for Table

* Update setup.cfg

* Revert CONTRIBUTING.md

* Revert lookups.md

* Revert top-level.md

* Small adjustments and docs [ci skip]
2019-12-22 01:53:56 +01:00
Ines Montani 158b98a3ef Merge branch 'master' into develop 2019-12-21 18:55:03 +01:00
Ines Montani 1b838d1313 Divide models into core and starters [ci skip] 2019-12-21 14:10:22 +01:00
Nicolai Bjerre Pedersen de5453cdcb Fix link to user hooks in docs (#4778)
* Fix link to user hooks in docs

* Update mr_bjerre.md

Mistake in contributor agreement

* Apparently hard to get it right (wrong name of sca)
2019-12-06 19:17:12 +01:00
Ines Montani cbacb0f1a4 Update shape docs and examples (resolves #4615) [ci skip] 2019-11-23 17:16:55 +01:00
Ines Montani 235fe6fe3b Auto-format [ci skip] 2019-11-20 13:14:58 +01:00
adrianeboyd 2c876eb672 Add tokenizer explain() debugging method (#4596)
* Expose tokenizer rules as a property

Expose the tokenizer rules property in the same way as the other core
properties. (The cache resetting is overkill, but consistent with
`from_bytes` for now.)

Add tests and update Tokenizer API docs.

* Update Hungarian punctuation to remove empty string

Update Hungarian punctuation definitions so that `_units` does not match
an empty string.

* Use _load_special_tokenization consistently

Use `_load_special_tokenization()` and have it to handle `None` checks.

* Fix precedence of `token_match` vs. special cases

Remove `token_match` check from `_split_affixes()` so that special cases
have precedence over `token_match`. `token_match` is checked only before
infixes are split.

* Add `make_debug_doc()` to the Tokenizer

Add `make_debug_doc()` to the Tokenizer as a working implementation of
the pseudo-code in the docs.

Add a test (marked as slow) that checks that `nlp.tokenizer()` and
`nlp.tokenizer.make_debug_doc()` return the same non-whitespace tokens
for all languages that have `examples.sentences` that can be imported.

* Update tokenization usage docs

Update pseudo-code and algorithm description to correspond to
`nlp.tokenizer.make_debug_doc()` with example debugging usage.

Add more examples for customizing tokenizers while preserving the
existing defaults.

Minor edits / clarifications.

* Revert "Update Hungarian punctuation to remove empty string"

This reverts commit f0a577f7a5.

* Rework `make_debug_doc()` as `explain()`

Rework `make_debug_doc()` as `explain()`, which returns a list of
`(pattern_string, token_string)` tuples rather than a non-standard
`Doc`. Update docs and tests accordingly, leaving the visualization for
future work.

* Handle cases with bad tokenizer patterns

Detect when tokenizer patterns match empty prefixes and suffixes so that
`explain()` does not hang on bad patterns.

* Remove unused displacy image

* Add tokenizer.explain() to usage docs
2019-11-20 13:07:25 +01:00
Ines Montani e8b9cee6fd Make example consistent with model (closes #4587) [ci skip] 2019-11-18 12:41:48 +01:00
Ines Montani e01a1a237f Auto-format [ci skip] 2019-11-18 12:41:31 +01:00
adrianeboyd 62e00fd9da Update tokenization usage docs (#4666)
Update pseudo-code and algorithm description to correspond to current
tokenizer behavior.

Add more examples for customizing tokenizers while preserving the
existing defaults.

Minor edits / clarifications.
2019-11-18 12:35:13 +01:00
Ines Montani 5adcb352e9 Adjust order of docs sections [ci skip] 2019-11-17 16:08:56 +01:00
Ines Montani e30d08410a
Add CI for Python 3.8 (#4479)
* Add 3.8 classifier

* Update azure-pipelines.yml

* Remove 3.8 warning from docs [ci skip]
2019-11-15 01:13:48 +01:00
adrianeboyd faaa832518 Generalize handling of tokenizer special cases (#4259)
* Generalize handling of tokenizer special cases

Handle tokenizer special cases more generally by using the Matcher
internally to match special cases after the affix/token_match
tokenization is complete.

Instead of only matching special cases while processing balanced or
nearly balanced prefixes and suffixes, this recognizes special cases in
a wider range of contexts:

* Allows arbitrary numbers of prefixes/affixes around special cases
* Allows special cases separated by infixes

Existing tests/settings that couldn't be preserved as before:

* The emoticon '")' is no longer a supported special case
* The emoticon ':)' in "example:)" is a false positive again

When merged with #4258 (or the relevant cache bugfix), the affix and
token_match properties should be modified to flush and reload all
special cases to use the updated internal tokenization with the Matcher.

* Remove accidentally added test case

* Really remove accidentally added test

* Reload special cases when necessary

Reload special cases when affixes or token_match are modified. Skip
reloading during initialization.

* Update error code number

* Fix offset and whitespace in Matcher special cases

* Fix offset bugs when merging and splitting tokens
* Set final whitespace on final token in inserted special case

* Improve cache flushing in tokenizer

* Separate cache and specials memory (temporarily)
* Flush cache when adding special cases
* Repeated `self._cache = PreshMap()` and `self._specials = PreshMap()`
are necessary due to this bug:
https://github.com/explosion/preshed/issues/21

* Remove reinitialized PreshMaps on cache flush

* Update UD bin scripts

* Update imports for `bin/`
* Add all currently supported languages
* Update subtok merger for new Matcher validation
* Modify blinded check to look at tokens instead of lemmas (for corpora
with tokens but not lemmas like Telugu)

* Use special Matcher only for cases with affixes

* Reinsert specials cache checks during normal tokenization for special
cases as much as possible
  * Additionally include specials cache checks while splitting on infixes
  * Since the special Matcher needs consistent affix-only tokenization
    for the special cases themselves, introduce the argument
    `with_special_cases` in order to do tokenization with or without
    specials cache checks
* After normal tokenization, postprocess with special cases Matcher for
special cases containing affixes

* Replace PhraseMatcher with Aho-Corasick

Replace PhraseMatcher with the Aho-Corasick algorithm over numpy arrays
of the hash values for the relevant attribute. The implementation is
based on FlashText.

The speed should be similar to the previous PhraseMatcher. It is now
possible to easily remove match IDs and matches don't go missing with
large keyword lists / vocabularies.

Fixes #4308.

* Restore support for pickling

* Fix internal keyword add/remove for numpy arrays

* Add test for #4248, clean up test

* Improve efficiency of special cases handling

* Use PhraseMatcher instead of Matcher
* Improve efficiency of merging/splitting special cases in document
  * Process merge/splits in one pass without repeated token shifting
  * Merge in place if no splits

* Update error message number

* Remove UD script modifications

Only used for timing/testing, should be a separate PR

* Remove final traces of UD script modifications

* Update UD bin scripts

* Update imports for `bin/`
* Add all currently supported languages
* Update subtok merger for new Matcher validation
* Modify blinded check to look at tokens instead of lemmas (for corpora
with tokens but not lemmas like Telugu)

* Add missing loop for match ID set in search loop

* Remove cruft in matching loop for partial matches

There was a bit of unnecessary code left over from FlashText in the
matching loop to handle partial token matches, which we don't have with
PhraseMatcher.

* Replace dict trie with MapStruct trie

* Fix how match ID hash is stored/added

* Update fix for match ID vocab

* Switch from map_get_unless_missing to map_get

* Switch from numpy array to Token.get_struct_attr

Access token attributes directly in Doc instead of making a copy of the
relevant values in a numpy array.

Add unsatisfactory warning for hash collision with reserved terminal
hash key. (Ideally it would change the reserved terminal hash and redo
the whole trie, but for now, I'm hoping there won't be collisions.)

* Restructure imports to export find_matches

* Implement full remove()

Remove unnecessary trie paths and free unused maps.

Parallel to Matcher, raise KeyError when attempting to remove a match ID
that has not been added.

* Switch to PhraseMatcher.find_matches

* Switch to local cdef functions for span filtering

* Switch special case reload threshold to variable

Refer to variable instead of hard-coded threshold

* Move more of special case retokenize to cdef nogil

Move as much of the special case retokenization to nogil as possible.

* Rewrap sort as stdsort for OS X

* Rewrap stdsort with specific types

* Switch to qsort

* Fix merge

* Improve cmp functions

* Fix realloc

* Fix realloc again

* Initialize span struct while retokenizing

* Temporarily skip retokenizing

* Revert "Move more of special case retokenize to cdef nogil"

This reverts commit 0b7e52c797.

* Revert "Switch to qsort"

This reverts commit a98d71a942.

* Fix specials check while caching

* Modify URL test with emoticons

The multiple suffix tests result in the emoticon `:>`, which is now
retokenized into one token as a special case after the suffixes are
split off.

* Refactor _apply_special_cases()

* Use cdef ints for span info used in multiple spots

* Modify _filter_special_spans() to prefer earlier

Parallel to #4414, modify _filter_special_spans() so that the earlier
span is preferred for overlapping spans of the same length.

* Replace MatchStruct with Entity

Replace MatchStruct with Entity since the existing Entity struct is
nearly identical.

* Replace Entity with more general SpanC

* Replace MatchStruct with SpanC

* Add error in debug-data if no dev docs are available (see #4575)

* Update azure-pipelines.yml

* Revert "Update azure-pipelines.yml"

This reverts commit ed1060cf59.

* Use latest wasabi

* Reorganise install_requires

* add dframcy to universe.json (#4580)

* Update universe.json [ci skip]

* Fix multiprocessing for as_tuples=True (#4582)

* Fix conllu script (#4579)

* force extensions to avoid clash between example scripts

* fix arg order and default file encoding

* add example config for conllu script

* newline

* move extension definitions to main function

* few more encodings fixes

* Add load_from_docbin example [ci skip]

TODO: upload the file somewhere

* Update README.md

* Add warnings about 3.8 (resolves #4593) [ci skip]

* Fixed typo: Added space between "recognize" and "various" (#4600)

* Fix DocBin.merge() example (#4599)

* Replace function registries with catalogue (#4584)

* Replace functions registries with catalogue

* Update __init__.py

* Fix test

* Revert unrelated flag [ci skip]

* Bugfix/dep matcher issue 4590 (#4601)

* add contributor agreement for prilopes

* add test for issue #4590

* fix on_match params for DependencyMacther (#4590)

* Minor updates to language example sentences (#4608)

* Add punctuation to Spanish example sentences

* Combine multilanguage examples for lang xx

* Add punctuation to nb examples

* Always realloc to a larger size

Avoid potential (unlikely) edge case and cymem error seen in #4604.

* Add error in debug-data if no dev docs are available (see #4575)

* Update debug-data for GoldCorpus / Example

* Ignore None label in misaligned NER data
2019-11-13 21:24:35 +01:00
Ines Montani 9d5ff177c4 Work around Markdown rendering issue surfaced in #4600 [ci skip] 2019-11-11 17:12:08 +01:00
walterhenry 5563c42ef5 Fixed typo: Added space between "recognize" and "various" (#4600) 2019-11-06 23:06:36 +01:00
Ines Montani 828ef27a32 Add warnings about 3.8 (resolves #4593) [ci skip] 2019-11-05 18:30:11 +01:00
Ines Montani 4e1de85e43 Update syntax iterators [ci skip] 2019-10-30 14:31:40 +01:00
Ines Montani 493be8e9db Update new version identifier [ci skip] 2019-10-25 11:42:49 +02:00
Ines Montani f31876154d Adjust formatting [ci skip] 2019-10-25 11:19:46 +02:00
Kabir Khan 93640373c7 Make entity_ruler ent_id resolution 2x faster and add docs for… (#4513)
* Update entityruler.py

* Making ent_id resolution 2x faster and adding docs

* Fixing newlines in docstrings

* Fixing newlines in docstrings
2019-10-25 11:16:42 +02:00
adrianeboyd 7fc39f124c Fix logic in rules+model entity example [ci skip] (#4510) 2019-10-23 14:41:21 +02:00
adrianeboyd 3195a8f170 Add Entity Linking to menu (#4489) 2019-10-21 12:17:30 +02:00
Ines Montani 573e543e4a Alphanumeric -> alphabetic [ci skip]
see ines/spacy-course#38
2019-10-06 13:30:01 +02:00
Ines Montani e65dffd80b Clarify serialization of extension attributes (closes #4377) [ci skip] 2019-10-05 11:58:00 +02:00
Sofie Van Landeghem 4e7259c6cf Bugfix initializing DocBin with attributes (#4368)
* docbin init fix + documentation fix + unit tests

* newline

* try with zlib instead of gzip (python 2 incompatibilities)
2019-10-03 14:48:45 +02:00
Ines Montani 80cf385f65 Update v2-2.md [ci skip] 2019-10-02 16:58:21 +02:00
Ines Montani b6670bf0c2 Use consistent spelling 2019-10-02 10:37:39 +02:00
Ines Montani 475e3188ce Add docs on filtering overlapping spans for merging (resolves #4352) [ci skip] 2019-10-01 21:59:50 +02:00
Ines Montani 0dd127bb00 Update v2-2.md [ci skip] 2019-10-01 21:37:06 +02:00
Ines Montani bc7e7db208 Fix wording [ci skip] 2019-10-01 14:20:44 +02:00
Ines Montani 2a3a4565cd Update infobox [ci skip] 2019-10-01 14:19:34 +02:00
Ines Montani 66aa0d479f Update v2.2 page [ci skip] 2019-10-01 14:11:05 +02:00
Ines Montani a8a1800f2a Update lemma data documentation [ci skip] 2019-10-01 13:22:13 +02:00
Ines Montani 932ad9cb91 Fix typos and formatting [ci skip] 2019-10-01 12:30:04 +02:00
Ines Montani 3d8fd4b461 Revert #4334 2019-09-29 17:32:12 +02:00
Ines Montani 3bd4da068e Fix link [ci skip] 2019-09-29 17:30:38 +02:00
Ines Montani 089f44cc56 Update serialization docs [ci skip] 2019-09-29 17:11:13 +02:00
Ines Montani c9cd516d96 Move tests out of package (#4334)
* Move tests out of package

* Fix typo
2019-09-28 18:05:00 +02:00
Ines Montani 10742d3219 Update v2 docs [ci skip] 2019-09-28 15:57:22 +02:00
Ines Montani 59beab8405 Update v2-2.md [ci skip] 2019-09-27 18:10:43 +02:00
Ines Montani 685e4b2554 Update v2-2.md [ci skip] 2019-09-27 16:35:01 +02:00
Em Zhan aafa091541 Fix typo in documentation (#4322)
* Fix typo 'probj' instead of 'pobj'

* Add spaCy contributor agreement for zqianem
2019-09-25 19:42:18 +02:00
Ines Montani 197406de1d Update v2-2.md [ci skip] 2019-09-19 14:33:58 +02:00
Ines Montani ddc09b08ed Update v2-2.md [ci skip] 2019-09-19 00:58:30 +02:00
Ines Montani 9c940eab94 Update version in examples [ci skip] 2019-09-18 21:23:26 +02:00
Ines Montani f873548f6c Add backwards incompatibility [ci skip] 2019-09-18 21:21:48 +02:00
Ines Montani dd1810f05a Update DocBin and add docs 2019-09-18 20:23:21 +02:00
Ines Montani d62690b3ba Update examples 2019-09-18 19:57:36 +02:00
Matthew Honnibal 931e96b6c7 DocPallet->DocBin in docs 2019-09-18 15:17:26 +02:00
Matthew Honnibal f537cbeacc Update v2-2 docs 2019-09-18 14:07:55 +02:00
Ines Montani 16c2522791 Merge branch 'master' into develop 2019-09-14 16:42:01 +02:00
Ines Montani 86befc80bf WIP: Add v2.2 page [ci skip] 2019-09-14 16:41:48 +02:00
Ines Montani 04d36d2471 Remove unused link [ci skip] 2019-09-14 16:41:19 +02:00
Ines Montani 5c8b5e68ec Fix docs consistency [ci skip] 2019-09-14 16:23:37 +02:00
Ines Montani bbf7337eaf Update adding languages docs [ci skip] 2019-09-14 15:32:15 +02:00
Ines Montani 25b2b3ff45 Remove LEMMA from exception examples [ci skip] 2019-09-12 16:26:27 +02:00
Ines Montani 82c16b7943 Remove u-strings and fix formatting [ci skip] 2019-09-12 16:11:15 +02:00
Ines Montani a31e9e1cd5 Update training docs [ci skip] 2019-09-12 15:32:39 +02:00
Ines Montani b544dcb3c5 Document debug-data [ci skip] 2019-09-12 15:26:20 +02:00
Ines Montani c0a4cab178 Update "Adding languages" docs [ci skip] 2019-09-12 14:53:06 +02:00
Ines Montani e7c20ad1d2 Update colors entry points docs [ci skip] 2019-09-12 12:59:10 +02:00
Ines Montani 7b59a919e6 Update entry points docs [ci skip] 2019-09-12 12:52:06 +02:00
Sofie Van Landeghem 0b4b4f1819 Documentation for Entity Linking (#4065)
* document token ent_kb_id

* document span kb_id

* update pipeline documentation

* prior and context weights as bool's instead

* entitylinker api documentation

* drop for both models

* finish entitylinker documentation

* small fixes

* documentation for KB

* candidate documentation

* links to api pages in code

* small fix

* frequency examples as counts for consistency

* consistent documentation about tensors returned by predict

* add entity linking to usage 101

* add entity linking infobox and KB section to 101

* entity-linking in linguistic features

* small typo corrections

* training example and docs for entity_linker

* predefined nlp and kb

* revert back to similarity encodings for simplicity (for now)

* set prior probabilities to 0 when excluded

* code clean up

* bugfix: deleting kb ID from tokens when entities were removed

* refactor train el example to use either model or vocab

* pretrain_kb example for example kb generation

* add to training docs for KB + EL example scripts

* small fixes

* error numbering

* ensure the language of vocab and nlp stay consistent across serialization

* equality with =

* avoid conflict in errors file

* add error 151

* final adjustements to the train scripts - consistency

* update of goldparse documentation

* small corrections

* push commit

* typo fix

* add candidate API to kb documentation

* update API sidebar with EntityLinker and KnowledgeBase

* remove EL from 101 docs

* remove entity linker from 101 pipelines / rephrase

* custom el model instead of existing model

* set version to 2.2 for EL functionality

* update documentation for 2 CLI scripts
2019-09-12 11:38:34 +02:00
Sofie Van Landeghem 6b012cebff Make pos/tag distinction more clear in docs (#4246)
* make distinction between tag and pos more prominent in docs

* out of the 101
2019-09-06 10:31:21 +02:00
adrianeboyd 8fe7bdd0fa Improve token pattern checking without validation (#4105)
* Fix typo in rule-based matching docs

* Improve token pattern checking without validation

Add more detailed token pattern checks without full JSON pattern validation and
provide more detailed error messages.

Addresses #4070 (also related: #4063, #4100).

* Check whether top-level attributes in patterns and attr for PhraseMatcher are
  in token pattern schema

* Check whether attribute value types are supported in general (as opposed to
  per attribute with full validation)

* Report various internal error types (OverflowError, AttributeError, KeyError)
  as ValueError with standard error messages

* Check for tagger/parser in PhraseMatcher pipeline for attributes TAG, POS,
  LEMMA, and DEP

* Add error messages with relevant details on how to use validate=True or nlp()
  instead of nlp.make_doc()

* Support attr=TEXT for PhraseMatcher

* Add NORM to schema

* Expand tests for pattern validation, Matcher, PhraseMatcher, and EntityRuler

* Remove unnecessary .keys()

* Rephrase error messages

* Add another type check to Matcher

Add another type check to Matcher for more understandable error messages
in some rare cases.

* Support phrase_matcher_attr=TEXT for EntityRuler

* Don't use spacy.errors in examples and bin scripts

* Fix error code

* Auto-format

Also try get Azure pipelines to finally start a build :(

* Update errors.py


Co-authored-by: Ines Montani <ines@ines.io>
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2019-08-21 14:00:37 +02:00
Ines Montani 3134a9b6e0 Add section on expanding regex match to token boundaries (see #4158) [ci skip] 2019-08-21 12:53:31 +02:00
Ines Montani 66aba2d676 Improve regex matching docs [ci skip] 2019-08-19 13:59:41 +02:00
Sofie Van Landeghem cc66f47893 Make enabling/disabling jupyter mode more explicit (#4144)
* make enabling/disabling jupyter mode more explicit

* markup fix
2019-08-19 11:53:34 +02:00
Ines Montani e520eb3f6c Make visualized NER examples more clear (closes #4104) [ci skip] 2019-08-18 16:29:29 +02:00
Ines Montani 1362f793cf Improve docs on phrase pattern attributes (closes #4100) [ci skip] 2019-08-11 11:13:49 +02:00
Ines Montani 8b4a0fabbb Adjust docs example [ci skip] 2019-08-07 00:46:47 +02:00
adrianeboyd 69aca7d839 Add validate option to EntityRuler (#4089)
* Add validate option to EntityRuler

* Add validate to EntityRuler, passed to Matcher and PhraseMatcher

* Add validate to usage and API docs

* Update website/docs/usage/rule-based-matching.md

Co-Authored-By: Ines Montani <ines@ines.io>

* Update website/docs/usage/rule-based-matching.md

Co-Authored-By: Ines Montani <ines@ines.io>
2019-08-07 00:40:53 +02:00
Ines Montani 4ae320e5c2 Use consistent casing for entity ruler patterns (see #4063) [ci skip] 2019-08-06 12:20:22 +02:00
Ines Montani 223bde5cf6 Improve docs on matcher attributes [ci skip] (closes #4063) 2019-08-06 12:13:42 +02:00
Ines Montani 2bfae0b167 Auto-format 2019-08-06 12:13:31 +02:00
Ines Montani bd39e5e630 Add "Processing text" section [ci skip] 2019-07-25 17:38:03 +02:00
Ines Montani a5e3d2f318 Improve section on disabling pipes [ci skip] 2019-07-25 14:25:34 +02:00
Ines Montani 02e444ec7c Add section on special tokenizer component [ci skip] 2019-07-25 14:25:03 +02:00
Ines Montani 1fa6d6ba55 Improve consistency of docs examples [ci skip] 2019-07-25 14:24:56 +02:00
Ines Montani 1167c303a0 Fix typos [ci skip] 2019-07-19 13:08:18 +02:00
Ines Montani c3ead02ea5 Adjust wording [ci skip] 2019-07-17 16:06:25 +02:00
Ines Montani 1d5ff3e455 Add infobox 2019-07-17 15:29:36 +02:00
Ines Montani 114cb18892 Improve wording 2019-07-17 15:27:53 +02:00
Ines Montani 7522beef9e Add "Things to try" prompts 2019-07-17 15:25:02 +02:00
Ines Montani 9f02e3c027 Adjust example
Not actually supported in this alignment interpretation
2019-07-17 15:13:50 +02:00
Ines Montani 1ea472468a Add usage docs for aligning tokenization 2019-07-17 15:08:33 +02:00
pmbaumgartner 9a86d95ea2 fix custom attribute links 2019-07-14 20:23:54 -04:00
Ines Montani ebe58e7fa1 Document gold.docs_to_json [ci skip] 2019-07-10 10:27:33 +02:00
Ines Montani 881f5bc401 Auto-format 2019-07-10 10:27:29 +02:00
Ines Montani d361e380b8 Fix matcher callback example (closes #3862) 2019-06-26 14:47:26 +02:00
Alejandro Alcalde 4866a7ee9e Changed learning rate by its param name. (#3855)
* Changed learning rate by its param name.

I've been searching for a while how the parameter learning rate was named, with `beta1` and `beta2` its easy as they are marked as code, but learning rate wasn't. I think writing the actual parameter name would be helpful.

* Signing SCA
2019-06-20 10:29:20 +02:00
Ramanan Balakrishnan eb12703d10 minor fix to broken link in documentation (#3819) [ci skip] 2019-06-04 11:15:35 +02:00
Ines Montani 0c74506c9c Fix typos in docs (closes #3802) [ci skip] 2019-06-01 11:35:01 +02:00
mak 89379a7fa4 Corrected example model URL in requirements.txt (#3786)
The URL used to show how to add a model to the requirements.txt had the old release path (excl. explosion).
2019-05-29 10:51:55 +02:00
Aaron Kub 719a15f23d fixing regex matcher examples (#3708) (#3719) 2019-05-10 14:23:52 +02:00
张晓飞 ba1ff00370 update response after calling add_pipe (#3661)
* update response after calling add_pipe

component:print_info is appened in the last, so need show it at the end of  pipeline

* Create henry860916.md
2019-05-01 12:02:18 +02:00
Ramiro Gómez 8ee4100f8f Remove dangling M (#3657)
I assume this is a typo. Sorry if it has a meaning that I'm not aware of.
2019-04-29 19:44:43 +02:00
Amit Chaudhary 167d63af31 Fix broken link to Dive Into Python 3 website (#3656)
* Fix broken link to Dive Into Python 3 website

* Sign spaCy Contributor Agreement
2019-04-29 19:44:00 +02:00
Ivan Tham fa94f83697 Improve redundant variable name (#3643)
* Improve redundant variable name

* Apply suggestions from code review

Co-Authored-By: pickfire <pickfire@riseup.net>
2019-04-26 16:50:14 +02:00
Ines Montani 0dce4585b1 Add course to 101 2019-04-19 15:59:51 +02:00
Ines Montani 38395d9518 Merge branch 'spacy.io' 2019-04-19 15:26:20 +02:00
Ines Montani 7ac5bb0a7b Update landing and feature overview 2019-04-19 15:23:08 +02:00
fizban99 f2f2df6e78 entity types for colors should be in uppercase (#3599)
although the text indicates the entity types should be in lowercase, the sample code shows uppercase, which is the correct format.
2019-04-17 11:22:56 +02:00
Ines Montani 9e7deeaf48 Remove Datacamp 2019-04-13 17:46:32 +02:00
Ines Montani 2f0f439c54 Remove non-existent example (closes #3533) 2019-04-03 09:59:17 +02:00
Ines Montani 200d8bdb3c Merge branch 'spacy.io' [ci skip] 2019-03-23 16:46:34 +01:00
Ines Montani 06bf130890 💫 Add better and serializable sentencizer (#3471)
* Add better serializable sentencizer component

* Replace default factory

* Add tests

* Tidy up

* Pass test

* Update docs
2019-03-23 15:45:02 +01:00
Ines Montani b532386a60 Fix typo [ci skip] 2019-03-22 18:36:17 +01:00
Ines Montani 5073ce63fd Merge branch 'spacy.io' [ci skip] 2019-03-22 15:17:11 +01:00
Ines Montani 0712efc6b3 Update version requirements [ci skip] 2019-03-21 10:23:54 +01:00
Ines Montani d4eed4a84f Add note on unicode build to troubleshooting guide (see #3421) [ci skip] 2019-03-19 10:27:02 +01:00
Ines Montani a611b32fbf Update model docs [ci skip] 2019-03-17 11:48:18 +01:00
Ines Montani cbcba699dd Fix missing ids 2019-03-14 17:56:53 +01:00
Ines Montani 4cfe4aa224 Fix small issues in the docs [ci skip] 2019-03-12 22:57:15 +01:00
Ines Montani ba7eb2d131 Update section [ci skip] 2019-03-12 16:18:34 +01:00
Ines Montani cecc31b765 Don't auto-slugify accordion links [ci skip] 2019-03-12 15:30:49 +01:00
Ines Montani 72fb324d95 Add vector training script to bin [ci skip] 2019-03-12 12:07:56 +01:00
Ines Montani 3abf0e6b9f Replace dev-resources links with real examples 2019-03-12 12:07:40 +01:00
Ines Montani 59c0620487 Auto-format 2019-03-12 12:07:11 +01:00
Ines Montani 7c05ca01e8 💫 Support mutable default values for extension attributes (#3389)
* Support mutable default values in extensions

* Update documentation
2019-03-11 12:50:44 +01:00
Ines Montani 8dbf1e9037 Also fix #3387 on develop 2019-03-10 23:36:28 +01:00
Ines Montani 9a8f169e5c Update v2-1.md 2019-03-10 18:58:51 +01:00
Ines Montani 296446a1c8
Tidy up and improve docs and docstrings (#3370)
<!--- Provide a general summary of your changes in the title. -->

## Description
* tidy up and adjust Cython code to code style
* improve docstrings and make calling `help()` nicer
* add URLs to new docs pages to docstrings wherever possible, mostly to user-facing objects
* fix various typos and inconsistencies in docs

### Types of change
enhancement, docs

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-03-08 11:42:26 +01:00
Ines Montani 48a206a95f Fix displaCy visualizations in docs (closes #3357) [ci skip] 2019-03-06 13:20:44 +01:00
Ines Montani c478a2ccb6 Update backwards incompat [ci skip] 2019-02-27 11:56:56 +01:00
Ines Montani 1b6238101a Add table explaining training metrics [closes #2644] 2019-02-25 10:03:43 +01:00
Ines Montani 62b558ab72 💫 Support lexical attributes in retokenizer attrs (closes #2390) (#3325)
* Fix formatting and whitespace

* Add support for lexical attributes (closes #2390)

* Document lexical attribute setting during retokenization

* Assign variable oputside of nested loop
2019-02-24 21:13:51 +01:00
Ines Montani aa52305461 Improve pipeline model and meta example [ci skip] 2019-02-24 18:45:39 +01:00
Ines Montani df19e2bff6
💫 Allow setting of custom attributes during retokenization (closes #3314) (#3324)
<!--- Provide a general summary of your changes in the title. -->

## Description

This PR adds the abilility to override custom extension attributes during merging. This will only work for attributes that are writable, i.e. attributes registered with a default value like `default=False` or attribute that have both a getter *and* a setter implemented.

```python
Token.set_extension('is_musician', default=False)

doc = nlp("I like David Bowie.")
with doc.retokenize() as retokenizer:
    attrs = {"LEMMA": "David Bowie", "_": {"is_musician": True}}
    retokenizer.merge(doc[2:4], attrs=attrs)

assert doc[2].text == "David Bowie"
assert doc[2].lemma_ == "David Bowie"
assert doc[2]._.is_musician
```

### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-24 18:38:47 +01:00
Ines Montani 403b9cd58b Add docs on adding to existing tokenizer rules [ci skip] 2019-02-24 18:35:19 +01:00
Ines Montani 383e2e1f12 Update Python versions [ci skip] 2019-02-24 11:49:45 +01:00
Ines Montani b624cb4b89 Update v2-1.md 2019-02-24 11:49:27 +01:00
Ines Montani 0fc908d7a5 Add note on merging speed in v2.1 (see #3300) [ci skip] 2019-02-21 12:34:18 +01:00
Ines Montani 236aa94ded Update v2-1.md 2019-02-21 12:33:56 +01:00
Sofie 9a478b6db8 Clean up of char classes, few tokenizer fixes and faster default French tokenizer (#3293)
* splitting up latin unicode interval

* removing hyphen as infix for French

* adding failing test for issue 1235

* test for issue #3002 which now works

* partial fix for issue #2070

* keep the hyphen as infix for French (as it was)

* restore french expressions with hyphen as infix (as it was)

* added succeeding unit test for Issue #2656

* Fix issue #2822 with custom Italian exception

* Fix issue #2926 by allowing numbers right before infix /

* splitting up latin unicode interval

* removing hyphen as infix for French

* adding failing test for issue 1235

* test for issue #3002 which now works

* partial fix for issue #2070

* keep the hyphen as infix for French (as it was)

* restore french expressions with hyphen as infix (as it was)

* added succeeding unit test for Issue #2656

* Fix issue #2822 with custom Italian exception

* Fix issue #2926 by allowing numbers right before infix /

* remove duplicate

* remove xfail for Issue #2179 fixed by Matt

* adjust documentation and remove reference to regex lib
2019-02-20 22:10:13 +01:00
Ines Montani 57ae71ea95 Add docs on serializing the pipeline (see #3289) [ci skip] 2019-02-18 14:13:29 +01:00
Ines Montani 38e4422c0d Improve matcher example (resolves #3287) 2019-02-18 13:26:37 +01:00
Ines Montani 660cfe44c5 Fix formatting 2019-02-18 13:26:22 +01:00
Ines Montani 212ff359ef Fix links [ci skip] 2019-02-17 22:25:50 +01:00
Ines Montani e597110d31
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 19:31:19 +01:00
ines 3f4fd2c5d5 Update usage documentation 2017-10-03 14:26:20 +02:00
Reza Gharibi 0461b82158 Fix typos 2017-09-27 03:56:20 +03:30
Reza Gharibi fa1844b132 Fix typo 2017-09-27 03:55:54 +03:30
Reza Gharibi b5dd7e7cc4 Fix typo 2017-09-27 03:55:28 +03:30
Ines Montani b8e81daccf Fix typo (closes #1312) 2017-09-14 12:49:59 +02:00
ines d15775c3ad Fix typos and commands in alpha docs 2017-08-21 13:40:11 +02:00
ines 3c33003078 Port over typo corrections from #1245 2017-08-20 12:00:17 +02:00
ines a29f132ffd Change python -m spacy to spacy
Reflects latest change to entry point or auto-alias
2017-08-14 13:04:48 +02:00
Nikolai Kruglikov 08e443e083 Fix small typo in documentation 2017-08-14 12:19:04 +02:00
ines ab8ffbaab7 Add text classification to v2 overview 2017-07-22 17:56:51 +02:00
ines 0fb89dd204 Add text classification usage guide template 2017-07-22 17:56:07 +02:00
ines d05ab1b3a0 Add text classification to 101 overview and change order 2017-07-22 17:55:53 +02:00
Jarle Mathiesen f20533ec0c fix small typo 2017-06-24 12:31:33 +02:00
Savva Kolbachev 800a8faff4 Changed the capital of Lithuania to Vilnius
Hi,
There is a typo about the capital of Lithuania.

Vilnius is the capital of Lithuania https://en.wikipedia.org/wiki/Vilnius
Ljubljana is the capital of Slovenia https://en.wikipedia.org/wiki/Ljubljana
2017-06-12 23:27:00 +03:00
Ines Montani 57f64b9e1c Merge pull request #1124 from v3t3a/patch-3
docs - Fix url error for Displacy Ent visualizer
2017-06-12 21:20:32 +02:00
Ines Montani b2a28028cf Merge pull request #1115 from v3t3a/patch-2
docs - Add read() method when opening file (Lightning tour)
2017-06-12 21:19:25 +02:00
Vetea eae1f7b19c Fix url error for Displacy Ent visualizer 2017-06-12 14:30:02 +02:00
ines 49026a1346 Fix typos in example (see #1105) 2017-06-08 19:15:50 +02:00
Vetea cc3aee1189 Add read() method when opening file
Add read() method for 

to avoid :
```TypeError: Argument 'string' has incorrect type (expected str, got _io.TextIOWrapper)```

Test with:
spaCy : v2.0.0 Alpha
python : 3.5.2+ (default, Sep 22 2016, 12:18:14)
2017-06-08 11:27:09 +02:00
ines 6b799bac54 Fix formatting and details 2017-06-06 14:37:49 +02:00
ines fd9ae0f0e0 Update v2 comparison table 2017-06-05 16:39:11 +02:00
ines a3f9745a14 Update similarity usage guide and examples 2017-06-05 15:37:33 +02:00
ines fd35d910b8 Update v2 docs and benchmarks 2017-06-05 14:13:38 +02:00
ines 040553ca59 Update architecture and features table 2017-06-05 13:33:01 +02:00
ines 505d43b832 Update norms example 2017-06-04 23:33:26 +02:00
ines f8e93b6d0a Update norms example 2017-06-04 23:24:29 +02:00
ines a857b2b511 Update norms example 2017-06-04 23:21:37 +02:00
ines 47d066b293 Add under construction 2017-06-04 23:17:54 +02:00
ines e9816daa6a Add details on syntax iterators 2017-06-04 23:16:33 +02:00
ines 990cb81556 Add info on syntax iterators 2017-06-04 21:47:22 +02:00
ines e4eb33daf7 Add links to production use guide 2017-06-04 20:56:58 +02:00
ines 63cd539d04 Add more details on model packages and requirements.txt (see #1099) 2017-06-04 20:52:10 +02:00
ines 97ff83d163 Fix docs on model loading 2017-06-04 20:44:59 +02:00
ines b6002db797 Add v2 label 2017-06-04 18:53:03 +02:00
ines 468ff1a7dd Update v2 docs and add benchmarks stub 2017-06-04 15:34:28 +02:00
Matthew Honnibal 23fd6b1782 Add intro narrative for v2 2017-06-04 15:10:37 +02:00
ines 3419ecbfdd Update docs on model shortcut links 2017-06-04 13:55:00 +02:00
ines 586e901143 Add v2 intro stub 2017-06-04 13:42:37 +02:00
ines 4f8f62d9b3 Merge branch 'v2-docs-edits' into develop 2017-06-04 13:40:58 +02:00
ines 809903dcad Fix link and update wording 2017-06-04 13:29:20 +02:00
ines 22dd18c364 Remove redundant CPU commands 2017-06-04 13:29:13 +02:00
ines 1d6377218a Update architecture blurb and move other info 2017-06-04 13:28:58 +02:00
ines 7a66c9f039 Fix formatting 2017-06-04 13:14:00 +02:00
Matthew Honnibal f2c4a9f690 Edits to spacy-101 page 2017-06-04 13:10:27 +02:00
Matthew Honnibal aca53b95e1 Link architecture blurb 2017-06-04 13:10:06 +02:00
Matthew Honnibal 64ca5123bb Add Architecture 101 blurb 2017-06-04 13:09:19 +02:00
Matthew Honnibal e77ed953f4 Update GPU instructions 2017-06-04 12:03:22 +02:00
ines 1d3b012e56 Update adding languages docs and add 101 2017-06-03 23:54:23 +02:00
ines a3715a81d5 Update adding languages guide 2017-06-03 22:16:38 +02:00
ines ec6d2bc81d Add table of contents mixin 2017-06-03 22:16:26 +02:00
ines 9acf8686f7 Update note on compact mode issues 2017-06-03 13:31:16 +02:00
ines c60431357d Port over docs typo corrections 2017-06-03 11:31:30 +02:00
ines c6dc2fafc0 Add Spanish and move example sentences to meta 2017-06-01 17:49:56 +02:00
ines b577ed79ee Move social image logic out to function and move files 2017-06-01 14:27:44 +02:00
ines 5e60b09dcd Fix custom tokenizer example 2017-06-01 13:02:50 +02:00
ines 8274dffad6 Update NER training draft 2017-06-01 12:51:36 +02:00
ines 04fac3f52a Add NER training example code 2017-06-01 12:47:47 +02:00
ines 7f5e7e7320 Fix typo 2017-06-01 12:47:36 +02:00
ines 4a927154d8 Update v2 docs 2017-06-01 11:56:32 +02:00
ines 03bbb96db8 Remove outdated examples 2017-06-01 11:56:02 +02:00
ines 789e69b73f Update training guide 2017-06-01 11:53:23 +02:00
ines 2f40d6e7e7 Add training 101 2017-06-01 11:53:16 +02:00
ines abed463bbb Update serialization 101 2017-06-01 11:52:58 +02:00
ines 72380c952a Update training section in NER guide and add links 2017-06-01 11:52:49 +02:00
ines 22b1f72870 Add spaCy 101 intro 2017-05-31 12:44:09 +02:00
ines a18b95ca12 Update docs on testing 2017-05-31 12:43:40 +02:00
ines 981196c181 Fix typo 2017-05-31 11:34:31 +02:00
ines f86289566a Update new in v2 section and add note on Matcher acceptors 2017-05-30 13:53:06 +02:00
ines ce4e45d0bb Update 101 intro 2017-05-29 22:15:06 +02:00
ines 687ed28340 Update processing pipelines guide 2017-05-29 14:21:00 +02:00
ines d5992f408f Update note on vocab consistency 2017-05-29 14:14:26 +02:00
ines a2134951f2 Update 101 and add note on pipeline order and tensors 2017-05-29 11:45:32 +02:00
ines 17b635eaab Update alpha docs note and fix typo 2017-05-29 11:09:24 +02:00
ines fbe105f1eb Add note on L in long integers in Python 2 2017-05-29 11:05:05 +02:00
ines 9d74810f6f Update examples 2017-05-29 01:09:52 +02:00
ines 42cf414138 Update Matcher example 2017-05-29 01:09:52 +02:00
ines 00b2094dc3 Fix typos, long integers and tests 2017-05-29 01:09:52 +02:00
ines d71c6db76e Add missing Chainer install for GPU if building spaCy from source 2017-05-28 23:34:59 +02:00
ines e0f9ccdaa3 Update texts and rename vectorizer to tensorizer 2017-05-28 23:26:13 +02:00
ines 606879b217 Update hash strings examples 2017-05-28 19:42:44 +02:00
ines c7b57ea314 Update docs and change integer IDs to hash values 2017-05-28 19:25:34 +02:00
ines 738b4f7187 Add quickstart options and docs for GPU 2017-05-28 19:20:11 +02:00
ines 4c00cb8c8b Update 101 and add community/FAQ and table of contents 2017-05-28 18:45:49 +02:00
ines 8a148b6563 Fix code, links and formatting 2017-05-28 18:29:16 +02:00
ines 414193e9ba Update docs to reflect StringStore changes 2017-05-28 18:19:11 +02:00
ines 69bda9aed7 Update text, examples, typos, wording and formatting 2017-05-28 16:41:01 +02:00
ines f8185b8e11 Rename vocab-stringsotre to vocab 2017-05-28 16:37:14 +02:00
ines 10d05c2b92 Fix typos, wording and formatting 2017-05-28 01:30:12 +02:00
ines db116cbeda Update tokenization 101 and add illustration 2017-05-28 00:22:40 +02:00
ines b03fb2d7b0 Update 101 and usage docs 2017-05-28 00:22:40 +02:00
ines ae11c8d60f Add emoji sentiment to lightning tour matcher example 2017-05-27 20:02:20 +02:00
ines 22bf5f63bf Update Matcher docs and add social media analysis example 2017-05-27 17:58:18 +02:00
ines 0d33ead507 Fix initialisation of Doc in lightning tour example 2017-05-27 17:58:06 +02:00
ines e05bcd6aa8 Update docs to reflect flattened model meta.json
Don't use "setup" key and instead, keep "lang" on root level and add
"pipeline".
2017-05-27 17:57:46 +02:00
ines 1b982f0838 Update train command and add docs on hyperparameters 2017-05-26 14:02:38 +02:00
ines 93ee5c4a52 Update serialization info 2017-05-26 13:22:45 +02:00
ines f122d82f29 Update usage docs and ddd "under construction" 2017-05-26 13:17:48 +02:00
ines 286c3d0719 Update usage and 101 docs 2017-05-26 12:46:29 +02:00
ines 6d76c1ea16 Add 101 for Vocab, Lexeme and StringStore 2017-05-26 12:45:01 +02:00