Commit Graph

6347 Commits

Author SHA1 Message Date
Ines Montani 3c3658ef9f Merge branch 'master' into develop 2019-09-12 18:03:01 +02:00
Ines Montani 228bbf506d Improve label properties on pipes 2019-09-12 18:02:44 +02:00
Paul O'Leary McCann 7d8df69158 Bloom-filter backed Lookup Tables (#4268)
* Improve load_language_data helper

* WIP: Add Lookups implementation

* Start moving lemma data over to JSON

* WIP: move data over for more languages

* Convert more languages

* Fix lemmatizer fixtures in tests

* Finish conversion

* Auto-format JSON files

* Fix test for now

* Make sure tables are stored on instance

* Update docstrings

* Update docstrings and errors

* Update test

* Add Lookups.__len__

* Add serialization methods

* Add Lookups.remove_table

* Use msgpack for serialization to disk

* Fix file exists check

* Try using OrderedDict for everything

* Update .flake8 [ci skip]

* Try fixing serialization

* Update test_lookups.py

* Update test_serialize_vocab_strings.py

* Lookups / Tables now work

This implements the stubs in the Lookups/Table classes. Currently this
is in Cython but with no type declarations, so that could be improved.

* Add lookups to setup.py

* Actually add lookups pyx

The previous commit added the old py file...

* Lookups work-in-progress

* Move from pyx back to py

* Add string based lookups, fix serialization

* Update tests, language/lemmatizer to work with string lookups

There are some outstanding issues here:

- a pickling-related test fails due to the bloom filter
- some custom lemmatizers (fr/nl at least) have issues

More generally, there's a question of how to deal with the case where
you have a string but want to use the lookup table. Currently the table
allows access by string or id, but that's getting pretty awkward.

* Change lemmatizer lookup method to pass (orth, string)

* Fix token lookup

* Fix French lookup

* Fix lt lemmatizer test

* Fix Dutch lemmatizer

* Fix lemmatizer lookup test

This was using a normal dict instead of a Table, so checks for the
string instead of an integer key failed.

* Make uk/nl/ru lemmatizer lookup methods consistent

The mentioned tokenizers all have their own implementation of the
`lookup` method, which accesses a `Lookups` table. The way that was
called in `token.pyx` was changed so this should be updated to have the
same arguments as `lookup` in `lemmatizer.py` (specificially (orth/id,
string)).

Prior to this change tests weren't failing, but there would probably be
issues with normal use of a model. More tests should proably be added.

Additionally, the language-specific `lookup` implementations seem like
they might not be needed, since they handle things like lower-casing
that aren't actually language specific.

* Make recently added Greek method compatible

* Remove redundant class/method

Leftovers from a merge not cleaned up adequately.
2019-09-12 17:26:11 +02:00
Sofie Van Landeghem 9be4d1c105 Allow copying of user_data in as_doc (#4282)
* Allow copying the user_data with as_doc + unit test

* add option to docs

* add typing

* import fix

* workaround to avoid bool clashing ...

* bint instead of bool
2019-09-12 17:08:14 +02:00
Matthew Honnibal 7d782aa97b Add more docstrings for MorphAnalysis 2019-09-12 16:48:30 +02:00
Ines Montani b544dcb3c5 Document debug-data [ci skip] 2019-09-12 15:26:20 +02:00
Ines Montani 05a2df6616 Remove not implemented file validation [ci skip] 2019-09-12 15:26:02 +02:00
Ines Montani 10257f3131 Document Lookups [ci skip] 2019-09-12 14:00:14 +02:00
Ines Montani 32404e613c Create directory if it doesn't exist 2019-09-12 14:00:01 +02:00
Ines Montani 625ce2db8e Update Language docs [ci skip] 2019-09-12 13:03:38 +02:00
Ines Montani 655b434553 Merge branch 'master' into develop 2019-09-12 11:39:18 +02:00
Sofie Van Landeghem 0b4b4f1819 Documentation for Entity Linking (#4065)
* document token ent_kb_id

* document span kb_id

* update pipeline documentation

* prior and context weights as bool's instead

* entitylinker api documentation

* drop for both models

* finish entitylinker documentation

* small fixes

* documentation for KB

* candidate documentation

* links to api pages in code

* small fix

* frequency examples as counts for consistency

* consistent documentation about tensors returned by predict

* add entity linking to usage 101

* add entity linking infobox and KB section to 101

* entity-linking in linguistic features

* small typo corrections

* training example and docs for entity_linker

* predefined nlp and kb

* revert back to similarity encodings for simplicity (for now)

* set prior probabilities to 0 when excluded

* code clean up

* bugfix: deleting kb ID from tokens when entities were removed

* refactor train el example to use either model or vocab

* pretrain_kb example for example kb generation

* add to training docs for KB + EL example scripts

* small fixes

* error numbering

* ensure the language of vocab and nlp stay consistent across serialization

* equality with =

* avoid conflict in errors file

* add error 151

* final adjustements to the train scripts - consistency

* update of goldparse documentation

* small corrections

* push commit

* typo fix

* add candidate API to kb documentation

* update API sidebar with EntityLinker and KnowledgeBase

* remove EL from 101 docs

* remove entity linker from 101 pipelines / rephrase

* custom el model instead of existing model

* set version to 2.2 for EL functionality

* update documentation for 2 CLI scripts
2019-09-12 11:38:34 +02:00
Ines Montani 4d4b3b0783 Add "labels" to Language.meta 2019-09-12 11:34:25 +02:00
Ines Montani ac0e27a825
💫 Add Language.pipe_labels (#4276)
* Add Language.pipe_labels

* Update spacy/language.py

Co-Authored-By: Matthew Honnibal <honnibal+gh@gmail.com>
2019-09-12 10:56:28 +02:00
tamuhey 71909cdf22 Fix iss4278 (#4279)
* fix: len(tuple) == 2

* (#4278) add fail test

* add contributor's aggreement
2019-09-12 10:44:49 +02:00
Ines Montani 8ebc3711dc Fix bug in Parser.labels and add test (#4275) 2019-09-11 18:29:35 +02:00
Matthew Honnibal 7fbb559045 Set version to v2.2.0.dev6 2019-09-11 18:07:20 +02:00
Matthew Honnibal f7a096b462 Update morphology 2019-09-11 18:06:43 +02:00
Matthew Honnibal f8ce9dde0f Set version to v2.2.0.dev5 2019-09-11 17:41:21 +02:00
Matthew Honnibal c47c0269b1 Update morphology features 2019-09-11 15:16:53 +02:00
Ines Montani af25323653 Tidy up and auto-format 2019-09-11 14:00:36 +02:00
Matthew Honnibal af93997993 Fix conllu converter 2019-09-11 13:28:07 +02:00
Matthew Honnibal 178d010b25 Set version to 2.2.0.dev4 2019-09-11 12:28:37 +02:00
Ines Montani e82a8d0d7a Merge branch 'master' into develop 2019-09-11 11:52:38 +02:00
Ines Montani 8f9f48b04c Add GreekLemmatizer.lookup (resolves #4272) 2019-09-11 11:44:40 +02:00
Ines Montani 6279d74c65 Tidy up and auto-format 2019-09-11 11:38:22 +02:00
Matthew Honnibal 7b858ba606 Update from master 2019-09-10 20:14:08 +02:00
Ines Montani 669a7d37ce Exclude vocab when testing to_bytes 2019-09-10 19:45:16 +02:00
adrianeboyd e367864e59 Update Ukrainian create_lemmatizer kwargs (#4266)
Allow Ukrainian create_lemmatizer to accept lookups kwarg.
2019-09-10 11:14:46 +02:00
adrianeboyd c32126359a Allow period as suffix following punctuation (#4248)
Addresses rare cases (such as `_MATH_.`, see #1061) where the final
period was not recognized as a suffix following punctuation.
2019-09-09 19:19:22 +02:00
Ines Montani 3e8f136ba7 💫 WIP: Basic lookup class scaffolding and JSON for all lemmatizer data (#4178)
* Improve load_language_data helper

* WIP: Add Lookups implementation

* Start moving lemma data over to JSON

* WIP: move data over for more languages

* Convert more languages

* Fix lemmatizer fixtures in tests

* Finish conversion

* Auto-format JSON files

* Fix test for now

* Make sure tables are stored on instance

* Update docstrings

* Update docstrings and errors

* Update test

* Add Lookups.__len__

* Add serialization methods

* Add Lookups.remove_table

* Use msgpack for serialization to disk

* Fix file exists check

* Try using OrderedDict for everything

* Update .flake8 [ci skip]

* Try fixing serialization

* Update test_lookups.py

* Update test_serialize_vocab_strings.py

* Fix serialization for lookups

* Fix lookups

* Fix lookups

* Fix lookups

* Try to fix serialization

* Try to fix serialization

* Try to fix serialization

* Try to fix serialization

* Give up on serialization test

* Xfail more serialization tests for 3.5

* Fix lookups for 2.7
2019-09-09 19:17:55 +02:00
Sofie Van Landeghem 482c7cd1b9 pulling tqdm imports in functions to avoid bug (tmp fix) (#4263) 2019-09-09 16:32:11 +02:00
Mihai Gliga 25aecd504f adding Romanian tag_map (#4257)
* adding Romanian tag_map

* added SCA file

* forgotten import
2019-09-09 11:53:09 +02:00
Matthew Honnibal 1653b818c5 Update Lithuanian tag map 2019-09-08 20:57:58 +02:00
adrianeboyd 3780e2ff50 Flush tokenizer cache when necessary (#4258)
Flush tokenizer cache when affixes, token_match, or special cases are
modified.

Fixes #4238, same issue as in #1250.
2019-09-08 20:52:46 +02:00
Matthew Honnibal da8830d909 Set version to v2.2.0.dev3 2019-09-08 18:22:03 +02:00
Matthew Honnibal 1a65c5b7af Update develop from master 2019-09-08 18:21:41 +02:00
Matthew Honnibal aec6174ae6 Fix lemmatizer 2019-09-08 18:09:53 +02:00
Matthew Honnibal fde4f8ac8e Create lookups if not passed in 2019-09-08 18:08:09 +02:00
Pavle Vidanović d03401f532 Lemmatizer lookup dictionary for Serbian and basic tag set adde… (#4251)
* Serbian stopwords added. (cyrillic alphabet)

* spaCy Contribution agreement included.

* Test initialize updated

* Serbian language code update. --bugfix

* Tokenizer exceptions added. Init file updated.

* Norm exceptions and lexical attributes added.

* Examples added.

* Tests added.

* sr_lang examples update.

* Tokenizer exceptions updated. (Serbian)

* Lemmatizer created. Licence included.

* Test updated.

* Tag map basic added.

* tag_map.py file removed since it uses default spacy tags.
2019-09-08 14:19:15 +02:00
Ivan Šarić b01025dd06 adds Croatian lemma_lookup.json, license file and corresponding tests (#4252) 2019-09-08 13:40:45 +02:00
adrianeboyd aec755d3a3 Modify retokenizer to use span root attributes (#4219)
* Modify retokenizer to use span root attributes

* tag/pos/morph are set to root tag/pos/morph

* lemma and norm are reset and end up as orth (not ideal, but better
than orth of first token)

* Also handle individual merge case

* Add test

* Attempt to handle ent_iob and ent_type in merges

* Fix check for whether B-ENT should become I-ENT

* Move IOB consistency check to after attrs

Move all IOB consistency checks after attrs are set and simplify to
check entire document, modifying I to B at the beginning of the document
or if the entity type of the previous token isn't the same.

* Move IOB consistency check for single merge

Move IOB consistency check after the token array is compressed for the
single merge case.

* Update spacy/tokens/_retokenize.pyx

Co-Authored-By: Matthew Honnibal <honnibal+gh@gmail.com>

* Remove single vs. multiple merge distinction

Remove original single-instance `_merge()` and use `_bulk_merge()` (now
renamed `_merge()`) for all merges.

* Add out-of-bound check in previous entity check
2019-09-08 13:04:49 +02:00
Bae Yong-Ju a55f5a744f Fix ValueError exception on empty Korean text. (#4245) 2019-09-06 10:29:40 +02:00
Adriane Boyd 0f28418446 Add regression test for #1061 back to test suite 2019-09-04 20:42:24 +02:00
Adriane Boyd c39c13f26b Add guillemets/chevrons to German orth variants
Add guillemets/chevrons to German orth variants for both German/Austrian
and Swiss conventions.
2019-09-04 20:05:08 +02:00
Adriane Boyd 6b0fec76fd Fix handling of preset entities in NER
* Fix check of valid ent_type for B
* Add valid L as preset-I followed by not-I
2019-09-04 13:42:42 +02:00
Ines Montani 419ae59c79 Make flaky test test_issue_1971_4 more explicit 2019-08-31 14:08:05 +02:00
Ines Montani cd90752193 Tidy up and auto-format [ci skip] 2019-08-31 13:39:06 +02:00
Matthew Honnibal 67c3d03905 Revert morphology serialisation 2019-08-30 13:13:07 +02:00
Adriane Boyd 893f11a9e3 Serialize tag_map directly
Fix Aspect_prof typo
2019-08-30 11:30:03 +02:00