Commit Graph

10800 Commits

Author SHA1 Message Date
Matthew Honnibal 28741ff5db Require preshed v3.0.0 2019-09-10 19:13:07 +02:00
adrianeboyd e367864e59 Update Ukrainian create_lemmatizer kwargs (#4266)
Allow Ukrainian create_lemmatizer to accept lookups kwarg.
2019-09-10 11:14:46 +02:00
adrianeboyd c32126359a Allow period as suffix following punctuation (#4248)
Addresses rare cases (such as `_MATH_.`, see #1061) where the final
period was not recognized as a suffix following punctuation.
2019-09-09 19:19:22 +02:00
Ines Montani 3e8f136ba7 💫 WIP: Basic lookup class scaffolding and JSON for all lemmatizer data (#4178)
* Improve load_language_data helper

* WIP: Add Lookups implementation

* Start moving lemma data over to JSON

* WIP: move data over for more languages

* Convert more languages

* Fix lemmatizer fixtures in tests

* Finish conversion

* Auto-format JSON files

* Fix test for now

* Make sure tables are stored on instance

* Update docstrings

* Update docstrings and errors

* Update test

* Add Lookups.__len__

* Add serialization methods

* Add Lookups.remove_table

* Use msgpack for serialization to disk

* Fix file exists check

* Try using OrderedDict for everything

* Update .flake8 [ci skip]

* Try fixing serialization

* Update test_lookups.py

* Update test_serialize_vocab_strings.py

* Fix serialization for lookups

* Fix lookups

* Fix lookups

* Fix lookups

* Try to fix serialization

* Try to fix serialization

* Try to fix serialization

* Try to fix serialization

* Give up on serialization test

* Xfail more serialization tests for 3.5

* Fix lookups for 2.7
2019-09-09 19:17:55 +02:00
Sofie Van Landeghem 482c7cd1b9 pulling tqdm imports in functions to avoid bug (tmp fix) (#4263) 2019-09-09 16:32:11 +02:00
Mihai Gliga 25aecd504f adding Romanian tag_map (#4257)
* adding Romanian tag_map

* added SCA file

* forgotten import
2019-09-09 11:53:09 +02:00
Matthew Honnibal 1653b818c5 Update Lithuanian tag map 2019-09-08 20:57:58 +02:00
adrianeboyd 3780e2ff50 Flush tokenizer cache when necessary (#4258)
Flush tokenizer cache when affixes, token_match, or special cases are
modified.

Fixes #4238, same issue as in #1250.
2019-09-08 20:52:46 +02:00
Matthew Honnibal da8830d909 Set version to v2.2.0.dev3 2019-09-08 18:22:03 +02:00
Matthew Honnibal 1a65c5b7af Update develop from master 2019-09-08 18:21:41 +02:00
Matthew Honnibal aec6174ae6 Fix lemmatizer 2019-09-08 18:09:53 +02:00
Matthew Honnibal fde4f8ac8e Create lookups if not passed in 2019-09-08 18:08:09 +02:00
Pavle Vidanović d03401f532 Lemmatizer lookup dictionary for Serbian and basic tag set adde… (#4251)
* Serbian stopwords added. (cyrillic alphabet)

* spaCy Contribution agreement included.

* Test initialize updated

* Serbian language code update. --bugfix

* Tokenizer exceptions added. Init file updated.

* Norm exceptions and lexical attributes added.

* Examples added.

* Tests added.

* sr_lang examples update.

* Tokenizer exceptions updated. (Serbian)

* Lemmatizer created. Licence included.

* Test updated.

* Tag map basic added.

* tag_map.py file removed since it uses default spacy tags.
2019-09-08 14:19:15 +02:00
Ivan Šarić b01025dd06 adds Croatian lemma_lookup.json, license file and corresponding tests (#4252) 2019-09-08 13:40:45 +02:00
adrianeboyd aec755d3a3 Modify retokenizer to use span root attributes (#4219)
* Modify retokenizer to use span root attributes

* tag/pos/morph are set to root tag/pos/morph

* lemma and norm are reset and end up as orth (not ideal, but better
than orth of first token)

* Also handle individual merge case

* Add test

* Attempt to handle ent_iob and ent_type in merges

* Fix check for whether B-ENT should become I-ENT

* Move IOB consistency check to after attrs

Move all IOB consistency checks after attrs are set and simplify to
check entire document, modifying I to B at the beginning of the document
or if the entity type of the previous token isn't the same.

* Move IOB consistency check for single merge

Move IOB consistency check after the token array is compressed for the
single merge case.

* Update spacy/tokens/_retokenize.pyx

Co-Authored-By: Matthew Honnibal <honnibal+gh@gmail.com>

* Remove single vs. multiple merge distinction

Remove original single-instance `_merge()` and use `_bulk_merge()` (now
renamed `_merge()`) for all merges.

* Add out-of-bound check in previous entity check
2019-09-08 13:04:49 +02:00
Sofie Van Landeghem 53a9ca45c9 Docs: bufsize instead of buffsize (#4247) 2019-09-06 11:11:54 +02:00
Sofie Van Landeghem 6b012cebff Make pos/tag distinction more clear in docs (#4246)
* make distinction between tag and pos more prominent in docs

* out of the 101
2019-09-06 10:31:21 +02:00
Bae Yong-Ju a55f5a744f Fix ValueError exception on empty Korean text. (#4245) 2019-09-06 10:29:40 +02:00
Ines Montani 232a029de6 Send referrer for internal links [ci skip] 2019-09-05 10:41:46 +02:00
Matthew Honnibal d039ed2267
Merge pull request #4237 from adrianeboyd/feature/gold-train-orth-variants
Add guillemets/chevrons to German orth variants
2019-09-04 23:10:49 +02:00
Matthew Honnibal b94c34ec8f
Merge pull request #4239 from adrianeboyd/bugfix/tokenizer-cache-test-1061
Add regression test for #1061 back to test suite
2019-09-04 23:10:12 +02:00
Adriane Boyd 0f28418446 Add regression test for #1061 back to test suite 2019-09-04 20:42:24 +02:00
Adriane Boyd c39c13f26b Add guillemets/chevrons to German orth variants
Add guillemets/chevrons to German orth variants for both German/Austrian
and Swiss conventions.
2019-09-04 20:05:08 +02:00
Ines Montani 2f31f96fce Update languages.json [ci skip] 2019-09-04 18:15:42 +02:00
Ines Montani 2245e95e2d Update languages.json [ci skip] 2019-09-04 17:11:40 +02:00
Matthew Honnibal 17c039406b
Merge pull request #4232 from adrianeboyd/bugfix/entityruler-ner-4229
Fix handling of preset entities in NER
2019-09-04 15:02:31 +02:00
Adriane Boyd 6b0fec76fd Fix handling of preset entities in NER
* Fix check of valid ent_type for B
* Add valid L as preset-I followed by not-I
2019-09-04 13:42:42 +02:00
Ines Montani 419ae59c79 Make flaky test test_issue_1971_4 more explicit 2019-08-31 14:08:05 +02:00
Ines Montani dad5621166 Tidy up and auto-format [ci skip] 2019-08-31 13:39:31 +02:00
Ines Montani cd90752193 Tidy up and auto-format [ci skip] 2019-08-31 13:39:06 +02:00
Ines Montani bcd1b12f43 Add contributor agreement [ci skip] 2019-08-30 17:02:43 +02:00
Matthew Honnibal 67c3d03905 Revert morphology serialisation 2019-08-30 13:13:07 +02:00
Matthew Honnibal efcb51ddc8
Merge pull request #4217 from adrianeboyd/bugfix/morph-en-serialization
Morphology tag_map-related bugfixes
2019-08-30 12:46:29 +02:00
Adriane Boyd 893f11a9e3 Serialize tag_map directly
Fix Aspect_prof typo
2019-08-30 11:30:03 +02:00
Adriane Boyd 02babf9317 English tag map without unsupported features/values 2019-08-30 11:29:19 +02:00
Matthew Honnibal 516650f58f
Merge pull request #4207 from svlandeg/bugfix/serialize-tok-exc
Bugfix for serializing tokenizer rules/exceptions
2019-08-30 11:04:58 +02:00
Matthew Honnibal f3c3ce7f1e Update vocab 2019-08-29 21:19:54 +02:00
Matthew Honnibal fc0a3c8c38 Add morphology serialization 2019-08-29 21:17:34 +02:00
Matthew Honnibal c94fc9edb9 Fix noise addition 2019-08-29 15:39:32 +02:00
Matthew Honnibal 32842a3cd4 Disable whitespace corruption 2019-08-29 15:01:58 +02:00
Matthew Honnibal 3c1c0ec18e Add tests for NER oracle with whitespace 2019-08-29 14:33:39 +02:00
Matthew Honnibal 6511e1d8d3 Fix NER gold-standard around whitespace 2019-08-29 14:33:07 +02:00
adrianeboyd 82159b5c19 Updates/bugfixes for NER/IOB converters (#4186)
* Updates/bugfixes for NER/IOB converters

* Converter formats `ner` and `iob` use autodetect to choose a converter if
  possible

* `iob2json` is reverted to handle sentence-per-line data like
  `word1|pos1|ent1 word2|pos2|ent2`

  * Fix bug in `merge_sentences()` so the second sentence in each batch isn't
    skipped

* `conll_ner2json` is made more general so it can handle more formats with
  whitespace-separated columns

  * Supports all formats where the first column is the token and the final
    column is the IOB tag; if present, the second column is the POS tag

  * As in CoNLL 2003 NER, blank lines separate sentences, `-DOCSTART- -X- O O`
    separates documents

  * Add option for segmenting sentences (new flag `-s`)

  * Parser-based sentence segmentation with a provided model, otherwise with
    sentencizer (new option `-b` to specify model)

  * Can group sentences into documents with `n_sents` as long as sentence
    segmentation is available

  * Only applies automatic segmentation when there are no existing delimiters
    in the data

* Provide info about settings applied during conversion with warnings and
  suggestions if settings conflict or might not be not optimal.

* Add tests for common formats

* Add '(default)' back to docs for -c auto

* Add document count back to output

* Revert changes to converter output message

* Use explicit tabs in convert CLI test data

* Adjust/add messages for n_sents=1 default

* Add sample NER data to training examples

* Update README

* Add links in docs to example NER data

* Define msg within converters
2019-08-29 12:04:01 +02:00
adrianeboyd 5feb342f5e Add more token attributes to token pattern schema (#4210)
Add token attributes with tests to token pattern schema.
2019-08-29 12:02:26 +02:00
Matthew Honnibal 216f63a987
Merge pull request #4208 from adrianeboyd/bugfix/orth-vs-noise
Add separate noise vs orth level to train CLI
2019-08-29 10:26:42 +02:00
Adriane Boyd f3906950d3 Add separate noise vs orth level to train CLI 2019-08-29 09:10:35 +02:00
Matthew Honnibal 7d6d438566 Set version to v2.2.0.dev2 2019-08-28 18:30:43 +02:00
Matthew Honnibal bc5ce49859 Fix 'noise_level' in train cmd 2019-08-28 17:55:38 +02:00
Matthew Honnibal 782056d117 Fix morph rules 2019-08-28 16:59:45 +02:00
Matthew Honnibal 6b2ea883ed
Merge pull request #4205 from adrianeboyd/feature/gold-train-orth-variants
Add train_docs() option to add orth variants
2019-08-28 16:54:06 +02:00