Commit Graph

11406 Commits

Author SHA1 Message Date
adrianeboyd 697bec764d
Normalize IS_SENT_START to SENT_START for Matcher (#5080) 2020-03-03 12:22:39 +01:00
adrianeboyd 2281c4708c
Restore empty tokenizer properties (#5026)
* Restore empty tokenizer properties

* Check for types in tokenizer.from_bytes()

* Add test for setting empty tokenizer rules
2020-03-02 11:55:02 +01:00
Sofie Van Landeghem c6b12ab02a
Bugfix/get doc (#5049)
* new (broken) unit test

* fixing get_doc method
2020-03-02 11:49:28 +01:00
adrianeboyd 65d7bab10f
Initialize all values in a2b/b2a in new align (#5063) 2020-02-27 18:43:00 +01:00
Matthew Honnibal b4e0d2bf50
Improve Makefile (#5067)
* Improve pex making

* Update gitignore
2020-02-26 20:59:10 +01:00
Adriane Boyd 9f740a9891 Add a few more Danish tokenizer exceptions 2020-02-26 14:59:03 +01:00
Ines Montani 1c212215cd
Merge pull request #5064 from adrianeboyd/feature/german-tokenization
Improve German tokenization
2020-02-26 13:41:44 +01:00
Ines Montani 56978f5cd8
Merge pull request #5060 from svlandeg/feature/update-thinc
update thinc
2020-02-26 13:40:23 +01:00
Adriane Boyd d1f703d78d Improve German tokenization
Improve German tokenization with respect to Tiger.
2020-02-26 13:06:52 +01:00
Ines Montani 54da6a2a07 Update pyproject.toml 2020-02-26 12:51:53 +01:00
Ines Montani ed9358420e Merge branch 'master' into pr/5060 2020-02-26 12:51:29 +01:00
adrianeboyd ff184b7a9c
Add tag_map argument to CLI debug-data and train (#4750) (#5038)
Add an argument for a path to a JSON-formatted tag map, which is used to
update and extend the default language tag map.
2020-02-26 12:10:38 +01:00
svlandeg 18ff97589d update spacy to 2.2.4.dev0 2020-02-26 10:50:05 +01:00
svlandeg 62406a9513 update from thinc 7.4.0.dev2 to 7.4.0 2020-02-26 10:30:35 +01:00
Ines Montani c7e3c034d2
Merge pull request #5061 from explosion/fix/pyproject-toml-master
Update pyproject.toml
2020-02-25 20:22:26 +01:00
Ines Montani dc36ec98a4 Update pyproject.toml 2020-02-25 16:46:14 +01:00
Ines Montani acb4e3c7ba
Merge pull request #5039 from adrianeboyd/typo/website-token-api-shape
Fix formatting in Token API
2020-02-25 14:57:25 +01:00
Ines Montani d50152b917
Merge pull request #5019 from questoph/master
Optimizing tokenization for Luxembourgish (dealing with apostrophe infixes)
2020-02-25 14:48:50 +01:00
Ines Montani 4440a072d2
Merge pull request #5006 from svlandeg/bugfix/multiproc-underscore
load Underscore state when multiprocessing
2020-02-25 14:46:02 +01:00
Ines Montani 38fc05986c
Merge pull request #5058 from bryant1410/patch-1
Add missing comma in a dependency specification
2020-02-25 14:44:29 +01:00
svlandeg d848a68340 thinc 7.4.0.dev2 2020-02-25 12:07:42 +01:00
Santiago Castro 54d8665ff7
Add missing comma in a dependency specification
Conda is complaining that it can't parse that line otherwise.
2020-02-24 16:15:28 -05:00
svlandeg b49a3afd0c use clean_underscore fixture 2020-02-23 15:49:20 +01:00
Ines Montani 4890db6339 Auto-format and fix image [ci skip] 2020-02-23 13:56:50 +01:00
Tom Keefe ddf63b97a8
make idx available via to_array (#5030) 2020-02-22 14:13:06 +01:00
Sofie Van Landeghem 44f4142ce4
add two abbreviations and some additional unit tests (#5040) 2020-02-22 14:12:32 +01:00
Sofie Van Landeghem 479bd8d09f
add lemma option to displacy 'dep' visualiser (#5041)
* add lemma option to displacy 'dep' visualiser

* more compact list comprehension

* add option to doc

* fix test and add lemmas to util.get_doc

* fix capital

* remove lemma from get_doc

* cleanup
2020-02-22 14:11:51 +01:00
Adriane Boyd 3853d385fa Fix formatting in Token API 2020-02-20 13:41:24 +01:00
adrianeboyd 2164e71ea8
Improved Romanian tokenization for UD RRT (#5036)
Modifications to Romanian tokenization to improve tokenization for
UD_Romanian-RRT.
2020-02-19 16:15:59 +01:00
Jan Jessewitsch c7e4fe9c5c
Fix/Improve german stop words (#5024)
* Fix german stop words

Two stop words ("einige" and  "einigen") are sticking together.
Remove three nouns that may serve as stop words in a specific context (e.g. religious or news) but are not applicable for general use.

* Create Jan-711.md
2020-02-17 18:59:22 +01:00
Kabir Khan f6ed07b85c
Use nlp.pipe in EntityRuler for phrase patterns in add_patterns (#4931)
* Fix ent_ids and labels properties when id attribute used in patterns

* use set for labels

* sort end_ids for comparison in entity_ruler tests

* fixing entity_ruler ent_ids test

* add to set

* Run make_doc optimistically if using phrase matcher patterns.

* remove unused coveragerc I was testing with

* format

* Refactor EntityRuler.add_patterns to use nlp.pipe for phrase patterns. Improves speed substantially.

* Removing old add_patterns function

* Fixing spacing

* Make sure token_patterns loaded as well, before generator was being emptied in from_disk
2020-02-16 18:17:47 +01:00
Sofie Van Landeghem 72c964bcf4
define pretrained_dims which is used by build_text_classifier (#5004) 2020-02-16 17:21:17 +01:00
adrianeboyd 3b22eb651b
Sync Span __eq__ and __hash__ (#5005)
* Sync Span __eq__ and __hash__

Use the same tuple for `__eq__` and `__hash__`, including all attributes
except `vector` and `vector_norm`.

* Update entity comparison in tests

Update `assert_docs_equal()` test util to compare `Span` properties for
ents rather than `Span` objects.
2020-02-16 17:20:36 +01:00
adrianeboyd 0c47a53b5e
Use int only in key2row for better performance (#4990)
Cast all keys and rows to `int` in `vectors.key2row` for more efficient
access and serialization.
2020-02-16 17:19:41 +01:00
adrianeboyd 5b102963bf
Require HEAD for is_parsed in Doc.from_array() (#5011)
Modify flag settings so that `DEP` is not sufficient to set `is_parsed`
and only run `set_children_from_heads()` if `HEAD` is provided.

Then the combination `[SENT_START, DEP]` will set deps and not clobber
sent starts with a lot of one-word sentences.
2020-02-16 17:17:09 +01:00
Sofie Van Landeghem 2572460175
add tok2vec parameters to train script to facilitate init_tok2vec (#5021) 2020-02-16 17:16:41 +01:00
Sofie Van Landeghem a27c77ce62
add message when cli train script throws exception (#5009)
* add message when cli train script throws exception

* fix formatting
2020-02-15 15:50:17 +01:00
Christos Aridas ff8e71f46d
Update streamlit app (#5017)
* Update streamlit app [ci skip]

* Add all labels by default

* Tidy up and auto-format

Co-authored-by: Ines Montani <ines@ines.io>
2020-02-15 15:49:09 +01:00
nlptechbook 979a3fd1f5
Update universe.json (#5022)
e-book is available from https://nostarch.com/NLPPython
2020-02-15 15:44:55 +01:00
questoph 5352fc8fc3 Update tokenizer_exceptions.py 2020-02-14 12:02:15 +01:00
questoph d1f0b397b5 Update punctuation.py 2020-02-13 22:18:51 +01:00
svlandeg 6e717c62ed avoid the tests interacting with eachother through the global Underscore variable 2020-02-12 13:21:31 +01:00
svlandeg 7939c63886 use English instead of model 2020-02-12 12:26:27 +01:00
svlandeg 46628d8890 add some asserts 2020-02-12 12:12:52 +01:00
svlandeg 51d37033c8 remove old comment 2020-02-12 12:10:05 +01:00
svlandeg 65f5b48b5d add comment 2020-02-12 12:06:27 +01:00
svlandeg 05dedaa2cf add unit test 2020-02-12 12:00:13 +01:00
svlandeg ecbb9c4b9f load Underscore state when multiprocessing 2020-02-12 11:50:42 +01:00
adrianeboyd 99a543367d
Set GPU before loading any models in train CLI (#4989)
Set the GPU before loading any existing models in the train CLI so that
you can start with a base model and train on GPU.
2020-02-11 17:45:41 -05:00
adrianeboyd 842dfddbb9
Standardize Greek tag map setup (#4997)
* Rename `tag_map.py` to `tag_map_fine.py` to indicate that it's not the
default tag map
* Remove duplicate generic UD tag map and load `../tag_map.py` instead
2020-02-11 17:44:56 -05:00