Commit Graph

10173 Commits

Author SHA1 Message Date
svlandeg 4142e8dd1b train and predict per article (saving time for doc encoding) 2019-05-13 17:02:34 +02:00
svlandeg 3b81b00954 evaluating on dev set during training 2019-05-13 14:26:04 +02:00
Ines Montani 8baff1c7c0
💫 Improve introspection of custom extension attributes (#3729)
* Add custom __dir__ to Underscore (see #3707)

* Make sure custom extension methods keep their docstrings (see #3707)

* Improve tests

* Prepend note on partial to docstring (see #3707)

* Remove print statement

* Handle cases where docstring is None
2019-05-12 00:53:11 +02:00
Ines Montani f96af8526a Merge branch 'spacy.io' [ci skip] 2019-05-11 23:03:56 +02:00
Matthew Honnibal 3aceeeaaeb Set version to v2.1.4 2019-05-11 22:57:53 +02:00
Ines Montani aea1c93a05 Replace cytoolz.partition_all with util.minibatch 2019-05-11 21:12:09 +02:00
Ines Montani 0bf6441863 Fix .iob converter (closes #3620) 2019-05-11 19:15:26 +02:00
Matthew Honnibal f6e9394aa5 Fix push-tag script 2019-05-11 19:04:35 +02:00
Matthew Honnibal a5159ddcf5 Set version to v2.1.4.dev1 2019-05-11 19:03:51 +02:00
Ines Montani 7534f7cb44 Fix return value of Language.update (closes #3692) 2019-05-11 18:40:19 +02:00
Ines Montani 503b8c85f1 Add TWiML podcast to universe [ci skip] 2019-05-11 17:48:22 +02:00
Ines Montani 0daf2422a3 Auto-format 2019-05-11 17:48:07 +02:00
Ines Montani 6b3a79ac96 Call rmtree and copytree with strings (closes #3713) 2019-05-11 15:48:35 +02:00
devforfu 21af12eb53 Make "text" key in JSONL format optional when "tokens" key is provided (#3721)
* Fix issue with forcing text key when it is not required

* Extending the docs to reflect the new behavior
2019-05-11 15:41:29 +02:00
Ines Montani 6cfa1e1f47 Fix DependencyParser.predict docs (resolves #3561) 2019-05-11 15:37:54 +02:00
Ines Montani 25f5592d57 Improve Token.prob and Lexeme.prob docs (resolves #3701) 2019-05-11 15:23:41 +02:00
Aaron Kub 719a15f23d fixing regex matcher examples (#3708) (#3719) 2019-05-10 14:23:52 +02:00
Luca Dorigo 82d034f976 Update glossary.py to match information found in documentation (#3704) (closes ##3679)
* Update glossary.py to match information found in documentation

I used regexes to add any dependency tag that was in the documentation but not in the glossary. Solves #3679 👍

* Adds forgotten colon
2019-05-10 14:23:20 +02:00
Wannaphong Phatthiyaphaibun 5a14a13f64 fix thai bug (#3693)
fix tokenize for pythainlp
2019-05-10 14:21:34 +02:00
Luca Dorigo 2663f4133c Submit contributor agreement (#3705) 2019-05-10 14:19:18 +02:00
Ines Montani 65b55f1aaa Add version tag to `--base-model` argument (closes #3720) 2019-05-10 14:06:47 +02:00
svlandeg b6d788064a some first experiments with different architectures and metrics 2019-05-10 12:53:14 +02:00
svlandeg 9d089c0410 grouping clusters of instances per doc+mention 2019-05-09 18:11:49 +02:00
svlandeg c6ca8649d7 first stab at model - not functional yet 2019-05-09 17:23:19 +02:00
richardpaulhudson a1e07f0d14 Request to include Holmes in spaCy Universe (#3685)
* Request to add Holmes to spaCy Universe

Dear spaCy team, I would be grateful if you would consider my Python library Holmes for inclusion in the spaCy Universe. Holmes transforms the syntactic structures delivered by spaCy into semantic structures that, together with various other techniques including ontological matching and word embeddings, serve as the basis for information extraction. Holmes supports several use cases including chatbot, structured search, topic matching and supervised document classification. I had the basic idea for Holmes around 15 years ago and now spaCy has made it possible to build an implementation that is stable and fast enough to actually be of use - thank you! At present Holmes supports English and German (I am based in Munich) but could easily be extended to support any other language with a spaCy model.

* Added
2019-05-08 02:42:03 +02:00
Ines Montani 505c9e0e19 Add util.filter_spans helper (#3686) 2019-05-08 02:33:40 +02:00
svlandeg 9f33732b96 using entity descriptions and article texts as input embedding vectors for training 2019-05-07 16:03:42 +02:00
F0rge1cE dd1e6b0bc6 Fix offset bug in loading pre-trained word2vec. (#3689)
* Fix offset bug in loading pre-trained word2vec.

* add contributor agreement
2019-05-06 23:00:38 +02:00
Bram Vanroy 8e6f8deaf6 Re-added Universe readme (#3688) (closes #3680) 2019-05-06 21:08:01 +02:00
Ines Montani 78cb807a9a Auto-format [ci skip] 2019-05-06 16:58:29 +02:00
svlandeg 7e348d7f7f baseline evaluation using highest-freq candidate 2019-05-06 15:13:50 +02:00
Ines Montani dd153b2b33 Simplify helper (see #3681) [ci skip] 2019-05-06 15:13:10 +02:00
Ines Montani f8fce6c03c Fix typo (see #3681) 2019-05-06 15:02:11 +02:00
Ines Montani f2a56c1b56 Rewrite example to use Retokenizer (resolves #3681)
Also add helper to filter spans
2019-05-06 14:51:18 +02:00
svlandeg 6961215578 refactor code to separate functionality into different files 2019-05-06 10:56:56 +02:00
Brad Jascob 955b95cb8b Fix inconsistant lemmatizer issue #3484 (#3646)
* Fix inconsistant lemmatizer issue #3484

* Remove test case
2019-05-04 18:16:03 +02:00
svlandeg f5190267e7 run only 100M of WP data as training dataset (9%) 2019-05-03 18:09:09 +02:00
svlandeg 4e929600e5 fix WP id parsing, speed up processing and remove ambiguous strings in one doc (for now) 2019-05-03 17:37:47 +02:00
svlandeg 34600c92bd try catch per article to ensure the pipeline goes on 2019-05-03 15:10:09 +02:00
Ines Montani b4d142e3c4 Adjust wording and formatting [ci skip] 2019-05-03 12:00:31 +02:00
Ines Montani 04658ebbb2 Relax jsonschema pin (closes #3628) 2019-05-03 11:58:58 +02:00
d5555 ba4bcbf285 Update universe.json (#3653) [ci skip]
* Update universe.json

* Update universe.json
2019-05-03 11:50:12 +02:00
svlandeg bbcb9da466 creating training data with clean WP texts and QID entities true/false 2019-05-03 10:44:29 +02:00
svlandeg cba9680d13 run NER on clean WP text and link to gold-standard entity IDs 2019-05-02 17:24:52 +02:00
svlandeg 581dc9742d parsing clean text from WP articles to use as input data for NER and NEL 2019-05-02 17:09:56 +02:00
svlandeg 8353552191 cleanup 2019-05-01 23:26:16 +02:00
svlandeg 1ae41daaa9 allow small rounding errors 2019-05-01 23:05:40 +02:00
Dobita21 f95ecedd83 Add Thai lex_attrs (#3655)
* test sPacy commit to git fri 04052019 10:54

* change Data format from my format to master format

* ทัทั้งนี้ ---> ทั้งนี้

* delete stop_word translate from Eng

* Adjust formatting and readability

* add Thai norm_exception

* Add Dobita21 SCA

* editรึ : หรือ,

* Update Dobita21.md

* Auto-format

* Integrate norms into language defaults

* add acronym and some norm exception words

* add lex_attrs

* Add lexical attribute getters into the language defaults

* fix LEX_ATTRS


Co-authored-by: Donut <dobita21@gmail.com>
Co-authored-by: Ines Montani <ines@ines.io>
2019-05-01 12:03:14 +02:00
张晓飞 ba1ff00370 update response after calling add_pipe (#3661)
* update response after calling add_pipe

component:print_info is appened in the last, so need show it at the end of  pipeline

* Create henry860916.md
2019-05-01 12:02:18 +02:00
BreakBB 8952004dfc Update French example sents and add two German stop words (#3662)
* Update french example sentences

* Add 'anderem' and 'ihren' to German stop words
2019-05-01 12:01:35 +02:00