Commit Graph

11406 Commits

Author SHA1 Message Date
Sofie Van Landeghem 1c01842588
add pyx and pxd files to the distribution (#5000) 2020-02-11 17:42:17 -05:00
Antti Ajanki e1f777b151
Improvements for Finnish tokenizer (#4985)
* don't split on a colon. Colon is used to attach suffixes for abbreviations
* tokenize on any of LIST_HYPHENS (except a single hyphen), not just on --
* simplify infix rules by merging similar rules
2020-02-10 20:32:43 -05:00
Julin S 479e81bafc
fix link (#4977) 2020-02-10 20:31:26 -05:00
adrianeboyd 5d8cb60e43
Update lower pin for srsly to 1.0.1 (#4976) 2020-02-10 20:30:54 -05:00
Ines Montani 9c08d9baa3 Remove old sections [ci skip] (closes #4961) 2020-02-03 13:10:46 +01:00
Filip Bednárik d4f4060bf3
Add Slovak language tools implementation (#4943)
* Add correct stopwords for Slovak language

* Add SNK Tags

* Disable formatting lint for TAGS

* Add example sentences for Slovak language

* Add slovak numerals in base form

* Add lex_attrs to sk init

* Add contributor agreement
2020-02-03 13:03:59 +01:00
Tyler Couto 9fa9d7f2cb
Fix for Issue 4665 - conllu2json (#4953)
* Fix for Issue 4665 - conllu2json

- Allowing HEAD to be an underscore

* Added contributor agreement
2020-02-03 13:01:48 +01:00
Ines Montani abd5c06374 Adjust formatting [ci skip] 2020-02-03 13:00:02 +01:00
Martin A. Kayser 02a44c5be2
Adding a note on retrieving the string rep of the match_id (#4904)
Stolen from here: https://stackoverflow.com/questions/47638877/using-phrasematcher-in-spacy-to-find-multiple-match-types
2020-02-03 12:58:58 +01:00
Omri Mendels 6ff947e1f9
Added presidio-research to universe.json (#4950)
* Added presidio-research to universe.json

Added a reference to Presidio Research, the data-science toolbox for Microsoft Presidio.

* Updated url
2020-02-03 12:57:55 +01:00
Matthew Honnibal d031440de2
Update setup.cfg 2020-01-29 17:35:46 +01:00
Paco Nathan 49fefb6139 Submitting `PyTextRank` for inclusion in the spaCy uniVerse (#4942)
* submitting PyTextRank for consideration of including in the spaCy uniVerse

* including SCA
2020-01-28 11:37:54 +01:00
adrianeboyd a938566b62 Fix Sentencizer.pipe() for empty doc (#4940) 2020-01-28 11:36:49 +01:00
adrianeboyd 7ad000fce7 Update docs for train CLI --use_gpu option (#4927) 2020-01-20 17:02:47 +01:00
Yohei Tamura 708a4d27eb fix nlp.evaluate (#4924) (#4925)
* new file:   test_issue4924.py

* modified:   spacy/gold.pyx

* modified:   test_issue4924.py for python2
2020-01-20 12:17:46 +01:00
Kabir Khan b9afcd56e3 Fix ent_ids and labels properties when id attribute used in patterns (#4900)
* Fix ent_ids and labels properties when id attribute used in patterns

* use set for labels

* sort end_ids for comparison in entity_ruler tests

* fixing entity_ruler ent_ids test

* add to set
2020-01-16 02:01:31 +01:00
Sofie Van Landeghem fbfc418745 run normal textcat train script with transformers (#4834)
* keep trf tok2vec and wordpiecer components during update

* also support transformer models for other example scripts
2020-01-16 02:01:23 +01:00
adrianeboyd 90c52128dc Improve train CLI with base model (#4911)
Improve train CLI with a provided base model so that you can:

* add a new component
* extend an existing component
* replace an existing component

When the final model and best model are saved, reenable any disabled
components and merge the meta information to include the full pipeline
and accuracy information for all components in the base model plus the
newly added components if needed.
2020-01-16 01:58:51 +01:00
Bram Vanroy 718704022a Changes to spacy_conll in universe (#4914)
* Update information on spacy_conll

* Typo fix
2020-01-16 01:56:39 +01:00
Matthew Honnibal 1785eebfe0
Merge pull request #4909 from svlandeg/bugfix/cnn_window
bugfix typo conv_window
2020-01-14 11:23:14 +01:00
svlandeg ee828d5a9a bugfix typo conv_window 2020-01-14 09:02:58 +01:00
Sofie Van Landeghem c70ccd543d Friendly error warning for NEL example script (#4881)
* make model positional arg and raise error if no vectors

* small doc fixes
2020-01-14 01:51:14 +01:00
adrianeboyd d24bca62f6 Add CJK to character classes (#4884)
* Add CJK character class as uncased

* Incorporate Chinese URL test case

Un-xfail Chinese URL test instance
2020-01-08 16:50:19 +01:00
Preston Badeer b216ff43c9 Update vectors-similarity.md (#4889)
These links are broken on the website, due to quotes around the URLs.
2020-01-08 16:49:40 +01:00
adrianeboyd aef83e8070 Mark most Hungarian tokenizer test cases as slow (#4883)
* Mark most Hungarian tokenizer test cases as slow

Mark most Hungarian tokenizer test cases as slow to reduce the runtime
of the test suite in ordinary usage:

* for normal tests: run default tests plus 10% of the detailed tests
* for slow tests: run all tests

* Rework to mark individual tests as slow
2020-01-08 12:34:06 +01:00
Sofie Van Landeghem 7b96a5e10f Reduce mem usage in training Entity Linker (#4811)
* move nlp processing for el pipe to batch training instead of preprocessing

* adding dev eval back in, and limit in articles instead of entities

* use pipe whenever possible

* few more small doc changes

* access dev data through generator

* tqdm description

* small fixes

* update documentation
2020-01-06 14:59:50 +01:00
Sofie Van Landeghem 6e9b61b49d add warning in debug_data for punctuation in entities (#4853) 2020-01-06 14:59:28 +01:00
adrianeboyd d652ff215d Add trailing whitespace to multiline test text (#4877) 2020-01-06 14:58:59 +01:00
adrianeboyd de69bc6509 Fix and improve URL pattern (#4882)
* match domains longer than `hostname.domain.tld` like `www.foo.co.uk`
* expand allowed characters in domain names while only matching
lowercase TLDs so that "this.That" isn't matched as a URL and can be
split on the period as an infix (relevant for at least English, German,
and Tatar)
2020-01-06 14:58:30 +01:00
Sofie Van Landeghem a1b22e90cd serialize ENT_ID (#4852)
* expand serialization test for custom token attribute

* add failing test for issue 4849

* define ENT_ID as attr and use in doc serialization

* fix few typos
2020-01-06 14:57:34 +01:00
Geoffrey Gordon Ashbrook 53929138d7 remove extra word typo (#4875)
"let you find you"
2020-01-06 12:37:42 +01:00
Ines Montani 400257a802 Update index.md [ci skip] 2020-01-04 01:52:18 +01:00
Al Johri 1aa2d4dac9 stop rendering mathjax by default in displacy (#4840)
* stop rendering mathjax by default in displacy

* Replace f-string and add comment

Co-authored-by: Ines Montani <ines@ines.io>
2020-01-01 13:15:05 +01:00
Anastasiia Iurshina db9257559c Adds script shebang (#4846) 2019-12-29 14:25:05 +01:00
Anastasiia Iurshina 1830a12578 Fixes typos (#4843)
* Fixes typos

* Fixes typo

* Contributor agreement
2019-12-29 14:24:13 +01:00
Ivan Echevarria ef13e0c038 Add n_process to Language.pipe documentation (#4842) [ci skip]
* Add n_process to documentation

* Auto-format and add default [ci skip]

Co-authored-by: Ines Montani <ines@ines.io>
2019-12-29 14:23:33 +01:00
Al Johri fd4a7bd2b7 sign contributor agreement for AlJohri (#4839) [ci skip] 2019-12-29 14:17:28 +01:00
Ines Montani 3431ac42de Fix typo 2019-12-21 21:17:45 +01:00
Ines Montani 7c69d30de5 Tidy up and expect warning 2019-12-21 21:14:52 +01:00
Sofie Van Landeghem 732142bf28 facilitate larger training files (#4827)
* add warning for large file and change start var to long

* type for file_length
2019-12-21 21:12:19 +01:00
Ines Montani cb4145adc7 Tidy up and auto-format 2019-12-21 19:04:17 +01:00
Olamilekan Wahab a741de7cf6 Adding support for Yoruba Language (#4614)
* Adding Support for Yoruba

* test text

* Updated test string.

* Fixing encoding declaration.

* Adding encoding to stop_words.py

* Added contributor agreement and removed iranlowo.

* Added removed test files and removed iranlowo to keep project bare.

* Returned CONTRIBUTING.md to default state.

* Added delted conftest entries

* Tidy up and auto-format

* Revert CONTRIBUTING.md

Co-authored-by: Ines Montani <ines@ines.io>
2019-12-21 14:11:50 +01:00
Ines Montani 1b838d1313 Divide models into core and starters [ci skip] 2019-12-21 14:10:22 +01:00
Ines Montani 0750d59e5a Allow setting ner_missing_tag on docs_to_json 2019-12-21 13:47:21 +01:00
Sofie Van Landeghem 8ebbb85117 Documentation for PhraseMatcher constructor (#4826)
* add max_length as argument for init PhraseMatcher

* improve error message too
2019-12-20 23:00:04 +01:00
Sofie Van Landeghem 12158c1e3a Restore tqdm imports (#4804)
* set 4.38.0 to minimal version with color bug fix

* set imports back to proper place

* add upper range for tqdm
2019-12-16 13:12:19 +01:00
Ines Montani c466e02466 Update universe [ci skip] 2019-12-13 15:57:39 +01:00
Sofie Van Landeghem 557dcf5659 NEL requires sentences to be set (#4801) 2019-12-13 15:55:18 +01:00
tamuhey 1707e77c5e add char_span to Span (#4793) 2019-12-13 15:54:58 +01:00
Sofie Van Landeghem f9b541f9ef More robust set entities method in KB (#4794)
* add unit test for setting entities with duplicate identifiers

* count the number of actual unique identifiers and throw duplicate warning
2019-12-13 10:45:29 +01:00