Commit Graph

2010 Commits

Author SHA1 Message Date
Ines Montani 0d07d7fc80 Apply emoticon exceptions to tokenizer 2016-12-07 21:11:59 +01:00
Ines Montani 71f0f34cb3 Fix formatting 2016-12-07 21:11:29 +01:00
Ines Montani 9413bcd9ee Declare encoding and unicode literals 2016-12-07 21:10:34 +01:00
Ines Montani a280ff2657 Fix __all__ 2016-12-07 21:10:12 +01:00
Ines Montani ba8721953c Add missing emoticons 2016-12-07 21:09:44 +01:00
Ines Montani 1285c4ba93 Update English language data 2016-12-07 20:33:28 +01:00
Ines Montani 79dce0aabe Add emoticons 2016-12-07 20:33:28 +01:00
Ines Montani a662a95294 Add line breaks 2016-12-07 20:33:28 +01:00
Ines Montani 07f0efb102 Add test for tokenizer regular expressions 2016-12-07 20:33:28 +01:00
Ines Montani e0712d1b32 Reformat language data 2016-12-07 20:33:28 +01:00
Matthew Honnibal 0c0f4c965d Increment version 2016-12-03 11:16:52 +01:00
Matthew Honnibal f6e356aada Add (and test) Span.sentiment attribute. By default we average token.span, but can override with custom hook. Re Issue #667 2016-12-02 11:05:50 +01:00
Janneke van der Zwaan 88869e0e07 Merge github.com:explosion/spaCy into dutch 2016-11-30 17:13:39 +01:00
Janneke van der Zwaan 51ade86b86 Update language data with tag map from UD_Dutch 2016-11-30 14:41:23 +01:00
Janneke van der Zwaan 90f6ff12c9 Update Dutch language data
- Use Dutch tag map
- remove tokenizer exceptions
2016-11-30 11:59:39 +01:00
dafnevk 7b8f4c49f2 Added language Dutch to init file 2016-11-29 16:42:05 +01:00
Matthew Honnibal 296d33a4fc Merge branch 'master' of ssh://github.com/explosion/spaCy 2016-11-26 12:36:18 +01:00
Matthew Honnibal 1f6c37c6f5 Fix create_tokenizer when nlp is None 2016-11-26 12:36:04 +01:00
Matthew Honnibal c7889492f9 Fix model saving error for Python 3 2016-11-25 18:04:30 -06:00
Matthew Honnibal bc0a202c9c Fix unicode problem in nonproj module 2016-11-25 17:29:17 -06:00
Matthew Honnibal 6dd3b94fa6 Filter out deprecated attributes when reading special-case tokenization rules. 2016-11-25 09:57:18 -06:00
Matthew Honnibal e879c79b8c Merge branch 'master' of https://github.com/explosion/spaCy 2016-11-25 09:18:28 -06:00
Matthew Honnibal a335c6dcc2 Exclude morphs from deprecated token attributes for now 2016-11-25 16:17:32 +01:00
Matthew Honnibal f799a07f25 Merge branch 'master' of https://github.com/explosion/spaCy 2016-11-25 09:16:43 -06:00
Matthew Honnibal 159e8c46e1 Merge old training fixes with newer state 2016-11-25 09:16:36 -06:00
Matthew Honnibal 846e80f2f4 Exclude morphs from deprecated token attributes for now 2016-11-25 16:14:54 +01:00
Matthew Honnibal 664f2dd1c0 Allow dep to be None in scorer, for missing labels. 2016-11-25 09:02:49 -06:00
Matthew Honnibal 39341598bb Fix NER label calculation 2016-11-25 09:02:22 -06:00
Matthew Honnibal ca773a1f53 Tweak arc_eager n_gold to deal with negative costs, and improve error message. 2016-11-25 09:01:52 -06:00
Matthew Honnibal a2f55e7015 Pass cfg through loading, for training. 2016-11-25 09:01:20 -06:00
Matthew Honnibal 608d8f5421 Pass cfg through parser, and have is_valid default to 1, not 0 when resetting state 2016-11-25 09:00:21 -06:00
Matthew Honnibal cc7e607a8a Fix gold.pyx for 1.0 2016-11-25 08:57:59 -06:00
root 080d29e092 Fix train.py for 1.0 2016-11-25 08:55:33 -06:00
Matthew Honnibal 6652f2a135 Test #656, #624: special case rules for tokenizer with attributes. 2016-11-25 12:44:13 +01:00
Matthew Honnibal 1e0f566d95 Fix #656, #624: Support arbitrary token attributes when adding special-case rules. 2016-11-25 12:43:24 +01:00
Matthew Honnibal 87613edf8f Add set_struct_attr staticmethod to token 2016-11-25 12:41:47 +01:00
Matthew Honnibal fb69aa648f Merge branch 'master' of ssh://github.com/explosion/spaCy 2016-11-25 11:35:44 +01:00
Matthew Honnibal 9a03a3f85e Add get_struct_attr staticmethod to Token, to match Lexeme.get_struct_attr. 2016-11-25 11:35:17 +01:00
Matthew Honnibal 53d8ca8f51 Add spacy.attrs.intify_attrs function, to normalize strings in token attribute dictionaries. 2016-11-25 11:34:30 +01:00
Ines Montani d21ad01840 Add emoticons 2016-11-24 19:13:00 +01:00
dafnevk d8c7ac203a Added nl module for dutch 2016-11-24 16:39:49 +01:00
dafnevk 3db8b0d322 Added language class and some language data (with some TODOs) for Dutch 2016-11-24 15:56:38 +01:00
Ines Montani 4dcfafde02 Add line breaks 2016-11-24 14:57:37 +01:00
Ines Montani 6247c005a2 Add test for tokenizer regular expressions 2016-11-24 13:51:59 +01:00
Ines Montani de747e39e7 Reformat language data 2016-11-24 13:51:32 +01:00
Matthew Honnibal b8c4f5ea76 Allow German noun chunks to work on Span
Update the German noun chunks iterator, so that it also works on Span objects.
2016-11-24 23:30:15 +11:00
Pokey Rule 3e3bda142d Add noun_chunks to Span 2016-11-24 10:47:20 +00:00
Janneke van der Zwaan 83daade0e4 Add directory and initial (empty) files for language Dutch 2016-11-24 09:45:41 +01:00
Matthew Honnibal 09f68bc641 Fix Issue #639: stop words in language class not used. This patch is messy, but it's better not to change too much until the language data loading can be properly refactored. 2016-11-24 00:13:55 +01:00
Matthew Honnibal 48e1dc29d4 Fix default path loading. 2016-11-23 23:48:55 +01:00
Matthew Honnibal e01c1875ee Work on test for #615 2016-11-23 23:48:41 +01:00
ExplodingCabbage 6c4f488e89 Fix syntax mistake 2016-11-23 15:12:45 +00:00
Matthew Honnibal 60eb2343ce Only try to load vectors if they exist. 2016-11-23 13:50:24 +01:00
Matthew Honnibal 618ac36093 Fix use of path argument in Language.__init__. Needs to be keyword arg, not positional. 2016-11-23 13:26:34 +01:00
Mark Amery fbe19680a6 Fix another bug related to Language.__init__'s path parameter 2016-11-20 20:31:34 +00:00
Mark Amery b0a07c21a0 Fix `path` param of `Language.__init__` always being ignored
There was an explicitly-declared `path` keyword argument, so 'path'
would never be present in `**overrides`. This line just overwrote
any manually-specified value the user might've passed to the `path`
parameter.
2016-11-20 16:29:57 +00:00
Mark Amery 1988fce389 Merge remote-tracking branch 'origin/master' into specify-data-path 2016-11-20 16:07:14 +00:00
Mark Amery 3871007c72 Let --data-path be specified when running download.py scripts
Resolves https://github.com/explosion/spaCy/issues/637
2016-11-20 15:48:04 +00:00
Ines Montani dad2c6cae9 Strip trailing whitespace 2016-11-20 16:45:51 +01:00
Ines Montani 3082e49326 Update and reformat German stopwords 2016-11-20 16:45:26 +01:00
Sourav Singh 6745eac309 Update language_data.py 2016-11-20 19:52:02 +05:30
Sourav Singh 4d9aae7d6a Add German Stopwords 2016-11-19 22:47:53 +05:30
Matthew Honnibal 7afb2544a7 Merge pull request #627 from sadovnychyi/patch-1
Remove duplicated line of vocab declaration
2016-11-16 06:09:18 +11:00
Yanhao 762169da29 Fixed bug: eg.guess is a tag id, rather than tag 2016-11-15 14:11:22 +08:00
Dmytro Sadovnychyi e70a7050e1 Remove duplicated line of vocab declaration
As already declared on line 211.
2016-11-13 18:52:49 +08:00
Matthew Honnibal f123f92e0c Fix #617: Vocab.load() required Path. Should work with string as well. 2016-11-10 22:48:48 +01:00
Matthew Honnibal e86f440ca6 Fix test for issue 617 2016-11-10 22:48:10 +01:00
Matthew Honnibal faa7610c56 Merge branch 'master' of ssh://github.com/explosion/spaCy 2016-11-10 22:46:38 +01:00
Matthew Honnibal a2c7de8329 spacy/tests/regression/test_issue617.py
Test Issue #617
2016-11-10 22:46:23 +01:00
tiago 2a3e342c1f Added a test case to cover the span.merge returning values 2016-11-09 18:57:50 +00:00
tiago b38cfd0ef9 now span.merge returns token like it says on documentation 2016-11-09 14:58:19 +00:00
Dmitry Sadovnychyi 9488222e79 Fix PhraseMatcher to work with updated Matcher
#613
2016-11-09 00:14:26 +08:00
Dmitry Sadovnychyi 86c056ba64 Add basic test for PhraseMatcher
#613
2016-11-09 00:10:32 +08:00
Matthew Honnibal 3ea15b257f Fix test for 605 2016-11-06 11:59:26 +01:00
Matthew Honnibal efe7790439 Test #590: Order dependence in Matcher rules. 2016-11-06 11:21:36 +01:00
Matthew Honnibal 5cd3acb265 Fix #605: Acceptor now rejects matches as expected. 2016-11-06 10:50:42 +01:00
Matthew Honnibal 75805397dd Test Issue #605 2016-11-06 10:42:32 +01:00
Matthew Honnibal 014b6936ac Fix #608 -- __version__ should be available at the base of the package. 2016-11-04 21:21:02 +01:00
Matthew Honnibal 42b0736db7 Increment version 2016-11-04 20:04:21 +01:00
Matthew Honnibal 9f93386994 Update version 2016-11-04 19:28:16 +01:00
Matthew Honnibal 1fb09c3dc1 Fix morphology tagger 2016-11-04 19:19:09 +01:00
Matthew Honnibal a36353df47 Temporarily put back the tokenize_from_strings method, while tests aren't updated yet. 2016-11-04 19:18:07 +01:00
Matthew Honnibal f0917b6808 Fix Issue #376: and/or was tagged as a noun. 2016-11-04 15:21:28 +01:00
Matthew Honnibal 737816e86e Fix #368: Tokenizer handled pattern 'unicode close quote, period' incorrectly. 2016-11-04 15:16:20 +01:00
Matthew Honnibal ab952b4756 Fix #578 -- Sputnik had been purging all files on --force, not just the relevant one. 2016-11-04 10:44:11 +01:00
Matthew Honnibal 6e37ba1d82 Fix #602, #603 --- Broken build 2016-11-04 09:54:24 +01:00
Matthew Honnibal 293c79c09a Fix #595: Lemmatization was incorrect for base forms, because morphological analyser wasn't adding morphology properly. 2016-11-04 00:29:07 +01:00
Matthew Honnibal e30348b331 Prefer to import from symbols instead of parts_of_speech 2016-11-04 00:27:55 +01:00
Matthew Honnibal 4a8a2b6001 Test #595 -- Bug in lemmatization of base forms. 2016-11-04 00:27:32 +01:00
Matthew Honnibal f1605df2ec Fix #588: Matcher should reject empty pattern. 2016-11-03 00:16:44 +01:00
Matthew Honnibal 72b9bd57ec Test Issue #588: Matcher accepts invalid, empty patterns. 2016-11-03 00:09:35 +01:00
Matthew Honnibal 41a90a7fbb Add tokenizer exception for 'Ph.D.', to fix 592. 2016-11-03 00:03:34 +01:00
Matthew Honnibal 532318e80b Import Jieba inside zh.make_doc 2016-11-02 23:49:19 +01:00
Matthew Honnibal f292f7f0e6 Fix Issue #599, by considering empty documents to be parsed and tagged. Implementation is a bit dodgy. 2016-11-02 23:48:43 +01:00
Matthew Honnibal b6b01d4680 Remove deprecated tokens_from_list test. 2016-11-02 23:47:21 +01:00
Matthew Honnibal 3d6c79e595 Test Issue #599: .is_tagged and .is_parsed attributes not reflected after deserialization for empty documents. 2016-11-02 23:40:11 +01:00
Matthew Honnibal 05a8b752a2 Fix Issue #600: Missing setters for Token attribute. 2016-11-02 23:28:59 +01:00
Matthew Honnibal 125c910a8d Test Issue #600 2016-11-02 23:24:13 +01:00
Matthew Honnibal e0c9695615 Fix doc strings for tokenizer 2016-11-02 23:15:39 +01:00
Matthew Honnibal 80824f6d29 Fix test 2016-11-02 20:48:40 +01:00