Commit Graph

3579 Commits

Author SHA1 Message Date
Matthew Honnibal 0ca5832427 Improve negative example handling in NER oracle 2017-07-20 00:18:49 +02:00
Matthew Honnibal a231b56d40 Add text-classification hook to pipeline 2017-07-20 00:18:15 +02:00
Matthew Honnibal 7ea50182a5 Add support for text-classification labels to GoldParse 2017-07-20 00:17:47 +02:00
Matthew Honnibal 727481377e Add text-classifer thinc models 2017-07-20 00:17:17 +02:00
Matthew Honnibal f014138c11 Fix parser tests 2017-07-20 00:16:52 +02:00
Ines Montani c91642efd5 Port over changes from #1168 2017-07-01 11:43:54 +02:00
Jim Regan d81ceb0cd5 Merge branch 'develop' into polish 2017-06-26 22:42:27 +01:00
Jim O'Regan 2f84c73585 a start 2017-06-26 22:40:04 +01:00
Jim O'Regan 28d7f0a672 reference 2017-06-26 22:38:28 +01:00
Matthew Honnibal 91e52543ef Merge pull request #1118 from Gregory-Howard/patch-2
Update _tokenizer_exceptions_list (adding cities)
2017-06-20 11:16:07 +02:00
Matthew Honnibal 8ea785e01a Merge pull request #1119 from oroszgy/patch-3
Fixed conllu converter
2017-06-20 11:14:41 +02:00
Tpt 7745b3ae04 Adds noun chunks to French syntax iterators 2017-06-12 15:29:58 +02:00
Tpt 57e8254f63 Adds function to extract french noun chunks 2017-06-12 15:20:49 +02:00
György Orosz 62dbf9025c Fixed conllu converter 2017-06-09 22:53:56 +02:00
Grégory Howard cd974b32b7 Update _tokenizer_exceptions_list (adding cities) 2017-06-09 17:58:18 +02:00
ines 34a2eecb17 Add simple "naughty strings" test (see #1107) 2017-06-06 17:43:51 +02:00
ines 045574a936 Update package name and increment version 2017-06-05 20:41:30 +02:00
Matthew Honnibal 1f5874a927 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-06-05 20:20:00 +02:00
ines 03db56f48c Detect spaCy version and add package title
Package title allows customised package names (like spacy-nightly)
2017-06-05 20:11:02 +02:00
Matthew Honnibal c0d90f52f7 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-06-05 19:20:13 +02:00
ines cc9c5dc7a3 Fix noun chunks test 2017-06-05 16:39:04 +02:00
Matthew Honnibal 836bfa2d0f Add factory for experimental SimilarityHook component 2017-06-05 15:40:22 +02:00
Matthew Honnibal d59fa32df1 Add experimental SimilarityHook omponent 2017-06-05 15:40:03 +02:00
Matthew Honnibal 5489b49203 Remove print statement 2017-06-05 13:20:41 +02:00
Matthew Honnibal fc4204a12a Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-06-05 13:13:23 +02:00
Matthew Honnibal 2479cde446 Support disable keyword in Language.__init__ 2017-06-05 13:13:07 +02:00
ines ea167e14db Fix model package loading from link 2017-06-05 13:10:49 +02:00
ines dd6dc4c120 Update spacy.load() helper functions 2017-06-05 13:02:31 +02:00
Matthew Honnibal b4cdd05466 Add vectors.pyx in setup 2017-06-05 12:45:29 +02:00
Matthew Honnibal 280d419529 Add pickle method for vectors 2017-06-05 12:36:04 +02:00
Matthew Honnibal 30369d580f Start testing Vectors class 2017-06-05 12:32:49 +02:00
Matthew Honnibal eb7cbb62c2 Flesh out Vectors class 2017-06-05 12:32:08 +02:00
ines 51d7414e94 Make sure sents are a list 2017-06-05 12:30:13 +02:00
Matthew Honnibal ebb6c49cd5 Make alignment case-insensitive for gold 2017-06-04 20:26:42 -05:00
Matthew Honnibal fc4dd62e84 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-06-04 20:19:05 -05:00
Matthew Honnibal 8f8f90b46b Disable labeller if not parsing 2017-06-04 20:18:54 -05:00
Matthew Honnibal c52fde40f4 Improve train CLI 2017-06-04 20:18:37 -05:00
Matthew Honnibal a053b1218e Fix item counting during training 2017-06-04 20:18:20 -05:00
Matthew Honnibal b3b5521625 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-06-04 20:17:18 -05:00
Matthew Honnibal 9bc4a26213 Add option of data augmentation noise 2017-06-04 20:16:57 -05:00
Matthew Honnibal 7b2ede783d Add SP tag to tag map if missing 2017-06-04 20:16:30 -05:00
ines a0f4592f0a Update tests 2017-06-05 02:26:13 +02:00
ines 3e105bcd36 Update tests 2017-06-05 02:09:27 +02:00
Matthew Honnibal 516798e9fc Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-06-05 01:35:21 +02:00
Matthew Honnibal 193bf913c0 Set is_tagged=True after tagging 2017-06-05 01:35:07 +02:00
ines 078232932c Fix tokenizer fixture scope 2017-06-05 01:06:34 +02:00
Matthew Honnibal 58be0e1f6f Update tests 2017-06-04 16:35:06 -05:00
Matthew Honnibal b78cc318c3 Fix loading of morphology exceptions 2017-06-04 16:34:32 -05:00
Matthew Honnibal bb98d45a63 Fix tests 2017-06-04 16:00:44 -05:00
Matthew Honnibal 55d0621532 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-06-04 15:53:25 -05:00
Matthew Honnibal 5b9f116aca Update tests 2017-06-04 15:53:17 -05:00
Matthew Honnibal 2a3bd5ee90 Fix fetching of noun chunk iterator 2017-06-04 15:53:05 -05:00
Matthew Honnibal 3680c51b8f Avoid clobbering preset POS tags 2017-06-04 15:52:42 -05:00
Matthew Honnibal 939e8ed567 Add lookup properties for components in Language 2017-06-04 15:52:09 -05:00
Matthew Honnibal e28f90b672 Fix syntax iterators 2017-06-04 15:51:50 -05:00
ines 8a29308d0b Remove unused imports 2017-06-04 22:39:29 +02:00
Ines Montani 112c5787eb Merge pull request #1101 from oroszgy/hu_tokenizer_fix
More robust Hungarian tokenizer.
2017-06-04 22:37:51 +02:00
ines 96867a24ae Fix typo 2017-06-04 22:36:40 +02:00
ines f432bb4b48 Fix fixture scopes 2017-06-04 22:34:31 +02:00
Matthew Honnibal 6d0356e6cc Whitespace 2017-06-04 14:55:24 -05:00
Matthew Honnibal 8a683a4494 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-06-04 21:53:56 +02:00
Matthew Honnibal 92ae36f84e Improve way noun chunks iterator is looked up 2017-06-04 21:53:39 +02:00
ines 9254a3dd78 Import and add Spanish syntax iterators 2017-06-04 21:42:15 +02:00
ines 7db1a0e83e Make sure printed values are always strings 2017-06-04 21:27:20 +02:00
Matthew Honnibal 51e1541ddb Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-06-04 14:26:29 -05:00
Matthew Honnibal add9a33782 Return False for vocab.has_vector 2017-06-04 14:26:14 -05:00
Matthew Honnibal 675f448313 Fix vector linkage on Doc 2017-06-04 14:25:30 -05:00
Matthew Honnibal f4662e9218 Fix vector linkage for token 2017-06-04 14:19:58 -05:00
ines 070e026ed9 Ensure path on read_json 2017-06-04 20:44:37 +02:00
ines e1e73936b1 Raise correct error 2017-06-04 20:44:27 +02:00
ines 848e47669e Fix typo 2017-06-04 20:44:15 +02:00
ines c4614c02a2 Fix dev resources URL 2017-06-04 15:45:50 +02:00
ines a66cf24ee8 xfail tokenizer serialization tests for now
Tests pass locally, but not on Travis – needs more investigation
2017-06-04 13:58:20 +02:00
ines 7b7d46b64e Fix typo and success message 2017-06-04 13:45:50 +02:00
ines 90d117f378 Update version 2017-06-04 13:41:16 +02:00
Matthew Honnibal 7ca215bc26 Resolve lex_attr_getters conflict 2017-06-03 16:12:01 -05:00
Matthew Honnibal 21eef90dbc Support specifying which GPU 2017-06-03 16:10:23 -05:00
Matthew Honnibal d0e42f9275 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-06-03 15:30:32 -05:00
Matthew Honnibal 8a17b99b1c Use NORM attribute, not LOWER 2017-06-03 15:30:16 -05:00
ines 4c643d74c5 Add norm exceptions to other Language classes 2017-06-03 22:29:21 +02:00
ines fa7e576c57 Change order of exception dicts 2017-06-03 21:52:06 +02:00
Matthew Honnibal 3f5c85d8de Reorder setting of lex attrs, to avoid clobbering 2017-06-03 14:47:55 -05:00
Matthew Honnibal aeb7520133 Make norm use lower-case 2017-06-03 14:47:38 -05:00
Matthew Honnibal de3954843e Populate norm exceptions with lower-case 2017-06-03 14:47:12 -05:00
Matthew Honnibal f6955a459c Fix prev commit 2017-06-03 14:38:37 -05:00
Matthew Honnibal 468ca6c760 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-06-03 14:33:51 -05:00
Matthew Honnibal c647a0d33e Fix training counter for gold preprocessing 2017-06-03 14:33:39 -05:00
ines e47eef5e03 Update German tokenizer exceptions and tests 2017-06-03 21:07:44 +02:00
ines d77c2cc8bb Add tests for English norm exceptions 2017-06-03 20:59:50 +02:00
ines 0d6fa8b241 Add German norm exceptions 2017-06-03 20:54:18 +02:00
ines 5bd311c77e Fix update of norm exceptions 2017-06-03 20:54:09 +02:00
Matthew Honnibal 94e063ae2a Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-06-03 13:31:40 -05:00
Matthew Honnibal fea1144e6d Set max batch size in evaluate 2017-06-03 13:31:33 -05:00
Matthew Honnibal 805495af27 Fix off-by-one in number of tags 2017-06-03 13:29:23 -05:00
Matthew Honnibal e62f46d39f Clarify gold.pyx slightly 2017-06-03 13:28:52 -05:00
Matthew Honnibal 43353b5413 Improve train CLI script 2017-06-03 13:28:20 -05:00
ines 746653880c Add English norm exceptions to lex_attrs 2017-06-03 20:27:28 +02:00
ines 095eeeb12f Update English tokenizer exceptions and add norms 2017-06-03 20:27:16 +02:00
ines e5d426406a Add base norm exceptions 2017-06-03 20:27:05 +02:00
ines 4c2bbc3ccc Add add_lookups util function 2017-06-03 19:44:47 +02:00
ines 05fe6758a7 Set lexeme attributes for tokenizer special cases 2017-06-03 19:44:39 +02:00
ines 3152ee5ca2 Update serialization tests for tokenizer 2017-06-03 17:05:28 +02:00
ines 7c919aeb09 Make sure serializers and deserializers are ordered 2017-06-03 17:05:09 +02:00
ines 1ebd0d3f27 Add assert_packed_msg_equal util function 2017-06-03 17:04:30 +02:00
ines de974f7bef Add serializer tests for tokenizer 2017-06-03 13:26:34 +02:00
ines 0153b66a86 Return self in Tokenizer.from_bytes 2017-06-03 13:26:13 +02:00
ines 82154a1861 Add letter spacing to arrow label 2017-06-03 13:25:41 +02:00
ines 32c6f05de9 Adjust spacing and sizing in compact mode 2017-06-03 13:25:32 +02:00
ines cc8c8617a4 Shut down displaCy server on KeyboardInterrupt 2017-06-03 13:24:56 +02:00
ines 70fbba7d08 Clone Doc to never merge punctuation on original Doc 2017-06-03 13:24:43 +02:00
ines 459a1e8470 Fix whitespace 2017-06-03 11:31:18 +02:00
ines 5109bba910 Port over fix from #1070 2017-06-03 11:31:11 +02:00
ines d21459f87d Update serializer tests 2017-06-02 21:42:26 +02:00
ines 6669583f4e Use OrderedDict 2017-06-02 21:07:56 +02:00
ines 2f1025a94c Port over Spanish changes from #1096 2017-06-02 19:09:58 +02:00
ines d86e7cde93 Add entity recognizer to parser serialization tests 2017-06-02 18:40:06 +02:00
ines 0051c05964 Add tests for serializing parser 2017-06-02 18:37:19 +02:00
ines fdd0923be4 Translate model=True in exclude to lower_model and upper_model 2017-06-02 18:37:07 +02:00
ines cef547a9f0 Add serialization tests for tensorizer 2017-06-02 18:18:30 +02:00
ines 924c58bde3 Fix serialization of optional elements 2017-06-02 18:18:17 +02:00
ines f74a45c1fe Remove unnecessary argument 2017-06-02 18:17:46 +02:00
ines 43b4d63f85 Add serialization tests for tagger 2017-06-02 17:29:34 +02:00
ines 1b593bbd6d Fix encoding on tagger serialization 2017-06-02 17:29:21 +02:00
Matthew Honnibal 5f4d328e2c Fix serialization of tag_map in NeuralTagger 2017-06-02 10:18:37 -05:00
Matthew Honnibal ed6f575e06 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-06-02 04:26:39 -05:00
ines acd65c00f6 Add serialization tests for StringStore and Vocab 2017-06-02 10:57:42 +02:00
ines 41a6adf1f6 Initialise Vocab length correctly 2017-06-02 10:57:25 +02:00
ines 53b82f972a Add strings to Vocab in init, instead of StringStore 2017-06-02 10:57:06 +02:00
ines 023f38bdd4 Fix return value of Vocab.from_bytes 2017-06-02 10:56:40 +02:00
ines 9692c98f57 Add test utils for temp file and temp dir 2017-06-02 10:56:09 +02:00
Matthew Honnibal c650bc481c Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-06-01 13:03:57 -05:00
Matthew Honnibal 307d615c5f Fix serialization for tagger when tag_map has changed 2017-06-01 12:18:36 -05:00
Matthew Honnibal 1d18cedae8 Fiddle with msgpack bytes vs unicode 2017-06-01 10:48:43 -05:00
ines 7a2380f617 Rename "nn_tagger" to "tagger" 2017-06-01 17:37:53 +02:00
ines e5ae6ccf4e Fix typo 2017-06-01 16:46:15 +02:00
ines a3e4f91f4a Only load vocab if it exists 2017-06-01 14:38:35 +02:00
Matthew Honnibal d310b0aab3 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-06-01 04:58:03 -05:00
Matthew Honnibal 3ff7d7fcef Merge for updated requirements 2017-06-01 04:57:47 -05:00
Matthew Honnibal 5eae3b9a1e Fix to/from disk in tagger 2017-06-01 04:55:49 -05:00
ines d5c8d2f5fd Update about.py and increment version 2017-06-01 11:52:24 +02:00
Matthew Honnibal 4c97371051 Fixes for thinc 6.7 2017-06-01 04:22:16 -05:00
Matthew Honnibal 53d00a0371 Move weight serialization to Thinc 2017-06-01 03:04:36 -05:00
Matthew Honnibal ae8010b526 Move weight serialization to Thinc 2017-06-01 02:56:12 -05:00
Gyorgy Orosz f0c3b09242 More robust Hungarian tokenizer. 2017-05-31 22:28:40 +02:00
Matthew Honnibal c8a58cfcf8 Fix Python2/3 load bug 2017-05-31 15:21:44 -05:00
Matthew Honnibal 99982684b0 Fix normalize_string_keys function' 2017-05-31 14:08:16 -05:00
Matthew Honnibal 67ade63fc4 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-05-31 08:28:42 -05:00
Matthew Honnibal 490b38e6bb Fix reference to thinc copy_array util 2017-05-31 08:25:21 -05:00
Matthew Honnibal 9805e0e369 Fix vocab pickling 2017-05-31 08:25:01 -05:00
Matthew Honnibal 6c51cd77b4 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-05-31 15:06:56 +02:00
Matthew Honnibal 8dfb9546f0 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-05-31 07:21:14 -05:00
Matthew Honnibal 480ef8bfc8 Add compat function to normalize dict keys 2017-05-31 07:14:29 -05:00
Matthew Honnibal 92f9e5cc9a Silence env_opt, and fix serialization for GPU 2017-05-31 07:14:11 -05:00
Matthew Honnibal 0561df2a9d Fix tokenizer serialization 2017-05-31 14:12:38 +02:00
Matthew Honnibal 4a398c15b7 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-05-31 13:44:16 +02:00
Matthew Honnibal 097ab9c6e4 Fix transition system to/from disk 2017-05-31 13:44:00 +02:00
Matthew Honnibal b1469d3360 Fix string serialisation 2017-05-31 13:43:44 +02:00
Matthew Honnibal e9419072e7 Fix tokenizer serialisation 2017-05-31 13:43:31 +02:00
Matthew Honnibal 33e5ec737f Fix to/from disk methods 2017-05-31 13:43:10 +02:00
ines 5e1c361270 Update tests README with info on model tests 2017-05-31 12:22:58 +02:00
Matthew Honnibal fe28602f2e Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-05-31 11:43:56 +02:00
Matthew Honnibal 66af019d5d Fix serialization of tokenizer 2017-05-31 11:43:40 +02:00
Ines Montani e6cf3c7e1c Merge pull request #1093 from oroszgy/hu_emoji_fix
Fixed emoji handling for Hungarian
2017-05-31 11:33:24 +02:00
Matthew Honnibal e98eff275d Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-05-31 10:29:15 +02:00
Matthew Honnibal 53a3824334 Fix mistake in ner feature 2017-05-31 03:01:02 +02:00
Matthew Honnibal 8a693c2605 Write binary file during training 2017-05-31 02:59:18 +02:00
Matthew Honnibal 498ad85309 Try using tensor for vector/similarity methdos 2017-05-30 23:35:17 +02:00
Matthew Honnibal a131981f3b Work on vectors 2017-05-30 23:34:50 +02:00
Matthew Honnibal 6937e311a4 Update doc tests 2017-05-30 23:34:23 +02:00
Matthew Honnibal cc911feab2 Fix bug in NER state 2017-05-30 22:12:19 +02:00
Gyorgy Orosz 8c0b4b850e Fixed emoji handling for Hungarian 2017-05-30 21:34:46 +02:00
Matthew Honnibal be4a640f0c Fix arc eager label costs for uint64 2017-05-30 20:37:58 +02:00
Matthew Honnibal b127645afc Fix test_misc merge conflict 2017-05-29 18:31:44 -05:00
Matthew Honnibal e0e8eae7c7 Tweak package test 2017-05-29 18:30:42 -05:00
Matthew Honnibal 11840ff5dd Store tag map before normalizing props 2017-05-29 17:53:48 -05:00
Matthew Honnibal b92a89f87b Make it easier to reference embedding tables 2017-05-29 17:53:29 -05:00
Matthew Honnibal 293d1b425b Serialize in consistent order 2017-05-29 17:53:06 -05:00
Matthew Honnibal 9bf22a94aa Fix tag set serialisation 2017-05-29 17:52:36 -05:00
Matthew Honnibal 2a061e2777 Fix serialisation, for reals this time 2017-05-29 17:52:08 -05:00
ines 20a7003c0d Update model fixtures and reorganise tests 2017-05-29 22:14:31 +02:00
ines 795fe43a4d Add load_test_model function with importorskip()
Loads model only if it can be imported, i.e. if it's installed as a
package.
2017-05-29 22:11:31 +02:00
ines ad3c8b3ad9 Fix formatting 2017-05-29 22:10:50 +02:00
ines 6e3937efc5 Check for arguments of model markers to specify models to test
Lets user set --models --en for only English models
2017-05-29 22:10:16 +02:00
Matthew Honnibal 35d981241f Fix model deserialization 2017-05-29 14:46:31 -05:00
Matthew Honnibal 5b29f227ae Fix serialization 2017-05-29 14:35:53 -05:00
Matthew Honnibal 1e6df0a2a1 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-05-29 14:30:12 -05:00
ines 08382f21e3 Pass model meta to nlp object in load_model 2017-05-29 20:44:11 +02:00
ines 6145fe6a93 Catch all kwargs on Language 2017-05-29 20:43:48 +02:00
ines 0d7d50fe22 Add __version__ to __init__.py 2017-05-29 20:43:24 +02:00
Matthew Honnibal 6522ea6c8b More serialization fixes. Still broken 2017-05-29 13:23:47 -05:00
Matthew Honnibal 9c9ee24411 Fix broken lambda scoping in Python 2 2017-05-29 13:23:28 -05:00
Matthew Honnibal f1acdaab55 Fix serialization of weight offsets 2017-05-29 13:23:11 -05:00
Matthew Honnibal c044e9c21c Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2017-05-29 08:41:02 -05:00
Matthew Honnibal aa4c33914b Work on serialization 2017-05-29 08:40:45 -05:00
ines 9e83a17e95 Use new model templates 2017-05-29 15:27:24 +02:00
ines 567485a818 Fix and document model loading with pipeline and overrides 2017-05-29 14:10:10 +02:00
Matthew Honnibal deac7eb01c Fix for serialization 2017-05-29 13:54:18 +02:00
Matthew Honnibal 04c32aa091 Fix for serialization 2017-05-29 13:53:32 +02:00
Matthew Honnibal a1960c2d09 Fix for serialization 2017-05-29 13:47:42 +02:00
Matthew Honnibal 7b06bb896e Fix for serialization 2017-05-29 13:42:55 +02:00