Commit Graph

156 Commits

Author SHA1 Message Date
ines e1efd589c3 Fix json imports and use ujson 2017-04-15 12:13:34 +02:00
ines c05ec4b89a Add compat functions and remove old workarounds
Add ensure_path util function to handle checking instance of path
2017-04-15 12:11:16 +02:00
ines d24589aa72 Clean up imports, unused code, whitespace, docstrings 2017-04-15 12:05:47 +02:00
ines 561f2a3eb4 Use consistent formatting for docstrings 2017-04-15 11:59:21 +02:00
Matthew Honnibal d013aba7b5 Merge branch 'master' of https://github.com/explosion/spaCy 2017-03-17 18:30:53 +01:00
Matthew Honnibal 854cfce7cf Make vocabs more compatible across versions
Previously, symbols were inserted into the string-store
before strings were loaded. This meant that adding a symbol
would invalidate saved models. We now make sure that strings
are loaded faithfully, so that compatibility is maintained.
2017-03-17 18:29:04 +01:00
Matthew Honnibal 1cc841e600 Merge branch 'master' of https://github.com/explosion/spaCy 2017-03-17 08:18:11 -05:00
Matthew Honnibal 4bfc55b532 Auto-add words to vocab when loading vectors
When calling vocab.load_vectors_from_bin_loc, ensure that missing
entries are added to the vocab. Otherwise, loading vectors into an
empty vocab object resulted in no vectors being added.
2017-03-17 08:15:59 -05:00
Matthew Honnibal 4382f175b3 Squelch compiler warnings 2017-03-11 12:44:43 -06:00
Matthew Honnibal d814892805 Hackish pickle support for Vocab. 2017-03-07 20:25:12 +01:00
ines aa92d4e9b5 Fix unicode regex for Python 2 (see #834) 2017-02-16 23:49:54 +01:00
ines 85d249d451 Revert "Revert "Merge pull request #836 from raphael0202/load_vectors (closes #834)""
This reverts commit ea05f78660.
2017-02-16 23:26:25 +01:00
ines ea05f78660 Revert "Merge pull request #836 from raphael0202/load_vectors (closes #834)"
This reverts commit 7d8c9eee7f, reversing
changes made to f6b69babcc.
2017-02-16 15:27:12 +01:00
Raphaël Bournhonesque e17dc2db75 Remove useless import 2017-02-16 12:10:24 +01:00
Raphaël Bournhonesque 3fd2742649 load_vectors should accept arbitrary space characters as word tokens
Fix bug  #834
2017-02-16 12:08:30 +01:00
Daniel Hershcovich 99eb494a82 Fix #737: support loading word vectors with " " as a word 2017-01-12 17:00:14 +02:00
Daniel Hershcovich 8e603cc917 Avoid "True if ... else False" 2017-01-11 11:18:22 +02:00
Matthew Honnibal cade536d1e Merge branch 'master' of ssh://github.com/explosion/spaCy 2016-12-27 21:04:10 +01:00
Matthew Honnibal ce4539dafd Allow the vocabulary to grow to 10,000, to prevent cold-start problem. 2016-12-27 21:03:45 +01:00
Ines Montani 8978806ea6 Allow Vocab to load without serializer_freqs 2016-12-21 18:05:23 +01:00
Ines Montani be8ed811f6 Remove trailing whitespace 2016-12-21 18:04:41 +01:00
Matthew Honnibal 6ee1df93c5 Set tag_map to None if it's not seen in the data by vocab 2016-12-18 16:51:10 +01:00
Matthew Honnibal 1e0f566d95 Fix #656, #624: Support arbitrary token attributes when adding special-case rules. 2016-11-25 12:43:24 +01:00
Matthew Honnibal f123f92e0c Fix #617: Vocab.load() required Path. Should work with string as well. 2016-11-10 22:48:48 +01:00
Matthew Honnibal b86f8af0c1 Fix doc strings 2016-11-01 12:25:36 +01:00
Matthew Honnibal 6036ec7c77 Fix vector norm when loading lexemes. 2016-10-23 19:40:18 +02:00
Matthew Honnibal 3e688e6d4b Fix issue #514 -- serializer fails when new entity type has been added. The fix here is quite ugly. It's best to add the entities ASAP after loading the NLP pipeline, to mitigate the brittleness. 2016-10-23 17:45:44 +02:00
Matthew Honnibal f62088d646 Fix compile error 2016-10-23 14:50:50 +02:00
Matthew Honnibal a0a4ada42a Fix calculation of L2-norm for Lexeme 2016-10-23 14:44:45 +02:00
Matthew Honnibal 7ab03050d4 Add resize_vectors method to Vocab 2016-10-21 01:44:50 +02:00
Matthew Honnibal f5fe4f595b Fix json loading, for Python 3. 2016-10-20 21:23:26 +02:00
Matthew Honnibal 5ec32f5d97 Fix loading of GloVe vectors, to address Issue #541 2016-10-20 18:27:48 +02:00
Matthew Honnibal d10c17f2a4 Fix Issue #536: oov_prob was 0 for OOV words. 2016-10-19 23:38:47 +02:00
Matthew Honnibal 2bbb050500 Fix default of serializer_freqs 2016-10-18 19:55:41 +02:00
Matthew Honnibal 2cc515b2ed Add add_flag method to Vocab, re Issue #504. 2016-10-14 12:15:38 +02:00
Matthew Honnibal ea23b64cc8 Refactor training, with new spacy.train module. Defaults still a little awkward. 2016-10-09 12:24:24 +02:00
Matthew Honnibal ca32a1ab01 Revert "Work on Issue #285: intern strings into document-specific pools, to address streaming data memory growth. StringStore.__getitem__ now raises KeyError when it can't find the string. Use StringStore.intern() to get the old behaviour. Still need to hunt down all uses of StringStore.__getitem__ in library and do testing, but logic looks good."
This reverts commit 8423e8627f.
2016-09-30 20:20:22 +02:00
Matthew Honnibal 1f1cd5013f Revert "Changes to vocab for new stringstore scheme"
This reverts commit a51149a717.
2016-09-30 20:10:30 +02:00
Matthew Honnibal a51149a717 Changes to vocab for new stringstore scheme 2016-09-30 20:01:19 +02:00
Matthew Honnibal 8423e8627f Work on Issue #285: intern strings into document-specific pools, to address streaming data memory growth. StringStore.__getitem__ now raises KeyError when it can't find the string. Use StringStore.intern() to get the old behaviour. Still need to hunt down all uses of StringStore.__getitem__ in library and do testing, but logic looks good. 2016-09-30 10:14:47 +02:00
Matthew Honnibal 95aaea0d3f Refactor so that the tokenizer data is read from Python data, rather than from disk 2016-09-25 14:49:53 +02:00
Matthew Honnibal df88690177 Fix encoding of path variable 2016-09-24 21:13:15 +02:00
Matthew Honnibal af847e07fc Fix usage of pathlib for Python3 -- turning paths to strings. 2016-09-24 21:05:27 +02:00
Matthew Honnibal 453683aaf0 Fix spacy/vocab.pyx 2016-09-24 20:50:31 +02:00
Matthew Honnibal fd65cf6cbb Finish refactoring data loading 2016-09-24 20:26:17 +02:00
Matthew Honnibal 83e364188c Mostly finished loading refactoring. Design is in place, but doesn't work yet. 2016-09-24 15:42:01 +02:00
Matthew Honnibal 872695759d Merge pull request #306 from wbwseeker/german_noun_chunks
add German noun chunk functionality
2016-04-08 00:54:24 +10:00
Henning Peters b8f63071eb add lang registration facility 2016-03-25 18:54:45 +01:00
Wolfgang Seeker 5e2e8e951a add baseclass DocIterator for iterators over documents
add classes for English and German noun chunks

the respective iterators are set for the document when created by the parser
as they depend on the annotation scheme of the parsing model
2016-03-16 15:53:35 +01:00
Wolfgang Seeker 03fb498dbe introduce lang field for LexemeC to hold language id
put noun_chunk logic into iterators.py for each language separately
2016-03-10 13:01:34 +01:00