Commit Graph

8896 Commits

Author SHA1 Message Date
Matthew Honnibal 82277f63a3 💫 Small efficiency fixes to tokenizer (#2587)
This patch improves tokenizer speed by about 10%, and reduces memory usage in the `Vocab` by removing a redundant index. The `vocab._by_orth` and `vocab._by_hash` indexed on different data in v1, but in v2 the orth and the hash are identical.

The patch also fixes an uninitialized variable in the tokenizer, the `has_special` flag. This checks whether a chunk we're tokenizing triggers a special-case rule. If it does, then we avoid caching within the chunk. This check led to incorrectly rejecting some chunks from the cache. 

With the `en_core_web_md` model, we now tokenize the IMDB train data at 503,104k words per second. Prior to this patch, we had 465,764k words per second.

Before switching to the regex library and supporting more languages, we had 1.3m words per second for the tokenizer. In order to recover the missing speed, we need to:

* Fix the variable-length lookarounds in the suffix, infix and `token_match` rules
* Improve the performance of the `token_match` regex
* Switch back from the `regex` library to the `re` library.

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-07-24 23:35:54 +02:00
ines 3c30d1763c Merge branch 'master' into develop 2018-07-21 15:34:18 +02:00
Matthew Honnibal 1a1c7304cf Set version to 2.0.12.dev1 2018-07-21 13:08:01 +02:00
Matthew Honnibal a723fafea3 Require thinc 6.10.3.dev1 2018-07-21 12:49:09 +02:00
ines 1ea881c80b Allow ignoring warnings and only overwrite if set explicitly 2018-07-20 22:50:19 +02:00
ines 95641f4026 Only install pathlib backport on Python < 3.4 2018-07-20 21:08:29 +02:00
Matthew Honnibal e0caf3ae8c Fix msgpack for new version 2018-07-20 17:32:00 +02:00
Matthew Honnibal 899f1cf442 Add regression test for issue 2179 2018-07-20 17:15:44 +02:00
Matthew Honnibal 9db77fd914 Fix deserialization for msgpack 2018-07-20 14:11:09 +02:00
Matthew Honnibal adde3826e2 Build against thinc 6.10.3.dev0 2018-07-20 13:34:54 +02:00
katarkor 5ca853bee0 changed tag_map, morph_rules, lemmatizer for Norwegian (#2565)
* changed tag_map, morph_rules, lemmatizer for Norwegian

* Move unicode declaration up

Hopefully fixes test failure on Python 2

* Update CONTRIBUTOR_AGREEMENT.md

* Move unicode declarations

Hopefully fixes test this time

* Revert "Merge remote-tracking branch 'origin/patch-1'"

This reverts commit f5ccd5dd0d, reversing
changes made to dd07e180ea.

* Update contributor agreement [ci skip]
2018-07-19 19:38:24 +02:00
kororo 2784babef9 Add ExcelCy into Universe list (#2572)
Hi guys,

This is my first spaCy extension. I am excited to able to do this. Please do let me know if there is any suggestions or modifications I need to do. Feel free to use/contribute the repo that I made.

## Description
ExcelCy is a SpaCy toolkit to help improve the data training experiences. It provides easy annotation using Excel file format. It has helper to pre-train entity annotation with phrase and regex matcher pipe.

### Types of change
Update to Universe list in website.

## Checklist
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-07-19 19:28:33 +02:00
ines 4339f64128 Merge branch 'master' into develop 2018-07-19 16:15:03 +02:00
ines d489ffb78b Fix formatting [ci skip] 2018-07-19 13:22:25 +02:00
ines c0b62ce13c Ignore pytest cache 2018-07-19 12:30:09 +02:00
Ines Montani e7b075565d
💫 Rule-based NER component (#2513)
* Add helper function for reading in JSONL

* Add rule-based NER component

* Fix whitespace

* Add component to factories

* Add tests

* Add option to disable indent on json_dumps compat

Otherwise, reading JSONL back in line by line won't work

* Fix error code
2018-07-18 19:43:16 +02:00
ines d84b13e02c Merge branch 'master' into develop 2018-07-18 18:57:00 +02:00
Ole Henrik Skogstrøm 6e2930a4a2 Conll(u)-bio converter (#2525)
* Started simple conllxbiluo converter

* Fix missing BIO to BILUO conversion
2018-07-18 18:55:42 +02:00
ines 02aefe7cc0 Merge branch 'master' into develop 2018-07-18 18:52:59 +02:00
Ioannis Daras 6ed18412d0 Greek language optimizations (#2558)
* Greek language optimizations

* Add encoding on files containing greek words

* Add encoding on files containing greek words
2018-07-18 18:51:38 +02:00
ines 80e7485630 Merge branch 'master' into develop 2018-07-18 17:28:47 +02:00
Xiang Ji 19a5ef1c58 Fix venv command examples (#2560) [ci skip]
* Fix venv command examples

The documentation refers to `venv`, which is native to Python3.
However, the command examples are as if they were still `virtualenv`,
which is a package independent of `venv`:

- It doesn't need to be installed via `pip`. In fact `pip install venv` would
return an error.
- The correct way to invoke `venv` is `python3 -m venv`, not `venv`, which would
return command not found.

See https://docs.python.org/3/library/venv.html

I suspect the documentation simply replaced all occurrences of `virtualenv` with
`venv`. However they are different modules and are used differently.

* Update comment [ci skip]
2018-07-18 10:31:24 +02:00
Paul O'Leary McCann 61ef0739b8 Add Japanese stop words. (#2549)
List created by taking the 2000 top words from a Wikipedia dump and
removing everything that wasn't hiragana.

Tried going through kanji words and deciding what to keep but there were
too many obvious non-stopwords (東京 was in the top 500) and many other
words where it wasn't clear if they should be included or not.
2018-07-17 10:12:48 +02:00
Tero K f35980f865 Enhancement/lang fi examples (#2547)
* Added a file with examples in finnish

* added contributor agreement
2018-07-15 09:50:27 +02:00
Paul O'Leary McCann 1987f3f784 Add Japanese lemmas (#2543)
This info was already available from Mecab, forgot to add it before.
2018-07-13 10:55:14 +02:00
ines 50c367ee96 Update meta [ci skip] 2018-07-10 13:51:45 +02:00
ines 3a321e79ac Merge branch 'master' into develop 2018-07-10 13:49:08 +02:00
Eleni170 6042723535 Add support for Greek language (#2535)
* Add contributor agreement

* Support for Greek language

* Fix missing el_tokenizer
2018-07-10 13:48:38 +02:00
ines 71bfc92913 Exclude models for non-stable versions [ci skip] 2018-07-10 13:44:55 +02:00
Stefan Schweter 3dfc7f86be lemmatizer: correct lemma for Rang (#2537)
<!--- Provide a general summary of your changes in the title. -->

## Description

This PR corrects the German lemma form for the word "Rang". Initially, the lemma form was "ringen", which is not correct, because it refers to the verb ("ringen") and not to the noun ("Rang").

### Types of change

The lemma form for "Rang" is corrected to "Rang", see also the [Duden](https://www.duden.de/rechtschreibung/Rang) entry.

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2018-07-10 13:11:19 +02:00
ines b5200962c0 Adjust formatting [ci skip] 2018-07-09 18:35:46 +02:00
Alex Villarreal bd35bf7f09 Guidance to handle binary files in git in Windows (#2526)
Adds guidance on what to do if users encounter the error described in [1634](https://github.com/explosion/spaCy/issues/1634), which probably only happens in Windows environments.
2018-07-09 18:31:37 +02:00
ines fd6207426a Merge branch 'master' into develop 2018-07-09 18:05:10 +02:00
Duygu Altinok 00b9a58558 German lemmatizer additions (#2529)
* lemma of was-> was

* added new pairs issue @2486

* added article tests
2018-07-09 11:10:15 +02:00
Ole Henrik Skogstrøm c21efea9bb Add sent property to token (#2521)
* Add sent property to token

* Refactored and cleaned up copy paste errors.
2018-07-06 15:54:15 +02:00
ines 38e07ade4c Add test for custom tokenizer serialization (resolves #2494) 2018-07-06 12:40:51 +02:00
ines c2581f9172 Tidy up tokenizer test 2018-07-06 12:40:28 +02:00
Matthew Honnibal 43dcaa473e Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2018-07-06 12:36:42 +02:00
Matthew Honnibal 6c8d627733 Fix tokenizer deserialization 2018-07-06 12:36:33 +02:00
ines c001d46153 Tidy up 2018-07-06 12:33:42 +02:00
Matthew Honnibal 63f5651f8d Fix tokenizer serialization 2018-07-06 12:32:11 +02:00
Matthew Honnibal e1569fda4e Fix compile error in matcher 2018-07-06 12:29:23 +02:00
Matthew Honnibal f5b2076700 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2018-07-06 12:23:14 +02:00
Matthew Honnibal 1a2f61725c Fix tokenizer serialization 2018-07-06 12:23:04 +02:00
ines 9e09477b2f Remove unused import 2018-07-06 12:18:17 +02:00
ines 26f04a6ac3 Fix Matcher tests and add test for any token with operator 2018-07-06 12:17:50 +02:00
Matthew Honnibal f5703b7a91 Clean up unused stuff in matcher 2018-07-06 12:16:44 +02:00
Matthew Honnibal 08c362d541 Suppress compiler warning about unreachable code 2018-07-06 11:31:22 +02:00
Matthew Honnibal 8ae1bec8bf Fix init_model 2018-07-05 14:02:06 +02:00
Matthew Honnibal 7b09a4ca49 Fix lemmatization 2018-07-05 13:56:02 +02:00