mirror of https://github.com/explosion/spaCy.git
Update readme with release notes for v0.100.8
This commit is contained in:
parent
72564213e3
commit
1b8b888a57
32
README.rst
32
README.rst
|
@ -13,6 +13,38 @@ spaCy is built on the very latest research, but it isn't researchware. It was
|
|||
designed from day 1 to be used in real products. It's commercial open-source
|
||||
software, released under the MIT license.
|
||||
|
||||
2016-04-05 v0.100.7: German!
|
||||
----------------------------
|
||||
|
||||
spaCy finally supports another language, in addition to English. We're lucky to have Wolfgang Seeker on the team, and the new German model is just the beginning.
|
||||
Now that there are multiple languages, you should consider loading spaCy via the load() function. This function also makes it easier to load extra word vector data for English:
|
||||
|
||||
import spacy
|
||||
|
||||
en_nlp = spacy.load('en', vectors='en_glove_cc_300_1m_vectors')
|
||||
de_nlp = spacy.load('de')
|
||||
|
||||
To support use of the load function, there are also two new helper functions: spacy.get_lang_class and spacy.set_lang_class.
|
||||
Once the German model is loaded, you can use it just like the English model:
|
||||
|
||||
doc = nlp(u'''Wikipedia ist ein Projekt zum Aufbau einer Enzyklopädie aus freien Inhalten, zu dem du mit deinem Wissen beitragen kannst. Seit Mai 2001 sind 1.936.257 Artikel in deutscher Sprache entstanden.''')
|
||||
for sent in doc.sents:
|
||||
print(sent.root.text, sent.root.n_lefts, sent.root.n_rights)
|
||||
# (u'ist', 1, 2)
|
||||
# (u'sind', 1, 3)
|
||||
|
||||
The German model provides tokenization, POS tagging, sentence boundary detection, syntactic dependency parsing, recognition of organisation, location and person entities, and word vector representations trained on a mix of open subtitles and Wikipedia data. It doesn't yet provide lemmatisation or morphological analysis, and it doesn't yet recognise numeric entities such as numbers and dates.
|
||||
|
||||
Bugfixes
|
||||
--------
|
||||
* spaCy < 0.100.7 had a bug in the semantics of the Token.__str__ and Token.__unicode__
|
||||
built-ins: they included a trailing space.
|
||||
* Improve handling of "infixed" hyphens. Previously the tokenizer struggled with multiple hyphens, such as "well-to-do".
|
||||
* Improve handling of periods after mixed-case tokens
|
||||
* Improve lemmatization for English special-case tokens
|
||||
* Fix bug that allowed spaces to be treated as heads in the syntactic parse
|
||||
* Fix bug that led to inconsistent sentence boundaries before and after serialisation.
|
||||
* Fix bug from deserialising untagged documents.
|
||||
|
||||
Features
|
||||
--------
|
||||
|
|
Loading…
Reference in New Issue