Update readme with release notes for v0.100.8

2016-05-05 00:16:13 +10:00 · 2016-05-05 00:16:13 +10:00 · 1b8b888a57
parent 72564213e3
commit 1b8b888a57
1 changed files with 32 additions and 0 deletions
--- a/README.rst
+++ b/README.rst
@ -13,6 +13,38 @@ spaCy is built on the very latest research, but it isn't researchware.  It was
 designed from day 1 to be used in real products. It's commercial open-source
 software, released under the MIT license.

+2016-04-05 v0.100.7: German!
+----------------------------
+
+spaCy finally supports another language, in addition to English. We're lucky to have Wolfgang Seeker on the team, and the new German model is just the beginning.
+Now that there are multiple languages, you should consider loading spaCy via the load() function. This function also makes it easier to load extra word vector data for English:
+    
+    import spacy
+    
+    en_nlp = spacy.load('en', vectors='en_glove_cc_300_1m_vectors')
+    de_nlp = spacy.load('de')
+    
+To support use of the load function, there are also two new helper functions: spacy.get_lang_class and spacy.set_lang_class.
+Once the German model is loaded, you can use it just like the English model:
+
+    doc = nlp(u'''Wikipedia ist ein Projekt zum Aufbau einer Enzyklopädie aus freien Inhalten, zu dem du mit deinem Wissen beitragen kannst. Seit Mai 2001 sind 1.936.257 Artikel in deutscher Sprache entstanden.''')
+    for sent in doc.sents:
+        print(sent.root.text, sent.root.n_lefts, sent.root.n_rights)
+    # (u'ist', 1, 2)
+    # (u'sind', 1, 3)
+    
+The German model provides tokenization, POS tagging, sentence boundary detection, syntactic dependency parsing, recognition of organisation, location and person entities, and word vector representations trained on a mix of open subtitles and Wikipedia data. It doesn't yet provide lemmatisation or morphological analysis, and it doesn't yet recognise numeric entities such as numbers and dates.
+
+Bugfixes
+--------
+* spaCy < 0.100.7 had a bug in the semantics of the Token.__str__ and Token.__unicode__
+built-ins: they included a trailing space.
+* Improve handling of "infixed" hyphens. Previously the tokenizer struggled with multiple hyphens, such as "well-to-do".
+* Improve handling of periods after mixed-case tokens
+* Improve lemmatization for English special-case tokens
+* Fix bug that allowed spaces to be treated as heads in the syntactic parse
+* Fix bug that led to inconsistent sentence boundaries before and after serialisation.
+* Fix bug from deserialising untagged documents.

 Features
 --------