spaCy/website/docs/usage/v2.jade

412 lines
16 KiB
Plaintext
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

//- 💫 DOCS > USAGE > WHAT'S NEW IN V2.0
include ../../_includes/_mixins
p
| We're very excited to finally introduce spaCy v2.0.
p
| On this page, you'll find a summary of the #[+a("#features") new features],
| information on the #[+a("#incompat") backwards incompatibilities],
| including a handy overview of what's been renamed or deprecated.
| To help you make the most of v2.0, we also
| #[strong re-wrote almost all of the usage guides and API docs], and added
| more real-world examples. If you're new to spaCy, or just want to brush
| up on some NLP basics and the details of the library, check out
| the #[+a("/docs/usage/spacy-101") spaCy 101 guide] that explains the most
| important concepts with examples and illustrations.
+h(2, "features") New features
p
| This section contains an overview of the most important
| #[strong new features and improvements]. The #[+a("/docs/api") API docs]
| include additional deprecation notes. New methods and functions that
| were introduced in this version are marked with a #[+tag-new(2)] tag.
+h(3, "features-pipelines") Improved processing pipelines
+aside-code("Example").
# Modify an existing pipeline
nlp = spacy.load('en')
nlp.pipeline.append(my_component)
# Register a factory to create a component
spacy.set_factory('my_factory', my_factory)
nlp = Language(pipeline=['my_factory', mycomponent])
p
| It's now much easier to #[strong customise the pipeline] with your own
| components, functions that receive a #[code Doc] object, modify and
| return it. If your component is stateful, you can define and register a
| factory which receives the shared #[code Vocab] object and returns a
|  component. spaCy's default components can be added to your pipeline by
| using their string IDs. This way, you won't have to worry about finding
| and implementing them simply add #[code "tagger"] to the pipeline,
| and spaCy will know what to do.
+image
include ../../assets/img/docs/pipeline.svg
+infobox
| #[strong API:] #[+api("language") #[code Language]]
| #[strong Usage:] #[+a("/docs/usage/language-processing-pipeline") Processing text]
+h(3, "features-hash-ids") Hash values instead of integer IDs
+aside-code("Example").
doc = nlp(u'I love coffee')
assert doc.vocab.strings[u'coffee'] == 3197928453018144401
assert doc.vocab.strings[3197928453018144401] == u'coffee'
beer_hash = doc.vocab.strings.add(u'beer')
assert doc.vocab.strings[u'beer'] == beer_hash
assert doc.vocab.strings[beer_hash] == u'beer'
p
| The #[+api("stringstore") #[code StringStore]] now resolves all strings
| to hash values instead of integer IDs. This means that the string-to-int
| mapping #[strong no longer depends on the vocabulary state], making a lot
| of workflows much simpler, especially during training. Unlike integer IDs
| in spaCy v1.x, hash values will #[strong always match] even across
| models. Strings can now be added explicitly using the new
| #[+api("stringstore#add") #[code Stringstore.add]] method. A token's hash
| is available via #[code token.orth].
+infobox
| #[strong API:] #[+api("stringstore") #[code StringStore]]
| #[strong Usage:] #[+a("/docs/usage/spacy-101#vocab") Vocab, hashes and lexemes 101]
+h(3, "features-serializer") Saving, loading and serialization
+aside-code("Example").
nlp = spacy.load('en') # shortcut link
nlp = spacy.load('en_core_web_sm') # package
nlp = spacy.load('/path/to/en') # unicode path
nlp = spacy.load(Path('/path/to/en')) # pathlib Path
nlp.to_disk('/path/to/nlp')
nlp = English().from_disk('/path/to/nlp')
p
| spay's serialization API has been made consistent across classes and
| objects. All container classes, i.e. #[code Language], #[code Doc],
| #[code Vocab] and #[code StringStore] now have a #[code to_bytes()],
| #[code from_bytes()], #[code to_disk()] and #[code from_disk()] method
| that supports the Pickle protocol.
p
| The improved #[code spacy.load] makes loading models easier and more
| transparent. You can load a model by supplying its
| #[+a("/docs/usage/models#usage") shortcut link], the name of an installed
| #[+a("/docs/usage/saving-loading#generating") model package] or a path.
| The #[code Language] class to initialise will be determined based on the
| model's settings. For a blank language, you can import the class directly,
| e.g. #[code from spacy.lang.en import English].
+infobox
| #[strong API:] #[+api("spacy#load") #[code spacy.load]], #[+api("binder") #[code Binder]]
| #[strong Usage:] #[+a("/docs/usage/saving-loading") Saving and loading]
+h(3, "features-displacy") displaCy visualizer with Jupyter support
+aside-code("Example").
from spacy import displacy
doc = nlp(u'This is a sentence about Facebook.')
displacy.serve(doc, style='dep') # run the web server
html = displacy.render(doc, style='ent') # generate HTML
p
| Our popular dependency and named entity visualizers are now an official
| part of the spaCy library! displaCy can run a simple web server, or
| generate raw HTML markup or SVG files to be exported. You can pass in one
| or more docs, and customise the style. displaCy also auto-detects whether
| you're running #[+a("https://jupyter.org") Jupyter] and will render the
| visualizations in your notebook.
+infobox
| #[strong API:] #[+api("displacy") #[code displacy]]
| #[strong Usage:] #[+a("/docs/usage/visualizers") Visualizing spaCy]
+h(3, "features-language") Improved language data and lazy loading
p
| Language-specfic data now lives in its own submodule, #[code spacy.lang].
| Languages are lazy-loaded, i.e. only loaded when you import a
| #[code Language] class, or load a model that initialises one. This allows
| languages to contain more custom data, e.g. lemmatizer lookup tables, or
| complex regular expressions. The language data has also been tidied up
| and simplified. spaCy now also supports simple lookup-based lemmatization.
+infobox
| #[strong API:] #[+api("language") #[code Language]]
| #[strong Code:] #[+src(gh("spaCy", "spacy/lang")) spacy/lang]
| #[strong Usage:] #[+a("/docs/usage/adding-languages") Adding languages]
+h(3, "features-matcher") Revised matcher API
+aside-code("Example").
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)
matcher.add('HEARTS', None, [{'ORTH': '❤️', 'OP': '+'}])
assert len(matcher) == 1
assert 'HEARTS' in matcher
p
| Patterns can now be added to the matcher by calling
| #[+api("matcher-add") #[code matcher.add()]] with a match ID, an optional
| callback function to be invoked on each match, and one or more patterns.
| This allows you to write powerful, pattern-specific logic using only one
| matcher. For example, you might only want to merge some entity types,
| and set custom flags for other matched patterns.
+infobox
| #[strong API:] #[+api("matcher") #[code Matcher]]
| #[strong Usage:] #[+a("/docs/usage/rule-based-matching") Rule-based matching]
+h(3, "features-models") Neural network models for English, German, French, Spanish and multi-language NER
+aside-code("Example", "bash").
python -m spacy download en # default English model
python -m spacy download de # default German model
python -m spacy download fr # default French model
python -m spacy download es # default Spanish model
python -m spacy download xx_ent_wiki_sm # multi-language NER
p
| spaCy v2.0 comes with new and improved neural network models for English,
| German, French and Spanish, as well as a multi-language named entity
| recognition model trained on Wikipedia. #[strong GPU usage] is now
| supported via #[+a("http://chainer.org") Chainer]'s CuPy module.
+infobox
| #[strong Details:] #[+a("/docs/api/language-models") Languages],
| #[+src(gh("spacy-models")) spacy-models]
| #[strong Usage:] #[+a("/docs/usage/models") Models],
| #[+a("/docs/usage#gpu") Using spaCy with GPU]
+h(2, "incompat") Backwards incompatibilities
+table(["Old", "New"])
+row
+cell
| #[code spacy.en]
| #[code spacy.xx]
+cell
| #[code spacy.lang.en]
| #[code spacy.lang.xx]
+row
+cell #[code spacy.orth]
+cell #[code spacy.lang.xx.lex_attrs]
+row
+cell #[code cli.model]
+cell -
+row
+cell #[code Language.save_to_directory]
+cell #[+api("language#to_disk") #[code Language.to_disk]]
+row
+cell #[code Language.create_make_doc]
+cell #[+api("language#attributes") #[code Language.tokenizer]]
+row
+cell
| #[code Vocab.load]
| #[code Vocab.load_lexemes]
| #[code Vocab.load_vectors]
| #[code Vocab.load_vectors_from_bin_loc]
+cell
| #[+api("vocab#from_disk") #[code Vocab.from_disk]]
| #[+api("vocab#from_bytes") #[code Vocab.from_bytes]]
+row
+cell
| #[code Vocab.dump]
| #[code Vocab.dump_vectors]
+cell
| #[+api("vocab#to_disk") #[code Vocab.to_disk]]
| #[+api("vocab#to_bytes") #[code Vocab.to_bytes]]
+row
+cell
| #[code StringStore.load]
+cell
| #[+api("stringstore#from_disk") #[code StringStore.from_disk]]
| #[+api("stringstore#from_bytes") #[code StringStore.from_bytes]]
+row
+cell
| #[code StringStore.dump]
+cell
| #[+api("stringstore#to_disk") #[code StringStore.to_disk]]
| #[+api("stringstore#to_bytes") #[code StringStore.to_bytes]]
+row
+cell #[code Tokenizer.load]
+cell -
+row
+cell #[code Tagger.load]
+cell
| #[+api("tagger#from_disk") #[code Tagger.from_disk]]
| #[+api("tagger#from_bytes") #[code Tagger.from_bytes]]
+row
+cell #[code DependencyParser.load]
+cell
| #[+api("dependencyparser#from_disk") #[code DependencyParser.from_disk]]
| #[+api("dependencyparser#from_bytes") #[code DependencyParser.from_bytes]]
+row
+cell #[code EntityRecognizer.load]
+cell
| #[+api("entityrecognizer#from_disk") #[code EntityRecognizer.from_disk]]
| #[+api("entityrecognizer#from_bytes") #[code EntityRecognizer.from_bytes]]
+row
+cell #[code Matcher.load]
+cell -
+row
+cell
| #[code Matcher.add_pattern]
| #[code Matcher.add_entity]
+cell #[+api("matcher#add") #[code Matcher.add]]
+row
+cell #[code Matcher.get_entity]
+cell #[+api("matcher#get") #[code Matcher.get]]
+row
+cell #[code Matcher.has_entity]
+cell #[+api("matcher#contains") #[code Matcher.__contains__]]
+row
+cell #[code Doc.read_bytes]
+cell #[+api("binder") #[code Binder]]
+row
+cell #[code Token.is_ancestor_of]
+cell #[+api("token#is_ancestor") #[code Token.is_ancestor]]
+h(2, "migrating") Migrating from spaCy 1.x
p
+infobox("Some tips")
| Before migrating, we strongly recommend writing a few
| #[strong simple tests] specific to how you're using spaCy in your
| application. This makes it easier to check whether your code requires
| changes, and if so, which parts are affected.
| (By the way, feel free contribute your tests to
| #[+src(gh("spaCy", "spacy/tests")) our test suite] this will also ensure
| we never accidentally introduce a bug in a workflow that's
| important to you.) If you've trained your own models, keep in mind that
| your train and runtime inputs must match. This means you'll have to
| #[strong retrain your models] with spaCy v2.0 to make them compatible.
+h(3, "migrating-saving-loading") Saving, loading and serialization
p
| Double-check all calls to #[code spacy.load()] and make sure they don't
| use the #[code path] keyword argument. If you're only loading in binary
| data and not a model package that can construct its own #[code Language]
| class and pipeline, you should now use the
| #[+api("language#from_disk") #[code Language.from_disk()]] method.
+code-new.
nlp = spacy.load('/model')
nlp = English().from_disk('/model/data')
+code-old nlp = spacy.load('en', path='/model')
p
| Review all other code that writes state to disk or bytes.
| All containers, now share the same, consistent API for saving and
| loading. Replace saving with #[code to_disk()] or #[code to_bytes()], and
| loading with #[code from_disk()] and #[code from_bytes()].
+code-new.
nlp.to_disk('/model')
nlp.vocab.to_disk('/vocab')
+code-old.
nlp.save_to_directory('/model')
nlp.vocab.dump('/vocab')
p
| If you've trained models with input from v1.x, you'll need to
| #[strong retrain them] with spaCy v2.0. All previous models will not
| be compatible with the new version.
+h(3, "migrating-strings") Strings and hash values
p
| The change from integer IDs to hash values may not actually affect your
| code very much. However, if you're adding strings to the vocab manually,
| you now need to call #[+api("stringstore#add") #[code StringStore.add()]]
| explicitly. You can also now be sure that the string-to-hash mapping will
| always match across vocabularies.
+code-new.
nlp.vocab.strings.add(u'coffee')
nlp.vocab.strings[u'coffee'] # 3197928453018144401
other_nlp.vocab.strings[u'coffee'] # 3197928453018144401
+code-old.
nlp.vocab.strings[u'coffee'] # 3672
other_nlp.vocab.strings[u'coffee'] # 40259
+h(3, "migrating-languages") Processing pipelines and language data
p
| If you're importing language data or #[code Language] classes, make sure
| to change your import statements to import from #[code spacy.lang]. If
| you've added your own custom language, it needs to be moved to
| #[code spacy/lang/xx] and adjusted accordingly.
+code-new from spacy.lang.en import English
+code-old from spacy.en import English
p
| If you've been using custom pipeline components, check out the new
| guide on #[+a("/docs/usage/language-processing-pipelines") processing pipelines].
| Appending functions to the pipeline still works but you might be able
| to make this more convenient by registering "component factories".
| Components of the processing pipeline can now be disabled by passing a
| list of their names to the #[code disable] keyword argument on loading
| or processing.
+code-new.
nlp = spacy.load('en', disable=['tagger', 'ner'])
doc = nlp(u"I don't want parsed", disable=['parser'])
+code-old.
nlp = spacy.load('en', tagger=False, entity=False)
doc = nlp(u"I don't want parsed", parse=False)
+h(3, "migrating-matcher") Adding patterns and callbacks to the matcher
p
| If you're using the matcher, you can now add patterns in one step. This
| should be easy to update simply merge the ID, callback and patterns
| into one call to #[+api("matcher#add") #[code Matcher.add()]].
+code-new.
matcher.add('GoogleNow', merge_phrases, [{ORTH: 'Google'}, {ORTH: 'Now'}])
+code-old.
matcher.add_entity('GoogleNow', on_match=merge_phrases)
matcher.add_pattern('GoogleNow', [{ORTH: 'Google'}, {ORTH: 'Now'}])
p
| If you've been using #[strong acceptor functions], you'll need to move
| this logic into the
| #[+a("/docs/usage/rule-based-matching#on_match") #[code on_match] callbacks].
| The callback function is invoked on every match and will give you access to
| the doc, the index of the current match and all total matches. This lets
| you both accept or reject the match, and define the actions to be
| triggered.