mirror of https://github.com/explosion/spaCy.git
Update docs and consistency [ci skip]
This commit is contained in:
parent
52bd3a8b48
commit
aa6a7cd6e7
|
@ -5,7 +5,7 @@
|
|||
Thanks for your interest in contributing to spaCy 🎉 The project is maintained
|
||||
by [@honnibal](https://github.com/honnibal) and [@ines](https://github.com/ines),
|
||||
and we'll do our best to help you get started. This page will give you a quick
|
||||
overview of how things are organised and most importantly, how to get involved.
|
||||
overview of how things are organized and most importantly, how to get involved.
|
||||
|
||||
## Table of contents
|
||||
|
||||
|
@ -195,7 +195,7 @@ modules in `.py` files, not Cython modules in `.pyx` and `.pxd` files.**
|
|||
### Code formatting
|
||||
|
||||
[`black`](https://github.com/ambv/black) is an opinionated Python code
|
||||
formatter, optimised to produce readable code and small diffs. You can run
|
||||
formatter, optimized to produce readable code and small diffs. You can run
|
||||
`black` from the command-line, or via your code editor. For example, if you're
|
||||
using [Visual Studio Code](https://code.visualstudio.com/), you can add the
|
||||
following to your `settings.json` to use `black` for formatting and auto-format
|
||||
|
@ -286,7 +286,7 @@ Code that interacts with the file-system should accept objects that follow the
|
|||
If the function is user-facing and takes a path as an argument, it should check
|
||||
whether the path is provided as a string. Strings should be converted to
|
||||
`pathlib.Path` objects. Serialization and deserialization functions should always
|
||||
accept **file-like objects**, as it makes the library io-agnostic. Working on
|
||||
accept **file-like objects**, as it makes the library IO-agnostic. Working on
|
||||
buffers makes the code more general, easier to test, and compatible with Python
|
||||
3's asynchronous IO.
|
||||
|
||||
|
@ -384,7 +384,7 @@ of Python and C++, with additional complexity and syntax from numpy. The
|
|||
many "traps for new players". Working in Cython is very rewarding once you're
|
||||
over the initial learning curve. As with C and C++, the first way you write
|
||||
something in Cython will often be the performance-optimal approach. In contrast,
|
||||
Python optimisation generally requires a lot of experimentation. Is it faster to
|
||||
Python optimization generally requires a lot of experimentation. Is it faster to
|
||||
have an `if item in my_dict` check, or to use `.get()`? What about `try`/`except`?
|
||||
Does this numpy operation create a copy? There's no way to guess the answers to
|
||||
these questions, and you'll usually be dissatisfied with your results — so
|
||||
|
@ -400,7 +400,7 @@ Python. If it's not fast enough the first time, just switch to Cython.
|
|||
- [PEP 8 Style Guide for Python Code](https://www.python.org/dev/peps/pep-0008/) (python.org)
|
||||
- [Official Cython documentation](http://docs.cython.org/en/latest/) (cython.org)
|
||||
- [Writing C in Cython](https://explosion.ai/blog/writing-c-in-cython) (explosion.ai)
|
||||
- [Multi-threading spaCy’s parser and named entity recogniser](https://explosion.ai/blog/multithreading-with-cython) (explosion.ai)
|
||||
- [Multi-threading spaCy’s parser and named entity recognizer](https://explosion.ai/blog/multithreading-with-cython) (explosion.ai)
|
||||
|
||||
## Adding tests
|
||||
|
||||
|
@ -412,7 +412,7 @@ name. For example, tests for the `Tokenizer` can be found in
|
|||
all test files and test functions need to be prefixed with `test_`.
|
||||
|
||||
When adding tests, make sure to use descriptive names, keep the code short and
|
||||
concise and only test for one behaviour at a time. Try to `parametrize` test
|
||||
concise and only test for one behavior at a time. Try to `parametrize` test
|
||||
cases wherever possible, use our pre-defined fixtures for spaCy components and
|
||||
avoid unnecessary imports.
|
||||
|
||||
|
|
|
@ -49,9 +49,8 @@ It's commercial open-source software, released under the MIT license.
|
|||
|
||||
## 💬 Where to ask questions
|
||||
|
||||
The spaCy project is maintained by [@honnibal](https://github.com/honnibal) and
|
||||
[@ines](https://github.com/ines), along with core contributors
|
||||
[@svlandeg](https://github.com/svlandeg) and
|
||||
The spaCy project is maintained by [@honnibal](https://github.com/honnibal),
|
||||
[@ines](https://github.com/ines), [@svlandeg](https://github.com/svlandeg) and
|
||||
[@adrianeboyd](https://github.com/adrianeboyd). Please understand that we won't
|
||||
be able to provide individual support via email. We also believe that help is
|
||||
much more valuable if it's shared publicly, so that more people can benefit from
|
||||
|
|
|
@ -47,9 +47,9 @@ cdef class Tokenizer:
|
|||
`infix_finditer` (callable): A function matching the signature of
|
||||
`re.compile(string).finditer` to find infixes.
|
||||
token_match (callable): A boolean function matching strings to be
|
||||
recognised as tokens.
|
||||
recognized as tokens.
|
||||
url_match (callable): A boolean function matching strings to be
|
||||
recognised as tokens after considering prefixes and suffixes.
|
||||
recognized as tokens after considering prefixes and suffixes.
|
||||
|
||||
EXAMPLE:
|
||||
>>> tokenizer = Tokenizer(nlp.vocab)
|
||||
|
|
|
@ -184,7 +184,7 @@ yourself. For details on how to get started with training your own model, check
|
|||
out the [training quickstart](/usage/training#quickstart).
|
||||
|
||||
<!-- TODO:
|
||||
<Project id="en_core_bert">
|
||||
<Project id="en_core_trf_lg">
|
||||
|
||||
The easiest way to get started is to clone a transformers-based project
|
||||
template. Swap in your data, edit the settings and hyperparameters and train,
|
||||
|
|
|
@ -368,7 +368,7 @@ from is called `spacy`. So, when using spaCy, never call anything else `spacy`.
|
|||
|
||||
</Accordion>
|
||||
|
||||
<Accordion title="NER model doesn't recognise other entities anymore after training" id="catastrophic-forgetting">
|
||||
<Accordion title="NER model doesn't recognize other entities anymore after training" id="catastrophic-forgetting">
|
||||
|
||||
If your training data only contained new entities and you didn't mix in any
|
||||
examples the model previously recognized, it can cause the model to "forget"
|
||||
|
|
|
@ -429,7 +429,7 @@ nlp = spacy.load("en_core_web_sm")
|
|||
doc = nlp("fb is hiring a new vice president of global policy")
|
||||
ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
|
||||
print('Before', ents)
|
||||
# the model didn't recognise "fb" as an entity :(
|
||||
# The model didn't recognize "fb" as an entity :(
|
||||
|
||||
fb_ent = Span(doc, 0, 1, label="ORG") # create a Span for the new entity
|
||||
doc.ents = list(doc.ents) + [fb_ent]
|
||||
|
@ -558,11 +558,11 @@ import spacy
|
|||
nlp = spacy.load("my_custom_el_model")
|
||||
doc = nlp("Ada Lovelace was born in London")
|
||||
|
||||
# document level
|
||||
# Document level
|
||||
ents = [(e.text, e.label_, e.kb_id_) for e in doc.ents]
|
||||
print(ents) # [('Ada Lovelace', 'PERSON', 'Q7259'), ('London', 'GPE', 'Q84')]
|
||||
|
||||
# token level
|
||||
# Token level
|
||||
ent_ada_0 = [doc[0].text, doc[0].ent_type_, doc[0].ent_kb_id_]
|
||||
ent_ada_1 = [doc[1].text, doc[1].ent_type_, doc[1].ent_kb_id_]
|
||||
ent_london_5 = [doc[5].text, doc[5].ent_type_, doc[5].ent_kb_id_]
|
||||
|
@ -914,12 +914,12 @@ from spacy.lang.char_classes import ALPHA, ALPHA_LOWER, ALPHA_UPPER
|
|||
from spacy.lang.char_classes import CONCAT_QUOTES, LIST_ELLIPSES, LIST_ICONS
|
||||
from spacy.util import compile_infix_regex
|
||||
|
||||
# default tokenizer
|
||||
# Default tokenizer
|
||||
nlp = spacy.load("en_core_web_sm")
|
||||
doc = nlp("mother-in-law")
|
||||
print([t.text for t in doc]) # ['mother', '-', 'in', '-', 'law']
|
||||
|
||||
# modify tokenizer infix patterns
|
||||
# Modify tokenizer infix patterns
|
||||
infixes = (
|
||||
LIST_ELLIPSES
|
||||
+ LIST_ICONS
|
||||
|
@ -929,8 +929,8 @@ infixes = (
|
|||
al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
|
||||
),
|
||||
r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
|
||||
# EDIT: commented out regex that splits on hyphens between letters:
|
||||
#r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
|
||||
# ✅ Commented out regex that splits on hyphens between letters:
|
||||
# r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
|
||||
r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),
|
||||
]
|
||||
)
|
||||
|
|
|
@ -108,11 +108,11 @@ class, or defined within a [model package](/usage/saving-loading#models).
|
|||
>
|
||||
> [components.tagger]
|
||||
> factory = "tagger"
|
||||
> # settings for the tagger component
|
||||
> # Settings for the tagger component
|
||||
>
|
||||
> [components.parser]
|
||||
> factory = "parser"
|
||||
> # settings for the parser component
|
||||
> # Settings for the parser component
|
||||
> ```
|
||||
|
||||
When you load a model, spaCy first consults the model's
|
||||
|
@ -171,11 +171,11 @@ lang = "en"
|
|||
pipeline = ["tagger", "parser", "ner"]
|
||||
data_path = "path/to/en_core_web_sm/en_core_web_sm-2.0.0"
|
||||
|
||||
cls = spacy.util.get_lang_class(lang) # 1. Get Language instance, e.g. English()
|
||||
nlp = cls() # 2. Initialize it
|
||||
cls = spacy.util.get_lang_class(lang) # 1. Get Language class, e.g. English
|
||||
nlp = cls() # 2. Initialize it
|
||||
for name in pipeline:
|
||||
nlp.add_pipe(name) # 3. Add the component to the pipeline
|
||||
nlp.from_disk(model_data_path) # 4. Load in the binary data
|
||||
nlp.add_pipe(name) # 3. Add the component to the pipeline
|
||||
nlp.from_disk(model_data_path) # 4. Load in the binary data
|
||||
```
|
||||
|
||||
When you call `nlp` on a text, spaCy will **tokenize** it and then **call each
|
||||
|
@ -187,9 +187,9 @@ which is then processed by the component next in the pipeline.
|
|||
|
||||
```python
|
||||
### The pipeline under the hood
|
||||
doc = nlp.make_doc("This is a sentence") # create a Doc from raw text
|
||||
for name, proc in nlp.pipeline: # iterate over components in order
|
||||
doc = proc(doc) # apply each component
|
||||
doc = nlp.make_doc("This is a sentence") # Create a Doc from raw text
|
||||
for name, proc in nlp.pipeline: # Iterate over components in order
|
||||
doc = proc(doc) # Apply each component
|
||||
```
|
||||
|
||||
The current processing pipeline is available as `nlp.pipeline`, which returns a
|
||||
|
@ -473,7 +473,7 @@ only being able to modify it afterwards.
|
|||
>
|
||||
> @Language.component("my_component")
|
||||
> def my_component(doc):
|
||||
> # do something to the doc here
|
||||
> # Do something to the doc here
|
||||
> return doc
|
||||
> ```
|
||||
|
||||
|
|
|
@ -511,21 +511,21 @@ from spacy.language import Language
|
|||
from spacy.matcher import Matcher
|
||||
from spacy.tokens import Token
|
||||
|
||||
# We're using a component factory because the component needs to be initialized
|
||||
# with the shared vocab via the nlp object
|
||||
# We're using a component factory because the component needs to be
|
||||
# initialized with the shared vocab via the nlp object
|
||||
@Language.factory("html_merger")
|
||||
def create_bad_html_merger(nlp, name):
|
||||
return BadHTMLMerger(nlp)
|
||||
return BadHTMLMerger(nlp.vocab)
|
||||
|
||||
class BadHTMLMerger:
|
||||
def __init__(self, nlp):
|
||||
def __init__(self, vocab):
|
||||
patterns = [
|
||||
[{"ORTH": "<"}, {"LOWER": "br"}, {"ORTH": ">"}],
|
||||
[{"ORTH": "<"}, {"LOWER": "br/"}, {"ORTH": ">"}],
|
||||
]
|
||||
# Register a new token extension to flag bad HTML
|
||||
Token.set_extension("bad_html", default=False)
|
||||
self.matcher = Matcher(nlp.vocab)
|
||||
self.matcher = Matcher(vocab)
|
||||
self.matcher.add("BAD_HTML", patterns)
|
||||
|
||||
def __call__(self, doc):
|
||||
|
|
|
@ -792,7 +792,7 @@ you save the transformer outputs for later use.
|
|||
|
||||
<!-- TODO:
|
||||
|
||||
<Project id="en_core_bert">
|
||||
<Project id="en_core_trf_lg">
|
||||
|
||||
Try out a BERT-based model pipeline using this project template: swap in your
|
||||
data, edit the settings and hyperparameters and train, evaluate, package and
|
||||
|
|
|
@ -66,7 +66,7 @@ menu:
|
|||
- **Architectures: ** [TransformerModel](/api/architectures#TransformerModel),
|
||||
[Tok2VecListener](/api/architectures#transformers-Tok2VecListener),
|
||||
[Tok2VecTransformer](/api/architectures#Tok2VecTransformer)
|
||||
- **Models:** [`en_core_bert_sm`](/models/en)
|
||||
- **Models:** [`en_core_trf_lg_sm`](/models/en)
|
||||
- **Implementation:**
|
||||
[`spacy-transformers`](https://github.com/explosion/spacy-transformers)
|
||||
|
||||
|
@ -293,7 +293,8 @@ format for documenting argument and return types.
|
|||
|
||||
- **Usage: ** [Embeddings & Transformers](/usage/embeddings-transformers),
|
||||
[Training models](/usage/training), [Projects](/usage/projects),
|
||||
[Custom pipeline components](/usage/processing-pipelines#custom-components)
|
||||
[Custom pipeline components](/usage/processing-pipelines#custom-components),
|
||||
[Custom tokenizers](/usage/linguistic-features#custom-tokenizer)
|
||||
- **API Reference: ** [Library architecture](/api),
|
||||
[Model architectures](/api/architectures), [Data formats](/api/data-formats)
|
||||
- **New Classes: ** [`Example`](/api/example), [`Tok2Vec`](/api/tok2vec),
|
||||
|
|
|
@ -363,7 +363,7 @@ body [id]:target
|
|||
color: var(--color-red-medium)
|
||||
background: var(--color-red-transparent)
|
||||
|
||||
&.italic
|
||||
&.italic, &.comment
|
||||
font-style: italic
|
||||
|
||||
|
||||
|
@ -384,9 +384,11 @@ body [id]:target
|
|||
// Settings for ini syntax (config files)
|
||||
[class*="language-ini"]
|
||||
color: var(--syntax-comment)
|
||||
font-style: italic !important
|
||||
|
||||
.token
|
||||
color: var(--color-subtle)
|
||||
font-style: normal !important
|
||||
|
||||
|
||||
.gatsby-highlight-code-line
|
||||
|
@ -424,6 +426,7 @@ body [id]:target
|
|||
|
||||
.cm-comment
|
||||
color: var(--syntax-comment)
|
||||
font-style: italic
|
||||
|
||||
.cm-keyword
|
||||
color: var(--syntax-keyword)
|
||||
|
|
Loading…
Reference in New Issue