Update docs and consistency [ci skip]

This commit is contained in:
Ines Montani 2020-08-21 13:49:18 +02:00
parent 52bd3a8b48
commit aa6a7cd6e7
11 changed files with 43 additions and 40 deletions

View File

@ -5,7 +5,7 @@
Thanks for your interest in contributing to spaCy 🎉 The project is maintained
by [@honnibal](https://github.com/honnibal) and [@ines](https://github.com/ines),
and we'll do our best to help you get started. This page will give you a quick
overview of how things are organised and most importantly, how to get involved.
overview of how things are organized and most importantly, how to get involved.
## Table of contents
@ -195,7 +195,7 @@ modules in `.py` files, not Cython modules in `.pyx` and `.pxd` files.**
### Code formatting
[`black`](https://github.com/ambv/black) is an opinionated Python code
formatter, optimised to produce readable code and small diffs. You can run
formatter, optimized to produce readable code and small diffs. You can run
`black` from the command-line, or via your code editor. For example, if you're
using [Visual Studio Code](https://code.visualstudio.com/), you can add the
following to your `settings.json` to use `black` for formatting and auto-format
@ -286,7 +286,7 @@ Code that interacts with the file-system should accept objects that follow the
If the function is user-facing and takes a path as an argument, it should check
whether the path is provided as a string. Strings should be converted to
`pathlib.Path` objects. Serialization and deserialization functions should always
accept **file-like objects**, as it makes the library io-agnostic. Working on
accept **file-like objects**, as it makes the library IO-agnostic. Working on
buffers makes the code more general, easier to test, and compatible with Python
3's asynchronous IO.
@ -384,7 +384,7 @@ of Python and C++, with additional complexity and syntax from numpy. The
many "traps for new players". Working in Cython is very rewarding once you're
over the initial learning curve. As with C and C++, the first way you write
something in Cython will often be the performance-optimal approach. In contrast,
Python optimisation generally requires a lot of experimentation. Is it faster to
Python optimization generally requires a lot of experimentation. Is it faster to
have an `if item in my_dict` check, or to use `.get()`? What about `try`/`except`?
Does this numpy operation create a copy? There's no way to guess the answers to
these questions, and you'll usually be dissatisfied with your results — so
@ -400,7 +400,7 @@ Python. If it's not fast enough the first time, just switch to Cython.
- [PEP 8 Style Guide for Python Code](https://www.python.org/dev/peps/pep-0008/) (python.org)
- [Official Cython documentation](http://docs.cython.org/en/latest/) (cython.org)
- [Writing C in Cython](https://explosion.ai/blog/writing-c-in-cython) (explosion.ai)
- [Multi-threading spaCys parser and named entity recogniser](https://explosion.ai/blog/multithreading-with-cython) (explosion.ai)
- [Multi-threading spaCys parser and named entity recognizer](https://explosion.ai/blog/multithreading-with-cython) (explosion.ai)
## Adding tests
@ -412,7 +412,7 @@ name. For example, tests for the `Tokenizer` can be found in
all test files and test functions need to be prefixed with `test_`.
When adding tests, make sure to use descriptive names, keep the code short and
concise and only test for one behaviour at a time. Try to `parametrize` test
concise and only test for one behavior at a time. Try to `parametrize` test
cases wherever possible, use our pre-defined fixtures for spaCy components and
avoid unnecessary imports.

View File

@ -49,9 +49,8 @@ It's commercial open-source software, released under the MIT license.
## 💬 Where to ask questions
The spaCy project is maintained by [@honnibal](https://github.com/honnibal) and
[@ines](https://github.com/ines), along with core contributors
[@svlandeg](https://github.com/svlandeg) and
The spaCy project is maintained by [@honnibal](https://github.com/honnibal),
[@ines](https://github.com/ines), [@svlandeg](https://github.com/svlandeg) and
[@adrianeboyd](https://github.com/adrianeboyd). Please understand that we won't
be able to provide individual support via email. We also believe that help is
much more valuable if it's shared publicly, so that more people can benefit from

View File

@ -47,9 +47,9 @@ cdef class Tokenizer:
`infix_finditer` (callable): A function matching the signature of
`re.compile(string).finditer` to find infixes.
token_match (callable): A boolean function matching strings to be
recognised as tokens.
recognized as tokens.
url_match (callable): A boolean function matching strings to be
recognised as tokens after considering prefixes and suffixes.
recognized as tokens after considering prefixes and suffixes.
EXAMPLE:
>>> tokenizer = Tokenizer(nlp.vocab)

View File

@ -184,7 +184,7 @@ yourself. For details on how to get started with training your own model, check
out the [training quickstart](/usage/training#quickstart).
<!-- TODO:
<Project id="en_core_bert">
<Project id="en_core_trf_lg">
The easiest way to get started is to clone a transformers-based project
template. Swap in your data, edit the settings and hyperparameters and train,

View File

@ -368,7 +368,7 @@ from is called `spacy`. So, when using spaCy, never call anything else `spacy`.
</Accordion>
<Accordion title="NER model doesn't recognise other entities anymore after training" id="catastrophic-forgetting">
<Accordion title="NER model doesn't recognize other entities anymore after training" id="catastrophic-forgetting">
If your training data only contained new entities and you didn't mix in any
examples the model previously recognized, it can cause the model to "forget"

View File

@ -429,7 +429,7 @@ nlp = spacy.load("en_core_web_sm")
doc = nlp("fb is hiring a new vice president of global policy")
ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
print('Before', ents)
# the model didn't recognise "fb" as an entity :(
# The model didn't recognize "fb" as an entity :(
fb_ent = Span(doc, 0, 1, label="ORG") # create a Span for the new entity
doc.ents = list(doc.ents) + [fb_ent]
@ -558,11 +558,11 @@ import spacy
nlp = spacy.load("my_custom_el_model")
doc = nlp("Ada Lovelace was born in London")
# document level
# Document level
ents = [(e.text, e.label_, e.kb_id_) for e in doc.ents]
print(ents) # [('Ada Lovelace', 'PERSON', 'Q7259'), ('London', 'GPE', 'Q84')]
# token level
# Token level
ent_ada_0 = [doc[0].text, doc[0].ent_type_, doc[0].ent_kb_id_]
ent_ada_1 = [doc[1].text, doc[1].ent_type_, doc[1].ent_kb_id_]
ent_london_5 = [doc[5].text, doc[5].ent_type_, doc[5].ent_kb_id_]
@ -914,12 +914,12 @@ from spacy.lang.char_classes import ALPHA, ALPHA_LOWER, ALPHA_UPPER
from spacy.lang.char_classes import CONCAT_QUOTES, LIST_ELLIPSES, LIST_ICONS
from spacy.util import compile_infix_regex
# default tokenizer
# Default tokenizer
nlp = spacy.load("en_core_web_sm")
doc = nlp("mother-in-law")
print([t.text for t in doc]) # ['mother', '-', 'in', '-', 'law']
# modify tokenizer infix patterns
# Modify tokenizer infix patterns
infixes = (
LIST_ELLIPSES
+ LIST_ICONS
@ -929,8 +929,8 @@ infixes = (
al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
),
r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
# EDIT: commented out regex that splits on hyphens between letters:
#r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
# ✅ Commented out regex that splits on hyphens between letters:
# r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),
]
)

View File

@ -108,11 +108,11 @@ class, or defined within a [model package](/usage/saving-loading#models).
>
> [components.tagger]
> factory = "tagger"
> # settings for the tagger component
> # Settings for the tagger component
>
> [components.parser]
> factory = "parser"
> # settings for the parser component
> # Settings for the parser component
> ```
When you load a model, spaCy first consults the model's
@ -171,11 +171,11 @@ lang = "en"
pipeline = ["tagger", "parser", "ner"]
data_path = "path/to/en_core_web_sm/en_core_web_sm-2.0.0"
cls = spacy.util.get_lang_class(lang) # 1. Get Language instance, e.g. English()
nlp = cls() # 2. Initialize it
cls = spacy.util.get_lang_class(lang) # 1. Get Language class, e.g. English
nlp = cls() # 2. Initialize it
for name in pipeline:
nlp.add_pipe(name) # 3. Add the component to the pipeline
nlp.from_disk(model_data_path) # 4. Load in the binary data
nlp.add_pipe(name) # 3. Add the component to the pipeline
nlp.from_disk(model_data_path) # 4. Load in the binary data
```
When you call `nlp` on a text, spaCy will **tokenize** it and then **call each
@ -187,9 +187,9 @@ which is then processed by the component next in the pipeline.
```python
### The pipeline under the hood
doc = nlp.make_doc("This is a sentence") # create a Doc from raw text
for name, proc in nlp.pipeline: # iterate over components in order
doc = proc(doc) # apply each component
doc = nlp.make_doc("This is a sentence") # Create a Doc from raw text
for name, proc in nlp.pipeline: # Iterate over components in order
doc = proc(doc) # Apply each component
```
The current processing pipeline is available as `nlp.pipeline`, which returns a
@ -473,7 +473,7 @@ only being able to modify it afterwards.
>
> @Language.component("my_component")
> def my_component(doc):
> # do something to the doc here
> # Do something to the doc here
> return doc
> ```

View File

@ -511,21 +511,21 @@ from spacy.language import Language
from spacy.matcher import Matcher
from spacy.tokens import Token
# We're using a component factory because the component needs to be initialized
# with the shared vocab via the nlp object
# We're using a component factory because the component needs to be
# initialized with the shared vocab via the nlp object
@Language.factory("html_merger")
def create_bad_html_merger(nlp, name):
return BadHTMLMerger(nlp)
return BadHTMLMerger(nlp.vocab)
class BadHTMLMerger:
def __init__(self, nlp):
def __init__(self, vocab):
patterns = [
[{"ORTH": "<"}, {"LOWER": "br"}, {"ORTH": ">"}],
[{"ORTH": "<"}, {"LOWER": "br/"}, {"ORTH": ">"}],
]
# Register a new token extension to flag bad HTML
Token.set_extension("bad_html", default=False)
self.matcher = Matcher(nlp.vocab)
self.matcher = Matcher(vocab)
self.matcher.add("BAD_HTML", patterns)
def __call__(self, doc):

View File

@ -792,7 +792,7 @@ you save the transformer outputs for later use.
<!-- TODO:
<Project id="en_core_bert">
<Project id="en_core_trf_lg">
Try out a BERT-based model pipeline using this project template: swap in your
data, edit the settings and hyperparameters and train, evaluate, package and

View File

@ -66,7 +66,7 @@ menu:
- **Architectures: ** [TransformerModel](/api/architectures#TransformerModel),
[Tok2VecListener](/api/architectures#transformers-Tok2VecListener),
[Tok2VecTransformer](/api/architectures#Tok2VecTransformer)
- **Models:** [`en_core_bert_sm`](/models/en)
- **Models:** [`en_core_trf_lg_sm`](/models/en)
- **Implementation:**
[`spacy-transformers`](https://github.com/explosion/spacy-transformers)
@ -293,7 +293,8 @@ format for documenting argument and return types.
- **Usage: ** [Embeddings & Transformers](/usage/embeddings-transformers),
[Training models](/usage/training), [Projects](/usage/projects),
[Custom pipeline components](/usage/processing-pipelines#custom-components)
[Custom pipeline components](/usage/processing-pipelines#custom-components),
[Custom tokenizers](/usage/linguistic-features#custom-tokenizer)
- **API Reference: ** [Library architecture](/api),
[Model architectures](/api/architectures), [Data formats](/api/data-formats)
- **New Classes: ** [`Example`](/api/example), [`Tok2Vec`](/api/tok2vec),

View File

@ -363,7 +363,7 @@ body [id]:target
color: var(--color-red-medium)
background: var(--color-red-transparent)
&.italic
&.italic, &.comment
font-style: italic
@ -384,9 +384,11 @@ body [id]:target
// Settings for ini syntax (config files)
[class*="language-ini"]
color: var(--syntax-comment)
font-style: italic !important
.token
color: var(--color-subtle)
font-style: normal !important
.gatsby-highlight-code-line
@ -424,6 +426,7 @@ body [id]:target
.cm-comment
color: var(--syntax-comment)
font-style: italic
.cm-keyword
color: var(--syntax-keyword)