From aa6a7cd6e72bfd8515b7c3b6ddb4c0951c6513e6 Mon Sep 17 00:00:00 2001
From: Ines Montani <ines@ines.io>
Date: Fri, 21 Aug 2020 13:49:18 +0200
Subject: [PATCH] Update docs and consistency [ci skip]

---
 CONTRIBUTING.md                               | 12 +++++------
 README.md                                     |  5 ++---
 spacy/tokenizer.pyx                           |  4 ++--
 website/docs/usage/embeddings-transformers.md |  2 +-
 website/docs/usage/index.md                   |  4 ++--
 website/docs/usage/linguistic-features.md     | 14 ++++++-------
 website/docs/usage/processing-pipelines.md    | 20 +++++++++----------
 website/docs/usage/rule-based-matching.md     | 10 +++++-----
 website/docs/usage/training.md                |  2 +-
 website/docs/usage/v3.md                      |  5 +++--
 website/src/styles/layout.sass                |  5 ++++-
 11 files changed, 43 insertions(+), 40 deletions(-)

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
index 81cfbf8cb..0abde2abf 100644
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -5,7 +5,7 @@
 Thanks for your interest in contributing to spaCy 🎉 The project is maintained
 by [@honnibal](https://github.com/honnibal) and [@ines](https://github.com/ines),
 and we'll do our best to help you get started. This page will give you a quick
-overview of how things are organised and most importantly, how to get involved.
+overview of how things are organized and most importantly, how to get involved.
 
 ## Table of contents
 
@@ -195,7 +195,7 @@ modules in `.py` files, not Cython modules in `.pyx` and `.pxd` files.**
 ### Code formatting
 
 [`black`](https://github.com/ambv/black) is an opinionated Python code
-formatter, optimised to produce readable code and small diffs. You can run
+formatter, optimized to produce readable code and small diffs. You can run
 `black` from the command-line, or via your code editor. For example, if you're
 using [Visual Studio Code](https://code.visualstudio.com/), you can add the
 following to your `settings.json` to use `black` for formatting and auto-format
@@ -286,7 +286,7 @@ Code that interacts with the file-system should accept objects that follow the
 If the function is user-facing and takes a path as an argument, it should check
 whether the path is provided as a string. Strings should be converted to
 `pathlib.Path` objects. Serialization and deserialization functions should always
-accept **file-like objects**, as it makes the library io-agnostic. Working on
+accept **file-like objects**, as it makes the library IO-agnostic. Working on
 buffers makes the code more general, easier to test, and compatible with Python
 3's asynchronous IO.
 
@@ -384,7 +384,7 @@ of Python and C++, with additional complexity and syntax from numpy. The
 many "traps for new players". Working in Cython is very rewarding once you're
 over the initial learning curve. As with C and C++, the first way you write
 something in Cython will often be the performance-optimal approach. In contrast,
-Python optimisation generally requires a lot of experimentation. Is it faster to
+Python optimization generally requires a lot of experimentation. Is it faster to
 have an `if item in my_dict` check, or to use `.get()`? What about `try`/`except`?
 Does this numpy operation create a copy? There's no way to guess the answers to
 these questions, and you'll usually be dissatisfied with your results — so
@@ -400,7 +400,7 @@ Python. If it's not fast enough the first time, just switch to Cython.
 - [PEP 8 Style Guide for Python Code](https://www.python.org/dev/peps/pep-0008/) (python.org)
 - [Official Cython documentation](http://docs.cython.org/en/latest/) (cython.org)
 - [Writing C in Cython](https://explosion.ai/blog/writing-c-in-cython) (explosion.ai)
-- [Multi-threading spaCy’s parser and named entity recogniser](https://explosion.ai/blog/multithreading-with-cython) (explosion.ai)
+- [Multi-threading spaCy’s parser and named entity recognizer](https://explosion.ai/blog/multithreading-with-cython) (explosion.ai)
 
 ## Adding tests
 
@@ -412,7 +412,7 @@ name. For example, tests for the `Tokenizer` can be found in
 all test files and test functions need to be prefixed with `test_`.
 
 When adding tests, make sure to use descriptive names, keep the code short and
-concise and only test for one behaviour at a time. Try to `parametrize` test
+concise and only test for one behavior at a time. Try to `parametrize` test
 cases wherever possible, use our pre-defined fixtures for spaCy components and
 avoid unnecessary imports.
 
diff --git a/README.md b/README.md
index 1fece1e5a..cef2a1fdd 100644
--- a/README.md
+++ b/README.md
@@ -49,9 +49,8 @@ It's commercial open-source software, released under the MIT license.
 
 ## 💬 Where to ask questions
 
-The spaCy project is maintained by [@honnibal](https://github.com/honnibal) and
-[@ines](https://github.com/ines), along with core contributors
-[@svlandeg](https://github.com/svlandeg) and
+The spaCy project is maintained by [@honnibal](https://github.com/honnibal),
+[@ines](https://github.com/ines), [@svlandeg](https://github.com/svlandeg) and
 [@adrianeboyd](https://github.com/adrianeboyd). Please understand that we won't
 be able to provide individual support via email. We also believe that help is
 much more valuable if it's shared publicly, so that more people can benefit from
diff --git a/spacy/tokenizer.pyx b/spacy/tokenizer.pyx
index a13299fff..9fda1800b 100644
--- a/spacy/tokenizer.pyx
+++ b/spacy/tokenizer.pyx
@@ -47,9 +47,9 @@ cdef class Tokenizer:
         `infix_finditer` (callable): A function matching the signature of
             `re.compile(string).finditer` to find infixes.
         token_match (callable): A boolean function matching strings to be
-            recognised as tokens.
+            recognized as tokens.
         url_match (callable): A boolean function matching strings to be
-            recognised as tokens after considering prefixes and suffixes.
+            recognized as tokens after considering prefixes and suffixes.
 
         EXAMPLE:
             >>> tokenizer = Tokenizer(nlp.vocab)
diff --git a/website/docs/usage/embeddings-transformers.md b/website/docs/usage/embeddings-transformers.md
index 70562cf7e..33385ff51 100644
--- a/website/docs/usage/embeddings-transformers.md
+++ b/website/docs/usage/embeddings-transformers.md
@@ -184,7 +184,7 @@ yourself. For details on how to get started with training your own model, check
 out the [training quickstart](/usage/training#quickstart).
 
 <!-- TODO:
-<Project id="en_core_bert">
+<Project id="en_core_trf_lg">
 
 The easiest way to get started is to clone a transformers-based project
 template. Swap in your data, edit the settings and hyperparameters and train,
diff --git a/website/docs/usage/index.md b/website/docs/usage/index.md
index c90c23b28..ede4ab6f9 100644
--- a/website/docs/usage/index.md
+++ b/website/docs/usage/index.md
@@ -169,7 +169,7 @@ $ python setup.py build_ext --inplace           # compile spaCy
 
 Compared to regular install via pip, the
 [`requirements.txt`](https://github.com/explosion/spaCy/tree/master/requirements.txt)
-additionally installs developer dependencies such as Cython. See the 
+additionally installs developer dependencies such as Cython. See the
 [quickstart widget](#quickstart) to get the right commands for your platform and
 Python version.
 
@@ -368,7 +368,7 @@ from is called `spacy`. So, when using spaCy, never call anything else `spacy`.
 
 </Accordion>
 
-<Accordion title="NER model doesn't recognise other entities anymore after training" id="catastrophic-forgetting">
+<Accordion title="NER model doesn't recognize other entities anymore after training" id="catastrophic-forgetting">
 
 If your training data only contained new entities and you didn't mix in any
 examples the model previously recognized, it can cause the model to "forget"
diff --git a/website/docs/usage/linguistic-features.md b/website/docs/usage/linguistic-features.md
index 3aa0df7b4..f52c2b2ad 100644
--- a/website/docs/usage/linguistic-features.md
+++ b/website/docs/usage/linguistic-features.md
@@ -429,7 +429,7 @@ nlp = spacy.load("en_core_web_sm")
 doc = nlp("fb is hiring a new vice president of global policy")
 ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents]
 print('Before', ents)
-# the model didn't recognise "fb" as an entity :(
+# The model didn't recognize "fb" as an entity :(
 
 fb_ent = Span(doc, 0, 1, label="ORG") # create a Span for the new entity
 doc.ents = list(doc.ents) + [fb_ent]
@@ -558,11 +558,11 @@ import spacy
 nlp = spacy.load("my_custom_el_model")
 doc = nlp("Ada Lovelace was born in London")
 
-# document level
+# Document level
 ents = [(e.text, e.label_, e.kb_id_) for e in doc.ents]
 print(ents)  # [('Ada Lovelace', 'PERSON', 'Q7259'), ('London', 'GPE', 'Q84')]
 
-# token level
+# Token level
 ent_ada_0 = [doc[0].text, doc[0].ent_type_, doc[0].ent_kb_id_]
 ent_ada_1 = [doc[1].text, doc[1].ent_type_, doc[1].ent_kb_id_]
 ent_london_5 = [doc[5].text, doc[5].ent_type_, doc[5].ent_kb_id_]
@@ -914,12 +914,12 @@ from spacy.lang.char_classes import ALPHA, ALPHA_LOWER, ALPHA_UPPER
 from spacy.lang.char_classes import CONCAT_QUOTES, LIST_ELLIPSES, LIST_ICONS
 from spacy.util import compile_infix_regex
 
-# default tokenizer
+# Default tokenizer
 nlp = spacy.load("en_core_web_sm")
 doc = nlp("mother-in-law")
 print([t.text for t in doc]) # ['mother', '-', 'in', '-', 'law']
 
-# modify tokenizer infix patterns
+# Modify tokenizer infix patterns
 infixes = (
     LIST_ELLIPSES
     + LIST_ICONS
@@ -929,8 +929,8 @@ infixes = (
             al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES
         ),
         r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA),
-        # EDIT: commented out regex that splits on hyphens between letters:
-        #r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
+        # ✅ Commented out regex that splits on hyphens between letters:
+        # r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS),
         r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA),
     ]
 )
diff --git a/website/docs/usage/processing-pipelines.md b/website/docs/usage/processing-pipelines.md
index bc8c990e8..a863c6c32 100644
--- a/website/docs/usage/processing-pipelines.md
+++ b/website/docs/usage/processing-pipelines.md
@@ -108,11 +108,11 @@ class, or defined within a [model package](/usage/saving-loading#models).
 >
 > [components.tagger]
 > factory = "tagger"
-> # settings for the tagger component
+> # Settings for the tagger component
 >
 > [components.parser]
 > factory = "parser"
-> # settings for the parser component
+> # Settings for the parser component
 > ```
 
 When you load a model, spaCy first consults the model's
@@ -171,11 +171,11 @@ lang = "en"
 pipeline = ["tagger", "parser", "ner"]
 data_path = "path/to/en_core_web_sm/en_core_web_sm-2.0.0"
 
-cls = spacy.util.get_lang_class(lang)   # 1. Get Language instance, e.g. English()
-nlp = cls()                             # 2. Initialize it
+cls = spacy.util.get_lang_class(lang)  # 1. Get Language class, e.g. English
+nlp = cls()                            # 2. Initialize it
 for name in pipeline:
-    nlp.add_pipe(name)                  # 3. Add the component to the pipeline
-nlp.from_disk(model_data_path)          # 4. Load in the binary data
+    nlp.add_pipe(name)                 # 3. Add the component to the pipeline
+nlp.from_disk(model_data_path)         # 4. Load in the binary data
 ```
 
 When you call `nlp` on a text, spaCy will **tokenize** it and then **call each
@@ -187,9 +187,9 @@ which is then processed by the component next in the pipeline.
 
 ```python
 ### The pipeline under the hood
-doc = nlp.make_doc("This is a sentence")   # create a Doc from raw text
-for name, proc in nlp.pipeline:             # iterate over components in order
-    doc = proc(doc)                         # apply each component
+doc = nlp.make_doc("This is a sentence")  # Create a Doc from raw text
+for name, proc in nlp.pipeline:           # Iterate over components in order
+    doc = proc(doc)                       # Apply each component
 ```
 
 The current processing pipeline is available as `nlp.pipeline`, which returns a
@@ -473,7 +473,7 @@ only being able to modify it afterwards.
 >
 > @Language.component("my_component")
 > def my_component(doc):
->    # do something to the doc here
+>    # Do something to the doc here
 >    return doc
 > ```
 
diff --git a/website/docs/usage/rule-based-matching.md b/website/docs/usage/rule-based-matching.md
index ce6625897..7fdce032e 100644
--- a/website/docs/usage/rule-based-matching.md
+++ b/website/docs/usage/rule-based-matching.md
@@ -511,21 +511,21 @@ from spacy.language import Language
 from spacy.matcher import Matcher
 from spacy.tokens import Token
 
-# We're using a component factory because the component needs to be initialized
-# with the shared vocab via the nlp object
+# We're using a component factory because the component needs to be
+# initialized with the shared vocab via the nlp object
 @Language.factory("html_merger")
 def create_bad_html_merger(nlp, name):
-    return BadHTMLMerger(nlp)
+    return BadHTMLMerger(nlp.vocab)
 
 class BadHTMLMerger:
-    def __init__(self, nlp):
+    def __init__(self, vocab):
         patterns = [
             [{"ORTH": "<"}, {"LOWER": "br"}, {"ORTH": ">"}],
             [{"ORTH": "<"}, {"LOWER": "br/"}, {"ORTH": ">"}],
         ]
         # Register a new token extension to flag bad HTML
         Token.set_extension("bad_html", default=False)
-        self.matcher = Matcher(nlp.vocab)
+        self.matcher = Matcher(vocab)
         self.matcher.add("BAD_HTML", patterns)
 
     def __call__(self, doc):
diff --git a/website/docs/usage/training.md b/website/docs/usage/training.md
index 892fb7f48..116561cd2 100644
--- a/website/docs/usage/training.md
+++ b/website/docs/usage/training.md
@@ -792,7 +792,7 @@ you save the transformer outputs for later use.
 
 <!-- TODO:
 
-<Project id="en_core_bert">
+<Project id="en_core_trf_lg">
 
 Try out a BERT-based model pipeline using this project template: swap in your
 data, edit the settings and hyperparameters and train, evaluate, package and
diff --git a/website/docs/usage/v3.md b/website/docs/usage/v3.md
index 3111bf38e..d71ecba31 100644
--- a/website/docs/usage/v3.md
+++ b/website/docs/usage/v3.md
@@ -66,7 +66,7 @@ menu:
 - **Architectures: ** [TransformerModel](/api/architectures#TransformerModel),
   [Tok2VecListener](/api/architectures#transformers-Tok2VecListener),
   [Tok2VecTransformer](/api/architectures#Tok2VecTransformer)
-- **Models:** [`en_core_bert_sm`](/models/en)
+- **Models:** [`en_core_trf_lg_sm`](/models/en)
 - **Implementation:**
   [`spacy-transformers`](https://github.com/explosion/spacy-transformers)
 
@@ -293,7 +293,8 @@ format for documenting argument and return types.
 
 - **Usage: ** [Embeddings & Transformers](/usage/embeddings-transformers),
   [Training models](/usage/training), [Projects](/usage/projects),
-  [Custom pipeline components](/usage/processing-pipelines#custom-components)
+  [Custom pipeline components](/usage/processing-pipelines#custom-components),
+  [Custom tokenizers](/usage/linguistic-features#custom-tokenizer)
 - **API Reference: ** [Library architecture](/api),
   [Model architectures](/api/architectures), [Data formats](/api/data-formats)
 - **New Classes: ** [`Example`](/api/example), [`Tok2Vec`](/api/tok2vec),
diff --git a/website/src/styles/layout.sass b/website/src/styles/layout.sass
index 775523190..b71eccd80 100644
--- a/website/src/styles/layout.sass
+++ b/website/src/styles/layout.sass
@@ -363,7 +363,7 @@ body [id]:target
         color: var(--color-red-medium)
         background: var(--color-red-transparent)
 
-    &.italic
+    &.italic, &.comment
         font-style: italic
 
 
@@ -384,9 +384,11 @@ body [id]:target
 // Settings for ini syntax (config files)
 [class*="language-ini"]
     color: var(--syntax-comment)
+    font-style: italic !important
 
     .token
         color: var(--color-subtle)
+        font-style: normal !important
 
 
 .gatsby-highlight-code-line
@@ -424,6 +426,7 @@ body [id]:target
 
     .cm-comment
         color: var(--syntax-comment)
+        font-style: italic
 
     .cm-keyword
         color: var(--syntax-keyword)