spaCy/CONTRIBUTING.md

<a href="https://explosion.ai"><img src="https://explosion.ai/assets/img/logo.svg" width="125" height="125" align="right" /></a>

# Contribute to spaCy

Thanks for your interest in contributing to spaCy 🎉 The project is maintained
by [@honnibal](https://github.com/honnibal) and [@ines](https://github.com/ines),
and we'll do our best to help you get started. This page will give you a quick
overview of how things are organized and most importantly, how to get involved.

## Table of contents

1. [Issues and bug reports](#issues-and-bug-reports)
2. [Contributing to the code base](#contributing-to-the-code-base)
3. [Code conventions](#code-conventions)
4. [Adding tests](#adding-tests)
5. [Updating the website](#updating-the-website)
6. [Publishing extensions and plugins](#publishing-spacy-extensions-and-plugins)
7. [Code of conduct](#code-of-conduct)

## Issues and bug reports

First, [do a quick search](https://github.com/issues?q=+is%3Aissue+user%3Aexplosion)
to see if the issue has already been reported. If so, it's often better to just
leave a comment on an existing issue, rather than creating a new one. Old issues
also often include helpful tips and solutions to common problems. You should
also check the [troubleshooting guide](https://spacy.io/usage/#troubleshooting)
to see if your problem is already listed there.

If you're looking for help with your code, consider posting a question on
[Stack Overflow](http://stackoverflow.com/questions/tagged/spacy) instead. If you
tag it `spacy` and `python`, more people will see it and hopefully be able to
help. Please understand that we won't be able to provide individual support via
email. We also believe that help is much more valuable if it's **shared publicly**,
so that more people can benefit from it.

### Submitting issues

When opening an issue, use a **descriptive title** and include your
**environment** (operating system, Python version, spaCy version). Our
[issue template](https://github.com/explosion/spaCy/issues/new) helps you
remember the most important details to include. If you've discovered a bug, you
can also submit a [regression test](#fixing-bugs) straight away. When you're
opening an issue to report the bug, simply refer to your pull request in the
issue body. A few more tips:

- **Describing your issue:** Try to provide as many details as possible. What
  exactly goes wrong? _How_ is it failing? Is there an error?
  "XY doesn't work" usually isn't that helpful for tracking down problems. Always
  remember to include the code you ran and if possible, extract only the relevant
  parts and don't just dump your entire script. This will make it easier for us to
  reproduce the error.

- **Getting info about your spaCy installation and environment:** If you're
  using spaCy v1.7+, you can use the command line interface to print details and
  even format them as Markdown to copy-paste into GitHub issues:
  `python -m spacy info --markdown`.

- **Checking the model compatibility:** If you're having problems with a
  [statistical model](https://spacy.io/models), it may be because the
  model is incompatible with your spaCy installation. In spaCy v2.0+, you can check
  this on the command line by running `python -m spacy validate`.

- **Sharing a model's output, like dependencies and entities:** spaCy v2.0+
  comes with [built-in visualizers](https://spacy.io/usage/visualizers) that
  you can run from within your script or a Jupyter notebook. For some issues, it's
  helpful to **include a screenshot** of the visualization. You can simply drag and
  drop the image into GitHub's editor and it will be uploaded and included.

- **Sharing long blocks of code or logs:** If you need to include long code,
  logs or tracebacks, you can wrap them in `<details>` and `</details>`. This
  [collapses the content](https://developer.mozilla.org/en/docs/Web/HTML/Element/details)
  so it only becomes visible on click, making the issue easier to read and follow.

### Issue labels

[See this page](https://github.com/explosion/spaCy/labels) for an overview of
the system we use to tag our issues and pull requests.

## Contributing to the code base

You don't have to be an NLP expert or Python pro to contribute, and we're happy
to help you get started. If you're new to spaCy, a good place to start is the
[spaCy 101 guide](https://spacy.io/usage/spacy-101) and the
[`help wanted (easy)`](https://github.com/explosion/spaCy/issues?q=is%3Aissue+is%3Aopen+label%3A%22help+wanted+%28easy%29%22)
label, which we use to tag bugs and feature requests that are easy and
self-contained. If you've decided to take on one of these problems and you're
making good progress, don't forget to add a quick comment to the issue. You can
also use the issue to ask questions, or share your work in progress.

### What belongs in spaCy?

Every library has a different inclusion philosophy — a policy of what should be
shipped in the core library, and what could be provided in other packages. Our
philosophy is to prefer a smaller core library. We generally ask the following
questions:

- **What would this feature look like if implemented in a separate package?**
  Some features would be very difficult to implement externally – for example,
  changes to spaCy's built-in methods. In contrast, a library of word
  alignment functions could easily live as a separate package that depended on
  spaCy — there's little difference between writing `import word_aligner` and
  `import spacy.word_aligner`. spaCy v2.0+ makes it easy to implement
  [custom pipeline components](https://spacy.io/usage/processing-pipelines#custom-components),
  and add your own attributes, properties and methods to the `Doc`, `Token` and
  `Span`. If you're looking to implement a new spaCy feature, starting with a
  custom component package is usually the best strategy. You won't have to worry
  about spaCy's internals and you can test your module in an isolated
  environment. And if it works well, we can always integrate it into the core
  library later.

- **Would the feature be easier to implement if it relied on "heavy" dependencies spaCy doesn't currently require?**
  Python has a very rich ecosystem. Libraries like scikit-learn, SciPy, Gensim or
  TensorFlow/Keras do lots of useful things — but we don't want to have them as
  dependencies. If the feature requires functionality in one of these libraries,
  it's probably better to break it out into a different package.

- **Is the feature orthogonal to the current spaCy functionality, or overlapping?**
  spaCy strongly prefers to avoid having 6 different ways of doing the same thing.
  As better techniques are developed, we prefer to drop support for "the old way".
  However, it's rare that one approach _entirely_ dominates another. It's very
  common that there's still a use-case for the "obsolete" approach. For instance,
  [WordNet](https://wordnet.princeton.edu/) is still very useful — but word
  vectors are better for most use-cases, and the two approaches to lexical
  semantics do a lot of the same things. spaCy therefore only supports word
  vectors, and support for WordNet is currently left for other packages.

- **Do you need the feature to get basic things done?** We do want spaCy to be
  at least somewhat self-contained. If we keep needing some feature in our
  recipes, that does provide some argument for bringing it "in house".

### Getting started

To make changes to spaCy's code base, you need to fork then clone the GitHub repository
and build spaCy from source. You'll need to make sure that you have a
development environment consisting of a Python distribution including header
files, a compiler, [pip](https://pip.pypa.io/en/latest/installing/),
[virtualenv](https://virtualenv.pypa.io/en/stable/) and
[git](https://git-scm.com) installed. The compiler is usually the trickiest part.

```
python -m pip install -U pip
git clone https://github.com/explosion/spaCy
cd spaCy

python -m venv .env
source .env/bin/activate
export PYTHONPATH=`pwd`
pip install -r requirements.txt
python setup.py build_ext --inplace
```

If you've made changes to `.pyx` files, you need to recompile spaCy before you
can test your changes by re-running `python setup.py build_ext --inplace`.
Changes to `.py` files will be effective immediately.

📖 **For more details and instructions, see the documentation on [compiling spaCy from source](https://spacy.io/usage/#source) and the [quickstart widget](https://spacy.io/usage/#section-quickstart) to get the right commands for your platform and Python version.**

### Contributor agreement

If you've made a contribution to spaCy, you should fill in the
[spaCy contributor agreement](.github/CONTRIBUTOR_AGREEMENT.md) to ensure that
your contribution can be used across the project. If you agree to be bound by
the terms of the agreement, fill in the [template](.github/CONTRIBUTOR_AGREEMENT.md)
and include it with your pull request, or submit it separately to
[`.github/contributors/`](/.github/contributors). The name of the file should be
your GitHub username, with the extension `.md`. For example, the user
example_user would create the file `.github/contributors/example_user.md`.

### Fixing bugs

When fixing a bug, first create an
[issue](https://github.com/explosion/spaCy/issues) if one does not already exist.
The description text can be very short – we don't want to make this too
bureaucratic.

Next, create a test file named `test_issue[ISSUE NUMBER].py` in the
[`spacy/tests/regression`](spacy/tests/regression) folder. Test for the bug
you're fixing, and make sure the test fails. Next, add and commit your test file
referencing the issue number in the commit message. Finally, fix the bug, make
sure your test passes and reference the issue in your commit message.

📖 **For more information on how to add tests, check out the [tests README](spacy/tests/README.md).**

## Code conventions

Code should loosely follow [pep8](https://www.python.org/dev/peps/pep-0008/).
As of `v2.1.0`, spaCy uses [`black`](https://github.com/ambv/black) for code
formatting and [`flake8`](http://flake8.pycqa.org/en/latest/) for linting its
Python modules. If you've built spaCy from source, you'll already have both
tools installed.

**⚠️ Note that formatting and linting is currently only possible for Python
modules in `.py` files, not Cython modules in `.pyx` and `.pxd` files.**

### Code formatting

[`black`](https://github.com/ambv/black) is an opinionated Python code
formatter, optimized to produce readable code and small diffs. You can run
`black` from the command-line, or via your code editor. For example, if you're
using [Visual Studio Code](https://code.visualstudio.com/), you can add the
following to your `settings.json` to use `black` for formatting and auto-format
your files on save:

```json
{
  "python.formatting.provider": "black",
  "[python]": {
    "editor.formatOnSave": true
  }
}
```

[See here](https://github.com/ambv/black#editor-integration) for the full
list of available editor integrations.

#### Disabling formatting

There are a few cases where auto-formatting doesn't improve readability – for
example, in some of the language data files like the `tag_map.py`, or in
the tests that construct `Doc` objects from lists of words and other labels.
Wrapping a block in `# fmt: off` and `# fmt: on` lets you disable formatting
for that particular code. Here's an example:

```python
# fmt: off
text = "I look forward to using Thingamajig.  I've been told it will make my life easier..."
heads = [1, 0, -1, -2, -1, -1, -5, -1, 3, 2, 1, 0, 2, 1, -3, 1, 1, -3, -7]
deps = ["nsubj", "ROOT", "advmod", "prep", "pcomp", "dobj", "punct", "",
        "nsubjpass", "aux", "auxpass", "ROOT", "nsubj", "aux", "ccomp",
        "poss", "nsubj", "ccomp", "punct"]
# fmt: on
```

### Code linting

[`flake8`](http://flake8.pycqa.org/en/latest/) is a tool for enforcing code
style. It scans one or more files and outputs errors and warnings. This feedback
can help you stick to general standards and conventions, and can be very useful
for spotting potential mistakes and inconsistencies in your code. The most
important things to watch out for are syntax errors and undefined names, but you
also want to keep an eye on unused declared variables or repeated
(i.e. overwritten) dictionary keys. If your code was formatted with `black`
(see above), you shouldn't see any formatting-related warnings.

The [`.flake8`](.flake8) config defines the configuration we use for this
codebase. For example, we're not super strict about the line length, and we're
excluding very large files like lemmatization and tokenizer exception tables.

Ideally, running the following command from within the repo directory should
not return any errors or warnings:

```bash
flake8 spacy
```

#### Disabling linting

Sometimes, you explicitly want to write code that's not compatible with our
rules. For example, a module's `__init__.py` might import a function so other
modules can import it from there, but `flake8` will complain about an unused
import. And although it's generally discouraged, there might be cases where it
makes sense to use a bare `except`.

To ignore a given line, you can add a comment like `# noqa: F401`, specifying
the code of the error or warning we want to ignore. It's also possible to
ignore several comma-separated codes at once, e.g. `# noqa: E731,E123`. Here
are some examples:

```python
# The imported class isn't used in this file, but imported here, so it can be
# imported *from* here by another module.
from .submodule import SomeClass  # noqa: F401

try:
    do_something()
except:  # noqa: E722
    # This bare except is justified, for some specific reason
    do_something_else()
```

### Python conventions

All Python code must be written **compatible with Python 3.6+**.
Code that interacts with the file-system should accept objects that follow the
`pathlib.Path` API, without assuming that the object inherits from `pathlib.Path`.
If the function is user-facing and takes a path as an argument, it should check
whether the path is provided as a string. Strings should be converted to
`pathlib.Path` objects. Serialization and deserialization functions should always
accept **file-like objects**, as it makes the library IO-agnostic. Working on
buffers makes the code more general, easier to test, and compatible with Python
3's asynchronous IO.

Although spaCy uses a lot of classes, **inheritance is viewed with some suspicion**
— it's seen as a mechanism of last resort. You should discuss plans to extend
the class hierarchy before implementing.

We have a number of conventions around variable naming that are still being
documented, and aren't 100% strict. A general policy is that instances of the
class `Doc` should by default be called `doc`, `Token` `token`, `Lexeme` `lex`,
`Vocab` `vocab` and `Language` `nlp`. You should avoid naming variables that are
of other types these names. For instance, don't name a text string `doc` — you
should usually call this `text`. Two general code style preferences further help
with naming. First, **lean away from introducing temporary variables**, as these
clutter your namespace. This is one reason why comprehension expressions are
often preferred. Second, **keep your functions shortish**, so they can work in a
smaller scope. Of course, this is a question of trade-offs.

### Cython conventions

spaCy's core data structures are implemented as [Cython](http://cython.org/) `cdef`
classes. Memory is managed through the `cymem.cymem.Pool` class, which allows
you to allocate memory which will be freed when the `Pool` object is garbage
collected. This means you usually don't have to worry about freeing memory. You
just have to decide which Python object owns the memory, and make it own the
`Pool`. When that object goes out of scope, the memory will be freed. You do
have to take care that no pointers outlive the object that owns them — but this
is generally quite easy.

All Cython modules should have the `# cython: infer_types=True` compiler
directive at the top of the file. This makes the code much cleaner, as it avoids
the need for many type declarations. If possible, you should prefer to declare
your functions `nogil`, even if you don't especially care about multi-threading.
The reason is that `nogil` functions help the Cython compiler reason about your
code quite a lot — you're telling the compiler that no Python dynamics are
possible. This lets many errors be raised, and ensures your function will run
at C speed.

Cython gives you many choices of sequences: you could have a Python list, a
numpy array, a memory view, a C++ vector, or a pointer. Pointers are preferred,
because they are fastest, have the most explicit semantics, and let the compiler
check your code more strictly. C++ vectors are also great — but you should only
use them internally in functions. It's less friendly to accept a vector as an
argument, because that asks the user to do much more work.

Here's how to get a pointer from a numpy array, memory view or vector:

```cython
cdef void get_pointers(np.ndarray[int, mode='c'] numpy_array, vector[int] cpp_vector, int[::1] memory_view) nogil:
    pointer1 = <int*>numpy_array.data
    pointer2 = cpp_vector.data()
    pointer3 = &memory_view[0]
```

Both C arrays and C++ vectors reassure the compiler that no Python operations
are possible on your variable. This is a big advantage: it lets the Cython
compiler raise many more errors for you.

When getting a pointer from a numpy array or memoryview, take care that the data
is actually stored in C-contiguous order — otherwise you'll get a pointer to
nonsense. The type-declarations in the code above should generate runtime errors
if buffers with incorrect memory layouts are passed in.

To iterate over the array, the following style is preferred:

```cython
cdef int c_total(const int* int_array, int length) nogil:
    total = 0
    for item in int_array[:length]:
        total += item
    return total
```

If this is confusing, consider that the compiler couldn't deal with
`for item in int_array:` — there's no length attached to a raw pointer, so how
could we figure out where to stop? The length is provided in the slice notation
as a solution to this. Note that we don't have to declare the type of `item` in
the code above — the compiler can easily infer it. This gives us tidy code that
looks quite like Python, but is exactly as fast as C — because we've made sure
the compilation to C is trivial.

Your functions cannot be declared `nogil` if they need to create Python objects
or call Python functions. This is perfectly okay — you shouldn't torture your
code just to get `nogil` functions. However, if your function isn't `nogil`, you
should compile your module with `cython -a --cplus my_module.pyx` and open the
resulting `my_module.html` file in a browser. This will let you see how Cython
is compiling your code. Calls into the Python run-time will be in bright yellow.
This lets you easily see whether Cython is able to correctly type your code, or
whether there are unexpected problems.

Finally, if you're new to Cython, you should expect to find the first steps a
bit frustrating. It's a very large language, since it's essentially a superset
of Python and C++, with additional complexity and syntax from numpy. The
[documentation](http://docs.cython.org/en/latest/) isn't great, and there are
many "traps for new players". Working in Cython is very rewarding once you're
over the initial learning curve. As with C and C++, the first way you write
something in Cython will often be the performance-optimal approach. In contrast,
Python optimization generally requires a lot of experimentation. Is it faster to
have an `if item in my_dict` check, or to use `.get()`? What about `try`/`except`?
Does this numpy operation create a copy? There's no way to guess the answers to
these questions, and you'll usually be dissatisfied with your results — so
there's no way to know when to stop this process. In the worst case, you'll make
a mess that invites the next reader to try their luck too. This is like one of
those [volcanic gas-traps](http://www.wemjournal.org/article/S1080-6032%2809%2970088-2/abstract),
where the rescuers keep passing out from low oxygen, causing another rescuer to
follow — only to succumb themselves. In short, just say no to optimizing your
Python. If it's not fast enough the first time, just switch to Cython.

### Resources to get you started

- [PEP 8 Style Guide for Python Code](https://www.python.org/dev/peps/pep-0008/) (python.org)
- [Official Cython documentation](http://docs.cython.org/en/latest/) (cython.org)
- [Writing C in Cython](https://explosion.ai/blog/writing-c-in-cython) (explosion.ai)
- [Multi-threading spaCy’s parser and named entity recognizer](https://explosion.ai/blog/multithreading-with-cython) (explosion.ai)

## Adding tests

spaCy uses the [pytest](http://doc.pytest.org/) framework for testing. For more
info on this, see the [pytest documentation](http://docs.pytest.org/en/latest/contents.html).
Tests for spaCy modules and classes live in their own directories of the same
name. For example, tests for the `Tokenizer` can be found in
[`/spacy/tests/tokenizer`](spacy/tests/tokenizer). To be interpreted and run,
all test files and test functions need to be prefixed with `test_`.

When adding tests, make sure to use descriptive names, keep the code short and
concise and only test for one behavior at a time. Try to `parametrize` test
cases wherever possible, use our pre-defined fixtures for spaCy components and
avoid unnecessary imports.

Extensive tests that take a long time should be marked with `@pytest.mark.slow`.
Tests that require the model to be loaded should be marked with
`@pytest.mark.models`. Loading the models is expensive and not necessary if
you're not actually testing the model performance. If all you need is a `Doc`
object with annotations like heads, POS tags or the dependency parse, you can
use the `get_doc()` utility function to construct it manually.

📖 **For more guidelines and information on how to add tests, check out the [tests README](spacy/tests/README.md).**

## Updating the website

For instructions on how to build and run the [website](https://spacy.io) locally see **[Setup and installation](https://github.com/explosion/spaCy/blob/master/website/README.md#setup-and-installation-setup)** in the _website_ directory's README.

The docs can always use another example or more detail, and they should always
be up to date and not misleading. To quickly find the correct file to edit,
simply click on the "Suggest edits" button at the bottom of a page.

📖 **For more info and troubleshooting guides, check out the [website README](website).**

## Publishing spaCy extensions and plugins

We're very excited about all the new possibilities for **community extensions**
and plugins in spaCy v2.0, and we can't wait to see what you build with it!

- An extension or plugin should add substantial functionality, be
  **well-documented** and **open-source**. It should be available for users to download
  and install as a Python package – for example via [PyPi](http://pypi.python.org).

- Extensions that write to `Doc`, `Token` or `Span` attributes should be wrapped
  as [pipeline components](https://spacy.io/usage/processing-pipelines#custom-components)
  that users can **add to their processing pipeline** using `nlp.add_pipe()`.

- When publishing your extension on GitHub, **tag it** with the topics
  [`spacy`](https://github.com/topics/spacy?o=desc&s=stars) and
  [`spacy-extensions`](https://github.com/topics/spacy-extension?o=desc&s=stars)
  to make it easier to find. Those are also the topics we're linking to from the
  spaCy website. If you're sharing your project on Twitter, feel free to tag
  [@spacy_io](https://twitter.com/spacy_io) so we can check it out.

- Once your extension is published, you can open an issue on the
  [issue tracker](https://github.com/explosion/spacy/issues) to suggest it for the
  [resources directory](https://spacy.io/usage/resources#extensions) on the
  website.

📖 **For more tips and best practices, see the [checklist for developing spaCy extensions](https://spacy.io/usage/processing-pipelines#extensions).**

## Code of conduct

spaCy adheres to the
[Contributor Covenant Code of Conduct](http://contributor-covenant.org/version/1/4/).
By participating, you are expected to uphold this code.
-												Add logo

											
										
										
											2016-11-02 16:46:04 +00:00
+								<a href="https://explosion.ai"><img src="https://explosion.ai/assets/img/logo.svg" width="125" height="125" align="right" /></a>
-												Add CONTRIBUTING.md

											
										
										
											2016-11-02 16:45:13 +00:00
+								# Contribute to spaCy
-												Tidy up and auto-format [ci skip]

											
										
										
											2019-07-27 10:19:35 +00:00
+								Thanks for your interest in contributing to spaCy 🎉 The project is maintained
-												Rewrite contributing guide

											
										
										
											2017-11-04 13:24:39 +00:00
+								by [@honnibal](https://github.com/honnibal) and [@ines](https://github.com/ines),
 								and we'll do our best to help you get started. This page will give you a quick
-												Update docs and consistency [ci skip]

											
										
										
											2020-08-21 11:49:18 +00:00
+								overview of how things are organized and most importantly, how to get involved.
-												Add CONTRIBUTING.md

											
										
										
											2016-11-02 16:45:13 +00:00
 								## Table of contents
-												Tidy up and auto-format [ci skip]

											
										
										
											2019-07-27 10:19:35 +00:00
-												Add CONTRIBUTING.md

											
										
										
											2016-11-02 16:45:13 +00:00
+. [Issues and bug reports](#issues-and-bug-reports)
 . [Contributing to the code base](#contributing-to-the-code-base)
-												Update CONTRIBUTING.md
											
										
										
											2017-04-08 11:35:52 +00:00
+. [Code conventions](#code-conventions)
 . [Adding tests](#adding-tests)
 . [Updating the website](#updating-the-website)
-												Rewrite contributing guide

											
										
										
											2017-11-04 13:24:39 +00:00
+. [Publishing extensions and plugins](#publishing-spacy-extensions-and-plugins)
 . [Code of conduct](#code-of-conduct)
-												Add CONTRIBUTING.md

											
										
										
											2016-11-02 16:45:13 +00:00
 								## Issues and bug reports
-												Rewrite contributing guide

											
										
										
											2017-11-04 13:24:39 +00:00
+								First, [do a quick search](https://github.com/issues?q=+is%3Aissue+user%3Aexplosion)
 								to see if the issue has already been reported. If so, it's often better to just
 								leave a comment on an existing issue, rather than creating a new one. Old issues
 								also often include helpful tips and solutions to common problems. You should
-												Get docs ready for v2.0.0

											
										
										
											2017-11-07 11:00:43 +00:00
+								also check the [troubleshooting guide](https://spacy.io/usage/#troubleshooting)
-												Rewrite contributing guide

											
										
										
											2017-11-04 13:24:39 +00:00
+								to see if your problem is already listed there.
 								If you're looking for help with your code, consider posting a question on
-												💫 Port master changes over to develop (#2979)

* Create aryaprabhudesai.md (#2681)

* Update _install.jade (#2688)

Typo fix: "models" -> "model"

* Add FAC to spacy.explain (resolves #2706)

* Remove docstrings for deprecated arguments (see #2703)

* When calling getoption() in conftest.py, pass a default option (#2709)

* When calling getoption() in conftest.py, pass a default option

This is necessary to allow testing an installed spacy by running:

  pytest --pyargs spacy

* Add contributor agreement

* update bengali token rules for hyphen and digits (#2731)

* Less norm computations in token similarity (#2730)

* Less norm computations in token similarity

* Contributor agreement

* Remove ')' for clarity (#2737)

Sorry, don't mean to be nitpicky, I just noticed this when going through the CLI and thought it was a quick fix. That said, if this was intention than please let me know.

* added contributor agreement for mbkupfer (#2738)

* Basic support for Telugu language (#2751)

* Lex _attrs for polish language (#2750)

* Signed spaCy contributor agreement

* Added polish version of english lex_attrs

* Introduces a bulk merge function, in order to solve issue #653 (#2696)

* Fix comment

* Introduce bulk merge to increase performance on many span merges

* Sign contributor agreement

* Implement pull request suggestions

* Describe converters more explicitly (see #2643)

* Add multi-threading note to Language.pipe (resolves #2582) [ci skip]

* Fix formatting

* Fix dependency scheme docs (closes #2705) [ci skip]

* Don't set stop word in example (closes #2657) [ci skip]

* Add words to portuguese language _num_words (#2759)

* Add words to portuguese language _num_words

* Add words to portuguese language _num_words

* Update Indonesian model (#2752)

* adding e-KTP in tokenizer exceptions list

* add exception token

* removing lines with containing space as it won't matter since we use .split() method in the end, added new tokens in exception

* add tokenizer exceptions list

* combining base_norms with norm_exceptions

* adding norm_exception

* fix double key in lemmatizer

* remove unused import on punctuation.py

* reformat stop_words to reduce number of lines, improve readibility

* updating tokenizer exception

* implement is_currency for lang/id

* adding orth_first_upper in tokenizer_exceptions

* update the norm_exception list

* remove bunch of abbreviations

* adding contributors file

* Fixed spaCy+Keras example (#2763)

* bug fixes in keras example

* created contributor agreement

* Adding French hyphenated first name (#2786)

* Fix typo (closes #2784)

* Fix typo (#2795) [ci skip]

Fixed typo on line 6 "regcognizer --> recognizer"

* Adding basic support for Sinhala language. (#2788)

* adding Sinhala language package, stop words, examples and lex_attrs.

* Adding contributor agreement

* Updating contributor agreement

* Also include lowercase norm exceptions

* Fix error (#2802)

* Fix error
ValueError: cannot resize an array that references or is referenced
by another array in this way.  Use the resize function

* added spaCy Contributor Agreement

* Add charlax's contributor agreement (#2805)

* agreement of contributor, may I introduce a tiny pl languge contribution (#2799)

* Contributors agreement

* Contributors agreement

* Contributors agreement

* Add jupyter=True to displacy.render in documentation (#2806)

* Revert "Also include lowercase norm exceptions"

This reverts commit 70f4e8adf37cfcfab60be2b97d6deae949b30e9e.

* Remove deprecated encoding argument to msgpack

* Set up dependency tree pattern matching skeleton (#2732)

* Fix bug when too many entity types. Fixes #2800

* Fix Python 2 test failure

* Require older msgpack-numpy

* Restore encoding arg on msgpack-numpy

* Try to fix version pin for msgpack-numpy

* Update Portuguese Language (#2790)

* Add words to portuguese language _num_words

* Add words to portuguese language _num_words

* Portuguese - Add/remove stopwords, fix tokenizer, add currency symbols

* Extended punctuation and norm_exceptions in the Portuguese language

* Correct error in spacy universe docs concerning spacy-lookup (#2814)

* Update Keras Example for (Parikh et al, 2016) implementation  (#2803)

* bug fixes in keras example

* created contributor agreement

* baseline for Parikh model

* initial version of parikh 2016 implemented

* tested asymmetric models

* fixed grevious error in normalization

* use standard SNLI test file

* begin to rework parikh example

* initial version of running example

* start to document the new version

* start to document the new version

* Update Decompositional Attention.ipynb

* fixed calls to similarity

* updated the README

* import sys package duh

* simplified indexing on mapping word to IDs

* stupid python indent error

* added code from https://github.com/tensorflow/tensorflow/issues/3388 for tf bug workaround

* Fix typo (closes #2815) [ci skip]

* Update regex version dependency

* Set version to 2.0.13.dev3

* Skip seemingly problematic test

* Remove problematic test

* Try previous version of regex

* Revert "Remove problematic test"

This reverts commit bdebbef45552d698d390aa430b527ee27830f11b.

* Unskip test

* Try older version of regex

* 💫 Update training examples and use minibatching (#2830)

<!--- Provide a general summary of your changes in the title. -->

## Description
Update the training examples in `/examples/training` to show usage of spaCy's `minibatch` and `compounding` helpers ([see here](https://spacy.io/usage/training#tips-batch-size) for details). The lack of batching in the examples has caused some confusion in the past, especially for beginners who would copy-paste the examples, update them with large training sets and experienced slow and unsatisfying results.

### Types of change
enhancements

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

* Visual C++ link updated (#2842) (closes #2841) [ci skip]

* New landing page

* Add contribution agreement

* Correcting lang/ru/examples.py (#2845)

* Correct some grammatical inaccuracies in lang\ru\examples.py; filled Contributor Agreement

* Correct some grammatical inaccuracies in lang\ru\examples.py

* Move contributor agreement to separate file

* Set version to 2.0.13.dev4

* Add Persian(Farsi) language support (#2797)

* Also include lowercase norm exceptions

* Remove in favour of https://github.com/explosion/spaCy/graphs/contributors

* Rule-based French Lemmatizer (#2818)

<!--- Provide a general summary of your changes in the title. -->

## Description
<!--- Use this section to describe your changes. If your changes required
testing, include information about the testing environment and the tests you
ran. If your test fixes a bug reported in an issue, don't forget to include the
issue number. If your PR is still a work in progress, that's totally fine – just
include a note to let us know. -->

Add a rule-based French Lemmatizer following the english one and the excellent PR for [greek language optimizations](https://github.com/explosion/spaCy/pull/2558) to adapt the Lemmatizer class.

### Types of change
<!-- What type of change does your PR cover? Is it a bug fix, an enhancement
or new feature, or a change to the documentation? -->

- Lemma dictionary used can be found [here](http://infolingu.univ-mlv.fr/DonneesLinguistiques/Dictionnaires/telechargement.html), I used the XML version.
- Add several files containing exhaustive list of words for each part of speech 
- Add some lemma rules
- Add POS that are not checked in the standard Lemmatizer, i.e PRON, DET, ADV and AUX
- Modify the Lemmatizer class to check in lookup table as a last resort if POS not mentionned
- Modify the lemmatize function to check in lookup table as a last resort
- Init files are updated so the model can support all the functionalities mentioned above
- Add words to tokenizer_exceptions_list.py in respect to regex used in tokenizer_exceptions.py

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [X] I have submitted the spaCy Contributor Agreement.
- [X] I ran the tests, and all new and existing tests passed.
- [X] My changes don't require a change to the documentation, or if they do, I've added all required information.

* Set version to 2.0.13

* Fix formatting and consistency

* Update docs for new version [ci skip]

* Increment version [ci skip]

* Add info on wheels [ci skip]

* Adding "This is a sentence" example to Sinhala (#2846)

* Add wheels badge

* Update badge [ci skip]

* Update README.rst [ci skip]

* Update murmurhash pin

* Increment version to 2.0.14.dev0

* Update GPU docs for v2.0.14

* Add wheel to setup_requires

* Import prefer_gpu and require_gpu functions from Thinc

* Add tests for prefer_gpu() and require_gpu()

* Update requirements and setup.py

* Workaround bug in thinc require_gpu

* Set version to v2.0.14

* Update push-tag script

* Unhack prefer_gpu

* Require thinc 6.10.6

* Update prefer_gpu and require_gpu docs [ci skip]

* Fix specifiers for GPU

* Set version to 2.0.14.dev1

* Set version to 2.0.14

* Update Thinc version pin

* Increment version

* Fix msgpack-numpy version pin

* Increment version

* Update version to 2.0.16

* Update version [ci skip]

* Redundant ')' in the Stop words' example (#2856)

<!--- Provide a general summary of your changes in the title. -->

## Description
<!--- Use this section to describe your changes. If your changes required
testing, include information about the testing environment and the tests you
ran. If your test fixes a bug reported in an issue, don't forget to include the
issue number. If your PR is still a work in progress, that's totally fine – just
include a note to let us know. -->

### Types of change
<!-- What type of change does your PR cover? Is it a bug fix, an enhancement
or new feature, or a change to the documentation? -->

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [ ] I have submitted the spaCy Contributor Agreement.
- [ ] I ran the tests, and all new and existing tests passed.
- [ ] My changes don't require a change to the documentation, or if they do, I've added all required information.

* Documentation improvement regarding joblib and SO (#2867)

Some documentation improvements

## Description
1. Fixed the dead URL to joblib
2. Fixed Stack Overflow brand name (with space)

### Types of change
Documentation

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

* raise error when setting overlapping entities as doc.ents (#2880)

* Fix out-of-bounds access in NER training

The helper method state.B(1) gets the index of the first token of the
buffer, or -1 if no such token exists. Normally this is safe because we
pass this to functions like state.safe_get(), which returns an empty
token. Here we used it directly as an array index, which is not okay!

This error may have been the cause of out-of-bounds access errors during
training. Similar errors may still be around, so much be hunted down.
Hunting this one down took a long time...I printed out values across
training runs and diffed, looking for points of divergence between
runs, when no randomness should be allowed.

* Change PyThaiNLP Url (#2876)

* Fix missing comma

* Add example showing a fix-up rule for space entities

* Set version to 2.0.17.dev0

* Update regex version

* Revert "Update regex version"

This reverts commit 62358dd867d15bc6a475942dff34effba69dd70a.

* Try setting older regex version, to align with conda

* Set version to 2.0.17

* Add spacy-js to universe [ci-skip]

* Add spacy-raspberry to universe (closes #2889)

* Add script to validate universe json [ci skip]

* Removed space in docs + added contributor indo (#2909)

* - removed unneeded space in documentation

* - added contributor info

* Allow input text of length up to max_length, inclusive (#2922)

* Include universe spec for spacy-wordnet component (#2919)

* feat: include universe spec for spacy-wordnet component

* chore: include spaCy contributor agreement

* Minor formatting changes [ci skip]

* Fix image [ci skip]

Twitter URL doesn't work on live site

* Check if the word is in one of the regular lists specific to each POS (#2886)

* 💫 Create random IDs for SVGs to prevent ID clashes (#2927)

Resolves #2924.

## Description
Fixes problem where multiple visualizations in Jupyter notebooks would have clashing arc IDs, resulting in weirdly positioned arc labels. Generating a random ID prefix so even identical parses won't receive the same IDs for consistency (even if effect of ID clash isn't noticable here.)

### Types of change
bug fix

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

* Fix typo [ci skip]

* fixes symbolic link on py3 and windows (#2949)

* fixes symbolic link on py3 and windows
during setup of spacy using command
python -m spacy link en_core_web_sm en
closes #2948

* Update spacy/compat.py

Co-Authored-By: cicorias <cicorias@users.noreply.github.com>

* Fix formatting

* Update universe [ci skip]

* Catalan Language Support (#2940)

* Catalan language Support

* Ddding Catalan to documentation

* Sort languages alphabetically [ci skip]

* Update tests for pytest 4.x (#2965)

<!--- Provide a general summary of your changes in the title. -->

## Description
- [x] Replace marks in params for pytest 4.0 compat ([see here](https://docs.pytest.org/en/latest/deprecations.html#marks-in-pytest-mark-parametrize))
- [x] Un-xfail passing tests (some fixes in a recent update resolved a bunch of issues, but tests were apparently never updated here)

### Types of change
<!-- What type of change does your PR cover? Is it a bug fix, an enhancement
or new feature, or a change to the documentation? -->

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

* Fix regex pin to harmonize with conda (#2964)

* Update README.rst

* Fix bug where Vocab.prune_vector did not use 'batch_size' (#2977)

Fixes #2976

* Fix typo

* Fix typo

* Remove duplicate file

* Require thinc 7.0.0.dev2

Fixes bug in gpu_ops that would use cupy instead of numpy on CPU

* Add missing import

* Fix error IDs

* Fix tests

											
										
										
											2018-11-29 15:30:29 +00:00
+								[Stack Overflow](http://stackoverflow.com/questions/tagged/spacy) instead. If you
-												Rewrite contributing guide

											
										
										
											2017-11-04 13:24:39 +00:00
+								tag it `spacy` and `python`, more people will see it and hopefully be able to
 								help. Please understand that we won't be able to provide individual support via
 								email. We also believe that help is much more valuable if it's **shared publicly**,
 								so that more people can benefit from it.
 								### Submitting issues
 								When opening an issue, use a **descriptive title** and include your
 								**environment** (operating system, Python version, spaCy version). Our
 								[issue template](https://github.com/explosion/spaCy/issues/new) helps you
 								remember the most important details to include. If you've discovered a bug, you
 								can also submit a [regression test](#fixing-bugs) straight away. When you're
 								opening an issue to report the bug, simply refer to your pull request in the
 								issue body. A few more tips:
-												fix typos

											
										
										
											2020-08-17 12:05:48 +00:00
+								- **Describing your issue:** Try to provide as many details as possible. What
 								  exactly goes wrong? _How_ is it failing? Is there an error?
 								  "XY doesn't work" usually isn't that helpful for tracking down problems. Always
 								  remember to include the code you ran and if possible, extract only the relevant
 								  parts and don't just dump your entire script. This will make it easier for us to
 								  reproduce the error.
 								- **Getting info about your spaCy installation and environment:** If you're
 								  using spaCy v1.7+, you can use the command line interface to print details and
 								  even format them as Markdown to copy-paste into GitHub issues:
 								  `python -m spacy info --markdown`.
 								- **Checking the model compatibility:** If you're having problems with a
 								  [statistical model](https://spacy.io/models), it may be because the
 								  model is incompatible with your spaCy installation. In spaCy v2.0+, you can check
 								  this on the command line by running `python -m spacy validate`.
 								- **Sharing a model's output, like dependencies and entities:** spaCy v2.0+
 								  comes with [built-in visualizers](https://spacy.io/usage/visualizers) that
 								  you can run from within your script or a Jupyter notebook. For some issues, it's
 								  helpful to **include a screenshot** of the visualization. You can simply drag and
 								  drop the image into GitHub's editor and it will be uploaded and included.
 								- **Sharing long blocks of code or logs:** If you need to include long code,
 								  logs or tracebacks, you can wrap them in `<details>` and `</details>`. This
 								  [collapses the content](https://developer.mozilla.org/en/docs/Web/HTML/Element/details)
 								  so it only becomes visible on click, making the issue easier to read and follow.
-												Mention regression tests in "Issues & bug reports"
											
										
										
											2016-11-07 23:04:46 +00:00
 								### Issue labels
-												Update CONTRIBUTING.md [ci skip]

											
										
										
											2019-09-29 16:37:22 +00:00
+								[See this page](https://github.com/explosion/spaCy/labels) for an overview of
 								the system we use to tag our issues and pull requests.
-												Add CONTRIBUTING.md

											
										
										
											2016-11-02 16:45:13 +00:00
 								## Contributing to the code base
-												Rewrite contributing guide

											
										
										
											2017-11-04 13:24:39 +00:00
+								You don't have to be an NLP expert or Python pro to contribute, and we're happy
 								to help you get started. If you're new to spaCy, a good place to start is the
-												Get docs ready for v2.0.0

											
										
										
											2017-11-07 11:00:43 +00:00
+								[spaCy 101 guide](https://spacy.io/usage/spacy-101) and the
-												Rewrite contributing guide

											
										
										
											2017-11-04 13:24:39 +00:00
+								[`help wanted (easy)`](https://github.com/explosion/spaCy/issues?q=is%3Aissue+is%3Aopen+label%3A%22help+wanted+%28easy%29%22)
 								label, which we use to tag bugs and feature requests that are easy and
 								self-contained. If you've decided to take on one of these problems and you're
 								making good progress, don't forget to add a quick comment to the issue. You can
 								also use the issue to ask questions, or share your work in progress.
-												Add CONTRIBUTING.md

											
										
										
											2016-11-02 16:45:13 +00:00
-												Add code contribution guidelines
											
										
										
											2017-04-08 10:56:46 +00:00
+								### What belongs in spaCy?
-												Add CONTRIBUTING.md

											
										
										
											2016-11-02 16:45:13 +00:00
-												Rewrite contributing guide

											
										
										
											2017-11-04 13:24:39 +00:00
+								Every library has a different inclusion philosophy — a policy of what should be
 								shipped in the core library, and what could be provided in other packages. Our
 								philosophy is to prefer a smaller core library. We generally ask the following
 								questions:
-												fix typos

											
										
										
											2020-08-17 12:05:48 +00:00
+								- **What would this feature look like if implemented in a separate package?**
 								  Some features would be very difficult to implement externally – for example,
 								  changes to spaCy's built-in methods. In contrast, a library of word
 								  alignment functions could easily live as a separate package that depended on
 								  spaCy — there's little difference between writing `import word_aligner` and
 								  `import spacy.word_aligner`. spaCy v2.0+ makes it easy to implement
 								  [custom pipeline components](https://spacy.io/usage/processing-pipelines#custom-components),
 								  and add your own attributes, properties and methods to the `Doc`, `Token` and
 								  `Span`. If you're looking to implement a new spaCy feature, starting with a
 								  custom component package is usually the best strategy. You won't have to worry
 								  about spaCy's internals and you can test your module in an isolated
 								  environment. And if it works well, we can always integrate it into the core
 								  library later.
 								- **Would the feature be easier to implement if it relied on "heavy" dependencies spaCy doesn't currently require?**
 								  Python has a very rich ecosystem. Libraries like scikit-learn, SciPy, Gensim or
 								  TensorFlow/Keras do lots of useful things — but we don't want to have them as
 								  dependencies. If the feature requires functionality in one of these libraries,
 								  it's probably better to break it out into a different package.
 								- **Is the feature orthogonal to the current spaCy functionality, or overlapping?**
 								  spaCy strongly prefers to avoid having 6 different ways of doing the same thing.
 								  As better techniques are developed, we prefer to drop support for "the old way".
 								  However, it's rare that one approach _entirely_ dominates another. It's very
 								  common that there's still a use-case for the "obsolete" approach. For instance,
 								  [WordNet](https://wordnet.princeton.edu/) is still very useful — but word
 								  vectors are better for most use-cases, and the two approaches to lexical
 								  semantics do a lot of the same things. spaCy therefore only supports word
 								  vectors, and support for WordNet is currently left for other packages.
 								- **Do you need the feature to get basic things done?** We do want spaCy to be
 								  at least somewhat self-contained. If we keep needing some feature in our
 								  recipes, that does provide some argument for bringing it "in house".
-												Rewrite contributing guide

											
										
										
											2017-11-04 13:24:39 +00:00
 								### Getting started
-												add fork to instructions

											
										
										
											2018-01-03 21:00:36 +00:00
+								To make changes to spaCy's code base, you need to fork then clone the GitHub repository
-												Rewrite contributing guide

											
										
										
											2017-11-04 13:24:39 +00:00
+								and build spaCy from source. You'll need to make sure that you have a
 								development environment consisting of a Python distribution including header
 								files, a compiler, [pip](https://pip.pypa.io/en/latest/installing/),
 								[virtualenv](https://virtualenv.pypa.io/en/stable/) and
 								[git](https://git-scm.com) installed. The compiler is usually the trickiest part.
-												Add CONTRIBUTING.md

											
										
										
											2016-11-02 16:45:13 +00:00
-												Rewrite contributing guide

											
										
										
											2017-11-04 13:24:39 +00:00
+								```
-												Fix venv command examples (#2560) [ci skip]

* Fix venv command examples

The documentation refers to `venv`, which is native to Python3.
However, the command examples are as if they were still `virtualenv`,
which is a package independent of `venv`:

- It doesn't need to be installed via `pip`. In fact `pip install venv` would
return an error.
- The correct way to invoke `venv` is `python3 -m venv`, not `venv`, which would
return command not found.

See https://docs.python.org/3/library/venv.html

I suspect the documentation simply replaced all occurrences of `virtualenv` with
`venv`. However they are different modules and are used differently.

* Update comment [ci skip]

											
										
										
											2018-07-18 08:31:24 +00:00
+								python -m pip install -U pip
-												Rewrite contributing guide

											
										
										
											2017-11-04 13:24:39 +00:00
+								git clone https://github.com/explosion/spaCy
 								cd spaCy
-												Add code contribution guidelines
											
										
										
											2017-04-08 10:56:46 +00:00
-												Fix venv command examples (#2560) [ci skip]

* Fix venv command examples

The documentation refers to `venv`, which is native to Python3.
However, the command examples are as if they were still `virtualenv`,
which is a package independent of `venv`:

- It doesn't need to be installed via `pip`. In fact `pip install venv` would
return an error.
- The correct way to invoke `venv` is `python3 -m venv`, not `venv`, which would
return command not found.

See https://docs.python.org/3/library/venv.html

I suspect the documentation simply replaced all occurrences of `virtualenv` with
`venv`. However they are different modules and are used differently.

* Update comment [ci skip]

											
										
										
											2018-07-18 08:31:24 +00:00
+								python -m venv .env
-												Rewrite contributing guide

											
										
										
											2017-11-04 13:24:39 +00:00
+								source .env/bin/activate
 								export PYTHONPATH=`pwd`
 								pip install -r requirements.txt
 								python setup.py build_ext --inplace
 								```
-												Add code contribution guidelines
											
										
										
											2017-04-08 10:56:46 +00:00
-												Rewrite contributing guide

											
										
										
											2017-11-04 13:24:39 +00:00
+								If you've made changes to `.pyx` files, you need to recompile spaCy before you
 								can test your changes by re-running `python setup.py build_ext --inplace`.
 								Changes to `.py` files will be effective immediately.
-												Add code contribution guidelines
											
										
										
											2017-04-08 10:56:46 +00:00
-												Get docs ready for v2.0.0

											
										
										
											2017-11-07 11:00:43 +00:00
+								📖 **For more details and instructions, see the documentation on [compiling spaCy from source](https://spacy.io/usage/#source) and the [quickstart widget](https://spacy.io/usage/#section-quickstart) to get the right commands for your platform and Python version.**
-												Update CONTRIBUTING.md
											
										
										
											2017-04-08 11:35:52 +00:00
 								### Contributor agreement
-												Rewrite contributing guide

											
										
										
											2017-11-04 13:24:39 +00:00
+								If you've made a contribution to spaCy, you should fill in the
 								[spaCy contributor agreement](.github/CONTRIBUTOR_AGREEMENT.md) to ensure that
 								your contribution can be used across the project. If you agree to be bound by
-												Fix broken link

Link for SLA template was broken
											
										
										
											2017-11-09 11:00:25 +00:00
+								the terms of the agreement, fill in the [template](.github/CONTRIBUTOR_AGREEMENT.md)
-												Fixed typos for #2222,#2223 (#2233) (closes #2222, closes #2223)


											
										
										
											2018-04-18 21:55:26 +00:00
+								and include it with your pull request, or submit it separately to
-												Rewrite contributing guide

											
										
										
											2017-11-04 13:24:39 +00:00
+								[`.github/contributors/`](/.github/contributors). The name of the file should be
 								your GitHub username, with the extension `.md`. For example, the user
-												Update CONTRIBUTING.md
											
										
										
											2017-04-08 11:35:52 +00:00
+								example_user would create the file `.github/contributors/example_user.md`.
 								### Fixing bugs
-												Rewrite contributing guide

											
										
										
											2017-11-04 13:24:39 +00:00
+								When fixing a bug, first create an
 								[issue](https://github.com/explosion/spaCy/issues) if one does not already exist.
 								The description text can be very short – we don't want to make this too
 								bureaucratic.
-												Update CONTRIBUTING.md
											
										
										
											2017-04-08 11:35:52 +00:00
-												Rewrite contributing guide

											
										
										
											2017-11-04 13:24:39 +00:00
+								Next, create a test file named `test_issue[ISSUE NUMBER].py` in the
-												Revert #4334

											
										
										
											2019-09-29 15:32:12 +00:00
+								[`spacy/tests/regression`](spacy/tests/regression) folder. Test for the bug
-												Rewrite contributing guide

											
										
										
											2017-11-04 13:24:39 +00:00
+								you're fixing, and make sure the test fails. Next, add and commit your test file
 								referencing the issue number in the commit message. Finally, fix the bug, make
 								sure your test passes and reference the issue in your commit message.
-												Update CONTRIBUTING.md
											
										
										
											2017-04-08 11:35:52 +00:00
-												Revert #4334

											
										
										
											2019-09-29 15:32:12 +00:00
+								📖 **For more information on how to add tests, check out the [tests README](spacy/tests/README.md).**
-												Update CONTRIBUTING.md
											
										
										
											2017-04-08 11:35:52 +00:00
 								## Code conventions
-												Add code contribution guidelines
											
										
										
											2017-04-08 10:56:46 +00:00
-												Rewrite contributing guide

											
										
										
											2017-11-04 13:24:39 +00:00
+								Code should loosely follow [pep8](https://www.python.org/dev/peps/pep-0008/).
-												💫 Tidy up and auto-format .py files (#2983)

<!--- Provide a general summary of your changes in the title. -->

## Description
- [x] Use [`black`](https://github.com/ambv/black) to auto-format all `.py` files.
- [x] Update flake8 config to exclude very large files (lemmatization tables etc.)
- [x] Update code to be compatible with flake8 rules
- [x] Fix various small bugs, inconsistencies and messy stuff in the language data
- [x] Update docs to explain new code style (`black`, `flake8`, when to use `# fmt: off` and `# fmt: on` and what `# noqa` means)

Once #2932 is merged, which auto-formats and tidies up the CLI, we'll be able to run `flake8 spacy` actually get meaningful results.

At the moment, the code style and linting isn't applied automatically, but I'm hoping that the new [GitHub Actions](https://github.com/features/actions) will let us auto-format pull requests and post comments with relevant linting information.

### Types of change
enhancement, code style

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2018-11-30 16:03:03 +00:00
+								As of `v2.1.0`, spaCy uses [`black`](https://github.com/ambv/black) for code
 								formatting and [`flake8`](http://flake8.pycqa.org/en/latest/) for linting its
 								Python modules. If you've built spaCy from source, you'll already have both
 								tools installed.
 								**⚠️ Note that formatting and linting is currently only possible for Python
 								modules in `.py` files, not Cython modules in `.pyx` and `.pxd` files.**
 								### Code formatting
 								[`black`](https://github.com/ambv/black) is an opinionated Python code
-												Update docs and consistency [ci skip]

											
										
										
											2020-08-21 11:49:18 +00:00
+								formatter, optimized to produce readable code and small diffs. You can run
-												💫 Tidy up and auto-format .py files (#2983)

<!--- Provide a general summary of your changes in the title. -->

## Description
- [x] Use [`black`](https://github.com/ambv/black) to auto-format all `.py` files.
- [x] Update flake8 config to exclude very large files (lemmatization tables etc.)
- [x] Update code to be compatible with flake8 rules
- [x] Fix various small bugs, inconsistencies and messy stuff in the language data
- [x] Update docs to explain new code style (`black`, `flake8`, when to use `# fmt: off` and `# fmt: on` and what `# noqa` means)

Once #2932 is merged, which auto-formats and tidies up the CLI, we'll be able to run `flake8 spacy` actually get meaningful results.

At the moment, the code style and linting isn't applied automatically, but I'm hoping that the new [GitHub Actions](https://github.com/features/actions) will let us auto-format pull requests and post comments with relevant linting information.

### Types of change
enhancement, code style

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2018-11-30 16:03:03 +00:00
+								`black` from the command-line, or via your code editor. For example, if you're
-												Tidy up and auto-format [ci skip]

											
										
										
											2019-07-27 10:19:35 +00:00
+								using [Visual Studio Code](https://code.visualstudio.com/), you can add the
-												💫 Tidy up and auto-format .py files (#2983)

<!--- Provide a general summary of your changes in the title. -->

## Description
- [x] Use [`black`](https://github.com/ambv/black) to auto-format all `.py` files.
- [x] Update flake8 config to exclude very large files (lemmatization tables etc.)
- [x] Update code to be compatible with flake8 rules
- [x] Fix various small bugs, inconsistencies and messy stuff in the language data
- [x] Update docs to explain new code style (`black`, `flake8`, when to use `# fmt: off` and `# fmt: on` and what `# noqa` means)

Once #2932 is merged, which auto-formats and tidies up the CLI, we'll be able to run `flake8 spacy` actually get meaningful results.

At the moment, the code style and linting isn't applied automatically, but I'm hoping that the new [GitHub Actions](https://github.com/features/actions) will let us auto-format pull requests and post comments with relevant linting information.

### Types of change
enhancement, code style

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2018-11-30 16:03:03 +00:00
+								following to your `settings.json` to use `black` for formatting and auto-format
 								your files on save:
 								```json
 								{
-												fix typos

											
										
										
											2020-08-17 12:05:48 +00:00
+								  "python.formatting.provider": "black",
 								  "[python]": {
 								    "editor.formatOnSave": true
 								  }
-												💫 Tidy up and auto-format .py files (#2983)

<!--- Provide a general summary of your changes in the title. -->

## Description
- [x] Use [`black`](https://github.com/ambv/black) to auto-format all `.py` files.
- [x] Update flake8 config to exclude very large files (lemmatization tables etc.)
- [x] Update code to be compatible with flake8 rules
- [x] Fix various small bugs, inconsistencies and messy stuff in the language data
- [x] Update docs to explain new code style (`black`, `flake8`, when to use `# fmt: off` and `# fmt: on` and what `# noqa` means)

Once #2932 is merged, which auto-formats and tidies up the CLI, we'll be able to run `flake8 spacy` actually get meaningful results.

At the moment, the code style and linting isn't applied automatically, but I'm hoping that the new [GitHub Actions](https://github.com/features/actions) will let us auto-format pull requests and post comments with relevant linting information.

### Types of change
enhancement, code style

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2018-11-30 16:03:03 +00:00
+								}
 								```
 								[See here](https://github.com/ambv/black#editor-integration) for the full
 								list of available editor integrations.
 								#### Disabling formatting
 								There are a few cases where auto-formatting doesn't improve readability – for
-												fix typos

											
										
										
											2020-08-17 12:05:48 +00:00
+								example, in some of the language data files like the `tag_map.py`, or in
-												💫 Tidy up and auto-format .py files (#2983)

<!--- Provide a general summary of your changes in the title. -->

## Description
- [x] Use [`black`](https://github.com/ambv/black) to auto-format all `.py` files.
- [x] Update flake8 config to exclude very large files (lemmatization tables etc.)
- [x] Update code to be compatible with flake8 rules
- [x] Fix various small bugs, inconsistencies and messy stuff in the language data
- [x] Update docs to explain new code style (`black`, `flake8`, when to use `# fmt: off` and `# fmt: on` and what `# noqa` means)

Once #2932 is merged, which auto-formats and tidies up the CLI, we'll be able to run `flake8 spacy` actually get meaningful results.

At the moment, the code style and linting isn't applied automatically, but I'm hoping that the new [GitHub Actions](https://github.com/features/actions) will let us auto-format pull requests and post comments with relevant linting information.

### Types of change
enhancement, code style

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

											
										
										
											2018-11-30 16:03:03 +00:00
+								the tests that construct `Doc` objects from lists of words and other labels.
 								Wrapping a block in `# fmt: off` and `# fmt: on` lets you disable formatting
 								for that particular code. Here's an example:
 								```python
 								# fmt: off
 								text = "I look forward to using Thingamajig.  I've been told it will make my life easier..."
 								heads = [1, 0, -1, -2, -1, -1, -5, -1, 3, 2, 1, 0, 2, 1, -3, 1, 1, -3, -7]
 								deps = ["nsubj", "ROOT", "advmod", "prep", "pcomp", "dobj", "punct", "",
 								        "nsubjpass", "aux", "auxpass", "ROOT", "nsubj", "aux", "ccomp",
 								        "poss", "nsubj", "ccomp", "punct"]
 								# fmt: on
 								```
 								### Code linting
 								[`flake8`](http://flake8.pycqa.org/en/latest/) is a tool for enforcing code
 								style. It scans one or more files and outputs errors and warnings. This feedback
 								can help you stick to general standards and conventions, and can be very useful
 								for spotting potential mistakes and inconsistencies in your code. The most
 								important things to watch out for are syntax errors and undefined names, but you
 								also want to keep an eye on unused declared variables or repeated
 								(i.e. overwritten) dictionary keys. If your code was formatted with `black`
 								(see above), you shouldn't see any formatting-related warnings.
 								The [`.flake8`](.flake8) config defines the configuration we use for this
 								codebase. For example, we're not super strict about the line length, and we're
 								excluding very large files like lemmatization and tokenizer exception tables.
 								Ideally, running the following command from within the repo directory should
 								not return any errors or warnings:
 								```bash
 								flake8 spacy
 								```
 								#### Disabling linting
 								Sometimes, you explicitly want to write code that's not compatible with our
 								rules. For example, a module's `__init__.py` might import a function so other
 								modules can import it from there, but `flake8` will complain about an unused
 								import. And although it's generally discouraged, there might be cases where it
 								makes sense to use a bare `except`.
 								To ignore a given line, you can add a comment like `# noqa: F401`, specifying
 								the code of the error or warning we want to ignore. It's also possible to
 								ignore several comma-separated codes at once, e.g. `# noqa: E731,E123`. Here
 								are some examples:
 								```python
 								# The imported class isn't used in this file, but imported here, so it can be
 								# imported *from* here by another module.
 								from .submodule import SomeClass  # noqa: F401
 								try:
 								    do_something()
 								except:  # noqa: E722
 								    # This bare except is justified, for some specific reason
 								    do_something_else()
 								```
-												Add code contribution guidelines
											
										
										
											2017-04-08 10:56:46 +00:00
-												Update CONTRIBUTING.md
											
										
										
											2017-04-08 11:35:52 +00:00
+								### Python conventions
-												Update CONTRIBUTING.md
											
										
										
											2017-04-08 11:26:48 +00:00
-												Drop Python 2.7 and 3.5 (#4828)

* Remove unicode declarations

* Remove Python 3.5 and 2.7 from CI

* Don't require pathlib

* Replace compat helpers

* Remove OrderedDict

* Use f-strings

* Set Cython compiler language level

* Fix typo

* Re-add OrderedDict for Table

* Update setup.cfg

* Revert CONTRIBUTING.md

* Revert lookups.md

* Revert top-level.md

* Small adjustments and docs [ci skip]

											
										
										
											2019-12-22 00:53:56 +00:00
+								All Python code must be written **compatible with Python 3.6+**.
-												Rewrite contributing guide

											
										
										
											2017-11-04 13:24:39 +00:00
+								Code that interacts with the file-system should accept objects that follow the
 								`pathlib.Path` API, without assuming that the object inherits from `pathlib.Path`.
 								If the function is user-facing and takes a path as an argument, it should check
 								whether the path is provided as a string. Strings should be converted to
 								`pathlib.Path` objects. Serialization and deserialization functions should always
-												Update docs and consistency [ci skip]

											
										
										
											2020-08-21 11:49:18 +00:00
+								accept **file-like objects**, as it makes the library IO-agnostic. Working on
-												Rewrite contributing guide

											
										
										
											2017-11-04 13:24:39 +00:00
+								buffers makes the code more general, easier to test, and compatible with Python
 's asynchronous IO.
 								Although spaCy uses a lot of classes, **inheritance is viewed with some suspicion**
 								— it's seen as a mechanism of last resort. You should discuss plans to extend
 								the class hierarchy before implementing.
 								We have a number of conventions around variable naming that are still being
 								documented, and aren't 100% strict. A general policy is that instances of the
 								class `Doc` should by default be called `doc`, `Token` `token`, `Lexeme` `lex`,
 								`Vocab` `vocab` and `Language` `nlp`. You should avoid naming variables that are
 								of other types these names. For instance, don't name a text string `doc` — you
 								should usually call this `text`. Two general code style preferences further help
 								with naming. First, **lean away from introducing temporary variables**, as these
 								clutter your namespace. This is one reason why comprehension expressions are
-												Replacing regex library with re to increase tokenization speed (#3218)

* replace unicode categories with raw list of code points

* simplifying ranges

* fixing variable length quotes

* removing redundant regular expression

* small cleanup of regexp notations

* quotes and alpha as ranges instead of alterations

* removed most regexp dependencies and features

* exponential backtracking - unit tests

* rewrote expression with pathological backtracking

* disabling double hyphen tests for now

* test additional variants of repeating punctuation

* remove regex and redundant backslashes from load_reddit script

* small typo fixes

* disable double punctuation test for russian

* clean up old comments

* format block code

* final cleanup

* naming consistency

* french strings as unicode for python 2 support

* french regular expression case insensitive

											
										
										
											2019-02-01 07:05:22 +00:00
+								often preferred. Second, **keep your functions shortish**, so they can work in a
-												Rewrite contributing guide

											
										
										
											2017-11-04 13:24:39 +00:00
+								smaller scope. Of course, this is a question of trade-offs.
-												Add note about variable naming
											
										
										
											2017-04-19 10:00:12 +00:00
-												Update CONTRIBUTING.md
											
										
										
											2017-04-08 11:35:52 +00:00
+								### Cython conventions
-												Add code contribution guidelines
											
										
										
											2017-04-08 10:56:46 +00:00
-												Rewrite contributing guide

											
										
										
											2017-11-04 13:24:39 +00:00
+								spaCy's core data structures are implemented as [Cython](http://cython.org/) `cdef`
 								classes. Memory is managed through the `cymem.cymem.Pool` class, which allows
 								you to allocate memory which will be freed when the `Pool` object is garbage
 								collected. This means you usually don't have to worry about freeing memory. You
 								just have to decide which Python object owns the memory, and make it own the
 								`Pool`. When that object goes out of scope, the memory will be freed. You do
 								have to take care that no pointers outlive the object that owns them — but this
 								is generally quite easy.
 								All Cython modules should have the `# cython: infer_types=True` compiler
 								directive at the top of the file. This makes the code much cleaner, as it avoids
 								the need for many type declarations. If possible, you should prefer to declare
 								your functions `nogil`, even if you don't especially care about multi-threading.
 								The reason is that `nogil` functions help the Cython compiler reason about your
 								code quite a lot — you're telling the compiler that no Python dynamics are
 								possible. This lets many errors be raised, and ensures your function will run
 								at C speed.
 								Cython gives you many choices of sequences: you could have a Python list, a
 								numpy array, a memory view, a C++ vector, or a pointer. Pointers are preferred,
 								because they are fastest, have the most explicit semantics, and let the compiler
 								check your code more strictly. C++ vectors are also great — but you should only
 								use them internally in functions. It's less friendly to accept a vector as an
 								argument, because that asks the user to do much more work.
-												Add code contribution guidelines
											
										
										
											2017-04-08 10:56:46 +00:00
 								Here's how to get a pointer from a numpy array, memory view or vector:
 								```cython
 								cdef void get_pointers(np.ndarray[int, mode='c'] numpy_array, vector[int] cpp_vector, int[::1] memory_view) nogil:
 								    pointer1 = <int*>numpy_array.data
 								    pointer2 = cpp_vector.data()
 								    pointer3 = &memory_view[0]
 								```
-												Rewrite contributing guide

											
										
										
											2017-11-04 13:24:39 +00:00
+								Both C arrays and C++ vectors reassure the compiler that no Python operations
 								are possible on your variable. This is a big advantage: it lets the Cython
 								compiler raise many more errors for you.
-												Add code contribution guidelines
											
										
										
											2017-04-08 10:56:46 +00:00
-												Rewrite contributing guide

											
										
										
											2017-11-04 13:24:39 +00:00
+								When getting a pointer from a numpy array or memoryview, take care that the data
 								is actually stored in C-contiguous order — otherwise you'll get a pointer to
 								nonsense. The type-declarations in the code above should generate runtime errors
 								if buffers with incorrect memory layouts are passed in.
-												Add code contribution guidelines
											
										
										
											2017-04-08 10:56:46 +00:00
 								To iterate over the array, the following style is preferred:
 								```cython
 								cdef int c_total(const int* int_array, int length) nogil:
 								    total = 0
 								    for item in int_array[:length]:
 								        total += item
 								    return total
 								```
-												Rewrite contributing guide

											
										
										
											2017-11-04 13:24:39 +00:00
+								If this is confusing, consider that the compiler couldn't deal with
 								`for item in int_array:` — there's no length attached to a raw pointer, so how
 								could we figure out where to stop? The length is provided in the slice notation
 								as a solution to this. Note that we don't have to declare the type of `item` in
 								the code above — the compiler can easily infer it. This gives us tidy code that
 								looks quite like Python, but is exactly as fast as C — because we've made sure
 								the compilation to C is trivial.
 								Your functions cannot be declared `nogil` if they need to create Python objects
 								or call Python functions. This is perfectly okay — you shouldn't torture your
 								code just to get `nogil` functions. However, if your function isn't `nogil`, you
 								should compile your module with `cython -a --cplus my_module.pyx` and open the
 								resulting `my_module.html` file in a browser. This will let you see how Cython
 								is compiling your code. Calls into the Python run-time will be in bright yellow.
 								This lets you easily see whether Cython is able to correctly type your code, or
 								whether there are unexpected problems.
 								Finally, if you're new to Cython, you should expect to find the first steps a
 								bit frustrating. It's a very large language, since it's essentially a superset
 								of Python and C++, with additional complexity and syntax from numpy. The
 								[documentation](http://docs.cython.org/en/latest/) isn't great, and there are
 								many "traps for new players". Working in Cython is very rewarding once you're
 								over the initial learning curve. As with C and C++, the first way you write
 								something in Cython will often be the performance-optimal approach. In contrast,
-												Update docs and consistency [ci skip]

											
										
										
											2020-08-21 11:49:18 +00:00
+								Python optimization generally requires a lot of experimentation. Is it faster to
-												Rewrite contributing guide

											
										
										
											2017-11-04 13:24:39 +00:00
+								have an `if item in my_dict` check, or to use `.get()`? What about `try`/`except`?
 								Does this numpy operation create a copy? There's no way to guess the answers to
 								these questions, and you'll usually be dissatisfied with your results — so
 								there's no way to know when to stop this process. In the worst case, you'll make
 								a mess that invites the next reader to try their luck too. This is like one of
 								those [volcanic gas-traps](http://www.wemjournal.org/article/S1080-6032%2809%2970088-2/abstract),
 								where the rescuers keep passing out from low oxygen, causing another rescuer to
 								follow — only to succumb themselves. In short, just say no to optimizing your
 								Python. If it's not fast enough the first time, just switch to Cython.
-												Add CONTRIBUTING.md

											
										
										
											2016-11-02 16:45:13 +00:00
-												Update CONTRIBUTING.md
											
										
										
											2017-04-08 11:35:52 +00:00
+								### Resources to get you started
-												Add CONTRIBUTING.md

											
										
										
											2016-11-02 16:45:13 +00:00
-												fix typos

											
										
										
											2020-08-17 12:05:48 +00:00
+								- [PEP 8 Style Guide for Python Code](https://www.python.org/dev/peps/pep-0008/) (python.org)
 								- [Official Cython documentation](http://docs.cython.org/en/latest/) (cython.org)
 								- [Writing C in Cython](https://explosion.ai/blog/writing-c-in-cython) (explosion.ai)
-												Update docs and consistency [ci skip]

											
										
										
											2020-08-21 11:49:18 +00:00
+								- [Multi-threading spaCy’s parser and named entity recognizer](https://explosion.ai/blog/multithreading-with-cython) (explosion.ai)
-												Add CONTRIBUTING.md

											
										
										
											2016-11-02 16:45:13 +00:00
-												Update contributing guidelines to add info on tests
											
										
										
											2017-01-13 14:23:58 +00:00
+								## Adding tests
-												Rewrite contributing guide

											
										
										
											2017-11-04 13:24:39 +00:00
+								spaCy uses the [pytest](http://doc.pytest.org/) framework for testing. For more
 								info on this, see the [pytest documentation](http://docs.pytest.org/en/latest/contents.html).
 								Tests for spaCy modules and classes live in their own directories of the same
 								name. For example, tests for the `Tokenizer` can be found in
-												Revert #4334

											
										
										
											2019-09-29 15:32:12 +00:00
+								[`/spacy/tests/tokenizer`](spacy/tests/tokenizer). To be interpreted and run,
-												Rewrite contributing guide

											
										
										
											2017-11-04 13:24:39 +00:00
+								all test files and test functions need to be prefixed with `test_`.
 								When adding tests, make sure to use descriptive names, keep the code short and
-												Update docs and consistency [ci skip]

											
										
										
											2020-08-21 11:49:18 +00:00
+								concise and only test for one behavior at a time. Try to `parametrize` test
-												Rewrite contributing guide

											
										
										
											2017-11-04 13:24:39 +00:00
+								cases wherever possible, use our pre-defined fixtures for spaCy components and
 								avoid unnecessary imports.
 								Extensive tests that take a long time should be marked with `@pytest.mark.slow`.
 								Tests that require the model to be loaded should be marked with
 								`@pytest.mark.models`. Loading the models is expensive and not necessary if
-												Replacing regex library with re to increase tokenization speed (#3218)

* replace unicode categories with raw list of code points

* simplifying ranges

* fixing variable length quotes

* removing redundant regular expression

* small cleanup of regexp notations

* quotes and alpha as ranges instead of alterations

* removed most regexp dependencies and features

* exponential backtracking - unit tests

* rewrote expression with pathological backtracking

* disabling double hyphen tests for now

* test additional variants of repeating punctuation

* remove regex and redundant backslashes from load_reddit script

* small typo fixes

* disable double punctuation test for russian

* clean up old comments

* format block code

* final cleanup

* naming consistency

* french strings as unicode for python 2 support

* french regular expression case insensitive

											
										
										
											2019-02-01 07:05:22 +00:00
+								you're not actually testing the model performance. If all you need is a `Doc`
-												Rewrite contributing guide

											
										
										
											2017-11-04 13:24:39 +00:00
+								object with annotations like heads, POS tags or the dependency parse, you can
 								use the `get_doc()` utility function to construct it manually.
-												Update contributing guidelines to add info on tests
											
										
										
											2017-01-13 14:23:58 +00:00
-												Revert #4334

											
										
										
											2019-09-29 15:32:12 +00:00
+								📖 **For more guidelines and information on how to add tests, check out the [tests README](spacy/tests/README.md).**
-												Add CONTRIBUTING.md

											
										
										
											2016-11-02 16:45:13 +00:00
 								## Updating the website
-												Tidy up and auto-format [ci skip]

											
										
										
											2019-07-27 10:19:35 +00:00
+								For instructions on how to build and run the [website](https://spacy.io) locally see **[Setup and installation](https://github.com/explosion/spaCy/blob/master/website/README.md#setup-and-installation-setup)** in the _website_ directory's README.
-												Add CONTRIBUTING.md

											
										
										
											2016-11-02 16:45:13 +00:00
-												Rewrite contributing guide

											
										
										
											2017-11-04 13:24:39 +00:00
+								The docs can always use another example or more detail, and they should always
 								be up to date and not misleading. To quickly find the correct file to edit,
-												Tidy up and auto-format [ci skip]

											
										
										
											2019-07-27 10:19:35 +00:00
+								simply click on the "Suggest edits" button at the bottom of a page.
-												Update CONTRIBUTING.md
											
										
										
											2017-01-13 14:51:22 +00:00
 								📖 **For more info and troubleshooting guides, check out the [website README](website).**
-												Add CONTRIBUTING.md

											
										
										
											2016-11-02 16:45:13 +00:00
-												Rewrite contributing guide

											
										
										
											2017-11-04 13:24:39 +00:00
+								## Publishing spaCy extensions and plugins
-												Add CONTRIBUTING.md

											
										
										
											2016-11-02 16:45:13 +00:00
-												Rewrite contributing guide

											
										
										
											2017-11-04 13:24:39 +00:00
+								We're very excited about all the new possibilities for **community extensions**
 								and plugins in spaCy v2.0, and we can't wait to see what you build with it!
-												Add CONTRIBUTING.md

											
										
										
											2016-11-02 16:45:13 +00:00
-												fix typos

											
										
										
											2020-08-17 12:05:48 +00:00
+								- An extension or plugin should add substantial functionality, be
 								  **well-documented** and **open-source**. It should be available for users to download
 								  and install as a Python package – for example via [PyPi](http://pypi.python.org).
 								- Extensions that write to `Doc`, `Token` or `Span` attributes should be wrapped
 								  as [pipeline components](https://spacy.io/usage/processing-pipelines#custom-components)
 								  that users can **add to their processing pipeline** using `nlp.add_pipe()`.
 								- When publishing your extension on GitHub, **tag it** with the topics
 								  [`spacy`](https://github.com/topics/spacy?o=desc&s=stars) and
 								  [`spacy-extensions`](https://github.com/topics/spacy-extension?o=desc&s=stars)
 								  to make it easier to find. Those are also the topics we're linking to from the
 								  spaCy website. If you're sharing your project on Twitter, feel free to tag
 								  [@spacy_io](https://twitter.com/spacy_io) so we can check it out.
 								- Once your extension is published, you can open an issue on the
 								  [issue tracker](https://github.com/explosion/spacy/issues) to suggest it for the
 								  [resources directory](https://spacy.io/usage/resources#extensions) on the
 								  website.
-												Add CONTRIBUTING.md

											
										
										
											2016-11-02 16:45:13 +00:00
-												Get docs ready for v2.0.0

											
										
										
											2017-11-07 11:00:43 +00:00
+								📖 **For more tips and best practices, see the [checklist for developing spaCy extensions](https://spacy.io/usage/processing-pipelines#extensions).**
-												Add CONTRIBUTING.md

											
										
										
											2016-11-02 16:45:13 +00:00
 								## Code of conduct
-												Rewrite contributing guide

											
										
										
											2017-11-04 13:24:39 +00:00
+								spaCy adheres to the
 								[Contributor Covenant Code of Conduct](http://contributor-covenant.org/version/1/4/).
 								By participating, you are expected to uphold this code.