spaCy/CONTRIBUTING.md

487 lines
25 KiB
Markdown
Raw Normal View History

2016-11-02 16:46:04 +00:00
<a href="https://explosion.ai"><img src="https://explosion.ai/assets/img/logo.svg" width="125" height="125" align="right" /></a>
2016-11-02 16:45:13 +00:00
# Contribute to spaCy
2019-07-27 10:19:35 +00:00
Thanks for your interest in contributing to spaCy 🎉 The project is maintained
2017-11-04 13:24:39 +00:00
by [@honnibal](https://github.com/honnibal) and [@ines](https://github.com/ines),
and we'll do our best to help you get started. This page will give you a quick
overview of how things are organised and most importantly, how to get involved.
2016-11-02 16:45:13 +00:00
## Table of contents
2019-07-27 10:19:35 +00:00
2016-11-02 16:45:13 +00:00
1. [Issues and bug reports](#issues-and-bug-reports)
2. [Contributing to the code base](#contributing-to-the-code-base)
2017-04-08 11:35:52 +00:00
3. [Code conventions](#code-conventions)
4. [Adding tests](#adding-tests)
5. [Updating the website](#updating-the-website)
2017-11-04 13:24:39 +00:00
6. [Publishing extensions and plugins](#publishing-spacy-extensions-and-plugins)
7. [Code of conduct](#code-of-conduct)
2016-11-02 16:45:13 +00:00
## Issues and bug reports
2017-11-04 13:24:39 +00:00
First, [do a quick search](https://github.com/issues?q=+is%3Aissue+user%3Aexplosion)
to see if the issue has already been reported. If so, it's often better to just
leave a comment on an existing issue, rather than creating a new one. Old issues
also often include helpful tips and solutions to common problems. You should
2017-11-07 11:00:43 +00:00
also check the [troubleshooting guide](https://spacy.io/usage/#troubleshooting)
2017-11-04 13:24:39 +00:00
to see if your problem is already listed there.
If you're looking for help with your code, consider posting a question on
💫 Port master changes over to develop (#2979) * Create aryaprabhudesai.md (#2681) * Update _install.jade (#2688) Typo fix: "models" -> "model" * Add FAC to spacy.explain (resolves #2706) * Remove docstrings for deprecated arguments (see #2703) * When calling getoption() in conftest.py, pass a default option (#2709) * When calling getoption() in conftest.py, pass a default option This is necessary to allow testing an installed spacy by running: pytest --pyargs spacy * Add contributor agreement * update bengali token rules for hyphen and digits (#2731) * Less norm computations in token similarity (#2730) * Less norm computations in token similarity * Contributor agreement * Remove ')' for clarity (#2737) Sorry, don't mean to be nitpicky, I just noticed this when going through the CLI and thought it was a quick fix. That said, if this was intention than please let me know. * added contributor agreement for mbkupfer (#2738) * Basic support for Telugu language (#2751) * Lex _attrs for polish language (#2750) * Signed spaCy contributor agreement * Added polish version of english lex_attrs * Introduces a bulk merge function, in order to solve issue #653 (#2696) * Fix comment * Introduce bulk merge to increase performance on many span merges * Sign contributor agreement * Implement pull request suggestions * Describe converters more explicitly (see #2643) * Add multi-threading note to Language.pipe (resolves #2582) [ci skip] * Fix formatting * Fix dependency scheme docs (closes #2705) [ci skip] * Don't set stop word in example (closes #2657) [ci skip] * Add words to portuguese language _num_words (#2759) * Add words to portuguese language _num_words * Add words to portuguese language _num_words * Update Indonesian model (#2752) * adding e-KTP in tokenizer exceptions list * add exception token * removing lines with containing space as it won't matter since we use .split() method in the end, added new tokens in exception * add tokenizer exceptions list * combining base_norms with norm_exceptions * adding norm_exception * fix double key in lemmatizer * remove unused import on punctuation.py * reformat stop_words to reduce number of lines, improve readibility * updating tokenizer exception * implement is_currency for lang/id * adding orth_first_upper in tokenizer_exceptions * update the norm_exception list * remove bunch of abbreviations * adding contributors file * Fixed spaCy+Keras example (#2763) * bug fixes in keras example * created contributor agreement * Adding French hyphenated first name (#2786) * Fix typo (closes #2784) * Fix typo (#2795) [ci skip] Fixed typo on line 6 "regcognizer --> recognizer" * Adding basic support for Sinhala language. (#2788) * adding Sinhala language package, stop words, examples and lex_attrs. * Adding contributor agreement * Updating contributor agreement * Also include lowercase norm exceptions * Fix error (#2802) * Fix error ValueError: cannot resize an array that references or is referenced by another array in this way. Use the resize function * added spaCy Contributor Agreement * Add charlax's contributor agreement (#2805) * agreement of contributor, may I introduce a tiny pl languge contribution (#2799) * Contributors agreement * Contributors agreement * Contributors agreement * Add jupyter=True to displacy.render in documentation (#2806) * Revert "Also include lowercase norm exceptions" This reverts commit 70f4e8adf37cfcfab60be2b97d6deae949b30e9e. * Remove deprecated encoding argument to msgpack * Set up dependency tree pattern matching skeleton (#2732) * Fix bug when too many entity types. Fixes #2800 * Fix Python 2 test failure * Require older msgpack-numpy * Restore encoding arg on msgpack-numpy * Try to fix version pin for msgpack-numpy * Update Portuguese Language (#2790) * Add words to portuguese language _num_words * Add words to portuguese language _num_words * Portuguese - Add/remove stopwords, fix tokenizer, add currency symbols * Extended punctuation and norm_exceptions in the Portuguese language * Correct error in spacy universe docs concerning spacy-lookup (#2814) * Update Keras Example for (Parikh et al, 2016) implementation (#2803) * bug fixes in keras example * created contributor agreement * baseline for Parikh model * initial version of parikh 2016 implemented * tested asymmetric models * fixed grevious error in normalization * use standard SNLI test file * begin to rework parikh example * initial version of running example * start to document the new version * start to document the new version * Update Decompositional Attention.ipynb * fixed calls to similarity * updated the README * import sys package duh * simplified indexing on mapping word to IDs * stupid python indent error * added code from https://github.com/tensorflow/tensorflow/issues/3388 for tf bug workaround * Fix typo (closes #2815) [ci skip] * Update regex version dependency * Set version to 2.0.13.dev3 * Skip seemingly problematic test * Remove problematic test * Try previous version of regex * Revert "Remove problematic test" This reverts commit bdebbef45552d698d390aa430b527ee27830f11b. * Unskip test * Try older version of regex * 💫 Update training examples and use minibatching (#2830) <!--- Provide a general summary of your changes in the title. --> ## Description Update the training examples in `/examples/training` to show usage of spaCy's `minibatch` and `compounding` helpers ([see here](https://spacy.io/usage/training#tips-batch-size) for details). The lack of batching in the examples has caused some confusion in the past, especially for beginners who would copy-paste the examples, update them with large training sets and experienced slow and unsatisfying results. ### Types of change enhancements ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Visual C++ link updated (#2842) (closes #2841) [ci skip] * New landing page * Add contribution agreement * Correcting lang/ru/examples.py (#2845) * Correct some grammatical inaccuracies in lang\ru\examples.py; filled Contributor Agreement * Correct some grammatical inaccuracies in lang\ru\examples.py * Move contributor agreement to separate file * Set version to 2.0.13.dev4 * Add Persian(Farsi) language support (#2797) * Also include lowercase norm exceptions * Remove in favour of https://github.com/explosion/spaCy/graphs/contributors * Rule-based French Lemmatizer (#2818) <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> Add a rule-based French Lemmatizer following the english one and the excellent PR for [greek language optimizations](https://github.com/explosion/spaCy/pull/2558) to adapt the Lemmatizer class. ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> - Lemma dictionary used can be found [here](http://infolingu.univ-mlv.fr/DonneesLinguistiques/Dictionnaires/telechargement.html), I used the XML version. - Add several files containing exhaustive list of words for each part of speech - Add some lemma rules - Add POS that are not checked in the standard Lemmatizer, i.e PRON, DET, ADV and AUX - Modify the Lemmatizer class to check in lookup table as a last resort if POS not mentionned - Modify the lemmatize function to check in lookup table as a last resort - Init files are updated so the model can support all the functionalities mentioned above - Add words to tokenizer_exceptions_list.py in respect to regex used in tokenizer_exceptions.py ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [X] I have submitted the spaCy Contributor Agreement. - [X] I ran the tests, and all new and existing tests passed. - [X] My changes don't require a change to the documentation, or if they do, I've added all required information. * Set version to 2.0.13 * Fix formatting and consistency * Update docs for new version [ci skip] * Increment version [ci skip] * Add info on wheels [ci skip] * Adding "This is a sentence" example to Sinhala (#2846) * Add wheels badge * Update badge [ci skip] * Update README.rst [ci skip] * Update murmurhash pin * Increment version to 2.0.14.dev0 * Update GPU docs for v2.0.14 * Add wheel to setup_requires * Import prefer_gpu and require_gpu functions from Thinc * Add tests for prefer_gpu() and require_gpu() * Update requirements and setup.py * Workaround bug in thinc require_gpu * Set version to v2.0.14 * Update push-tag script * Unhack prefer_gpu * Require thinc 6.10.6 * Update prefer_gpu and require_gpu docs [ci skip] * Fix specifiers for GPU * Set version to 2.0.14.dev1 * Set version to 2.0.14 * Update Thinc version pin * Increment version * Fix msgpack-numpy version pin * Increment version * Update version to 2.0.16 * Update version [ci skip] * Redundant ')' in the Stop words' example (#2856) <!--- Provide a general summary of your changes in the title. --> ## Description <!--- Use this section to describe your changes. If your changes required testing, include information about the testing environment and the tests you ran. If your test fixes a bug reported in an issue, don't forget to include the issue number. If your PR is still a work in progress, that's totally fine – just include a note to let us know. --> ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [ ] I have submitted the spaCy Contributor Agreement. - [ ] I ran the tests, and all new and existing tests passed. - [ ] My changes don't require a change to the documentation, or if they do, I've added all required information. * Documentation improvement regarding joblib and SO (#2867) Some documentation improvements ## Description 1. Fixed the dead URL to joblib 2. Fixed Stack Overflow brand name (with space) ### Types of change Documentation ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * raise error when setting overlapping entities as doc.ents (#2880) * Fix out-of-bounds access in NER training The helper method state.B(1) gets the index of the first token of the buffer, or -1 if no such token exists. Normally this is safe because we pass this to functions like state.safe_get(), which returns an empty token. Here we used it directly as an array index, which is not okay! This error may have been the cause of out-of-bounds access errors during training. Similar errors may still be around, so much be hunted down. Hunting this one down took a long time...I printed out values across training runs and diffed, looking for points of divergence between runs, when no randomness should be allowed. * Change PyThaiNLP Url (#2876) * Fix missing comma * Add example showing a fix-up rule for space entities * Set version to 2.0.17.dev0 * Update regex version * Revert "Update regex version" This reverts commit 62358dd867d15bc6a475942dff34effba69dd70a. * Try setting older regex version, to align with conda * Set version to 2.0.17 * Add spacy-js to universe [ci-skip] * Add spacy-raspberry to universe (closes #2889) * Add script to validate universe json [ci skip] * Removed space in docs + added contributor indo (#2909) * - removed unneeded space in documentation * - added contributor info * Allow input text of length up to max_length, inclusive (#2922) * Include universe spec for spacy-wordnet component (#2919) * feat: include universe spec for spacy-wordnet component * chore: include spaCy contributor agreement * Minor formatting changes [ci skip] * Fix image [ci skip] Twitter URL doesn't work on live site * Check if the word is in one of the regular lists specific to each POS (#2886) * 💫 Create random IDs for SVGs to prevent ID clashes (#2927) Resolves #2924. ## Description Fixes problem where multiple visualizations in Jupyter notebooks would have clashing arc IDs, resulting in weirdly positioned arc labels. Generating a random ID prefix so even identical parses won't receive the same IDs for consistency (even if effect of ID clash isn't noticable here.) ### Types of change bug fix ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Fix typo [ci skip] * fixes symbolic link on py3 and windows (#2949) * fixes symbolic link on py3 and windows during setup of spacy using command python -m spacy link en_core_web_sm en closes #2948 * Update spacy/compat.py Co-Authored-By: cicorias <cicorias@users.noreply.github.com> * Fix formatting * Update universe [ci skip] * Catalan Language Support (#2940) * Catalan language Support * Ddding Catalan to documentation * Sort languages alphabetically [ci skip] * Update tests for pytest 4.x (#2965) <!--- Provide a general summary of your changes in the title. --> ## Description - [x] Replace marks in params for pytest 4.0 compat ([see here](https://docs.pytest.org/en/latest/deprecations.html#marks-in-pytest-mark-parametrize)) - [x] Un-xfail passing tests (some fixes in a recent update resolved a bunch of issues, but tests were apparently never updated here) ### Types of change <!-- What type of change does your PR cover? Is it a bug fix, an enhancement or new feature, or a change to the documentation? --> ## Checklist <!--- Before you submit the PR, go over this checklist and make sure you can tick off all the boxes. [] -> [x] --> - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information. * Fix regex pin to harmonize with conda (#2964) * Update README.rst * Fix bug where Vocab.prune_vector did not use 'batch_size' (#2977) Fixes #2976 * Fix typo * Fix typo * Remove duplicate file * Require thinc 7.0.0.dev2 Fixes bug in gpu_ops that would use cupy instead of numpy on CPU * Add missing import * Fix error IDs * Fix tests
2018-11-29 15:30:29 +00:00
[Stack Overflow](http://stackoverflow.com/questions/tagged/spacy) instead. If you
2017-11-04 13:24:39 +00:00
tag it `spacy` and `python`, more people will see it and hopefully be able to
help. Please understand that we won't be able to provide individual support via
email. We also believe that help is much more valuable if it's **shared publicly**,
so that more people can benefit from it.
### Submitting issues
When opening an issue, use a **descriptive title** and include your
**environment** (operating system, Python version, spaCy version). Our
[issue template](https://github.com/explosion/spaCy/issues/new) helps you
remember the most important details to include. If you've discovered a bug, you
can also submit a [regression test](#fixing-bugs) straight away. When you're
opening an issue to report the bug, simply refer to your pull request in the
issue body. A few more tips:
2019-07-27 10:19:35 +00:00
- **Describing your issue:** Try to provide as many details as possible. What
exactly goes wrong? _How_ is it failing? Is there an error?
"XY doesn't work" usually isn't that helpful for tracking down problems. Always
remember to include the code you ran and if possible, extract only the relevant
parts and don't just dump your entire script. This will make it easier for us to
reproduce the error.
- **Getting info about your spaCy installation and environment:** If you're
using spaCy v1.7+, you can use the command line interface to print details and
even format them as Markdown to copy-paste into GitHub issues:
`python -m spacy info --markdown`.
- **Checking the model compatibility:** If you're having problems with a
[statistical model](https://spacy.io/models), it may be because the
model is incompatible with your spaCy installation. In spaCy v2.0+, you can check
this on the command line by running `python -m spacy validate`.
- **Sharing a model's output, like dependencies and entities:** spaCy v2.0+
comes with [built-in visualizers](https://spacy.io/usage/visualizers) that
you can run from within your script or a Jupyter notebook. For some issues, it's
helpful to **include a screenshot** of the visualization. You can simply drag and
drop the image into GitHub's editor and it will be uploaded and included.
- **Sharing long blocks of code or logs:** If you need to include long code,
logs or tracebacks, you can wrap them in `<details>` and `</details>`. This
[collapses the content](https://developer.mozilla.org/en/docs/Web/HTML/Element/details)
so it only becomes visible on click, making the issue easier to read and follow.
### Issue labels
2017-11-04 13:24:39 +00:00
To distinguish issues that are opened by us, the maintainers, we usually add a
2018-05-07 19:24:35 +00:00
💫 to the title. [See this page](https://github.com/explosion/spaCy/labels)
2018-03-27 11:47:00 +00:00
for an overview of the system we use to tag our issues and pull requests.
2016-11-02 16:45:13 +00:00
## Contributing to the code base
2017-11-04 13:24:39 +00:00
You don't have to be an NLP expert or Python pro to contribute, and we're happy
to help you get started. If you're new to spaCy, a good place to start is the
2017-11-07 11:00:43 +00:00
[spaCy 101 guide](https://spacy.io/usage/spacy-101) and the
2017-11-04 13:24:39 +00:00
[`help wanted (easy)`](https://github.com/explosion/spaCy/issues?q=is%3Aissue+is%3Aopen+label%3A%22help+wanted+%28easy%29%22)
label, which we use to tag bugs and feature requests that are easy and
self-contained. If you've decided to take on one of these problems and you're
making good progress, don't forget to add a quick comment to the issue. You can
also use the issue to ask questions, or share your work in progress.
2016-11-02 16:45:13 +00:00
2017-04-08 10:56:46 +00:00
### What belongs in spaCy?
2016-11-02 16:45:13 +00:00
2017-11-04 13:24:39 +00:00
Every library has a different inclusion philosophy — a policy of what should be
shipped in the core library, and what could be provided in other packages. Our
philosophy is to prefer a smaller core library. We generally ask the following
questions:
2019-07-27 10:19:35 +00:00
- **What would this feature look like if implemented in a separate package?**
Some features would be very difficult to implement externally for example,
changes to spaCy's built-in methods. In contrast, a library of word
alignment functions could easily live as a separate package that depended on
spaCy — there's little difference between writing `import word_aligner` and
`import spacy.word_aligner`. spaCy v2.0+ makes it easy to implement
[custom pipeline components](https://spacy.io/usage/processing-pipelines#custom-components),
and add your own attributes, properties and methods to the `Doc`, `Token` and
`Span`. If you're looking to implement a new spaCy feature, starting with a
custom component package is usually the best strategy. You won't have to worry
about spaCy's internals and you can test your module in an isolated
environment. And if it works well, we can always integrate it into the core
library later.
- **Would the feature be easier to implement if it relied on "heavy" dependencies spaCy doesn't currently require?**
Python has a very rich ecosystem. Libraries like scikit-learn, SciPy, Gensim or
TensorFlow/Keras do lots of useful things — but we don't want to have them as
dependencies. If the feature requires functionality in one of these libraries,
it's probably better to break it out into a different package.
- **Is the feature orthogonal to the current spaCy functionality, or overlapping?**
spaCy strongly prefers to avoid having 6 different ways of doing the same thing.
As better techniques are developed, we prefer to drop support for "the old way".
However, it's rare that one approach _entirely_ dominates another. It's very
common that there's still a use-case for the "obsolete" approach. For instance,
[WordNet](https://wordnet.princeton.edu/) is still very useful — but word
vectors are better for most use-cases, and the two approaches to lexical
semantics do a lot of the same things. spaCy therefore only supports word
vectors, and support for WordNet is currently left for other packages.
- **Do you need the feature to get basic things done?** We do want spaCy to be
at least somewhat self-contained. If we keep needing some feature in our
recipes, that does provide some argument for bringing it "in house".
2017-11-04 13:24:39 +00:00
### Getting started
2018-01-03 21:00:36 +00:00
To make changes to spaCy's code base, you need to fork then clone the GitHub repository
2017-11-04 13:24:39 +00:00
and build spaCy from source. You'll need to make sure that you have a
development environment consisting of a Python distribution including header
files, a compiler, [pip](https://pip.pypa.io/en/latest/installing/),
[virtualenv](https://virtualenv.pypa.io/en/stable/) and
[git](https://git-scm.com) installed. The compiler is usually the trickiest part.
2016-11-02 16:45:13 +00:00
2017-11-04 13:24:39 +00:00
```
python -m pip install -U pip
2017-11-04 13:24:39 +00:00
git clone https://github.com/explosion/spaCy
cd spaCy
2017-04-08 10:56:46 +00:00
python -m venv .env
2017-11-04 13:24:39 +00:00
source .env/bin/activate
export PYTHONPATH=`pwd`
pip install -r requirements.txt
python setup.py build_ext --inplace
```
2017-04-08 10:56:46 +00:00
2017-11-04 13:24:39 +00:00
If you've made changes to `.pyx` files, you need to recompile spaCy before you
can test your changes by re-running `python setup.py build_ext --inplace`.
Changes to `.py` files will be effective immediately.
2017-04-08 10:56:46 +00:00
2017-11-07 11:00:43 +00:00
📖 **For more details and instructions, see the documentation on [compiling spaCy from source](https://spacy.io/usage/#source) and the [quickstart widget](https://spacy.io/usage/#section-quickstart) to get the right commands for your platform and Python version.**
2017-04-08 11:35:52 +00:00
### Contributor agreement
2017-11-04 13:24:39 +00:00
If you've made a contribution to spaCy, you should fill in the
[spaCy contributor agreement](.github/CONTRIBUTOR_AGREEMENT.md) to ensure that
your contribution can be used across the project. If you agree to be bound by
the terms of the agreement, fill in the [template](.github/CONTRIBUTOR_AGREEMENT.md)
and include it with your pull request, or submit it separately to
2017-11-04 13:24:39 +00:00
[`.github/contributors/`](/.github/contributors). The name of the file should be
your GitHub username, with the extension `.md`. For example, the user
2017-04-08 11:35:52 +00:00
example_user would create the file `.github/contributors/example_user.md`.
### Fixing bugs
2017-11-04 13:24:39 +00:00
When fixing a bug, first create an
[issue](https://github.com/explosion/spaCy/issues) if one does not already exist.
The description text can be very short we don't want to make this too
bureaucratic.
2017-04-08 11:35:52 +00:00
2017-11-04 13:24:39 +00:00
Next, create a test file named `test_issue[ISSUE NUMBER].py` in the
2019-09-29 15:32:12 +00:00
[`spacy/tests/regression`](spacy/tests/regression) folder. Test for the bug
2017-11-04 13:24:39 +00:00
you're fixing, and make sure the test fails. Next, add and commit your test file
referencing the issue number in the commit message. Finally, fix the bug, make
sure your test passes and reference the issue in your commit message.
2017-04-08 11:35:52 +00:00
2019-09-29 15:32:12 +00:00
📖 **For more information on how to add tests, check out the [tests README](spacy/tests/README.md).**
2017-04-08 11:35:52 +00:00
## Code conventions
2017-04-08 10:56:46 +00:00
2017-11-04 13:24:39 +00:00
Code should loosely follow [pep8](https://www.python.org/dev/peps/pep-0008/).
As of `v2.1.0`, spaCy uses [`black`](https://github.com/ambv/black) for code
formatting and [`flake8`](http://flake8.pycqa.org/en/latest/) for linting its
Python modules. If you've built spaCy from source, you'll already have both
tools installed.
**⚠️ Note that formatting and linting is currently only possible for Python
modules in `.py` files, not Cython modules in `.pyx` and `.pxd` files.**
### Code formatting
[`black`](https://github.com/ambv/black) is an opinionated Python code
formatter, optimised to produce readable code and small diffs. You can run
`black` from the command-line, or via your code editor. For example, if you're
2019-07-27 10:19:35 +00:00
using [Visual Studio Code](https://code.visualstudio.com/), you can add the
following to your `settings.json` to use `black` for formatting and auto-format
your files on save:
```json
{
"python.formatting.provider": "black",
"[python]": {
"editor.formatOnSave": true
}
}
```
[See here](https://github.com/ambv/black#editor-integration) for the full
list of available editor integrations.
#### Disabling formatting
There are a few cases where auto-formatting doesn't improve readability for
example, in some of the the language data files like the `tag_map.py`, or in
the tests that construct `Doc` objects from lists of words and other labels.
Wrapping a block in `# fmt: off` and `# fmt: on` lets you disable formatting
for that particular code. Here's an example:
```python
# fmt: off
text = "I look forward to using Thingamajig. I've been told it will make my life easier..."
heads = [1, 0, -1, -2, -1, -1, -5, -1, 3, 2, 1, 0, 2, 1, -3, 1, 1, -3, -7]
deps = ["nsubj", "ROOT", "advmod", "prep", "pcomp", "dobj", "punct", "",
"nsubjpass", "aux", "auxpass", "ROOT", "nsubj", "aux", "ccomp",
"poss", "nsubj", "ccomp", "punct"]
# fmt: on
```
### Code linting
[`flake8`](http://flake8.pycqa.org/en/latest/) is a tool for enforcing code
style. It scans one or more files and outputs errors and warnings. This feedback
can help you stick to general standards and conventions, and can be very useful
for spotting potential mistakes and inconsistencies in your code. The most
important things to watch out for are syntax errors and undefined names, but you
also want to keep an eye on unused declared variables or repeated
(i.e. overwritten) dictionary keys. If your code was formatted with `black`
(see above), you shouldn't see any formatting-related warnings.
The [`.flake8`](.flake8) config defines the configuration we use for this
codebase. For example, we're not super strict about the line length, and we're
excluding very large files like lemmatization and tokenizer exception tables.
Ideally, running the following command from within the repo directory should
not return any errors or warnings:
```bash
flake8 spacy
```
#### Disabling linting
Sometimes, you explicitly want to write code that's not compatible with our
rules. For example, a module's `__init__.py` might import a function so other
modules can import it from there, but `flake8` will complain about an unused
import. And although it's generally discouraged, there might be cases where it
makes sense to use a bare `except`.
To ignore a given line, you can add a comment like `# noqa: F401`, specifying
the code of the error or warning we want to ignore. It's also possible to
ignore several comma-separated codes at once, e.g. `# noqa: E731,E123`. Here
are some examples:
```python
# The imported class isn't used in this file, but imported here, so it can be
# imported *from* here by another module.
from .submodule import SomeClass # noqa: F401
try:
do_something()
except: # noqa: E722
# This bare except is justified, for some specific reason
do_something_else()
```
2017-04-08 10:56:46 +00:00
2017-04-08 11:35:52 +00:00
### Python conventions
2017-04-08 11:26:48 +00:00
2017-11-04 13:24:39 +00:00
All Python code must be written in an **intersection of Python 2 and Python 3**.
This is easy in Cython, but somewhat ugly in Python. Logic that deals with
Python or platform compatibility should only live in
[`spacy.compat`](spacy/compat.py). To distinguish them from the builtin
functions, replacement functions are suffixed with an underscore, for example
2017-11-04 13:24:39 +00:00
`unicode_`. If you need to access the user's version or platform information,
for example to show more specific error messages, you can use the `is_config()`
helper function.
2017-04-19 10:09:10 +00:00
```python
from .compat import unicode_, is_config
2017-04-19 10:09:10 +00:00
compatible_unicode = unicode_('hello world')
if is_config(windows=True, python2=True):
print("You are using Python 2 on Windows.")
```
2017-04-08 10:56:46 +00:00
2017-11-04 13:24:39 +00:00
Code that interacts with the file-system should accept objects that follow the
`pathlib.Path` API, without assuming that the object inherits from `pathlib.Path`.
If the function is user-facing and takes a path as an argument, it should check
whether the path is provided as a string. Strings should be converted to
`pathlib.Path` objects. Serialization and deserialization functions should always
accept **file-like objects**, as it makes the library io-agnostic. Working on
buffers makes the code more general, easier to test, and compatible with Python
3's asynchronous IO.
Although spaCy uses a lot of classes, **inheritance is viewed with some suspicion**
— it's seen as a mechanism of last resort. You should discuss plans to extend
the class hierarchy before implementing.
We have a number of conventions around variable naming that are still being
documented, and aren't 100% strict. A general policy is that instances of the
class `Doc` should by default be called `doc`, `Token` `token`, `Lexeme` `lex`,
`Vocab` `vocab` and `Language` `nlp`. You should avoid naming variables that are
of other types these names. For instance, don't name a text string `doc` — you
should usually call this `text`. Two general code style preferences further help
with naming. First, **lean away from introducing temporary variables**, as these
clutter your namespace. This is one reason why comprehension expressions are
often preferred. Second, **keep your functions shortish**, so they can work in a
2017-11-04 13:24:39 +00:00
smaller scope. Of course, this is a question of trade-offs.
2017-04-19 10:00:12 +00:00
2017-04-08 11:35:52 +00:00
### Cython conventions
2017-04-08 10:56:46 +00:00
2017-11-04 13:24:39 +00:00
spaCy's core data structures are implemented as [Cython](http://cython.org/) `cdef`
classes. Memory is managed through the `cymem.cymem.Pool` class, which allows
you to allocate memory which will be freed when the `Pool` object is garbage
collected. This means you usually don't have to worry about freeing memory. You
just have to decide which Python object owns the memory, and make it own the
`Pool`. When that object goes out of scope, the memory will be freed. You do
have to take care that no pointers outlive the object that owns them — but this
is generally quite easy.
All Cython modules should have the `# cython: infer_types=True` compiler
directive at the top of the file. This makes the code much cleaner, as it avoids
the need for many type declarations. If possible, you should prefer to declare
your functions `nogil`, even if you don't especially care about multi-threading.
The reason is that `nogil` functions help the Cython compiler reason about your
code quite a lot — you're telling the compiler that no Python dynamics are
possible. This lets many errors be raised, and ensures your function will run
at C speed.
Cython gives you many choices of sequences: you could have a Python list, a
numpy array, a memory view, a C++ vector, or a pointer. Pointers are preferred,
because they are fastest, have the most explicit semantics, and let the compiler
check your code more strictly. C++ vectors are also great — but you should only
use them internally in functions. It's less friendly to accept a vector as an
argument, because that asks the user to do much more work.
2017-04-08 10:56:46 +00:00
Here's how to get a pointer from a numpy array, memory view or vector:
```cython
cdef void get_pointers(np.ndarray[int, mode='c'] numpy_array, vector[int] cpp_vector, int[::1] memory_view) nogil:
pointer1 = <int*>numpy_array.data
pointer2 = cpp_vector.data()
pointer3 = &memory_view[0]
```
2017-11-04 13:24:39 +00:00
Both C arrays and C++ vectors reassure the compiler that no Python operations
are possible on your variable. This is a big advantage: it lets the Cython
compiler raise many more errors for you.
2017-04-08 10:56:46 +00:00
2017-11-04 13:24:39 +00:00
When getting a pointer from a numpy array or memoryview, take care that the data
is actually stored in C-contiguous order — otherwise you'll get a pointer to
nonsense. The type-declarations in the code above should generate runtime errors
if buffers with incorrect memory layouts are passed in.
2017-04-08 10:56:46 +00:00
To iterate over the array, the following style is preferred:
```cython
cdef int c_total(const int* int_array, int length) nogil:
total = 0
for item in int_array[:length]:
total += item
return total
```
2017-11-04 13:24:39 +00:00
If this is confusing, consider that the compiler couldn't deal with
`for item in int_array:` — there's no length attached to a raw pointer, so how
could we figure out where to stop? The length is provided in the slice notation
as a solution to this. Note that we don't have to declare the type of `item` in
the code above — the compiler can easily infer it. This gives us tidy code that
looks quite like Python, but is exactly as fast as C — because we've made sure
the compilation to C is trivial.
Your functions cannot be declared `nogil` if they need to create Python objects
or call Python functions. This is perfectly okay — you shouldn't torture your
code just to get `nogil` functions. However, if your function isn't `nogil`, you
should compile your module with `cython -a --cplus my_module.pyx` and open the
resulting `my_module.html` file in a browser. This will let you see how Cython
is compiling your code. Calls into the Python run-time will be in bright yellow.
This lets you easily see whether Cython is able to correctly type your code, or
whether there are unexpected problems.
Finally, if you're new to Cython, you should expect to find the first steps a
bit frustrating. It's a very large language, since it's essentially a superset
of Python and C++, with additional complexity and syntax from numpy. The
[documentation](http://docs.cython.org/en/latest/) isn't great, and there are
many "traps for new players". Working in Cython is very rewarding once you're
over the initial learning curve. As with C and C++, the first way you write
something in Cython will often be the performance-optimal approach. In contrast,
Python optimisation generally requires a lot of experimentation. Is it faster to
have an `if item in my_dict` check, or to use `.get()`? What about `try`/`except`?
Does this numpy operation create a copy? There's no way to guess the answers to
these questions, and you'll usually be dissatisfied with your results — so
there's no way to know when to stop this process. In the worst case, you'll make
a mess that invites the next reader to try their luck too. This is like one of
those [volcanic gas-traps](http://www.wemjournal.org/article/S1080-6032%2809%2970088-2/abstract),
where the rescuers keep passing out from low oxygen, causing another rescuer to
follow — only to succumb themselves. In short, just say no to optimizing your
Python. If it's not fast enough the first time, just switch to Cython.
2016-11-02 16:45:13 +00:00
2017-04-08 11:35:52 +00:00
### Resources to get you started
2016-11-02 16:45:13 +00:00
2019-07-27 10:19:35 +00:00
- [PEP 8 Style Guide for Python Code](https://www.python.org/dev/peps/pep-0008/) (python.org)
- [Official Cython documentation](http://docs.cython.org/en/latest/) (cython.org)
- [Writing C in Cython](https://explosion.ai/blog/writing-c-in-cython) (explosion.ai)
- [Multi-threading spaCys parser and named entity recogniser](https://explosion.ai/blog/multithreading-with-cython) (explosion.ai)
2016-11-02 16:45:13 +00:00
## Adding tests
2017-11-04 13:24:39 +00:00
spaCy uses the [pytest](http://doc.pytest.org/) framework for testing. For more
info on this, see the [pytest documentation](http://docs.pytest.org/en/latest/contents.html).
Tests for spaCy modules and classes live in their own directories of the same
name. For example, tests for the `Tokenizer` can be found in
2019-09-29 15:32:12 +00:00
[`/spacy/tests/tokenizer`](spacy/tests/tokenizer). To be interpreted and run,
2017-11-04 13:24:39 +00:00
all test files and test functions need to be prefixed with `test_`.
When adding tests, make sure to use descriptive names, keep the code short and
concise and only test for one behaviour at a time. Try to `parametrize` test
cases wherever possible, use our pre-defined fixtures for spaCy components and
avoid unnecessary imports.
Extensive tests that take a long time should be marked with `@pytest.mark.slow`.
Tests that require the model to be loaded should be marked with
`@pytest.mark.models`. Loading the models is expensive and not necessary if
you're not actually testing the model performance. If all you need is a `Doc`
2017-11-04 13:24:39 +00:00
object with annotations like heads, POS tags or the dependency parse, you can
use the `get_doc()` utility function to construct it manually.
2019-09-29 15:32:12 +00:00
📖 **For more guidelines and information on how to add tests, check out the [tests README](spacy/tests/README.md).**
2016-11-02 16:45:13 +00:00
## Updating the website
2019-07-27 10:19:35 +00:00
For instructions on how to build and run the [website](https://spacy.io) locally see **[Setup and installation](https://github.com/explosion/spaCy/blob/master/website/README.md#setup-and-installation-setup)** in the _website_ directory's README.
2016-11-02 16:45:13 +00:00
2017-11-04 13:24:39 +00:00
The docs can always use another example or more detail, and they should always
be up to date and not misleading. To quickly find the correct file to edit,
2019-07-27 10:19:35 +00:00
simply click on the "Suggest edits" button at the bottom of a page.
2017-01-13 14:51:22 +00:00
📖 **For more info and troubleshooting guides, check out the [website README](website).**
2016-11-02 16:45:13 +00:00
2017-11-04 13:24:39 +00:00
## Publishing spaCy extensions and plugins
2016-11-02 16:45:13 +00:00
2017-11-04 13:24:39 +00:00
We're very excited about all the new possibilities for **community extensions**
and plugins in spaCy v2.0, and we can't wait to see what you build with it!
2016-11-02 16:45:13 +00:00
2019-07-27 10:19:35 +00:00
- An extension or plugin should add substantial functionality, be
**well-documented** and **open-source**. It should be available for users to download
and install as a Python package for example via [PyPi](http://pypi.python.org).
- Extensions that write to `Doc`, `Token` or `Span` attributes should be wrapped
as [pipeline components](https://spacy.io/usage/processing-pipelines#custom-components)
that users can **add to their processing pipeline** using `nlp.add_pipe()`.
- When publishing your extension on GitHub, **tag it** with the topics
[`spacy`](https://github.com/topics/spacy?o=desc&s=stars) and
[`spacy-extensions`](https://github.com/topics/spacy-extension?o=desc&s=stars)
to make it easier to find. Those are also the topics we're linking to from the
spaCy website. If you're sharing your project on Twitter, feel free to tag
[@spacy_io](https://twitter.com/spacy_io) so we can check it out.
- Once your extension is published, you can open an issue on the
[issue tracker](https://github.com/explosion/spacy/issues) to suggest it for the
[resources directory](https://spacy.io/usage/resources#extensions) on the
website.
2016-11-02 16:45:13 +00:00
2017-11-07 11:00:43 +00:00
📖 **For more tips and best practices, see the [checklist for developing spaCy extensions](https://spacy.io/usage/processing-pipelines#extensions).**
2016-11-02 16:45:13 +00:00
## Code of conduct
2017-11-04 13:24:39 +00:00
spaCy adheres to the
[Contributor Covenant Code of Conduct](http://contributor-covenant.org/version/1/4/).
By participating, you are expected to uphold this code.