2020-07-26 11:42:08 +00:00
|
|
|
|
spaCy's tagger, parser, text categorizer and many other components are powered
|
|
|
|
|
by **statistical models**. Every "decision" these components make – for example,
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 18:31:19 +00:00
|
|
|
|
which part-of-speech tag to assign, or whether a word is a named entity – is a
|
2020-07-26 11:42:08 +00:00
|
|
|
|
**prediction** based on the model's current **weight values**. The weight
|
|
|
|
|
values are estimated based on examples the model has seen
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 18:31:19 +00:00
|
|
|
|
during **training**. To train a model, you first need training data – examples
|
|
|
|
|
of text, and the labels you want the model to predict. This could be a
|
|
|
|
|
part-of-speech tag, a named entity or any other information.
|
|
|
|
|
|
2020-07-26 11:42:08 +00:00
|
|
|
|
Training is an iterative process in which the model's predictions are compared
|
|
|
|
|
against the reference annotations in order to estimate the **gradient of the
|
|
|
|
|
loss**. The gradient of the loss is then used to calculate the gradient of the
|
|
|
|
|
weights through [backpropagation](https://thinc.ai/backprop101). The gradients
|
|
|
|
|
indicate how the weight values should be changed so that the model's
|
|
|
|
|
predictions become more similar to the reference labels over time.
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 18:31:19 +00:00
|
|
|
|
|
|
|
|
|
> - **Training data:** Examples and their annotations.
|
|
|
|
|
> - **Text:** The input text the model should predict a label for.
|
|
|
|
|
> - **Label:** The label the model should predict.
|
2020-07-26 11:42:08 +00:00
|
|
|
|
> - **Gradient:** The direction and rate of change for a numeric value.
|
|
|
|
|
> Minimising the gradient of the weights should result in predictions that
|
|
|
|
|
> are closer to the reference labels on the training data.
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 18:31:19 +00:00
|
|
|
|
|
|
|
|
|
![The training process](../../images/training.svg)
|
|
|
|
|
|
|
|
|
|
When training a model, we don't just want it to memorize our examples – we want
|
2020-07-26 11:42:08 +00:00
|
|
|
|
it to come up with a theory that can be **generalized across unseen data**.
|
💫 Update website (#3285)
<!--- Provide a general summary of your changes in the title. -->
## Description
The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.
This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.
### Types of change
enhancement
## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
2019-02-17 18:31:19 +00:00
|
|
|
|
After all, we don't just want the model to learn that this one instance of
|
|
|
|
|
"Amazon" right here is a company – we want it to learn that "Amazon", in
|
|
|
|
|
contexts _like this_, is most likely a company. That's why the training data
|
|
|
|
|
should always be representative of the data we want to process. A model trained
|
|
|
|
|
on Wikipedia, where sentences in the first person are extremely rare, will
|
|
|
|
|
likely perform badly on Twitter. Similarly, a model trained on romantic novels
|
|
|
|
|
will likely perform badly on legal text.
|
|
|
|
|
|
|
|
|
|
This also means that in order to know how the model is performing, and whether
|
|
|
|
|
it's learning the right things, you don't only need **training data** – you'll
|
|
|
|
|
also need **evaluation data**. If you only test the model with the data it was
|
|
|
|
|
trained on, you'll have no idea how well it's generalizing. If you want to train
|
|
|
|
|
a model from scratch, you usually need at least a few hundred examples for both
|
2020-07-26 11:42:08 +00:00
|
|
|
|
training and evaluation. A good rule of thumb is that you should have 10
|
|
|
|
|
samples for each significant figure of accuracy you report.
|
|
|
|
|
If you only have 100 samples and your model predicts 92 of them correctly, you
|
|
|
|
|
would report accuracy of 0.9 rather than 0.92.
|