spaCy/website/usage/_training/_tips.jade

163 lines
7.5 KiB
Plaintext
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

//- 💫 DOCS > USAGE > TRAINING > OPTIMIZATION TIPS AND ADVICE
p
| There are lots of conflicting "recipes" for training deep neural
| networks at the moment. The cutting-edge models take a very long time to
| train, so most researchers can't run enough experiments to figure out
| what's #[em really] going on. For what it's worth, here's a recipe that seems
| to work well on a lot of NLP problems:
+list("numbers")
+item
| Initialise with batch size 1, and compound to a maximum determined
| by your data size and problem type.
+item
| Use Adam solver with fixed learning rate.
+item
| Use averaged parameters
+item
| Use L2 regularization.
+item
| Clip gradients by L2 norm to 1.
+item
| On small data sizes, start at a high dropout rate, with linear decay.
p
| This recipe has been cobbled together experimentally. Here's why the
| various elements of the recipe made enough sense to try initially, and
| what you might try changing, depending on your problem.
+h(3, "tips-batch-size") Compounding batch size
p
| The trick of increasing the batch size is starting to become quite
| popular (see #[+a("https://arxiv.org/abs/1711.00489") Smith et al., 2017]).
| Their recipe is quite different from how spaCy's models are being
| trained, but there are some similarities. In training the various spaCy
| models, we haven't found much advantage from decaying the learning
| rate but starting with a low batch size has definitely helped. You
| should try it out on your data, and see how you go. Here's our current
| strategy:
+code("Batch heuristic").
def get_batches(train_data, model_type):
max_batch_sizes = {'tagger': 32, 'parser': 16, 'ner': 16, 'textcat': 64}
max_batch_size = max_batch_sizes[model_type]
if len(train_data) < 1000:
max_batch_size /= 2
if len(train_data) < 500:
max_batch_size /= 2
batch_size = compounding(1, max_batch_size, 1.001)
batches = minibatch(train_data, size=batch_size)
return batches
p
| This will set the batch size to start at #[code 1], and increase each
| batch until it reaches a maximum size. The tagger, parser and entity
| recognizer all take whole sentences as input, so they're learning a lot
| of labels in a single example. You therefore need smaller batches for
| them. The batch size for the text categorizer should be somewhat larger,
| especially if your documents are long.
+h(3, "tips-hyperparams") Learning rate, regularization and gradient clipping
p
| By default spaCy uses the Adam solver, with default settings
| (learning rate #[code 0.001], #[code beta1=0.9], #[code beta2=0.999]).
| Some researchers have said they found these settings terrible on their
| problems but they've always performed very well in training spaCy's
| models, in combination with the rest of our recipe. You can change these
| settings directly, by modifying the corresponding attributes on the
| #[code optimizer] object. You can also set environment variables, to
| adjust the defaults.
p
| There are two other key hyper-parameters of the solver: #[code L2]
| #[strong regularization], and #[strong gradient clipping]
| (#[code max_grad_norm]). Gradient clipping is a hack that's not discussed
| often, but everybody seems to be using. It's quite important in helping
| to ensure the network doesn't diverge, which is a fancy way of saying
| "fall over during training". The effect is sort of similar to setting the
| learning rate low. It can also compensate for a large batch size (this is
| a good example of how the choices of all these hyper-parameters
| intersect).
+h(3, "tips-dropout") Dropout rate
p
| For small datasets, it's useful to set a
| #[strong high dropout rate at first], and #[strong decay] it down towards
| a more reasonable value. This helps avoid the network immediately
| overfitting, while still encouraging it to learn some of the more
| interesting things in your data. spaCy comes with a
| #[+api("top-level#util.decaying") #[code decaying]] utility function to
| facilitate this. You might try setting:
+code.
from spacy.util import decaying
dropout = decaying(0.6, 0.2, 1e-4)
p
| You can then draw values from the iterator with #[code next(dropout)],
| which you would pass to the #[code drop] keyword argument of
| #[+api("language#update") #[code nlp.update]]. It's pretty much always a
| good idea to use at least #[strong some dropout]. All of the models
| currently use Bernoulli dropout, for no particularly principled reason
| we just haven't experimented with another scheme like Gaussian dropout
| yet.
+h(3, "tips-param-avg") Parameter averaging
p
| The last part of our optimization recipe is #[strong parameter averaging],
| an old trick introduced by
| #[+a("https://cseweb.ucsd.edu/~yfreund/papers/LargeMarginsUsingPerceptron.pdf") Freund and Schapire (1999)],
| popularised in the NLP community by
| #[+a("http://www.aclweb.org/anthology/P04-1015") Collins (2002)],
| and explained in more detail by
| #[+a("http://leon.bottou.org/projects/sgd") Leon Bottou]. Just about the
| only other people who seem to be using this for neural network training
| are the SyntaxNet team (one of whom is Michael Collins) but it really
| seems to work great on every problem.
p
| The trick is to store the moving average of the weights during training.
| We don't optimize this average we just track it. Then when we want to
| actually use the model, we use the averages, not the most recent value.
| In spaCy (and #[+a(gh("thinc")) Thinc]) this is done by using a
| context manager, #[+api("language#use_params") #[code use_params]], to
| temporarily replace the weights:
+code.
with nlp.use_params(optimizer.averages):
nlp.to_disk('/model')
p
| The context manager is handy because you naturally want to evaluate and
| save the model at various points during training (e.g. after each epoch).
| After evaluating and saving, the context manager will exit and the
| weights will be restored, so you resume training from the most recent
| value, rather than the average. By evaluating the model after each epoch,
| you can remove one hyper-parameter from consideration (the number of
| epochs). Having one less magic number to guess is extremely nice so
| having the averaging under a context manager is very convenient.
+h(3, "tips-transfer-learning") Transfer learning
p
| Finally, if you're training from a small data set, it's very useful to
| start off with some knowledge already in the model. #[strong Word vectors]
| are an easy and reliable way to do that, but depending on the
| application, you may also be able to start with useful knowledge from one
| of spaCy's #[+a("/models") pre-trained models], such as the parser,
| entity recogniser and tagger. If you're adapting a pre-trained model and
| you want it to retain accuracy on the tasks it was originally trained
| for, you should consider the "catastrophic forgetting" problem.
| #[+a("https://explosion.ai/blog/pseudo-rehearsal-catastrophic-forgetting", true) See this blog post]
| to read more about the problem and our suggested solution,
| pseudo-rehearsal.