Update training tips

2017-11-10 00:17:10 +01:00 · 2017-11-10 00:17:10 +01:00 · 6ae0ebfa3a
parent b20779bac4
commit 6ae0ebfa3a
1 changed files with 37 additions and 10 deletions
--- a/website/usage/_training/_tips.jade
+++ b/website/usage/_training/_tips.jade
@ -5,7 +5,43 @@ p
    |  networks at the moment. The cutting-edge models take a very long time to
    |  train, so most researchers can't run enough experiments to figure out
    |  what's #[em really] going on. For what it's worth, here's a recipe seems
-    |  to work well on a lot of problems:
+    |  to work well on a lot of NLP problems:
+
+list("numbers")
+    +item
+        |  Initialise with batch size 1, and compound to a maximum determined
+        |  by your data size and problem type.
+    +item
+        |  Use Adam solver with fixed learning rate.
+
+    +item
+        |  Use averaged parameters
+
+    +item
+        |  Use L2 regularization.
+
+    +item
+        |  Clip gradients by L2 norm to 1.
+
+    +item
+        |  On small data sizes, start at a high dropout rate, with linear decay.
+
+p
+    |  This recipe has been cobbled together experimentally. Here's why the
+    |  various elements of the recipe made enough sense to try initially, and
+    |  what you might try changing, depending on your problem.
+
+h(3, "tips-batch-size") Compounding batch size
+
+p
+    |  The trick of increasing the batch size is starting to become quite
+    |  popular (see #[+a("https://arxiv.org/abs/1711.00489") Smith et al., 2017]).
+    |  Their recipe is quite different from how spaCy's models are being
+    |  trained, but there are some similarities. In training the various spaCy
+    |  models, we haven't found much advantage from decaying the learning
+    |  rate – but starting with a low batch size has definitely helped. You
+    |  should try it out on your data, and see how you go. Here's our current
+    |  strategy:

 +code("Batch heuristic").
    def get_batches(train_data, model_type):
@ -27,15 +63,6 @@ p
    |  them. The batch size for the text categorizer should be somewhat larger,
    |  especially if your documents are long.

-p
-    |  The trick of increasing the batch size is starting to become quite
-    |  popular (see #[+a("https://arxiv.org/abs/1711.00489") Smith et al., 2017]).
-    |  Their recipe is quite different from how spaCy's models are being
-    |  trained, but there are some similarities. In training the various spaCy
-    |  models, we haven't found much advantage from decaying the learning
-    |  rate – but starting with a low batch size has definitely helped. You
-    |  should try it out on your data, and see how you go.
-
 +h(3, "tips-hyperparams") Learning rate, regularization and gradient clipping

 p