mirror of https://github.com/explosion/spaCy.git
Add BILUO scheme to annotation docs
This commit is contained in:
parent
99b631617d
commit
465a1dd710
|
@ -71,6 +71,44 @@ include _annotation/_dep-labels
|
|||
|
||||
include _annotation/_named-entities
|
||||
|
||||
+h(3, "biluo") BILUO Scheme
|
||||
|
||||
p
|
||||
| spaCy translates character offsets into the BILUO scheme, in order to
|
||||
| decide the cost of each action given the current state of the entity
|
||||
| recognizer. The costs are then used to calculate the gradient of the
|
||||
| loss, to train the model.
|
||||
|
||||
+aside("Why BILUO, not IOB?")
|
||||
| There are several coding schemes for encoding entity annotations as
|
||||
| token tags. These coding schemes are equally expressive, but not
|
||||
| necessarily equally learnable.
|
||||
| #[+a("http://www.aclweb.org/anthology/W09-1119") Ratinov and Roth]
|
||||
| showed that the minimal #[strong Begin], #[strong In], #[strong Out]
|
||||
| scheme was more difficult to learn than the #[strong BILUO] scheme that
|
||||
| we use, which explicitly marks boundary tokens.
|
||||
|
||||
+table([ "Tag", "Description" ])
|
||||
+row
|
||||
+cell #[code #[span.u-color-theme B] EGIN]
|
||||
+cell The first token of a multi-token entity.
|
||||
|
||||
+row
|
||||
+cell #[code #[span.u-color-theme I] N]
|
||||
+cell An inner token of a multi-token entity.
|
||||
|
||||
+row
|
||||
+cell #[code #[span.u-color-theme L] AST]
|
||||
+cell The final token of a multi-token entity.
|
||||
|
||||
+row
|
||||
+cell #[code #[span.u-color-theme U] NIT]
|
||||
+cell A single-token entity.
|
||||
|
||||
+row
|
||||
+cell #[code #[span.u-color-theme O] UT]
|
||||
+cell A non-entity token.
|
||||
|
||||
+h(2, "json-input") JSON input format for training
|
||||
|
||||
p
|
||||
|
|
Loading…
Reference in New Issue