Fix is_sent_start when converting from JSON (fix #7635) (#7655)

Data in the JSON format is split into sentences, and each sentence is saved with is_sent_start flags. Currently the flags are 1 for the first token and 0 for the others. When deserialized this results in a pattern of True, None, None, None... which makes single-sentence documents look as though they haven't had sentence boundaries set. Since items saved in JSON format have been split into sentences already, the is_sent_start values should all be True or False.
2021-04-08 17:24:52 +09:00 · 2021-04-08 17:24:52 +09:00 · c362006cb9
parent 82d3caf861
commit c362006cb9
1 changed files with 1 additions and 1 deletions
--- a/spacy/training/gold_io.pyx
+++ b/spacy/training/gold_io.pyx
@ -121,7 +121,7 @@ def json_to_annotations(doc):
                if i == 0:
                    sent_starts.append(1)
                else:
-                    sent_starts.append(0)
+                    sent_starts.append(-1)
            if "brackets" in sent:
                brackets.extend((b["first"] + sent_start_i,
                                 b["last"] + sent_start_i, b["label"])