mirror of https://github.com/explosion/spaCy.git
Data in the JSON format is split into sentences, and each sentence is saved with is_sent_start flags. Currently the flags are 1 for the first token and 0 for the others. When deserialized this results in a pattern of True, None, None, None... which makes single-sentence documents look as though they haven't had sentence boundaries set. Since items saved in JSON format have been split into sentences already, the is_sent_start values should all be True or False.
This commit is contained in:
parent
82d3caf861
commit
c362006cb9
|
@ -121,7 +121,7 @@ def json_to_annotations(doc):
|
||||||
if i == 0:
|
if i == 0:
|
||||||
sent_starts.append(1)
|
sent_starts.append(1)
|
||||||
else:
|
else:
|
||||||
sent_starts.append(0)
|
sent_starts.append(-1)
|
||||||
if "brackets" in sent:
|
if "brackets" in sent:
|
||||||
brackets.extend((b["first"] + sent_start_i,
|
brackets.extend((b["first"] + sent_start_i,
|
||||||
b["last"] + sent_start_i, b["label"])
|
b["last"] + sent_start_i, b["label"])
|
||||||
|
|
Loading…
Reference in New Issue