Fix is_sent_start when converting from JSON (fix #7635) (#7655)

Data in the JSON format is split into sentences, and each sentence is
saved with is_sent_start flags. Currently the flags are 1 for the first
token and 0 for the others. When deserialized this results in a pattern
of True, None, None, None... which makes single-sentence documents look
as though they haven't had sentence boundaries set.

Since items saved in JSON format have been split into sentences already,
the is_sent_start values should all be True or False.
This commit is contained in:
Paul O'Leary McCann 2021-04-08 17:24:52 +09:00 committed by GitHub
parent 82d3caf861
commit c362006cb9
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 1 additions and 1 deletions

View File

@ -121,7 +121,7 @@ def json_to_annotations(doc):
if i == 0: if i == 0:
sent_starts.append(1) sent_starts.append(1)
else: else:
sent_starts.append(0) sent_starts.append(-1)
if "brackets" in sent: if "brackets" in sent:
brackets.extend((b["first"] + sent_start_i, brackets.extend((b["first"] + sent_start_i,
b["last"] + sent_start_i, b["label"]) b["last"] + sent_start_i, b["label"])