Commit Graph

68 Commits

Author SHA1 Message Date
Adriane Boyd 5eeb25f043 Tidy up code 2021-06-28 12:08:15 +02:00
Matthew Honnibal f9946154d9
Add SpanCategorizer component (#6747)
* Draft spancat model

* Add spancat model

* Add test for extract_spans

* Add extract_spans layer

* Upd extract_spans

* Add spancat model

* Add test for spancat model

* Upd spancat model

* Update spancat component

* Upd spancat

* Update spancat model

* Add quick spancat test

* Import SpanCategorizer

* Fix SpanCategorizer component

* Import SpanGroup

* Fix span extraction

* Fix import

* Fix import

* Upd model

* Update spancat models

* Add scoring, update defaults

* Update and add docs

* Fix type

* Update spacy/ml/extract_spans.py

* Auto-format and fix import

* Fix comment

* Fix type

* Fix type

* Update website/docs/api/spancategorizer.md

* Fix comment

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Better defense

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Fix labels list

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/ml/extract_spans.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/pipeline/spancat.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Set annotations during update

* Set annotations in spancat

* fix imports in test

* Update spacy/pipeline/spancat.py

* replace MaxoutLogistic with LinearLogistic

* fix config

* various small fixes

* remove set_annotations parameter in update

* use our beloved tupley format with recent support for doc.spans

* bugfix to allow renaming the default span_key (scores weren't showing up)

* use different key in docs example

* change defaults to better-working parameters from project (WIP)

* register spacy.extract_spans.v1 for legacy purposes

* Upd dev version so can build wheel

* layers instead of architectures for smaller building blocks

* Update website/docs/api/spancategorizer.md

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update website/docs/api/spancategorizer.md

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Include additional scores from overrides in combined score weights

* Parameterize spans key in scoring

Parameterize the `SpanCategorizer` `spans_key` for scoring purposes so
that it's possible to evaluate multiple `spancat` components in the same
pipeline.

* Use the (intentionally very short) default spans key `sc` in the
  `SpanCategorizer`
* Adjust the default score weights to include the default key
* Adjust the scorer to use `spans_{spans_key}` as the prefix for the
  returned score
* Revert addition of `attr_name` argument to `score_spans` and adjust
  the key in the `getter` instead.

Note that for `spancat` components with a custom `span_key`, the score
weights currently need to be modified manually in
`[training.score_weights]` for them to be available during training. To
suppress the default score weights `spans_sc_p/r/f` during training, set
them to `null` in `[training.score_weights]`.

* Update website/docs/api/scorer.md

* Fix scorer for spans key containing underscore

* Increment version

* Add Spans to Evaluate CLI (#8439)

* Add Spans to Evaluate CLI

* Change to spans_key

* Add spans per_type output

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Fix spancat GPU issues (#8455)

* Fix GPU issues

* Require thinc >=8.0.6

* Switch to glorot_uniform_init

* Fix and test ngram suggester

* Include final ngram in doc for all sizes
* Fix ngrams for docs of the same length as ngram size
* Handle batches of docs that result in no ngrams
* Add tests

Co-authored-by: Ines Montani <ines@ines.io>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Nirant <NirantK@users.noreply.github.com>
2021-06-24 12:35:27 +02:00
Adriane Boyd 30e1a89aeb
Fix displacy output in evaluate CLI (#7122)
Now that `nlp.evaluate()` does not modify the examples, rerun the
pipeline on the (limited) texts in order to provide the predicted
annotation in the displacy output option.
2021-02-19 23:01:20 +11:00
Ines Montani 26bf642afd
Fix issue #7019: Handle None scores in evaluate printer (#7026) 2021-02-11 16:45:23 +11:00
Ines Montani d0c3775712 Replace links to nightly docs [ci skip] 2021-01-30 20:09:38 +11:00
Ines Montani 991669c934 Tidy up and auto-format 2021-01-05 13:41:53 +11:00
Adriane Boyd a4b32b9552
Handle missing reference values in scorer (#6286)
* Handle missing reference values in scorer

Handle missing values in reference doc during scoring where it is
possible to detect an unset state for the attribute. If no reference
docs contain annotation, `None` is returned instead of a score. `spacy
evaluate` displays `-` for missing scores and the missing scores are
saved as `None`/`null` in the metrics.

Attributes without unset states:

* `token.head`: relies on `token.dep` to recognize unset values
* `doc.cats`: unable to handle missing annotation

Additional changes:

* add optional `has_annotation` check to `score_scans` to replace
`doc.sents` hack
* update `score_token_attr_per_feat` to handle missing and empty morph
representations
* fix bug in `Doc.has_annotation` for normalization of `IS_SENT_START`
vs. `SENT_START`

* Fix import

* Update return types
2020-11-03 15:47:18 +01:00
Adriane Boyd 563a21834e Save raw scores in evaluate output 2020-10-19 15:49:09 +02:00
Adriane Boyd dd207ca6d0 Add dep_las_per_type and more generic PRF printer 2020-10-19 15:49:02 +02:00
Adriane Boyd 4300858ecb Include per-type/feat scores in evaluate output 2020-10-19 15:48:55 +02:00
Ines Montani 604be54a5c Support --code in evaluate CLI [ci skip] 2020-09-29 21:20:56 +02:00
Ines Montani 822ea4ef61 Refactor CLI 2020-09-28 15:09:59 +02:00
Sofie Van Landeghem 8e7557656f
Renaming gold & annotation_setter (#6042)
* version bump to 3.0.0a16

* rename "gold" folder to "training"

* rename 'annotation_setter' to 'set_extra_annotations'

* formatting
2020-09-09 10:31:03 +02:00
Ines Montani ab1bb421ed Update docs links in codebase 2020-09-04 12:58:50 +02:00
Ines Montani b5a0657fd6 "model" terminology consistency in docs 2020-09-03 13:13:03 +02:00
Ines Montani 3ae5e02f4f Update docs, types and API consistency 2020-08-17 16:45:24 +02:00
Ines Montani 37814b608d Remove env_opt and simplfy default Optimizer 2020-08-14 14:59:54 +02:00
Ines Montani 1d01d89b79 Update CLI docs and evaluate command [ci skip] 2020-08-07 14:40:58 +02:00
Ines Montani e68459296d Tidy up and auto-format 2020-08-05 16:00:59 +02:00
Matthew Honnibal ecb3c4e8f4
Create corpus iterator and batcher from registry during training (#5865)
* Move batchers into their own module (and registry)

* Update CLI

* Update Corpus and batcher

* Update tests

* Update one config

* Merge 'evaluation' block back under [training]

* Import batchers in gold __init__

* Fix batchers

* Update config

* Update schema

* Update util

* Don't assume train and dev are actually paths

* Update onto-joint config

* Fix missing import

* Format

* Format

* Update spacy/gold/corpus.py

Co-authored-by: Ines Montani <ines@ines.io>

* Fix name

* Update default config

* Fix get_length option in batchers

* Update test

* Add comment

* Pass path into Corpus

* Update docstring

* Update schema and configs

* Update config

* Fix test

* Fix paths

* Fix print

* Fix create_train_batches

* [training.read_train] -> [training.train_corpus]

* Update onto-joint config

Co-authored-by: Ines Montani <ines@ines.io>
2020-08-04 15:09:37 +02:00
Adriane Boyd 0cddb0dbe9
Move timing into Language.evaluate (#5836)
Move timing into `Language.evaluate` so that only the processing is
timing, not processing + scoring. `Language.evaluate` returns
`scores["speed"]` as words per second, which should be identical to how
the speed was added to the scores previously. Also add the speed to the
evaluate CLI output.
2020-07-29 11:02:31 +02:00
Adriane Boyd baf19fd652 Update cats scoring to provide overall score
* Provide top-level score as `attr_score`
* Provide a description of the score as `attr_score_desc`
* Provide all potential scores keys, setting unused keys to `None`
* Update CLI evaluate accordingly
2020-07-27 12:26:10 +02:00
Ines Montani e92df281ce Tidy up, autoformat, add types 2020-07-25 15:01:15 +02:00
Adriane Boyd 2bcceb80c4
Refactor the Scorer to improve flexibility (#5731)
* Refactor the Scorer to improve flexibility

Refactor the `Scorer` to improve flexibility for arbitrary pipeline
components.

* Individual pipeline components provide their own `evaluate` methods
that score a list of `Example`s and return a dictionary of scores
* `Scorer` is initialized either:
  * with a provided pipeline containing components to be scored
  * with a default pipeline containing the built-in statistical
    components (senter, tagger, morphologizer, parser, ner)
* `Scorer.score` evaluates a list of `Example`s and returns a dictionary
of scores referring to the scores provided by the components in the
pipeline

Significant differences:

* `tags_acc` is renamed to `tag_acc` to be consistent with `token_acc`
and the new `morph_acc`, `pos_acc`, and `lemma_acc`
* Scoring is no longer cumulative: `Scorer.score` scores a list of
examples rather than a single example and does not retain any state
about previously scored examples
* PRF values in the returned scores are no longer multiplied by 100

* Add kwargs to Morphologizer.evaluate

* Create generalized scoring methods in Scorer

* Generalized static scoring methods are added to `Scorer`
  * Methods require an attribute (either on Token or Doc) that is
used to key the returned scores

Naming differences:

* `uas`, `las`, and `las_per_type` in the scores dict are renamed to
`dep_uas`, `dep_las`, and `dep_las_per_type`

Scoring differences:

* `Doc.sents` is now scored as spans rather than on sentence-initial
token positions so that `Doc.sents` and `Doc.ents` can be scored with
the same method (this lowers scores since a single incorrect sentence
start results in two incorrect spans)

* Simplify / extend hasattr check for eval method

* Add hasattr check to tokenizer scoring
* Simplify to hasattr check for component scoring

* Reset Example alignment if docs are set

Reset the Example alignment if either doc is set in case the
tokenization has changed.

* Add PRF tokenization scoring for tokens as spans

Add PRF scores for tokens as character spans. The scores are:

* token_acc: # correct tokens / # gold tokens
* token_p/r/f: PRF for (token.idx, token.idx + len(token))

* Add docstring to Scorer.score_tokenization

* Rename component.evaluate() to component.score()

* Update Scorer API docs

* Update scoring for positive_label in textcat

* Fix TextCategorizer.score kwargs

* Update Language.evaluate docs

* Update score names in default config
2020-07-25 12:53:02 +02:00
Ines Montani 43b960c01b
Refactor pipeline components, config and language data (#5759)
* Update with WIP

* Update with WIP

* Update with pipeline serialization

* Update types and pipe factories

* Add deep merge, tidy up and add tests

* Fix pipe creation from config

* Don't validate default configs on load

* Update spacy/language.py

Co-authored-by: Ines Montani <ines@ines.io>

* Adjust factory/component meta error

* Clean up factory args and remove defaults

* Add test for failing empty dict defaults

* Update pipeline handling and methods

* provide KB as registry function instead of as object

* small change in test to make functionality more clear

* update example script for EL configuration

* Fix typo

* Simplify test

* Simplify test

* splitting pipes.pyx into separate files

* moving default configs to each component file

* fix batch_size type

* removing default values from component constructors where possible (TODO: test 4725)

* skip instead of xfail

* Add test for config -> nlp with multiple instances

* pipeline.pipes -> pipeline.pipe

* Tidy up, document, remove kwargs

* small cleanup/generalization for Tok2VecListener

* use DEFAULT_UPSTREAM field

* revert to avoid circular imports

* Fix tests

* Replace deprecated arg

* Make model dirs require config

* fix pickling of keyword-only arguments in constructor

* WIP: clean up and integrate full config

* Add helper to handle function args more reliably

Now also includes keyword-only args

* Fix config composition and serialization

* Improve config debugging and add visual diff

* Remove unused defaults and fix type

* Remove pipeline and factories from meta

* Update spacy/default_config.cfg

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update spacy/default_config.cfg

* small UX edits

* avoid printing stack trace for debug CLI commands

* Add support for language-specific factories

* specify the section of the config which holds the model to debug

* WIP: add Language.from_config

* Update with language data refactor WIP

* Auto-format

* Add backwards-compat handling for Language.factories

* Update morphologizer.pyx

* Fix morphologizer

* Update and simplify lemmatizers

* Fix Japanese tests

* Port over tagger changes

* Fix Chinese and tests

* Update to latest Thinc

* WIP: xfail first Russian lemmatizer test

* Fix component-specific overrides

* fix nO for output layers in debug_model

* Fix default value

* Fix tests and don't pass objects in config

* Fix deep merging

* Fix lemma lookup data registry

Only load the lookups if an entry is available in the registry (and if spacy-lookups-data is installed)

* Add types

* Add Vocab.from_config

* Fix typo

* Fix tests

* Make config copying more elegant

* Fix pipe analysis

* Fix lemmatizers and is_base_form

* WIP: move language defaults to config

* Fix morphology type

* Fix vocab

* Remove comment

* Update to latest Thinc

* Add morph rules to config

* Tidy up

* Remove set_morphology option from tagger factory

* Hack use_gpu

* Move [pipeline] to top-level block and make [nlp.pipeline] list

Allows separating component blocks from component order – otherwise, ordering the config would mean a changed component order, which is bad. Also allows initial config to define more components and not use all of them

* Fix use_gpu and resume in CLI

* Auto-format

* Remove resume from config

* Fix formatting and error

* [pipeline] -> [components]

* Fix types

* Fix tagger test: requires set_morphology?

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-07-22 13:42:59 +02:00
Adriane Boyd ec819fc311
Provide default output for evaluate in CLI (#5784) 2020-07-20 14:42:46 +02:00
Ines Montani 73332ddb67 Update CLI commans to use one shared util file 2020-07-10 17:57:40 +02:00
Ines Montani 412dbb1f38
Remove dead and/or deprecated code (#5710)
* Remove dead and/or deprecated code

* Remove n_threads

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-07-06 13:06:25 +02:00
Ines Montani dbfa292ed3 Output more stats in evaluate 2020-06-28 15:34:28 +02:00
Ines Montani d54f33441a Merge branch 'feature/project-cli' of https://github.com/explosion/spaCy into feature/project-cli 2020-06-27 21:17:00 +02:00
Ines Montani cd0dd78276 Simplify model loading (now supported via load_model) 2020-06-27 21:16:57 +02:00
Matthew Honnibal 8e3baebdce Merge branch 'feature/project-cli' of https://github.com/explosion/spaCy into feature/project-cli 2020-06-27 21:16:18 +02:00
Matthew Honnibal d8c70b415e Fix Example usage in evaluate 2020-06-27 21:15:25 +02:00
Ines Montani e33d2b1bea Add success message 2020-06-27 21:15:13 +02:00
Ines Montani 42eb381ec6 Improve output handling in evaluate 2020-06-27 21:13:11 +02:00
Matthew Honnibal 8c29268749
Improve spacy.gold (no GoldParse, no json format!) (#5555)
* Update errors

* Remove beam for now (maybe)

Remove beam_utils

Update setup.py

Remove beam

* Remove GoldParse

WIP on removing goldparse

Get ArcEager compiling after GoldParse excise

Update setup.py

Get spacy.syntax compiling after removing GoldParse

Rename NewExample -> Example and clean up

Clean html files

Start updating tests

Update Morphologizer

* fix error numbers

* fix merge conflict

* informative error when calling to_array with wrong field

* fix error catching

* fixing language and scoring tests

* start testing get_aligned

* additional tests for new get_aligned function

* Draft create_gold_state for arc_eager oracle

* Fix import

* Fix import

* Remove TokenAnnotation code from nonproj

* fixing NER one-to-many alignment

* Fix many-to-one IOB codes

* fix test for misaligned

* attempt to fix cases with weird spaces

* fix spaces

* test_gold_biluo_different_tokenization works

* allow None as BILUO annotation

* fixed some tests + WIP roundtrip unit test

* add spaces to json output format

* minibatch utiltiy can deal with strings, docs or examples

* fix augment (needs further testing)

* various fixes in scripts - needs to be further tested

* fix test_cli

* cleanup

* correct silly typo

* add support for MORPH in to/from_array, fix morphologizer overfitting test

* fix tagger

* fix entity linker

* ensure test keeps working with non-linked entities

* pipe() takes docs, not examples

* small bug fix

* textcat bugfix

* throw informative error when running the components with the wrong type of objects

* fix parser tests to work with example (most still failing)

* fix BiluoPushDown parsing entities

* small fixes

* bugfix tok2vec

* fix renames and simple_ner labels

* various small fixes

* prevent writing dummy values like deps because that could interfer with sent_start values

* fix the fix

* implement split_sent with aligned SENT_START attribute

* test for split sentences with various alignment issues, works

* Return ArcEagerGoldParse from ArcEager

* Update parser and NER gold stuff

* Draft new GoldCorpus class

* add links to to_dict

* clean up

* fix test checking for variants

* Fix oracles

* Start updating converters

* Move converters under spacy.gold

* Move things around

* Fix naming

* Fix name

* Update converter to produce DocBin

* Update converters

* Allow DocBin to take list of Doc objects.

* Make spacy convert output docbin

* Fix import

* Fix docbin

* Fix compile in ArcEager

* Fix import

* Serialize all attrs by default

* Update converter

* Remove jsonl converter

* Add json2docs converter

* Draft Corpus class for DocBin

* Work on train script

* Update Corpus

* Update DocBin

* Allocate Doc before starting to add words

* Make doc.from_array several times faster

* Update train.py

* Fix Corpus

* Fix parser model

* Start debugging arc_eager oracle

* Update header

* Fix parser declaration

* Xfail some tests

* Skip tests that cause crashes

* Skip test causing segfault

* Remove GoldCorpus

* Update imports

* Update after removing GoldCorpus

* Fix module name of corpus

* Fix mimport

* Work on parser oracle

* Update arc_eager oracle

* Restore ArcEager.get_cost function

* Update transition system

* Update test_arc_eager_oracle

* Remove beam test

* Update test

* Unskip

* Unskip tests

* add links to to_dict

* clean up

* fix test checking for variants

* Allow DocBin to take list of Doc objects.

* Fix compile in ArcEager

* Serialize all attrs by default

Move converters under spacy.gold

Move things around

Fix naming

Fix name

Update converter to produce DocBin

Update converters

Make spacy convert output docbin

Fix import

Fix docbin

Fix import

Update converter

Remove jsonl converter

Add json2docs converter

* Allocate Doc before starting to add words

* Make doc.from_array several times faster

* Start updating converters

* Work on train script

* Draft Corpus class for DocBin

Update Corpus

Fix Corpus

* Update DocBin

Add missing strings when serializing

* Update train.py

* Fix parser model

* Start debugging arc_eager oracle

* Update header

* Fix parser declaration

* Xfail some tests

Skip tests that cause crashes

Skip test causing segfault

* Remove GoldCorpus

Update imports

Update after removing GoldCorpus

Fix module name of corpus

Fix mimport

* Work on parser oracle

Update arc_eager oracle

Restore ArcEager.get_cost function

Update transition system

* Update tests

Remove beam test

Update test

Unskip

Unskip tests

* Add get_aligned_parse method in Example

Fix Example.get_aligned_parse

* Add kwargs to Corpus.dev_dataset to match train_dataset

* Update nonproj

* Use get_aligned_parse in ArcEager

* Add another arc-eager oracle test

* Remove Example.doc property

Remove Example.doc

Remove Example.doc

Remove Example.doc

Remove Example.doc

* Update ArcEager oracle

Fix Break oracle

* Debugging

* Fix Corpus

* Fix eg.doc

* Format

* small fixes

* limit arg for Corpus

* fix test_roundtrip_docs_to_docbin

* fix test_make_orth_variants

* fix add_label test

* Update tests

* avoid writing temp dir in json2docs, fixing 4402 test

* Update test

* Add missing costs to NER oracle

* Update test

* Work on Example.get_aligned_ner method

* Clean up debugging

* Xfail tests

* Remove prints

* Remove print

* Xfail some tests

* Replace unseen labels for parser

* Update test

* Update test

* Xfail test

* Fix Corpus

* fix imports

* fix docs_to_json

* various small fixes

* cleanup

* Support gold_preproc in Corpus

* Support gold_preproc

* Pass gold_preproc setting into corpus

* Remove debugging

* Fix gold_preproc

* Fix json2docs converter

* Fix convert command

* Fix flake8

* Fix import

* fix output_dir (converted to Path by typer)

* fix var

* bugfix: update states after creating golds to avoid out of bounds indexing

* Improve efficiency of ArEager oracle

* pull merge_sent into iob2docs to avoid Doc creation for each line

* fix asserts

* bugfix excl Span.end in iob2docs

* Support max_length in Corpus

* Fix arc_eager oracle

* Filter out uannotated sentences in NER

* Remove debugging in parser

* Simplify NER alignment

* Fix conversion of NER data

* Fix NER init_gold_batch

* Tweak efficiency of precomputable affine

* Update onto-json default

* Update gold test for NER

* Fix parser test

* Update test

* Add NER data test

* Fix convert for single file

* Fix test

* Hack scorer to avoid evaluating non-nered data

* Fix handling of NER data in Example

* Output unlabelled spans from O biluo tags in iob_utils

* Fix unset variable

* Return kept examples from init_gold_batch

* Return examples from init_gold_batch

* Dont return Example from init_gold_batch

* Set spaces on gold doc after conversion

* Add test

* Fix spaces reading

* Improve NER alignment

* Improve handling of missing values in NER

* Restore the 'cutting' in parser training

* Add assertion

* Print epochs

* Restore random cuts in parser/ner training

* Implement Doc.copy

* Implement Example.copy

* Copy examples at the start of Language.update

* Don't unset example docs

* Tweak parser model slightly

* attempt to fix _guess_spaces

* _add_entities_to_doc first, so that links don't get overwritten

* fixing get_aligned_ner for one-to-many

* fix indexing into x_text

* small fix biluo_tags_from_offsets

* Add onto-ner config

* Simplify NER alignment

* Fix NER scoring for partially annotated documents

* fix indexing into x_text

* fix test_cli failing tests by ignoring spans in doc.ents with empty label

* Fix limit

* Improve NER alignment

* Fix count_train

* Remove print statement

* fix tests, we're not having nothing but None

* fix clumsy fingers

* Fix tests

* Fix doc.ents

* Remove empty docs in Corpus and improve limit

* Update config

Co-authored-by: svlandeg <sofie.vanlandeghem@gmail.com>
2020-06-26 19:34:12 +02:00
Ines Montani 275bab62df Refactor CLI 2020-06-21 21:35:01 +02:00
Ines Montani c12713a8be Port CLI to Typer and add project stubs 2020-06-21 13:44:00 +02:00
Sofie Van Landeghem c0f4a1e43b
train is from-config by default (#5575)
* verbose and tag_map options

* adding init_tok2vec option and only changing the tok2vec that is specified

* adding omit_extra_lookups and verifying textcat config

* wip

* pretrain bugfix

* add replace and resume options

* train_textcat fix

* raw text functionality

* improve UX when KeyError or when input data can't be parsed

* avoid unnecessary access to goldparse in TextCat pipe

* save performance information in nlp.meta

* add noise_level to config

* move nn_parser's defaults to config file

* multitask in config - doesn't work yet

* scorer offering both F and AUC options, need to be specified in config

* add textcat verification code from old train script

* small fixes to config files

* clean up

* set default config for ner/parser to allow create_pipe to work as before

* two more test fixes

* small fixes

* cleanup

* fix NER pickling + additional unit test

* create_pipe as before
2020-06-12 02:02:07 +02:00
Ines Montani 24f72c669c Merge branch 'develop' into master-tmp 2020-05-21 18:39:06 +02:00
adrianeboyd bdff76dede
Various updates/additions to CLI scripts (#5362)
* `debug-data`: determine coverage of provided vectors

* `evaluate`: support `blank:lg` model to make it possible to just evaluate
tokenization

* `init-model`: add option to truncate vectors to N most frequent vectors
from word2vec file

* `train`:

  * if training on GPU, only run evaluation/timing on CPU in the first
    iteration

  * if training is aborted, exit with a non-0 exit status
2020-04-29 12:56:46 +02:00
adrianeboyd b71a11ff6d
Update morphologizer (#5108)
* Add pos and morph scoring to Scorer

Add pos, morph, and morph_per_type to `Scorer`. Report pos and morph
accuracy in `spacy evaluate`.

* Update morphologizer for v3

* switch to tagger-based morphologizer
* use `spacy.HashCharEmbedCNN` for morphologizer defaults
* add `Doc.is_morphed` flag

* Add morphologizer to train CLI

* Add basic morphologizer pipeline tests

* Add simple morphologizer training example

* Remove subword_features from CharEmbed models

Remove `subword_features` argument from `spacy.HashCharEmbedCNN.v1` and
`spacy.HashCharEmbedBiLSTM.v1` since in these cases `subword_features`
is always `False`.

* Rename setting in morphologizer example

Use `with_pos_tags` instead of `without_pos_tags`.

* Fix kwargs for spacy.HashCharEmbedBiLSTM.v1

* Remove defaults for spacy.HashCharEmbedBiLSTM.v1

Remove default `nM/nC` for `spacy.HashCharEmbedBiLSTM.v1`.

* Set random seed for textcat overfitting test
2020-04-02 14:46:32 +02:00
Ines Montani 83e0a6f3e3
Modernize plac commands for Python 3 (#4836) 2020-01-01 13:15:46 +01:00
Ines Montani a892821c51 More formatting changes 2019-12-25 17:59:52 +01:00
Ines Montani db55577c45
Drop Python 2.7 and 3.5 (#4828)
* Remove unicode declarations

* Remove Python 3.5 and 2.7 from CI

* Don't require pathlib

* Replace compat helpers

* Remove OrderedDict

* Use f-strings

* Set Cython compiler language level

* Fix typo

* Re-add OrderedDict for Table

* Update setup.cfg

* Revert CONTRIBUTING.md

* Revert lookups.md

* Revert top-level.md

* Small adjustments and docs [ci skip]
2019-12-22 01:53:56 +01:00
adrianeboyd b841d3fe75 Add a tagger-based SentenceRecognizer (#4713)
* Add sent_starts to GoldParse

* Add SentTagger pipeline component

Add `SentTagger` pipeline component as a subclass of `Tagger`.

* Model reduces default parameters from `Tagger` to be small and fast
* Hard-coded set of two labels:
  * S (1): token at beginning of sentence
  * I (0): all other sentence positions
* Sets `token.sent_start` values

* Add sentence segmentation to Scorer

Report `sent_p/r/f` for sentence boundaries, which may be provided by
various pipeline components.

* Add sentence segmentation to CLI evaluate

* Add senttagger metrics/scoring to train CLI

* Rename SentTagger to SentenceRecognizer

* Add SentenceRecognizer to spacy.pipes imports

* Add SentenceRecognizer serialization test

* Shorten component name to sentrec

* Remove duplicates from train CLI output metrics
2019-11-28 11:10:07 +01:00
adrianeboyd faaa832518 Generalize handling of tokenizer special cases (#4259)
* Generalize handling of tokenizer special cases

Handle tokenizer special cases more generally by using the Matcher
internally to match special cases after the affix/token_match
tokenization is complete.

Instead of only matching special cases while processing balanced or
nearly balanced prefixes and suffixes, this recognizes special cases in
a wider range of contexts:

* Allows arbitrary numbers of prefixes/affixes around special cases
* Allows special cases separated by infixes

Existing tests/settings that couldn't be preserved as before:

* The emoticon '")' is no longer a supported special case
* The emoticon ':)' in "example:)" is a false positive again

When merged with #4258 (or the relevant cache bugfix), the affix and
token_match properties should be modified to flush and reload all
special cases to use the updated internal tokenization with the Matcher.

* Remove accidentally added test case

* Really remove accidentally added test

* Reload special cases when necessary

Reload special cases when affixes or token_match are modified. Skip
reloading during initialization.

* Update error code number

* Fix offset and whitespace in Matcher special cases

* Fix offset bugs when merging and splitting tokens
* Set final whitespace on final token in inserted special case

* Improve cache flushing in tokenizer

* Separate cache and specials memory (temporarily)
* Flush cache when adding special cases
* Repeated `self._cache = PreshMap()` and `self._specials = PreshMap()`
are necessary due to this bug:
https://github.com/explosion/preshed/issues/21

* Remove reinitialized PreshMaps on cache flush

* Update UD bin scripts

* Update imports for `bin/`
* Add all currently supported languages
* Update subtok merger for new Matcher validation
* Modify blinded check to look at tokens instead of lemmas (for corpora
with tokens but not lemmas like Telugu)

* Use special Matcher only for cases with affixes

* Reinsert specials cache checks during normal tokenization for special
cases as much as possible
  * Additionally include specials cache checks while splitting on infixes
  * Since the special Matcher needs consistent affix-only tokenization
    for the special cases themselves, introduce the argument
    `with_special_cases` in order to do tokenization with or without
    specials cache checks
* After normal tokenization, postprocess with special cases Matcher for
special cases containing affixes

* Replace PhraseMatcher with Aho-Corasick

Replace PhraseMatcher with the Aho-Corasick algorithm over numpy arrays
of the hash values for the relevant attribute. The implementation is
based on FlashText.

The speed should be similar to the previous PhraseMatcher. It is now
possible to easily remove match IDs and matches don't go missing with
large keyword lists / vocabularies.

Fixes #4308.

* Restore support for pickling

* Fix internal keyword add/remove for numpy arrays

* Add test for #4248, clean up test

* Improve efficiency of special cases handling

* Use PhraseMatcher instead of Matcher
* Improve efficiency of merging/splitting special cases in document
  * Process merge/splits in one pass without repeated token shifting
  * Merge in place if no splits

* Update error message number

* Remove UD script modifications

Only used for timing/testing, should be a separate PR

* Remove final traces of UD script modifications

* Update UD bin scripts

* Update imports for `bin/`
* Add all currently supported languages
* Update subtok merger for new Matcher validation
* Modify blinded check to look at tokens instead of lemmas (for corpora
with tokens but not lemmas like Telugu)

* Add missing loop for match ID set in search loop

* Remove cruft in matching loop for partial matches

There was a bit of unnecessary code left over from FlashText in the
matching loop to handle partial token matches, which we don't have with
PhraseMatcher.

* Replace dict trie with MapStruct trie

* Fix how match ID hash is stored/added

* Update fix for match ID vocab

* Switch from map_get_unless_missing to map_get

* Switch from numpy array to Token.get_struct_attr

Access token attributes directly in Doc instead of making a copy of the
relevant values in a numpy array.

Add unsatisfactory warning for hash collision with reserved terminal
hash key. (Ideally it would change the reserved terminal hash and redo
the whole trie, but for now, I'm hoping there won't be collisions.)

* Restructure imports to export find_matches

* Implement full remove()

Remove unnecessary trie paths and free unused maps.

Parallel to Matcher, raise KeyError when attempting to remove a match ID
that has not been added.

* Switch to PhraseMatcher.find_matches

* Switch to local cdef functions for span filtering

* Switch special case reload threshold to variable

Refer to variable instead of hard-coded threshold

* Move more of special case retokenize to cdef nogil

Move as much of the special case retokenization to nogil as possible.

* Rewrap sort as stdsort for OS X

* Rewrap stdsort with specific types

* Switch to qsort

* Fix merge

* Improve cmp functions

* Fix realloc

* Fix realloc again

* Initialize span struct while retokenizing

* Temporarily skip retokenizing

* Revert "Move more of special case retokenize to cdef nogil"

This reverts commit 0b7e52c797.

* Revert "Switch to qsort"

This reverts commit a98d71a942.

* Fix specials check while caching

* Modify URL test with emoticons

The multiple suffix tests result in the emoticon `:>`, which is now
retokenized into one token as a special case after the suffixes are
split off.

* Refactor _apply_special_cases()

* Use cdef ints for span info used in multiple spots

* Modify _filter_special_spans() to prefer earlier

Parallel to #4414, modify _filter_special_spans() so that the earlier
span is preferred for overlapping spans of the same length.

* Replace MatchStruct with Entity

Replace MatchStruct with Entity since the existing Entity struct is
nearly identical.

* Replace Entity with more general SpanC

* Replace MatchStruct with SpanC

* Add error in debug-data if no dev docs are available (see #4575)

* Update azure-pipelines.yml

* Revert "Update azure-pipelines.yml"

This reverts commit ed1060cf59.

* Use latest wasabi

* Reorganise install_requires

* add dframcy to universe.json (#4580)

* Update universe.json [ci skip]

* Fix multiprocessing for as_tuples=True (#4582)

* Fix conllu script (#4579)

* force extensions to avoid clash between example scripts

* fix arg order and default file encoding

* add example config for conllu script

* newline

* move extension definitions to main function

* few more encodings fixes

* Add load_from_docbin example [ci skip]

TODO: upload the file somewhere

* Update README.md

* Add warnings about 3.8 (resolves #4593) [ci skip]

* Fixed typo: Added space between "recognize" and "various" (#4600)

* Fix DocBin.merge() example (#4599)

* Replace function registries with catalogue (#4584)

* Replace functions registries with catalogue

* Update __init__.py

* Fix test

* Revert unrelated flag [ci skip]

* Bugfix/dep matcher issue 4590 (#4601)

* add contributor agreement for prilopes

* add test for issue #4590

* fix on_match params for DependencyMacther (#4590)

* Minor updates to language example sentences (#4608)

* Add punctuation to Spanish example sentences

* Combine multilanguage examples for lang xx

* Add punctuation to nb examples

* Always realloc to a larger size

Avoid potential (unlikely) edge case and cymem error seen in #4604.

* Add error in debug-data if no dev docs are available (see #4575)

* Update debug-data for GoldCorpus / Example

* Ignore None label in misaligned NER data
2019-11-13 21:24:35 +01:00
Sofie Van Landeghem e48a09df4e Example class for training data (#4543)
* OrigAnnot class instead of gold.orig_annot list of zipped tuples

* from_orig to replace from_annot_tuples

* rename to RawAnnot

* some unit tests for GoldParse creation and internal format

* removing orig_annot and switching to lists instead of tuple

* rewriting tuples to use RawAnnot (+ debug statements, WIP)

* fix pop() changing the data

* small fixes

* pop-append fixes

* return RawAnnot for existing GoldParse to have uniform interface

* clean up imports

* fix merge_sents

* add unit test for 4402 with new structure (not working yet)

* introduce DocAnnot

* typo fixes

* add unit test for merge_sents

* rename from_orig to from_raw

* fixing unit tests

* fix nn parser

* read_annots to produce text, doc_annot pairs

* _make_golds fix

* rename golds_to_gold_annots

* small fixes

* fix encoding

* have golds_to_gold_annots use DocAnnot

* missed a spot

* merge_sents as function in DocAnnot

* allow specifying only part of the token-level annotations

* refactor with Example class + underlying dicts

* pipeline components to work with Example objects (wip)

* input checking

* fix yielding

* fix calls to update

* small fixes

* fix scorer unit test with new format

* fix kwargs order

* fixes for ud and conllu scripts

* fix reading data for conllu script

* add in proper errors (not fixed numbering yet to avoid merge conflicts)

* fixing few more small bugs

* fix EL script
2019-11-11 17:35:27 +01:00
Ines Montani cf4ec88b38 Use latest wasabi 2019-11-04 02:38:45 +01:00
adrianeboyd b5d999e510 Add textcat to train CLI (#4226)
* Add doc.cats to spacy.gold at the paragraph level

Support `doc.cats` as `"cats": [{"label": string, "value": number}]` in
the spacy JSON training format at the paragraph level.

* `spacy.gold.docs_to_json()` writes `docs.cats`

* `GoldCorpus` reads in cats in each `GoldParse`

* Update instances of gold_tuples to handle cats

Update iteration over gold_tuples / gold_parses to handle addition of
cats at the paragraph level.

* Add textcat to train CLI

* Add textcat options to train CLI
* Add textcat labels in `TextCategorizer.begin_training()`
* Add textcat evaluation to `Scorer`:
  * For binary exclusive classes with provided label: F1 for label
  * For 2+ exclusive classes: F1 macro average
  * For multilabel (not exclusive): ROC AUC macro average (currently
relying on sklearn)
* Provide user info on textcat evaluation settings, potential
incompatibilities
* Provide pipeline to Scorer in `Language.evaluate` for textcat config
* Customize train CLI output to include only metrics relevant to current
pipeline
* Add textcat evaluation to evaluate CLI

* Fix handling of unset arguments and config params

Fix handling of unset arguments and model confiug parameters in Scorer
initialization.

* Temporarily add sklearn requirement

* Remove sklearn version number

* Improve Scorer handling of models without textcats

* Fixing Scorer handling of models without textcats

* Update Scorer output for python 2.7

* Modify inf in Scorer for python 2.7

* Auto-format

Also make small adjustments to make auto-formatting with black easier and produce nicer results

* Move error message to Errors

* Update documentation

* Add cats to annotation JSON format [ci skip]

* Fix tpl flag and docs [ci skip]

* Switch to internal roc_auc_score

Switch to internal `roc_auc_score()` adapted from scikit-learn.

* Add AUCROCScore tests and improve errors/warnings

* Add tests for AUCROCScore and roc_auc_score
* Add missing error for only positive/negative values
* Remove unnecessary warnings and errors

* Make reduced roc_auc_score functions private

Because most of the checks and warnings have been stripped for the
internal functions and access is only intended through `ROCAUCScore`,
make the functions for roc_auc_score adapted from scikit-learn private.

* Check that data corresponds with multilabel flag

Check that the training instances correspond with the multilabel flag,
adding the multilabel flag if required.

* Add textcat score to early stopping check

* Add more checks to debug-data for textcat

* Add example training data for textcat

* Add more checks to textcat train CLI

* Check configuration when extending base model
* Fix typos

* Update textcat example data

* Provide licensing details and licenses for data
* Remove two labels with no positive instances from jigsaw-toxic-comment
data.


Co-authored-by: Ines Montani <ines@ines.io>
2019-09-15 22:31:31 +02:00