Commit Graph

1715 Commits

Author SHA1 Message Date
Ilya Nikitin c323789721
`token.md`: Fix documentation of `Token.ancestors` (#10917) 2022-06-06 14:32:36 +09:00
Raphael Mitsch 8387ce4c01
Add Doc.from_json() (#10688)
* Implement Doc.from_json: rough draft.

* Implement Doc.from_json: first draft with tests.

* Implement Doc.from_json: added documentation on website for Doc.to_json(), Doc.from_json().

* Implement Doc.from_json: formatting changes.

* Implement Doc.to_json(): reverting unrelated formatting changes.

* Implement Doc.to_json(): fixing entity and span conversion. Moving fixture and doc <-> json conversion tests into single file.

* Implement Doc.from_json(): replaced entity/span converters with doc.char_span() calls.

* Implement Doc.from_json(): handling sentence boundaries in spans.

* Implementing Doc.from_json(): added parser-free sentence boundaries transfer.

* Implementing Doc.from_json(): added parser-free sentence boundaries transfer.

* Implementing Doc.from_json(): incorporated various PR feedback.

* Renaming fixture for document without dependencies.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implementing Doc.from_json(): using two sent_starts instead of one.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implementing Doc.from_json(): doc_without_dependency_parser() -> doc_without_deps.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implementing Doc.from_json(): incorporating various PR feedback. Rebased on latest master.

* Implementing Doc.from_json(): refactored Doc.from_json() to work with annotation IDs instead of their string representations.

* Implement Doc.from_json(): reverting unwanted formatting/rebasing changes.

* Implement Doc.from_json(): added check for char_span() calculation for entities.

* Update spacy/tokens/doc.pyx

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): minor refactoring, additional check for token attribute consistency with corresponding test.

* Implement Doc.from_json(): removed redundancy in annotation type key naming.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): Simplifying setting annotation values.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement doc.from_json(): renaming annot_types to token_attrs.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): adjustments for renaming of annot_types to token_attrs.

* Implement Doc.from_json(): removing default categories.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): simplifying lexeme initialization.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): simplifying lexeme initialization.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): refactoring to only have keys for present annotations.

* Implement Doc.from_json(): fix check for tokens' HEAD attributes.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): refactoring Doc.from_json().

* Implement Doc.from_json(): fixing span_group retrieval.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): fixing span retrieval.

* Implement Doc.from_json(): added schema for Doc JSON format. Minor refactoring in Doc.from_json().

* Implement Doc.from_json(): added comment regarding Token and Span extension support.

* Implement Doc.from_json(): renaming inconsistent_props to partial_attrs..

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): adjusting error message.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): extending E1038 message.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): added params to E1038 raises.

* Implement Doc.from_json(): combined attribute collection with partial attributes check.

* Implement Doc.from_json(): added optional schema validation.

* Implement Doc.from_json(): fixed optional fields in schema, tests.

* Implement Doc.from_json(): removed redundant None check for DEP.

* Implement Doc.from_json(): added passing of schema validatoin message to E1037..

* Implement Doc.from_json(): removing redundant error E1040.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): changing message for E1037.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): adjusted website docs and docstring of Doc.from_json().

* Update spacy/tests/doc/test_json_doc_conversion.py

* Implement Doc.from_json(): docstring update.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): docstring update.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): website docs update.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): docstring formatting.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): docstring formatting.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): fixing Doc reference in website docs.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): reformatted website/docs/api/doc.md.

* Implement Doc.from_json(): bumped IDs of new errors to avoid merge conflicts.

* Implement Doc.from_json(): fixing bug in tests.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Implement Doc.from_json(): fix setting of sentence starts for docs without DEP.

* Implement Doc.from_json(): add check for valid char spans when manually setting sentence boundaries. Refactor sentence boundary setting slightly. Move error message for lack of support for partial token annotations to errors.py.

* Implement Doc.from_json(): simplify token sentence start manipulation.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Combine related error messages

* Update spacy/tests/doc/test_json_doc_conversion.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-06-02 14:03:47 +02:00
Adriane Boyd a322d6d5f2
Add SpanRuler component (#9880)
* Add SpanRuler component

Add a `SpanRuler` component similar to `EntityRuler` that saves a list
of matched spans to `Doc.spans[spans_key]`. The matches from the token
and phrase matchers are deduplicated and sorted before assignment but
are not otherwise filtered.

* Update spacy/pipeline/span_ruler.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Fix cast

* Add self.key property

* Use number of patterns as length

* Remove patterns kwarg from init

* Update spacy/tests/pipeline/test_span_ruler.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Add options for spans filter and setting to ents

* Add `spans_filter` option as a registered function'
* Make `spans_key` optional and if `None`, set to `doc.ents` instead of
`doc.spans[spans_key]`.

* Update and generalize tests

* Add test for setting doc.ents, fix key property type

* Fix typing

* Allow independent doc.spans and doc.ents

* If `spans_key` is set, set `doc.spans` with `spans_filter`.
* If `annotate_ents` is set, set `doc.ents` with `ents_fitler`.
  * Use `util.filter_spans` by default as `ents_filter`.
  * Use a custom warning if the filter does not work for `doc.ents`.

* Enable use of SpanC.id in Span

* Support id in SpanRuler as Span.id

* Update types

* `id` can only be provided as string (already by `PatternType`
definition)

* Update all uses of Span.id/ent_id in Doc

* Rename Span id kwarg to span_id

* Update types and docs

* Add ents filter to mimic EntityRuler overwrite_ents

* Refactor `ents_filter` to take `entities, spans` args for more
  filtering options
* Give registered filters more descriptive names
* Allow registered `filter_spans` filter
  (`spacy.first_longest_spans_filter.v1`) to take any number of
  `Iterable[Span]` objects as args so it can be used for spans filter
  or ents filter

* Implement future entity ruler as span ruler

Implement a compatible `entity_ruler` as `future_entity_ruler` using
`SpanRuler` as the underlying component:
* Add `sort_key` and `sort_reverse` to allow the sorting behavior to be
  customized. (Necessary for the same sorting/filtering as in
  `EntityRuler`.)
* Implement `overwrite_overlapping_ents_filter` and
  `preserve_existing_ents_filter` to support
  `EntityRuler.overwrite_ents` settings.
* Add `remove_by_id` to support `EntityRuler.remove` functionality.
* Refactor `entity_ruler` tests to parametrize all tests to test both
  `entity_ruler` and `future_entity_ruler`
* Implement `SpanRuler.token_patterns` and `SpanRuler.phrase_patterns`
  properties.

Additional changes:

* Move all config settings to top-level attributes to avoid duplicating
  settings in the config vs. `span_ruler/cfg`. (Also avoids a lot of
  casting.)

* Format

* Fix filter make method name

* Refactor to use same error for removing by label or ID

* Also provide existing spans to spans filter

* Support ids property

* Remove token_patterns and phrase_patterns

* Update docstrings

* Add span ruler docs

* Fix types

* Apply suggestions from code review

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Move sorting into filters

* Check for all tokens in seen tokens in entity ruler filters

* Remove registered sort key

* Set Token.ent_id in a backwards-compatible way in Doc.set_ents

* Remove sort options from API docs

* Update docstrings

* Rename entity ruler filters

* Fix and parameterize scoring

* Add id to Span API docs

* Fix typo in API docs

* Include explicit labeled=True for scorer

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-06-02 13:12:53 +02:00
Max Tarlov 709d6d9114
Update documentation for displacy style kwargs (#10841)
* Update docs for displacy style kwargs

Added "span" to the accepted values for the style kwarg in the displacy.serve and displacy.render top-level functions. These styles are new as of SpaCy 3.3, so I added the "new" tag for that option only

* restored alpha ordering
2022-05-30 09:11:55 +02:00
Peter Baumgartner bf95f0a1dd
add doc cleaner to menu (#10862) 2022-05-30 08:51:19 +02:00
Freddy Heppell 322c5a3ac4
Fix misspelt keyword in StringStore example 2022-05-29 10:49:19 +01:00
Sofie Van Landeghem 83ed1f391b
Remove NBSP's across tables in the docs (#10842) 2022-05-25 09:48:39 +02:00
Lj Miranda 1d34aa2b3d
Add spacy-span-analyzer to debug data (#10668)
* Rename to spans_key for consistency

* Implement spans length in debug data

* Implement how span bounds and spans are obtained

In this commit, I implemented how span boundaries (the tokens) around a
given span and spans are obtained. I've put them in the compile_gold()
function so that it's accessible later on. I will do the actual
computation of the span and boundary distinctiveness in the main
function above.

* Compute for p_spans and p_bounds

* Add computation for SD and BD

* Fix mypy issues

* Add weighted average computation

* Fix compile_gold conditional logic

* Add test for frequency distribution computation

* Add tests for kl-divergence computation

* Fix weighted average computation

* Make tables more compact by rounding them

* Add more descriptive checks for spans

* Modularize span computation methods

In this commit, I added the _get_span_characteristics and
_print_span_characteristics functions so that they can be reusable
anywhere.

* Remove unnecessary arguments and make fxs more compact

* Update a few parameter arguments

* Add tests for print_span and get_span methods

* Update API to talk about span characteristics in brief

* Add better reporting of spans_length

* Add test for span length reporting

* Update formatting of span length report

Removed '' to indicate that it's not a string, then
sort the n-grams by their length, not by their frequency.

* Apply suggestions from code review

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Show all frequency distribution when -V

In this commit, I displayed the full frequency distribution of the
span lengths when --verbose is passed. To make things simpler, I
rewrote some of the formatter functions so that I can call them
whenever.

Another notable change is that instead of showing percentages as
Integers, I showed them as floats (max 2-decimal places). I did this
because it looks weird when it displays (0%).

* Update logic on how total is computed

The way the 90% thresholding is computed now is that we keep
adding the percentages until we reach >= 90%. I also updated the wording
and used the term "At least" to denote that >= 90% of your spans have
these distributions.

* Fix display when showing the threshold percentage

* Apply suggestions from code review

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Add better phrasing for span information

* Update spacy/cli/debug_data.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Add minor edits for whitespaces etc.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-05-23 19:06:38 +02:00
Peter Baumgartner 7ce3460b23
add floret to static vectors docs (#10833) 2022-05-23 09:16:31 +02:00
kadarakos a3814ee739
oov confusion fix (#10828) 2022-05-23 09:15:51 +02:00
Adriane Boyd a82ec56aae
Remove cuda extras for non-linux arm in install widget (#10796)
* Remove cuda extras for non-linux arm platforms in install widget
* Extend cuda versions install widget
* Update GPU install docs to clarify cuda
2022-05-20 09:57:41 +02:00
Adriane Boyd b65d652881
Override SpanGroups.setdefault to provide default SpanGroup (#10772)
* Fix mistake in SpanGroup API docs

* Restrict SpanGroups.setdefault to SpanGroup only

* Refactor to support default span iterable
2022-05-12 10:06:25 +02:00
Richard Hudson d524f6415f
Add documentation tip about overriding variables (#10780) 2022-05-11 10:15:32 +02:00
Raphael Mitsch 2904359685
Allow assets to be optional in spacy project (#10714)
* Allow assets to be optional in spacy project: draft for optional flag/download_all options.

* Allow assets to be optional in spacy project: added OPTIONAL_DEFAULT reflecting default asset optionality.

* Allow assets to be optional in spacy project: renamed --all to --extra.

* Allow assets to be optional in spacy project: included optional flag in project config test.

* Allow assets to be optional in spacy project: added documentation.

* Allow assets to be optional in spacy project: fixing deprecated --all reference.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Allow assets to be optional in spacy project: fixed project_assets() docstring.

* Allow assets to be optional in spacy project: adjusted wording in justification of optional assets.

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Allow assets to be optional in spacy project: switched to  as keyword in project.yml. Updated docs.

* Allow assets to be optional in spacy project: updated comment.

* Allow assets to be optional in spacy project: replacing 'optional' with 'extra' in output.

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Allow assets to be optional in spacy project: replacing 'optional' with 'extra' in docstring..

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Allow assets to be optional in spacy project: replacing 'optional' with 'extra' in test..

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Allow assets to be optional in spacy project: replacing 'optional' with 'extra' in test.

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Allow assets to be optional in spacy project: renamed OPTIONAL_DEFAULT to EXTRA_DEFAULT.

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-05-10 10:40:11 +02:00
Sofie Van Landeghem 1543558d08
Add test for old architectures (#10751)
* add v1 and v2 tests for tok2vec architectures

* textcat architectures are not "layers"

* test older textcat architectures

* test older parser architecture
2022-05-10 08:24:42 +02:00
Madeesh Kannan 733114bdd9
`training.md`: Fix typos (#10775) 2022-05-09 19:44:14 +02:00
Raphael Mitsch e626df959f
Document different ways to create a pipeline (#10762)
* Document different ways to create a pipeline: moved up/slightly modified paragraph on pipeline creation.

* Document different ways to create a pipeline: changed Finnish to Ukrainian in example for language without trained pipeline.

* Document different ways to create a pipeline: added explanation of blank pipeline.

* Document different ways to create a pipeline: exchanged Ukrainian with Yoruba.
2022-05-06 15:40:59 +02:00
Sofie Van Landeghem e03b9f8095
Small doc typos (#10750)
* fix typos

* formatting
2022-05-03 13:55:27 +02:00
Adriane Boyd 497a708c71
Docs for v3.3 (#10628)
* Temporarily disable CI tests

* Start v3.3 website updates

* Add trainable lemmatizer to pipeline design

* Fix Vectors.most_similar

* Add floret vector info to pipeline design

* Add Lower and Upper Sorbian

* Add span to sidebar

* Work on release notes

* Copy from release notes

* Update pipeline design graphic

* Upgrading note about Doc.from_docs

* Add tables and details

* Update website/docs/models/index.md

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Fix da lemma acc

* Add minimal intro, various updates

* Round lemma acc

* Add section on floret / word lists

* Add new pipelines table, minor edits

* Fix displacy spans example title

* Clarify adding non-trainable lemmatizer

* Update adding-languages URLs

* Revert "Temporarily disable CI tests"

This reverts commit 1dee505920.

* Spell out words/sec

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-04-28 14:09:35 +02:00
harmbuisman c066fb8a4e
#10672: fixes displacy output for manual unsorted entities (#10673)
* #10672: fixes displacy output for manual unsorted entities

* #10672: removed unused import

* fix prettier formatting

Co-authored-by: Harm Buisman <h.buisman@iknl.nl>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-04-27 09:51:58 +02:00
Adriane Boyd 455f089c9b
Support exclude in Doc.from_docs (#10689)
* Support exclude in Doc.from_docs

* Update API docs

* Add new tag to docs
2022-04-25 18:19:03 +02:00
single-fingal 4228f3c757
Fix a few minor bugs in the SpanGroup API web docs (#10650)
* Fix a few minor bugs in the SpanGroup API web docs

* Update SpanGroup docs examples to have Spans reflect intended "errors"
2022-04-14 09:59:48 +02:00
Lj Miranda 02dafa3a84
Add debug diff command in spaCy CLI (#10502)
* Add initial design for diff command

For now, the diffing process looks like this:
- The default config is created based from some values in the user
config (e.g. which pipeline components were used, the lang, etc.)
- The user must supply manually if it was optimized for acc/efficiency
and if pretraining was involved.

* Make diff command structure similar to siblings

* Include gpu as a user option for CLI

* Make variables more explicit

* Fix type declaration for optimize enum

* Improve docstrings for diff CLI

* Add debug-diff to website API docs

* Switch position of configs so that user config is modded

* Add markdown flag for debug diff

This commit adds a --markdown (--md) flag that allows easier
copy-pasting to Github issues. Please note that this commit is dependent
on an unreleased version of wasabi (for the time being).

For posterity, the related PR is found here: https://github.com/ines/wasabi/pull/20

* Bump version of wasabi to 0.9.1

So that we can use the add_symbols parameter.

* Apply suggestions from code review

Co-authored-by: Ines Montani <ines@ines.io>

* Update docs based on code review suggestions

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Change command name from diff -> diff-config

* Clarify when options are relevant or not

* Rerun prettier on cli.md

Co-authored-by: Ines Montani <ines@ines.io>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-04-07 10:48:45 +02:00
Adriane Boyd 0d0153db63
Update default spans_key to sc in API docs (#10616) 2022-04-04 18:09:15 +02:00
Adriane Boyd ca54de27bb
Support more internal methods for SpanGroup (#10476)
* Added new convenience cython functions to SpanGroup to avoid unnecessary allocation/deallocation of objects

* Replaced sorting in has_overlap with C++ for efficiency. Also, added a test for has_overlap

* Added a method to efficiently merge SpanGroups

* Added __delitem__, __add__ and __iadd__. Also, allowed to pass span lists to merge function. Replaced extend() body with call to merge

* Renamed merge to concat and added missing things to documentation

* Added operator+ and operator += in the documentation

* Added a test for Doc deallocation

* Update spacy/tokens/span_group.pyx

* Updated SpanGroup tests to use new span list comparison function rather than assert_span_list_equal, eliminating the need to have a separate assert_not_equal fnction

* Fixed typos in SpanGroup documentation

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Minor changes requested by Sofie: rearranged import statements. Added new=3.2.1 tag to SpanGroup.__setitem__ documentation

* SpanGroup: moved repetitive list index check/adjustment in a separate function

* Turn off formatting that hurts readability spacy/tests/doc/test_span_group.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Remove formatting that hurts readability spacy/tests/doc/test_span_group.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Turn off formatting that hurts readability in spacy/tests/doc/test_span_group.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Support more internal methods for SpanGroup

Add support for:

* `__setitem__`
* `__delitem__`
* `__iadd__`: for `SpanGroup` or `Iterable[Span]`
* `__add__`: for `SpanGroup` only

Adapted from #9698 with the scope limited to the magic methods.

* Use v3.3 as new version in docs

* Add new tag to SpanGroup.copy in API docs

* Remove duplicate import

* Apply suggestions from code review

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Remaining suggestions and formatting

Co-authored-by: nrodnova <nrodnova@hotmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Natalia Rodnova <4512370+nrodnova@users.noreply.github.com>
2022-04-01 09:56:26 +02:00
Adriane Boyd f98b41c390
Add vector deduplication (#10551)
* Add vector deduplication

* Add `Vocab.deduplicate_vectors()`
* Always run deduplication in `spacy init vectors`
* Clean up a few vector-related error messages and docs examples

* Always unique with numpy

* Fix types
2022-03-30 08:54:23 +02:00
Adriane Boyd 85778dfcf4
Add edit tree lemmatizer (#10231)
* Add edit tree lemmatizer

Co-authored-by: Daniël de Kok <me@danieldk.eu>

* Hide edit tree lemmatizer labels

* Use relative imports

* Switch to single quotes in error message

* Type annotation fixes

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Reformat edit_tree_lemmatizer with black

* EditTreeLemmatizer.predict: take Iterable

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Validate edit trees during deserialization

This change also changes the serialized representation. Rather than
mirroring the deep C structure, we use a simple flat union of the match
and substitution node types.

* Move edit_trees to _edit_tree_internals

* Fix invalid edit tree format error message

* edit_tree_lemmatizer: remove outdated TODO comment

* Rename factory name to trainable_lemmatizer

* Ignore type instead of casting truths to List[Union[Ints1d, Floats2d, List[int], List[str]]] for thinc v8.0.14

* Switch to Tagger.v2

* Add documentation for EditTreeLemmatizer

* docs: Fix 3.2 -> 3.3 somewhere

* trainable_lemmatizer documentation fixes

* docs: EditTreeLemmatizer is in edit_tree_lemmatizer.py

Co-authored-by: Daniël de Kok <me@danieldk.eu>
Co-authored-by: Daniël de Kok <me@github.danieldk.eu>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-03-28 11:13:50 +02:00
Adriane Boyd d5666fd12d
Add NORM to Matcher feature in docs (#10560) 2022-03-28 10:35:47 +02:00
Adriane Boyd 3711af74e5
Add tokenizer option to allow Matcher handling for all rules (#10452)
* Add tokenizer option to allow Matcher handling for all rules

Add tokenizer option `with_faster_rules_heuristics` that determines
whether the special cases applied by the internal `Matcher` are filtered
by whether they contain affixes or space. If `True` (default), the rules
are filtered to prioritize speed over rare edge cases. If `False`, all
rules are included in the final `Matcher`-based pass over the doc.

* Reset all caches when reloading special cases

* Revert "Reset all caches when reloading special cases"

This reverts commit 4ef6bd171d.

* Initialize max_length properly

* Add new tag to API docs

* Rename to faster heuristics
2022-03-24 13:21:32 +01:00
Lj Miranda a79cd3542b
Add displacy support for overlapping Spans (#10332)
* Fix docstring for EntityRenderer

* Add warning in displacy if doc.spans are empty

* Implement parse_spans converter

One notable change here is that the default spans_key is sc, and
it's set by the user through the options.

* Implement SpanRenderer

Here, I implemented a SpanRenderer that looks similar to the
EntityRenderer except for some templates.  The spans_key, by default, is
set to sc, but can be configured in the options (see parse_spans). The
way I rendered these spans is per-token, i.e., I first check if each
token (1) belongs to a given span type and (2) a starting token of a
given span type. Once I have this information, I render them into the
markup.

* Fix mypy issues on typing

* Add tests for displacy spans support

* Update colors from RGB to hex

Co-authored-by: Ines Montani <ines@ines.io>

* Remove unnecessary CSS properties

* Add documentation for website

* Remove unnecesasry scripts

* Update wording on the documentation

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Put typing dependency on top of file

* Put back z-index so that spans overlap properly

* Make warning more explicit for spans_key

Co-authored-by: Ines Montani <ines@ines.io>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2022-03-16 18:14:34 +01:00
Daniël de Kok e5debc68e4
Tagger: use unnormalized probabilities for inference (#10197)
* Tagger: use unnormalized probabilities for inference

Using unnormalized softmax avoids use of the relatively expensive exp function,
which can significantly speed up non-transformer models (e.g. I got a speedup
of 27% on a German tagging + parsing pipeline).

* Add spacy.Tagger.v2 with configurable normalization

Normalization of probabilities is disabled by default to improve
performance.

* Update documentation, models, and tests to spacy.Tagger.v2

* Move Tagger.v1 to spacy-legacy

* docs/architectures: run prettier

* Unnormalized softmax is now a Softmax_v2 option

* Require thinc 8.0.14 and spacy-legacy 3.0.9
2022-03-15 14:15:31 +01:00
Adriane Boyd e8357923ec
Various install docs updates (#10487)
* Simplify quickstart source install to use only editable pip install

* Update pytorch install instructions to more recent versions
2022-03-15 11:12:50 +01:00
Adriane Boyd 0dc454ba95
Update docs for Vocab.get_vector (#10486)
* Update docs for Vocab.get_vector

* Clarify description of 0-vector dimensions
2022-03-15 09:10:47 +01:00
Edward 2eef47dd26
Save span candidates produced by spancat suggesters (#10413)
* Add save_candidates attribute

* Change spancat api

* Add unit test

* reimplement method to produce a list of doc

* Add method to docs

* Add new version tag

* Add intended use to docstring

* prettier formatting
2022-03-14 16:46:58 +01:00
Adriane Boyd 297dd82c86
Fix initial special cases for Tokenizer.explain (#10460)
Add the missing initial check for special cases to `Tokenizer.explain`
to align with `Tokenizer._tokenize_affixes`.
2022-03-11 10:50:47 +01:00
Peter Baumgartner 01ec6349ea
Add `path.mkdir` to custom component examples of `to_disk` (#10348)
* add `path.mkdir` to examples

* add ensure_path + mkdir

* update highlights
2022-03-08 16:04:10 +01:00
Adriane Boyd 60520d8669
Fix types in API docs for moves in parser and ner (#10464) 2022-03-08 13:51:11 +01:00
Adriane Boyd b2bbefd0b5
Add Finnish, Korean, and Swedish models and Korean support notes (#10355)
* Add Finnish, Korean, and Swedish models to website

* Add Korean language support notes
2022-03-07 17:03:45 +01:00
Paul O'Leary McCann 91acc3ea75
Fix entity linker batching (#9669)
* Partial fix of entity linker batching

* Add import

* Better name

* Add `use_gold_ents` option, docs

* Change to v2, create stub v1, update docs etc.

* Fix error type

Honestly no idea what the right type to use here is.
ConfigValidationError seems wrong. Maybe a NotImplementedError?

* Make mypy happy

* Add hacky fix for init issue

* Add legacy pipeline entity linker

* Fix references to class name

* Add __init__.py for legacy

* Attempted fix for loss issue

* Remove placeholder V1

* formatting

* slightly more interesting train data

* Handle batches with no usable examples

This adds a test for batches that have docs but not entities, and a
check in the component that detects such cases and skips the update step
as thought the batch were empty.

* Remove todo about data verification

Check for empty data was moved further up so this should be OK now - the
case in question shouldn't be possible.

* Fix gradient calculation

The model doesn't know which entities are not in the kb, so it generates
embeddings for the context of all of them.

However, the loss does know which entities aren't in the kb, and it
ignores them, as there's no sensible gradient.

This has the issue that the gradient will not be calculated for some of
the input embeddings, which causes a dimension mismatch in backprop.
That should have caused a clear error, but with numpyops it was causing
nans to happen, which is another problem that should be addressed
separately.

This commit changes the loss to give a zero gradient for entities not in
the kb.

* add failing test for v1 EL legacy architecture

* Add nasty but simple working check for legacy arch

* Clarify why init hack works the way it does

* Clarify use_gold_ents use case

* Fix use gold ents related handling

* Add tests for no gold ents and fix other tests

* Use aligned ents function (not working)

This doesn't actually work because the "aligned" ents are gold-only. But
if I have a different function that returns the intersection, *then*
this will work as desired.

* Use proper matching ent check

This changes the process when gold ents are not used so that the
intersection of ents in the pred and gold is used.

* Move get_matching_ents to Example

* Use model attribute to check for legacy arch

* Rename flag

* bump spacy-legacy to lower 3.0.9

Co-authored-by: svlandeg <svlandeg@github.com>
2022-03-04 09:17:36 +01:00
Adriane Boyd 8e93fa8507
Fix Vectors.n_keys for floret vectors (#10394)
Fix `Vectors.n_keys` for floret vectors to match docstring description
and avoid W007 warnings in similarity methods.
2022-03-01 09:21:25 +01:00
Sofie Van Landeghem 3f68bbcfec
Clean up loggers docs (#10351)
* update docs to point to spacy-loggers docs

* remove unused error code
2022-02-25 16:29:12 +01:00
Sofie Van Landeghem a16b14e591
Merge branch 'master' into copy/develop 2022-02-16 14:04:59 +01:00
Ines Montani 7b883da9fd
Merge pull request #10239 from explosion/docs/spacy-tailored-pipelines [ci skip] 2022-02-08 18:04:01 +01:00
Ines Montani f2c2b97e56 Add spaCy Tailored Pipelines 2022-02-08 11:46:42 +01:00
Sofie Van Landeghem deb143fa70
Token sent attributes more consistent (#10164)
* remove duplicate line

* add sent start/end token attributes to the docs

* let has_annotation work with IS_SENT_END

* elif instead of if

* add has_annotation test for sent attributes

* fix typo

* remove duplicate is_sent_start entry in docs
2022-02-08 08:35:37 +01:00
Peter Baumgartner 836f689cc7
YAML multiline tip for project.yml files (#10187)
* MultiHashEmbed vector docs correction

* add in multi-line tip

* convert to sidebar tip
2022-02-08 08:35:09 +01:00
Lj Miranda 72fece712f
Add shuffle parameter to Corpus API docs (#10220)
* Add shuffle parameter to Corpus API docs

* Update website/docs/api/corpus.md

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-02-07 14:55:53 +01:00
Sofie Van Landeghem 14513f82da
Merge pull request #10215 from explosion/master
update develop
2022-02-06 13:45:41 +01:00
Lj Miranda 345e7f6bc4
Clarify Span.ents documentation (#10154)
* Clarify Span.ents documentation

Ref: #10135

Retain current behaviour. Span.ents will only include entities within
said span. You can't get tokens outside of the original span.

* Reword docstrings

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Update API docs in the website

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2022-01-31 08:41:42 +01:00
Adriane Boyd 4f441dfa24
Fix infix as prefix in Tokenizer.explain (#10140)
* Fix infix as prefix in Tokenizer.explain

Update `Tokenizer.explain` to align with the `Tokenizer` algorithm:

* skip infix matches that are prefixes in the current substring

* Update tokenizer pseudocode in docs
2022-01-28 17:00:54 +01:00