Commit Graph

9330 Commits

Author SHA1 Message Date
Sofie Van Landeghem 869cc4ab0b
warn when an unsupported/unknown key is given to the dependency matcher (#12928) 2023-08-22 09:03:35 +02:00
Adriane Boyd 198488ee86
Extend to weasel v0.3 (#12908)
* Extend to weasel v0.3

* Clean up unused imports in test_cli
2023-08-16 17:36:53 +02:00
Sofie Van Landeghem 0737443096
feat: add example stubs (3) (#12801)
* feat: add example stubs

* fix: add required annotations

* fix: mypy issues

* fix: use Py36-compatible Portocol

* Minor reformatting

* adding further type specifications and removing internal methods

* black formatting

* widen type to iterable

* add private methods that are being used by the built-in convertors

* revert changes to corpus.py

* fixes

* fixes

* fix typing of PlainTextCorpus

---------

Co-authored-by: Basile Dura <basile@bdura.me>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-08-02 08:15:12 +02:00
Adriane Boyd 0fe43f40f1
Support registered vectors (#12492)
* Support registered vectors

* Format

* Auto-fill [nlp] on load from config and from bytes/disk

* Only auto-fill [nlp]

* Undo all changes to Language.from_disk

* Expand BaseVectors

These methods are needed in various places for training and vector
similarity.

* isort

* More linting

* Only fill [nlp.vectors]

* Update spacy/vocab.pyx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Revert changes to test related to auto-filling [nlp]

* Add vectors registry

* Rephrase error about vocab methods for vectors

* Switch to dummy implementation for BaseVectors.to_ops

* Add initial draft of docs

* Remove example from BaseVectors docs

* Apply suggestions from code review

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Update website/docs/api/basevectors.mdx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Fix type and lint bpemb example

* Update website/docs/api/basevectors.mdx

---------

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2023-08-01 15:46:08 +02:00
Adriane Boyd 5888afa884
Update numpy build constraints for numpy 1.25 (#12839)
* Update numpy build constraints for numpy 1.25

Starting in numpy 1.25 (see
https://github.com/numpy/numpy/releases/tag/v1.25.0), the numpy C API is
backwards-compatible by default.

For python 3.9+, we should be able to drop the specific numpy build
requirements and use `numpy>=1.25`, which is currently
backwards-compatible to `numpy>=1.19`.

In the future, the python <3.9 requirements could be dropped and the
lower numpy pin could correspond to the oldest supported version for the
current lower python pin.

* Turn off fail-fast

* Revert "Turn off fail-fast"

This reverts commit 4306f516bc.

* Update for python 3.6

* Fix typo
2023-07-24 10:32:56 +02:00
Jacobo Myerston 4f8daa4f00
Add Left and Right Pointing Angle Brackets as punctuation to ancient Greek (#12829)
* Update universe.json

* Update universe.json

add some missing commas in the greCy's description.

* Update punctuation.py

Add mathematical left and right angle brackets as punctuation for ancient Greek for better tokenization.
2023-07-20 11:16:01 +02:00
svlandeg 79ec68f01b Merge branch 'upstream_master' into sync_develop 2023-07-19 12:08:52 +02:00
Basile Dura b0228d8ea6
ci: add cython linter (#12694)
* chore: add cython-linter dev dependency

* fix: lexeme.pyx

* fix: morphology.pxd

* fix: tokenizer.pxd

* fix: vocab.pxd

* fix: morphology.pxd (line length)

* ci: add cython-lint

* ci: fix cython-lint call

* Fix kb/candidate.pyx.

* Fix kb/kb.pyx.

* Fix kb/kb_in_memory.pyx.

* Fix kb.

* Fix training/ partially.

* Fix training/. Ignore trailing whitespaces and too long lines.

* Fix ml/.

* Fix matcher/.

* Fix pipeline/.

* Fix tokens/.

* Fix build errors. Fix vocab.pyx.

* Fix cython-lint install and run.

* Fix lexeme.pyx, parts_of_speech.pxd, vectors.pyx. Temporarily disable cython-lint execution.

* Fix attrs.pyx, lexeme.pyx, symbols.pxd, isort issues.

* Make cython-lint install conditional. Fix tokenizer.pyx.

* Fix remaining files. Reenable cython-lint check.

* Readded parentheses.

* Fix test_build_dependencies().

* Add explanatory comment to cython-lint execution.

---------

Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>
2023-07-19 12:03:31 +02:00
Adriane Boyd 1509c96694
Clean up unused code in Language (#12836)
Follow-up to #12701.
2023-07-18 14:10:30 +02:00
Adriane Boyd 6bf7c65329
Update matcher pattern validation tests (#12835)
- parametrize over individual token patterns (as originally intended, as
far as I can tell)
- add a test for lowercase `in` in patterns
2023-07-18 10:00:07 +02:00
Ian Thompson ef20e114e0
Typo fix in `Language.replace_listeners` docs (#12823)
* modified:   spacy/language.py
	- corrected typo in docstring for :method:`Language.replace_listeners`
	- added noqa comment on unused local variable assignment in :method:`Language.from_config` as I wasn't sure if it should be unassigned

modified:   website/docs/api/language.mdx
	- corrected typo in `Language.replace_listeners` markdown

* modified:   spacy/language.py
	- removed noqa comment

---------

Co-authored-by: Ian Thompson <ian.thompson@hrblock.com>
2023-07-14 09:45:54 +02:00
Connor Brinton 0566c3a166
🐛 Escape annotated HTML tags in span renderer (#12817)
These changes add a missing call to `escape_html` in the displaCy span
renderer. Previously span-annotated tokens would be inserted into the
page markup without being escaped, resulting in potentially incorrect
rendering. When I encountered this issue, it resulted in some docs and
span underlines being superimposed on top of properly rendered docs and
span underlines near the beginning of the visualization (due to an
unescaped `<span>` tag).
2023-07-13 17:33:05 +02:00
Sofie Van Landeghem b1b20bf69d
Replace projects functionality with weasel (#12769)
* Setting up weasel branch (#12456)

* remove project-specific functionality

* remove project-specific tests

* remove project-specific schemas

* remove project-specific information in about

* remove project-specific functions in util.py

* remove project-specific error strings

* remove project-specific CLI commands

* black formatting

* restore some functions that are used beyond projects

* remove project imports

* remove imports

* remove remote_storage tests

* remove one more project unit test

* update for PR 12394

* remove get_hash and get_checksum

* remove upload_ and download_file methods

* remove ensure_pathy

* revert clumsy fingers

* reinstate E970

* feat: use weasel as spacy project command (#12473)

* feat: use weasel as spacy project command

* build: use constrained requirement for weasel

* feat: add weasel to the library requirements

* build: update weasel to new version

* build: use specific weasel tag

* build: use weasel-0.1.0rc1 from PyPI

* fix: remove weasel from requirements.txt

* fix: requirements.txt and setup.cfg need to reflect each other

* feat: remove legacy spacy project code

* bump version

* further merge fixes

* isort

---------

Co-authored-by: Basile Dura <bdura@users.noreply.github.com>
2023-07-07 09:10:27 +02:00
Sofie Van Landeghem 9e63006b12
Merge pull request #12800 from explosion/master_copy
Sync develop with master
2023-07-07 08:44:19 +02:00
svlandeg 991bcc111e disable tests until 3.7 models are available 2023-07-07 08:09:57 +02:00
Madeesh Kannan d195923164
Set version to `3.7.0.dev0` (#12799) 2023-07-06 18:29:03 +02:00
svlandeg d26e4e0849 Revert "feat: add example stubs (#12679)"
This reverts commit 30bb34533a.
2023-07-06 17:02:38 +02:00
Basile Dura 30bb34533a
feat: add example stubs (#12679)
* feat: add example stubs

* fix: add required annotations

* fix: mypy issues

* fix: use Py36-compatible Portocol

* Minor reformatting

---------

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: svlandeg <svlandeg@github.com>
2023-07-06 16:49:43 +02:00
Adriane Boyd a1191146f5 Revert "Temporarily skip tests for compat table"
This reverts commit dd5e00c735.
2023-07-06 12:47:50 +02:00
Adriane Boyd 830dcca367
SpanFinder: set default max_length to 25 (#12791)
When the default `max_length` is not set and there are longer training
documents, it can be difficult to train and evaluate the span finder due
to memory limits and the time it takes to evaluate a huge number of
predicted spans.
2023-07-06 09:55:34 +02:00
Madeesh Kannan 8113cfb257
`Language.replace_listeners`: Pass the replaced listener and the `tok2vec` pipe to the callback (#12785)
* `Language.replace_listeners`: Pass the replaced listener and the `tok2vec` pipe to the callback

* Update developer docs

* `isort` fixes

* Add error message to assertion

* Add clarification to dev docs

* Replace assertion with exception

* Doc fixes
2023-07-05 13:36:04 +02:00
Adriane Boyd fb0da3e097
Support custom token/lexeme attribute for vectors (#12625)
* Support custom token/lexeme attribute for vectors

* Fix imports

* Back off to ORTH without Vectors.attr

* Fallback if vectors.attr doesn't exist

* Update docs
2023-06-28 09:43:14 +02:00
Adriane Boyd 337a360cc7
Use spans_ prefix for default span finder scores (#12753) 2023-06-27 19:32:17 +02:00
Adriane Boyd 65f6c9cd10
Support overriding registered functions in configs (#12623)
Support overriding registered functions in configs. Previously the registry name was parsed as a section name rather than as a registry name.
2023-06-27 17:36:33 +02:00
Adriane Boyd c067b5264c
Address issues with source with component names and replacing listeners (#12701)
When sourcing a component, the object from the original pipeline is added to the new pipeline as the same object. This creates a situation where there are several attributes that cannot be in sync between the original pipeline and the new pipeline at the same time for this one object:

* component.name
* component.listener_map / component.listening_components for tok2vec and transformer

When running replace_listeners on a component, the config is not updated correctly if the state of the component is incorrect for the current pipeline (in particular changes that should be applied from model.attrs["replace_listener_cfg"] as used in spacy-transformers) due to the fact that:

* find_listeners relies on component.name to set the name in the listener_map
* replace_listeners relies on listener_map to determine how to modify the configs

In addition, there are several places where pipeline components are modified and the listener map and/or internal component names aren't currently updated.

In cases where there is a component shared by two pipelines that cannot be in sync, this PR chooses to prioritize the most recently modified or initialized pipeline. There is no actual solution with the current source behavior that will make both pipelines usable, so the current pipeline is updated whenever components are added/renamed/removed or the pipeline is initialized for training.
2023-06-27 10:47:07 +02:00
Adriane Boyd e1664217f5
Add spancat_singlelabel to debug data CLI (#12749) 2023-06-26 10:25:20 +02:00
Adriane Boyd cb4fdc83e4
Merge pull request #12742 from adrianeboyd/chore/v3.6.0
Set version to v3.6.0
2023-06-21 15:34:28 +02:00
Adriane Boyd 34971bcbd1 Set version to v3.6.0 2023-06-21 12:59:36 +02:00
Adriane Boyd dd5e00c735 Temporarily skip tests for compat table 2023-06-21 12:59:36 +02:00
Sofie Van Landeghem d3ac8e897c
default value for phrasematcher in pyi (#12714) 2023-06-21 10:10:13 +02:00
Ziad Amerr 3125b97ace
Fixed e941 link rendering by removing the dot (#12735) 2023-06-19 13:31:08 +02:00
Daniël de Kok e2b70df012
Configure isort to use the Black profile, recursively isort the `spacy` module (#12721)
* Use isort with Black profile

* isort all the things

* Fix import cycles as a result of import sorting

* Add DOCBIN_ALL_ATTRS type definition

* Add isort to requirements

* Remove isort from build dependencies check

* Typo
2023-06-14 17:48:41 +02:00
Sofie Van Landeghem d65e3c31a6
use system-independent commands (#12693) 2023-06-08 11:43:36 +02:00
Adriane Boyd 0f9d2b01fb
Set version v3.6.0.dev1 (#12703) 2023-06-07 16:23:14 +02:00
kadarakos c003aac29a
SpanFinder into spaCy from experimental (#12507)
* span finder integrated into spacy from experimental

* black

* isort

* black

* default spankey constant

* black

* Update spacy/pipeline/spancat.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* rename

* rename

* max_length and min_length as Optional[int] and strict checking

* black

* mypy fix for integer type infinity

* revert line order

* implement all comparison operators for inf int

* avoid two for loops over all docs by not precomputing

* interleave thresholding with span creation

* black

* revert to not interleaving (relized its faster)

* black

* Update spacy/errors.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* update dosctring

* enforce that the gold and predicted documents have the same text

* new error for ensuring reference and predicted texts are the same

* remove todo

* adjust test

* black

* handle misaligned tokenization

* return correct variable

* failing overfit test

* only use a single spans_key like in spancat

* black

* remove debug lines

* typo

* remove comment

* remove near duplicate reduntant method

* use the 'spans_key' variable name everywhere

* Update spacy/pipeline/span_finder.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* flaky test fix suggestion, hand set bias terms

* only test suggester and test result exhaustively

* make it clear that the span_finder_suggester is more general (not specific to span_finder)

* Update spacy/tests/pipeline/test_span_finder.py

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Apply suggestions from code review

* remove question comment

* move preset_spans_suggester test to spancat tests

* Add docs and unify default configs for spancat and span finder

* Add `allow_overlap=True` to span finder scorer

* Fix offset bug in set_annotations

* Ignore labels in span finder scorer

* Format

* Add span_finder to quickstart template

* Move settings to self.cfg, store min/max unset as None

* Remove debugging

* Update docstrings and docs

* Update spacy/pipeline/span_finder.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Fix imports

---------

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2023-06-07 15:52:28 +02:00
Basile Dura c3c064ace4
fix: `InitializableComponent` type hints (#12692)
* fix: InitializableComponent type hints

* fix: avoid circular dependency

* style: clean imports in language.py

* style: use relative imports

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* fix: apply black

---------

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-06-02 14:29:52 +02:00
Adriane Boyd c4112a1da3
Require that all SpanGroup spans are from the current doc (#12569)
* Require that all SpanGroup spans are from the current doc

The restriction on only adding spans from the current doc were already
implemented for all operations except for `SpanGroup.__init__`.

Initialize copied spans for `SpanGroup.copy` with `Doc.char_span` in
order to validate the character offsets and to make it possible to copy
spans between documents with differing tokenization. Currently there is
no validation that the document texts are identical, but the span char
offsets must be valid spans in the target doc, which prevents you from
ending up with completely invalid spans.

* Undo change in test_beam_overfitting_IO
2023-06-01 19:19:17 +02:00
Adriane Boyd c936db2faf
Address numpy 1.25 deprecations in test suite (#12684)
* Address upcoming numpy v1.25 deprecations in test suite

* Temporarily test most recent numpy prerelease in CI

* Revert "Temporarily test most recent numpy prerelease in CI"

This reverts commit d75a66e55e.
2023-05-31 17:23:07 +02:00
Basile Dura 6ea4155487
feat: add comparison operators in `span.pyi` (#12652)
* feat: add comparison operators in span.pyi

remove Cython-specific `__richcmp__`

* fix: comparison operators should be defined for any other object
2023-05-23 08:50:37 +02:00
Basile Dura 95fd46b1dd
feat: add type hinting on SpanGroup.__iter__ (#12642) 2023-05-17 14:20:00 +02:00
Sani 873c16a4df
Malay language support (#12602)
* add malay lang

* fix token len

* black format

* reformat conftest malay

* remove exceptions not exist in dbp

* format code
2023-05-17 12:45:21 +02:00
Adriane Boyd 3637148c4d
Add scorer option to return per-component scores (#12540)
* Add scorer option to return per-component scores

Add `per_component` option to `Language.evaluate` and `Scorer.score` to
return scores keyed by `tokenizer` (hard-coded) or by component name.

Add option to `evaluate` CLI to score by component. Per-component scores
can only be saved to JSON.

* Update help text and messages
2023-05-12 15:36:54 +02:00
Adriane Boyd b5af0fe836
Revert "Use Latin normalization for Serbian attrs (#12608)" (#12621)
This reverts commit 6f314f99c4.

We are reverting this until we can support this normalization more
consistently across vectors, training corpora, and lemmatizer data.
2023-05-11 11:54:16 +02:00
Adriane Boyd 1279b464bb
In initialize only calculate current vectors hash if needed (#12607) 2023-05-08 16:51:58 +02:00
Adriane Boyd 6f314f99c4
Use Latin normalization for Serbian attrs (#12608)
* Use Latin normalization for Serbian attrs

Use Latin normalization for Serbian `NORM`, `PREFIX`, and `SUFFIX`.

* Update NORMs in tokenizer exceptions and related tests

* Add tests for all custom lex attrs

* Remove unused imports
2023-05-08 12:33:56 +02:00
Adriane Boyd fbd12eb4a4 Set version to v3.6.0.dev0 2023-05-08 09:10:35 +02:00
Adriane Boyd dbc71ecd44
Remove #egg from download URLs (#12567)
The current URLs will become invalid in pip 25.0. According to the pip
docs, the egg= URLs are currently only needed for editable VCS installs.
2023-05-04 17:13:12 +02:00
Lj Miranda 298e6036b7
Add spans in spacy benchmark (#12575)
* Add spans in spacy benchmark

The current implementation of spaCy benchmark accuracy / spacy evaluate
doesn't include the "spans" type, so calling the command doesn't render
the HTML displaCy file needed.

This PR attempts to fix that by creating a new parameter for "spans"
and calling the appropriate displaCy value.

* Reformat file with black

* Add tests for evaluate

* Fix spans -> span for displacy style

* Update test to check render instead

* Update source so mypy passes

* Add parser information to avoid warnings
2023-04-28 14:32:52 +02:00
kadarakos 34d1164b0e
Spancat speed improvement (#12577)
* avoid nesting then flattening

* mypy fix

* Apply suggestions from code review

* Add type for indices

* Run full matrix for mypy

* Add back modified type: ignore

* Revert "Run full matrix for mypy"

This reverts commit e218873d04.

---------

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-04-27 15:27:13 +02:00
Patrick J. Burns ab4ba04c32
Update LatinDefaults for lang 'la' (#12538)
* Add noun chunking to la syntax iterators

* Expand list of numeral, ordinal words

* Expand abbreviations in la tokenizer_exceptions

* Add example sents

* Update spacy/lang/la/syntax_iterators.py

Reorganize la syntax iterators

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Minor updates based on review

* fix call

---------

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2023-04-20 16:55:40 +02:00