spaCy

Commit Graph

Author	SHA1	Message	Date
Matthew Honnibal	bda4bb0184	Try disabling pretraining tests to probe windows ci failure (#13646 )	2024-10-02 01:01:40 +02:00
Matthew Honnibal	59ac7e6bdb	Format	2024-09-09 11:22:52 +02:00
Matthew Honnibal	1b8d560d0e	Support 'memory zones' for user memory management (#13621 ) Add a context manage nlp.memory_zone(), which will begin memory_zone() blocks on the vocab, string store, and potentially other components. Example usage: ``` with nlp.memory_zone(): for text in nlp.pipe(texts): do_something(doc) # do_something(doc) <-- Invalid ``` Once the memory_zone() block expires, spaCy will free any shared resources that were allocated for the text-processing that occurred within the memory_zone. If you create Doc objects within a memory zone, it's invalid to access them once the memory zone is expired. The purpose of this is that spaCy creates and stores Lexeme objects in the Vocab that can be shared between multiple Doc objects. It also interns strings. Normally, spaCy can't know when all Doc objects using a Lexeme are out-of-scope, so new Lexemes accumulate in the vocab, causing memory pressure. Memory zones solve this problem by telling spaCy "okay none of the documents allocated within this block will be accessed again". This lets spaCy free all new Lexeme objects and other data that were created during the block. The mechanism is general, so memory_zone() context managers can be added to other components that could benefit from them, e.g. pipeline components. I experimented with adding memory zone support to the tokenizer as well, for its cache. However, this seems unnecessarily complicated. It makes more sense to just stick a limit on the cache size. This lets spaCy benefit from the efficiency advantage of the cache better, because we can maintain a (bounded) cache even if only small batches of documents are being processed.	2024-09-09 11:19:39 +02:00
ykyogoku	608f65ce40	add Tibetan (#13510 )	2024-09-09 11:18:03 +02:00
Muzaffer Cikay	acbf2a428f	Add Kurdish Kurmanji language (#13561 ) * Add Kurdish Kurmanji language * Add lex_attrs	2024-09-09 11:15:40 +02:00
Alex Strick van Linschoten	045cd43c3f	Fix typos in docs (#13466 ) * fix typos * prettier formatting --------- Co-authored-by: svlandeg <svlandeg@github.com>	2024-04-29 11:10:17 +02:00
Sofie Van Landeghem	2e2334632b	Fix use_gold_ents behaviour for EntityLinker (#13400 ) * fix type annotation in docs * only restore entities after loss calculation * restore entities of sample in initialization * rename overfitting function * fix EL scorer * Relax test * fix formatting * Update spacy/pipeline/entity_linker.py Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com> * rename to _ensure_ents * further rename * allow for scorer to be None --------- Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>	2024-04-16 12:00:22 +02:00
Matthew Honnibal	0518c36f04	Sanitize direct download (#13313 ) The 'direct' option in 'spacy download' is supposed to only download from our model releases repository. However, users were able to pass in a relative path, allowing download from arbitrary repositories. This meant that a service that sourced strings from user input and which used the direct option would allow users to install arbitrary packages.	2024-02-20 13:17:51 +01:00
Daniël de Kok	fdfdbcd9f4	Make `Language.pipe` workers exit cleanly (#13321 ) Also warn when any worker exited with a non-zero exit code and modify test to ensure that workers exit cleanly by default.	2024-02-12 14:39:38 +01:00
Daniël de Kok	2dbb332cea	`TextCatParametricAttention.v1`: set key transform dimensions (#13249 ) * TextCatParametricAttention.v1: set key transform dimensions This is necessary for tok2vec implementations that initialize lazily (e.g. curated transformers). * Add lazily-initialized tok2vec to simulate transformers Add a lazily-initialized tok2vec to the tests and test the current textcat models with it. Fix some additional issues found using this test. * isort * Add `test.` prefix to `LazyInitTok2Vec.v1`	2024-02-02 13:01:59 +01:00
Daniël de Kok	68d7841df5	Extension serialization attr tests: add teardown (#13284 ) The doc/token extension serialization tests add extensions that are not serializable with pickle. This didn't cause issues before due to the implicit run order of tests. However, test ordering has changed with pytest 8.0.0, leading to failed tests in test_language. Update the fixtures in the extension serialization tests to do proper teardown and remove the extensions.	2024-01-29 13:51:56 +01:00
Daniël de Kok	afac7fb650	test_find_available_port: use port 5001 (#13255 ) macOS now uses port 5000 for the AirPlay receiver functionality, so this test will always fail on a macOS desktop (unless AirPlay receiver functionality is disabled like in CI).	2024-01-23 20:11:16 +01:00
Daniël de Kok	e2a3952de5	Add spacy.TextCatParametricAttention.v1 (#13201 ) * Add spacy.TextCatParametricAttention.v1 This layer provides is a simplification of the ensemble classifier that only uses paramteric attention. We have found empirically that with a sufficient amount of training data, using the ensemble classifier with BoW does not provide significant improvement in classifier accuracy. However, plugging in a BoW classifier does reduce GPU training and inference performance substantially, since it uses a GPU-only kernel. * Fix merge fallout	2024-01-02 10:03:06 +01:00
Daniël de Kok	7ebba86402	Add TextCatReduce.v1 (#13181 ) * Add TextCatReduce.v1 This is a textcat classifier that pools the vectors generated by a tok2vec implementation and then applies a classifier to the pooled representation. Three reductions are supported for pooling: first, max, and mean. When multiple reductions are enabled, the reductions are concatenated before providing them to the classification layer. This model is a generalization of the TextCatCNN model, which only supports mean reductions and is a bit of a misnomer, because it can also be used with transformers. This change also reimplements TextCatCNN.v2 using the new TextCatReduce.v1 layer. * Doc fixes Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Fully specify `TextCatCNN` <-> `TextCatReduce` equivalence * Move TextCatCNN docs to legacy, in prep for moving to spacy-legacy * Add back a test for TextCatCNN.v2 * Replace TextCatCNN in pipe configurations and templates * Add an infobox to the `TextCatReduce` section with an `TextCatCNN` anchor * Add last reduction (`use_reduce_last`) * Remove non-working TextCatCNN Netlify redirect * Revert layer changes for the quickstart * Revert one more quickstart change * Remove unused import * Fix docstring * Fix setting name in error message --------- Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2023-12-21 11:00:06 +01:00
Daniël de Kok	da7ad97519	Update `TextCatBOW` to use the fixed `SparseLinear` layer (#13149 ) * Update `TextCatBOW` to use the fixed `SparseLinear` layer A while ago, we fixed the `SparseLinear` layer to use all available parameters: https://github.com/explosion/thinc/pull/754 This change updates `TextCatBOW` to `v3` which uses the new `SparseLinear_v2` layer. This results in a sizeable improvement on a text categorization task that was tested. While at it, this `spacy.TextCatBOW.v3` also adds the `length_exponent` option to make it possible to change the hidden size. Ideally, we'd just have an option called `length`. But the way that `TextCatBOW` uses hashes results in a non-uniform distribution of parameters when the length is not a power of two. * Replace TexCatBOW `length_exponent` parameter by `length` We now round up the length to the next power of two if it isn't a power of two. * Remove some tests for TextCatBOW.v2 * Fix missing import	2023-11-29 09:11:54 +01:00
Lise	b6e022381d	Feature/nn and fo language extensions (#13116 ) * add language extensions for norwegian nynorsk and faroese * update docstring for nn/examples.py * use relative imports * add fo and nn tokenizers to pytest fixtures * add unittests for fo and nn and fix bug in nn * remove module docstring from fo/__init__.py * add comments about example sentences' origin * add license information to faroese data credit * format unittests using black * add __init__ files to test/lang/nn and tests/lang/fo * fix import order and use relative imports in fo/__nit__.py and nn/__init__.py * Make the tests a bit more compact * Add fo and nn to website languages * Add note about jul. * Add "jul." as exception --------- Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2023-11-20 07:49:59 +01:00
Adriane Boyd	0c25725359	Update Tokenizer.explain for special cases with whitespace (#13086 ) * Update Tokenizer.explain for special cases with whitespace Update `Tokenizer.explain` to skip special case matches if the exact text has not been matched due to intervening whitespace. Enable fuzzy `Tokenizer.explain` tests with additional whitespace normalization. * Add unit test for special cases with whitespace, xfail fuzzy tests again	2023-11-06 17:29:59 +01:00
Adriane Boyd	ff9ddb6a07	Unskip python 3.12 remote tests (#13110 )	2023-11-06 11:59:45 +01:00
Raphael Mitsch	c4e2daf6ef	Fix displacy span stacking (#13068 ) * Fix displacy span stacking. * Format. Remove counter. * Remove test files. * Add unit test. Refactor to allow for unit test. * Fix off-by-one error in tests.	2023-11-02 12:02:18 +01:00
Adriane Boyd	ea1befa8ff	Support Any comparisons for Token and Span (#13058 ) * Support Any comparisons for Token and Span * Preserve previous behavior for None	2023-10-12 11:53:33 +02:00
Adriane Boyd	77c568e524	Restore spacy.cli.project API (#13053 ) * Restore spacy.cli.project API * Fix typing errors, add simple import test	2023-10-10 15:35:25 +02:00
Adriane Boyd	6d0185f7fb	Revert "Load the cli module lazily for spacy.info (#12962 )" This reverts commit `beda27a91e`.	2023-10-04 12:33:33 +02:00
Adriane Boyd	1b043dde3f	Revert "disable tests until 3.7 models are available" This reverts commit `991bcc111e`.	2023-10-01 18:48:31 +02:00
Adriane Boyd	406794a081	Merge remote-tracking branch 'upstream/master' into chore/update-develop-from-master-v3.7-1	2023-09-28 15:09:06 +02:00
Daniël de Kok	beda27a91e	Load the cli module lazily for spacy.info (#12962 ) * Load the cli module lazily for spacy.info This avoids that the `spacy` module cannot be imported when the users chooses not to install `typer`/`requests`. * Add test --------- Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>	2023-09-28 11:36:44 +02:00
Adriane Boyd	36d4767aca	Skip project remotes test for python 3.12 (#12980 ) `weasel` (using `cloudpathlib`) does not currently support remote paths for python 3.12.	2023-09-13 13:16:05 +02:00
Sofie Van Landeghem	869cc4ab0b	warn when an unsupported/unknown key is given to the dependency matcher (#12928 )	2023-08-22 09:03:35 +02:00
Adriane Boyd	198488ee86	Extend to weasel v0.3 (#12908 ) * Extend to weasel v0.3 * Clean up unused imports in test_cli	2023-08-16 17:36:53 +02:00
Adriane Boyd	245e2ddc25	Allow pydantic v2 using transitional v1 support (#12888 )	2023-08-08 11:27:28 +02:00
Adriane Boyd	45af8a5dcf	Update br tags (#12882 ) * Fix displacy br tag * Prefer <br>, also update package CLI	2023-08-04 10:52:41 +02:00
Peter Baumgartner	a0a195688f	Tests for CLI app - `init config` generates `train`-able config (#12173 ) * remove migration support form * initial test commit * add fixture * add combo test * pull out parameter example data * fix formatting on examples * remove unused import * remove unncessary fmt:off instructions * only set logger level if verbose flag is explicitly set --------- Co-authored-by: svlandeg <svlandeg@github.com>	2023-07-31 14:45:04 +02:00
Sofie Van Landeghem	c9e9dccf79	Add displaCy data structures to docs (2) (#12875 ) * Add data structures to docs * Adjusted descriptions for more consistency * Add _optional_ flag to parameters * Add tests and adjust optional title key in doc * Add title to dep visualizations * fix typo --------- Co-authored-by: thomashacker <EdwardSchmuhl@web.de>	2023-07-31 10:47:57 +02:00
Victoria	49055ed7c8	Add cli for finding locations of registered func (#12757 ) * Add cli for finding locations of registered func * fixes: naming and typing * isort * update naming * remove to find-function * remove file:// bit * use registry name if given and exit gracefully if a registry was not found * clean up failure msg * specify registry_name options * mypy fixes * return location for internal usage * add documentation * more mypy fixes * clean up example * add section to menu * add tests --------- Co-authored-by: svlandeg <svlandeg@github.com>	2023-07-31 09:39:00 +02:00
Adriane Boyd	5888afa884	Update numpy build constraints for numpy 1.25 (#12839 ) * Update numpy build constraints for numpy 1.25 Starting in numpy 1.25 (see https://github.com/numpy/numpy/releases/tag/v1.25.0), the numpy C API is backwards-compatible by default. For python 3.9+, we should be able to drop the specific numpy build requirements and use `numpy>=1.25`, which is currently backwards-compatible to `numpy>=1.19`. In the future, the python <3.9 requirements could be dropped and the lower numpy pin could correspond to the oldest supported version for the current lower python pin. * Turn off fail-fast * Revert "Turn off fail-fast" This reverts commit `4306f516bc`. * Update for python 3.6 * Fix typo	2023-07-24 10:32:56 +02:00
svlandeg	79ec68f01b	Merge branch 'upstream_master' into sync_develop	2023-07-19 12:08:52 +02:00
Basile Dura	b0228d8ea6	ci: add cython linter (#12694 ) * chore: add cython-linter dev dependency * fix: lexeme.pyx * fix: morphology.pxd * fix: tokenizer.pxd * fix: vocab.pxd * fix: morphology.pxd (line length) * ci: add cython-lint * ci: fix cython-lint call * Fix kb/candidate.pyx. * Fix kb/kb.pyx. * Fix kb/kb_in_memory.pyx. * Fix kb. * Fix training/ partially. * Fix training/. Ignore trailing whitespaces and too long lines. * Fix ml/. * Fix matcher/. * Fix pipeline/. * Fix tokens/. * Fix build errors. Fix vocab.pyx. * Fix cython-lint install and run. * Fix lexeme.pyx, parts_of_speech.pxd, vectors.pyx. Temporarily disable cython-lint execution. * Fix attrs.pyx, lexeme.pyx, symbols.pxd, isort issues. * Make cython-lint install conditional. Fix tokenizer.pyx. * Fix remaining files. Reenable cython-lint check. * Readded parentheses. * Fix test_build_dependencies(). * Add explanatory comment to cython-lint execution. --------- Co-authored-by: Raphael Mitsch <r.mitsch@outlook.com>	2023-07-19 12:03:31 +02:00
Adriane Boyd	6bf7c65329	Update matcher pattern validation tests (#12835 ) - parametrize over individual token patterns (as originally intended, as far as I can tell) - add a test for lowercase `in` in patterns	2023-07-18 10:00:07 +02:00
Connor Brinton	0566c3a166	🐛 Escape annotated HTML tags in span renderer (#12817 ) These changes add a missing call to `escape_html` in the displaCy span renderer. Previously span-annotated tokens would be inserted into the page markup without being escaped, resulting in potentially incorrect rendering. When I encountered this issue, it resulted in some docs and span underlines being superimposed on top of properly rendered docs and span underlines near the beginning of the visualization (due to an unescaped `<span>` tag).	2023-07-13 17:33:05 +02:00
Sofie Van Landeghem	b1b20bf69d	Replace projects functionality with weasel (#12769 ) * Setting up weasel branch (#12456) * remove project-specific functionality * remove project-specific tests * remove project-specific schemas * remove project-specific information in about * remove project-specific functions in util.py * remove project-specific error strings * remove project-specific CLI commands * black formatting * restore some functions that are used beyond projects * remove project imports * remove imports * remove remote_storage tests * remove one more project unit test * update for PR 12394 * remove get_hash and get_checksum * remove upload_ and download_file methods * remove ensure_pathy * revert clumsy fingers * reinstate E970 * feat: use weasel as spacy project command (#12473) * feat: use weasel as spacy project command * build: use constrained requirement for weasel * feat: add weasel to the library requirements * build: update weasel to new version * build: use specific weasel tag * build: use weasel-0.1.0rc1 from PyPI * fix: remove weasel from requirements.txt * fix: requirements.txt and setup.cfg need to reflect each other * feat: remove legacy spacy project code * bump version * further merge fixes * isort --------- Co-authored-by: Basile Dura <bdura@users.noreply.github.com>	2023-07-07 09:10:27 +02:00
svlandeg	991bcc111e	disable tests until 3.7 models are available	2023-07-07 08:09:57 +02:00
Adriane Boyd	a1191146f5	Revert "Temporarily skip tests for compat table" This reverts commit `dd5e00c735`.	2023-07-06 12:47:50 +02:00
Adriane Boyd	fb0da3e097	Support custom token/lexeme attribute for vectors (#12625 ) * Support custom token/lexeme attribute for vectors * Fix imports * Back off to ORTH without Vectors.attr * Fallback if vectors.attr doesn't exist * Update docs	2023-06-28 09:43:14 +02:00
Adriane Boyd	337a360cc7	Use spans_ prefix for default span finder scores (#12753 )	2023-06-27 19:32:17 +02:00
Adriane Boyd	65f6c9cd10	Support overriding registered functions in configs (#12623 ) Support overriding registered functions in configs. Previously the registry name was parsed as a section name rather than as a registry name.	2023-06-27 17:36:33 +02:00
Adriane Boyd	c067b5264c	Address issues with source with component names and replacing listeners (#12701 ) When sourcing a component, the object from the original pipeline is added to the new pipeline as the same object. This creates a situation where there are several attributes that cannot be in sync between the original pipeline and the new pipeline at the same time for this one object: * component.name * component.listener_map / component.listening_components for tok2vec and transformer When running replace_listeners on a component, the config is not updated correctly if the state of the component is incorrect for the current pipeline (in particular changes that should be applied from model.attrs["replace_listener_cfg"] as used in spacy-transformers) due to the fact that: * find_listeners relies on component.name to set the name in the listener_map * replace_listeners relies on listener_map to determine how to modify the configs In addition, there are several places where pipeline components are modified and the listener map and/or internal component names aren't currently updated. In cases where there is a component shared by two pipelines that cannot be in sync, this PR chooses to prioritize the most recently modified or initialized pipeline. There is no actual solution with the current source behavior that will make both pipelines usable, so the current pipeline is updated whenever components are added/renamed/removed or the pipeline is initialized for training.	2023-06-27 10:47:07 +02:00
Adriane Boyd	e1664217f5	Add spancat_singlelabel to debug data CLI (#12749 )	2023-06-26 10:25:20 +02:00
Adriane Boyd	dd5e00c735	Temporarily skip tests for compat table	2023-06-21 12:59:36 +02:00
Daniël de Kok	e2b70df012	Configure isort to use the Black profile, recursively isort the `spacy` module (#12721 ) * Use isort with Black profile * isort all the things * Fix import cycles as a result of import sorting * Add DOCBIN_ALL_ATTRS type definition * Add isort to requirements * Remove isort from build dependencies check * Typo	2023-06-14 17:48:41 +02:00
Sofie Van Landeghem	d65e3c31a6	use system-independent commands (#12693 )	2023-06-08 11:43:36 +02:00
kadarakos	c003aac29a	SpanFinder into spaCy from experimental (#12507 ) * span finder integrated into spacy from experimental * black * isort * black * default spankey constant * black * Update spacy/pipeline/spancat.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * rename * rename * max_length and min_length as Optional[int] and strict checking * black * mypy fix for integer type infinity * revert line order * implement all comparison operators for inf int * avoid two for loops over all docs by not precomputing * interleave thresholding with span creation * black * revert to not interleaving (relized its faster) * black * Update spacy/errors.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * update dosctring * enforce that the gold and predicted documents have the same text * new error for ensuring reference and predicted texts are the same * remove todo * adjust test * black * handle misaligned tokenization * return correct variable * failing overfit test * only use a single spans_key like in spancat * black * remove debug lines * typo * remove comment * remove near duplicate reduntant method * use the 'spans_key' variable name everywhere * Update spacy/pipeline/span_finder.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * flaky test fix suggestion, hand set bias terms * only test suggester and test result exhaustively * make it clear that the span_finder_suggester is more general (not specific to span_finder) * Update spacy/tests/pipeline/test_span_finder.py Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> * Apply suggestions from code review * remove question comment * move preset_spans_suggester test to spancat tests * Add docs and unify default configs for spancat and span finder * Add `allow_overlap=True` to span finder scorer * Fix offset bug in set_annotations * Ignore labels in span finder scorer * Format * Add span_finder to quickstart template * Move settings to self.cfg, store min/max unset as None * Remove debugging * Update docstrings and docs * Update spacy/pipeline/span_finder.py Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com> * Fix imports --------- Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com> Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>	2023-06-07 15:52:28 +02:00

1 2 3 4 5 ...

2610 Commits