Commit Graph

16129 Commits

Author SHA1 Message Date
thjbbvlt aee667a781 add elleux 2024-03-28 16:57:25 +01:00
thjbbvlt 0f1a82b24a add abbrev juill. (same as juil.) 2024-03-26 11:26:41 +01:00
thjbbvlt 54fcc5bba2 add dot after some abbrevs 2024-03-26 11:25:43 +01:00
thjbbvlt b0b66fbab4 add months + days abbrev 2024-03-26 11:24:56 +01:00
thjbbvlt 43aec2d8fc missing suffix: 'ce' 2024-03-26 11:17:33 +01:00
thjbbvlt 657849238f fixed variable types 2024-03-26 11:08:49 +01:00
thjbbvlt f3967e8b91 isort + flake8 fixed 2024-03-25 15:43:13 +01:00
thjbbvlt 396266904e formatted with black 2024-03-25 14:34:58 +01:00
thjbbvlt 73b68c4c78 removed few new sentences 2024-03-15 14:28:12 +01:00
thjbbvlt 8e8dd31e67 end french tokenizer + add sentences (examples) 2024-03-15 12:05:01 +01:00
thjbbvlt ef4c65598a works
modifié :         __init__.py
modifié :         punctuation.py
modifié :         tokenizer_exceptions.py
2024-03-15 11:55:27 +01:00
thjbbvlt 5a3928fe1e refactored french tokenizer 2024-03-15 08:45:23 +01:00
Matthew Honnibal 0518c36f04
Sanitize direct download (#13313)
The 'direct' option in 'spacy download' is supposed to only download from our model releases repository. However, users were able to pass in a relative path, allowing download from arbitrary repositories. This meant that a service that sourced strings from user input and which used the direct option would allow users to install arbitrary packages.
2024-02-20 13:17:51 +01:00
Daniël de Kok bff8725f4b
Set version to 3.7.4 (#13327) 2024-02-14 14:46:28 +01:00
Daniël de Kok fdfdbcd9f4
Make `Language.pipe` workers exit cleanly (#13321)
Also warn when any worker exited with a non-zero exit code and modify
test to ensure that workers exit cleanly by default.
2024-02-12 14:39:38 +01:00
Daniël de Kok 14bd9d89a3
Update example that shows model in requirments (#13302)
See #13293.
2024-02-11 19:46:43 +01:00
Daniël de Kok e1249d3722
Test if closing explicitly solves recursive lock issues (#13304) 2024-02-05 10:07:03 +01:00
Daniël de Kok 40422ff904
Set version to 3.7.3 (#13301) 2024-02-02 13:51:26 +01:00
Daniël de Kok 2dbb332cea
`TextCatParametricAttention.v1`: set key transform dimensions (#13249)
* TextCatParametricAttention.v1: set key transform dimensions

This is necessary for tok2vec implementations that initialize
lazily (e.g. curated transformers).

* Add lazily-initialized tok2vec to simulate transformers

Add a lazily-initialized tok2vec to the tests and test the current
textcat models with it.

Fix some additional issues found using this test.

* isort

* Add `test.` prefix to `LazyInitTok2Vec.v1`
2024-02-02 13:01:59 +01:00
Daniël de Kok d84068e460
Run slow tests: v4 -> main (#13290)
* Run slow tests: v4 -> main

* Also update the branch in GPU tests
2024-01-30 13:58:28 +01:00
Sofie Van Landeghem 89a43f39b7
update universe description (#13291) 2024-01-30 13:49:49 +01:00
Daniël de Kok 68d7841df5
Extension serialization attr tests: add teardown (#13284)
The doc/token extension serialization tests add extensions that are not
serializable with pickle. This didn't cause issues before due to the
implicit run order of tests. However, test ordering has changed with
pytest 8.0.0, leading to failed tests in test_language.

Update the fixtures in the extension serialization tests to do proper
teardown and remove the extensions.
2024-01-29 13:51:56 +01:00
Eliana Vornov 00e938a7c3
add custom code support to CLI speed benchmark (#13247)
* add custom code support to CLI speed benchmark

* sort imports

* better copying for warmup docs
2024-01-26 13:29:22 +01:00
Sofie Van Landeghem 68b85ea950
Clarify data_path loading for apply CLI command (#13272)
* attempt to clarify additional annotations on .spacy file

* suggestion by Daniël

* pipeline instead of pipe
2024-01-26 12:10:05 +01:00
Sofie Van Landeghem 7496e03a2c
Clarify vocab docs (#13273)
* add line to ensure that apple is in fact in the vocab

* add that the vocab may be empty
2024-01-26 10:58:48 +01:00
Sofie Van Landeghem a493981163
fix typo (#13254) 2024-01-24 09:29:57 +01:00
Daniël de Kok a8894a8946
Merge pull request #13240 from mauricesvp/patch-1
Fix typo in method name
2024-01-23 20:49:21 +01:00
Daniël de Kok afac7fb650
test_find_available_port: use port 5001 (#13255)
macOS now uses port 5000 for the AirPlay receiver functionality, so this
test will always fail on a macOS desktop (unless AirPlay receiver
functionality is disabled like in CI).
2024-01-23 20:11:16 +01:00
Daniël de Kok 5a2ad4af4b Merge remote-tracking branch 'upstream/master' into patch-1 2024-01-23 19:53:20 +01:00
Daniël de Kok 128197a5fc
Properly clean up pipe multiprocessing workers (#13259)
Before this change, the workers of pipe call with n_process != 1 were
stopped by calling `terminate` on the processes. However, terminating a
process can leave queues, pipes, and other concurrent data structures in
an invalid state.

With this change, we stop using terminate and take the following approach
instead:

* When the all documents are processed, the parent process puts a
  sentinel in the queue of each worker.
* The parent process then calls `join` on each worker process to
  let them finish up gracefully.
* Worker processes break from the queue processing loop when the
  sentinel is encountered, so that they exit.

We need special handling when one of the workers encounters an error and
the error handler is set to raise an exception. In this case, we cannot
rely on the sentinel to finish all workers -- the queue is a FIFO queue
and there may be other work queued up before the sentinel. We use the
following approach to handle error scenarios:

* The parent puts the end-of-work sentinel in the queue of each worker.
* The parent closes the reading-end of the channel of each worker.
* Then:
  - If the worker was waiting for work, it will encounter the sentinel
    and break from the processing loop.
  - If the worker was processing a batch, it will attempt to write
    results to the channel. This will fail because the channel was
    closed by the parent and the worker will break from the processing
    loop.
2024-01-23 18:33:04 +01:00
Raphael Mitsch 3b3b5cdc63
Merge pull request #13253 from explosion/chore/sync-master-with-llm_main
Sync `master` with `docs/llm_main`
2024-01-19 16:50:43 +01:00
Raphael Mitsch 575c405ae3 Fix LLM docs on task factories. 2024-01-19 16:48:54 +01:00
Raphael Mitsch 256468c414 Merge branch 'docs/llm_main' into chore/sync-master-with-llm_main
# Conflicts:
#	website/docs/api/large-language-models.mdx
2024-01-19 16:34:35 +01:00
Raphael Mitsch 91c24c0285
Merge pull request #13251 from explosion/docs/llm_develop
Sync `docs/llm_main` with `docs/llm_develop`
2024-01-19 12:56:38 +01:00
maurice c608baeecc
Fix typo in method name 2024-01-16 21:54:54 +01:00
Raphael Mitsch 0062c22c35
Updated docs w.r.t. infinite doc length changes (#13214)
* Updated docs w.r.t. infinite doc length.

* Fix typo.

* fix typo's

* Fix table formatting.

* Update formatting.

---------

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2024-01-05 14:20:58 +01:00
Daniël de Kok e2a3952de5
Add spacy.TextCatParametricAttention.v1 (#13201)
* Add spacy.TextCatParametricAttention.v1

This layer provides is a simplification of the ensemble classifier that
only uses paramteric attention. We have found empirically that with a
sufficient amount of training data, using the ensemble classifier with
BoW does not provide significant improvement in classifier accuracy.
However, plugging in a BoW classifier does reduce GPU training and
inference performance substantially, since it uses a GPU-only kernel.

* Fix merge fallout
2024-01-02 10:03:06 +01:00
Daniël de Kok 7ebba86402
Add TextCatReduce.v1 (#13181)
* Add TextCatReduce.v1

This is a textcat classifier that pools the vectors generated by a
tok2vec implementation and then applies a classifier to the pooled
representation. Three reductions are supported for pooling: first, max,
and mean. When multiple reductions are enabled, the reductions are
concatenated before providing them to the classification layer.

This model is a generalization of the TextCatCNN model, which only
supports mean reductions and is a bit of a misnomer, because it can also
be used with transformers. This change also reimplements TextCatCNN.v2
using the new TextCatReduce.v1 layer.

* Doc fixes

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Fully specify `TextCatCNN` <-> `TextCatReduce` equivalence

* Move TextCatCNN docs to legacy, in prep for moving to spacy-legacy

* Add back a test for TextCatCNN.v2

* Replace TextCatCNN in pipe configurations and templates

* Add an infobox to the `TextCatReduce` section with an `TextCatCNN` anchor

* Add last reduction (`use_reduce_last`)

* Remove non-working TextCatCNN Netlify redirect

* Revert layer changes for the quickstart

* Revert one more quickstart change

* Remove unused import

* Fix docstring

* Fix setting name in error message

---------

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-12-21 11:00:06 +01:00
Steven Crowther 764be103bc
update README to include links to GPU processing, LLM's, and the spaCy blog. (#13197)
* Update README.md to include links for GPU processing, LLM, and spaCy's blog.

* Create ojo4f3.md

* corrected README to most current version with links to GPU processing, LLM's, and the spaCy blog.

* Delete .github/contributors/ojo4f3.md

* changed LLM icon

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>

* Apply suggestions from code review

---------

Co-authored-by: Adriane Boyd <adrianeboyd@gmail.com>
2023-12-18 09:49:07 +01:00
Sofie Van Landeghem 56fc3bc0f3
Type documentation fixes for Doc (#13187)
* correct char_span output type - can be None

* unify type of exclude parameter

* black

* further fixes to from_dict and to_dict

* formatting
2023-12-18 09:00:47 +01:00
Ines Montani 7df328fbfe
Update README.md [ci skip] 2023-12-12 10:19:57 +01:00
Raphael Mitsch d56ee65ddf
Document `spacy-llm`'s `TranslationTask` (#13183)
* Describe translation task.

* Fix references to examples and template.

* Format.
2023-12-11 17:41:04 +01:00
Raphael Mitsch e79a9c5acd
Document `spacy-llm`'s `RawTask` (#13180)
* Add section on RawTask.

* Fix API docs.

* Update website/docs/api/large-language-models.mdx

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

---------

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
2023-12-11 17:14:12 +01:00
Ines Montani 8cfccdd2f8
Update links [ci skip] 2023-12-11 15:51:43 +01:00
Ines Montani f78b91c03b
Update links [ci skip] 2023-12-11 15:51:01 +01:00
Raphael Mitsch 9fcd2bfa08
Add info on endpoint arg. (#13169) 2023-12-05 12:46:29 +01:00
Raphael Mitsch a25a3b996b
Merge pull request #13173 from explosion/docs/llm_main
Sync `llm_develop` with `llm_main`
2023-12-04 16:46:21 +01:00
Raphael Mitsch 55ed2b4e82
Add documentation for EL task (#12988)
* Add documentation for EL task.

* Fix EL factory name.

* Add llm_entity_linker_mentio.

* Apply suggestions from code review

Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>

* Update EL task docs.

* Update EL task docs.

* Update EL task docs.

* Update EL task docs.

* Update EL task docs.

* Update EL task docs.

* Update EL task docs.

* Update EL task docs.

* Update EL task docs.

* Apply suggestions from code review

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

* Incorporate feedback.

* Format.

* Fix link to KB data.

---------

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com>
2023-12-04 15:23:28 +01:00
Adriane Boyd e467573550
Docs: update trf_data examples and pipeline design info (#13164) 2023-12-04 15:15:54 +01:00
Raphael Mitsch 0e43fca036
Add Claude-2.1 mention. (#13167) 2023-12-01 16:48:35 +01:00