Commit Graph

73 Commits

Author SHA1 Message Date
Matthew Honnibal f277bfdf0f
Add SpanGroup and Graph container types to represent arbitrary annotations (#6696)
* Draft out initial Spans data structure

* Initial span group commit

* Basic span group support on Doc

* Basic test for span group

* Compile span_group.pyx

* Draft addition of SpanGroup to DocBin

* Add deserialization for SpanGroup

* Add tests for serializing SpanGroup

* Fix serialization of SpanGroup

* Add EdgeC and GraphC structs

* Add draft Graph data structure

* Compile graph

* More work on Graph

* Update GraphC

* Upd graph

* Fix walk functions

* Let Graph take nodes and edges on construction

* Fix walking and getting

* Add graph tests

* Fix import

* Add module with the SpanGroups dict thingy

* Update test

* Rename 'span_groups' attribute

* Try to fix c++11 compilation

* Fix test

* Update DocBin

* Try to fix compilation

* Try to fix graph

* Improve SpanGroup docstrings

* Add doc.spans to documentation

* Fix serialization

* Tidy up and add docs

* Update docs [ci skip]

* Add SpanGroup.has_overlap

* WIP updated Graph API

* Start testing new Graph API

* Update Graph tests

* Update Graph

* Add docstring

Co-authored-by: Ines Montani <ines@ines.io>
2021-01-14 17:30:41 +11:00
Adriane Boyd a45d89f09a Add initialize.before_init and after_init callbacks
Add `initialize.before_init` and `initialize.after_init` callbacks to
the config. The `initialize.before_init` callback is a place to
implement one-time tokenizer customizations that are then saved with the
model.
2021-01-12 13:07:44 +01:00
Adriane Boyd 1442d2f213
Improve simple training example in v3 migration (#6438)
* Create the examples once
* Use the examples in the initialization
* Provide the batch size
* Fix `begin_training` migration example
2020-11-30 09:39:45 +08:00
Ines Montani 019a1dd5e8 Fix v3 overview [ci skip] 2020-11-03 18:10:06 +01:00
Ines Montani 20f80587d6
Merge pull request #6257 from walterhenry/develop-proof
A few tiny typo fixes to push through with release of nightly
2020-10-15 18:17:30 +02:00
walterhenry 75b7f86383 Three small typos
Some little typos since v3.0 is out.
2020-10-15 18:06:37 +02:00
Ines Montani 7f05ccc170 Update docs [ci skip] 2020-10-15 12:35:30 +02:00
Ines Montani e50dc2c1c9 Update docs [ci skip] 2020-10-09 12:04:52 +02:00
Sofie Van Landeghem d093d6343b
TrainablePipe (#6213)
* rename Pipe to TrainablePipe

* split functionality between Pipe and TrainablePipe

* remove unnecessary methods from certain components

* cleanup

* hasattr(component, "pipe") should be sufficient again

* remove serialization and vocab/cfg from Pipe

* unify _ensure_examples and validate_examples

* small fixes

* hasattr checks for self.cfg and self.vocab

* make is_resizable and is_trainable properties

* serialize strings.json instead of vocab

* fix KB IO + tests

* fix typos

* more typos

* _added_strings as a set

* few more tests specifically for _added_strings field

* bump to 3.0.0a36
2020-10-08 21:33:49 +02:00
Ines Montani 064575d79d
Merge pull request #6216 from svlandeg/feature/nel-initialize 2020-10-08 11:14:12 +02:00
Ines Montani 43e59bb22a Update docs and install extras [ci skip] 2020-10-08 10:58:50 +02:00
svlandeg bcaad28eda fix typos 2020-10-07 13:05:37 +02:00
Ines Montani ce14520789 Update docs [ci skip] 2020-10-06 14:35:17 +02:00
Ines Montani 11347f34da Tidy up, tests and docs 2020-10-04 13:54:05 +02:00
Ines Montani b6b73a3ca8 Update docs [ci skip] 2020-10-01 17:45:29 +02:00
Ines Montani 0a8a124a6e Update docs [ci skip] 2020-10-01 12:15:53 +02:00
Ines Montani 115481aca7 Update docs [ci skip] 2020-09-30 15:16:00 +02:00
Ines Montani ff9a63bfbd begin_training -> initialize 2020-09-28 21:35:09 +02:00
Ines Montani e06ff8b71d Update docs [ci skip] 2020-09-26 13:18:08 +02:00
Ines Montani 6ca06cb62c Update docs and formatting [ci skip] 2020-09-23 10:14:27 +02:00
Ines Montani 60a317520a
Merge pull request #6109 from svlandeg/feature/2rename 2020-09-23 09:47:12 +02:00
Ines Montani 930b116f00 Update docs [ci skip] 2020-09-23 09:35:21 +02:00
svlandeg b556a10808 rename converts in_to_out 2020-09-22 11:50:19 +02:00
Ines Montani 67fbcb3da5 Tidy up tests and docs 2020-09-21 20:43:54 +02:00
Ines Montani 012b3a7096 Update docs [ci skip] 2020-09-20 17:44:58 +02:00
Ines Montani c8fa2247e3 Merge branch 'develop' of https://github.com/explosion/spaCy into develop 2020-09-17 12:34:15 +02:00
Ines Montani 6761028c6f Update docs [ci skip] 2020-09-17 12:34:11 +02:00
Adriane Boyd 7e4cd7575c
Refactor Docs.is_ flags (#6044)
* Refactor Docs.is_ flags

* Add derived `Doc.has_annotation` method

  * `Doc.has_annotation(attr)` returns `True` for partial annotation

  * `Doc.has_annotation(attr, require_complete=True)` returns `True` for
    complete annotation

* Add deprecation warnings to `is_tagged`, `is_parsed`, `is_sentenced`
and `is_nered`

* Add `Doc._get_array_attrs()`, which returns a full list of `Doc` attrs
for use with `Doc.to_array`, `Doc.to_bytes` and `Doc.from_docs`. The
list is the `DocBin` attributes list plus `SPACY` and `LENGTH`.

Notes on `Doc.has_annotation`:

* `HEAD` is converted to `DEP` because heads don't have an unset state

* Accept `IS_SENT_START` as a synonym of `SENT_START`

Additional changes:

* Add `NORM`, `ENT_ID` and `SENT_START` to default attributes for
`DocBin`

* In `Doc.from_array()` the presence of `DEP` causes `HEAD` to override
`SENT_START`

* In `Doc.from_array()` using `attrs` other than
`Doc._get_array_attrs()` (i.e., a user's custom list rather than our
default internal list) with both `HEAD` and `SENT_START` shows a warning
that `HEAD` will override `SENT_START`

* `set_children_from_heads` does not require dependency labels to set
sentence boundaries and sets `sent_start` for all non-sentence starts to
`-1`

* Fix call to set_children_form_heads

Co-authored-by: Matthew Honnibal <honnibal+gh@gmail.com>
2020-09-17 00:14:01 +02:00
Ines Montani b7faa38960 Update docs [ci skip] 2020-09-15 12:44:03 +02:00
Ines Montani 154752f9c2 Update docs and consistency [ci skip] 2020-09-15 00:32:49 +02:00
Ines Montani 5ebb2a2ac8 Update docs [ci skip] 2020-09-13 22:36:20 +02:00
Ines Montani 47acb45850 Update docs [ci skip] 2020-09-13 22:30:33 +02:00
Ines Montani 8b0dabe987 Update docs [ci skip] 2020-09-12 17:05:10 +02:00
Ines Montani c443c82722 Update docs [ci skip] 2020-09-05 13:41:10 +02:00
Ines Montani b3e338d65e Update docs [ci skip] 2020-09-04 20:58:36 +02:00
Ines Montani 157caf4dfa WIP: update docs [ci skip] 2020-09-04 16:30:31 +02:00
Adriane Boyd b927893309
Merge branch 'develop' into feature/dependency-matcher-v3 2020-09-04 13:03:30 +02:00
Ines Montani 121809dd1e Fix anchor [ci skip] 2020-09-03 16:49:56 +02:00
Ines Montani b5a0657fd6 "model" terminology consistency in docs 2020-09-03 13:13:03 +02:00
Adriane Boyd 960d9cfadc Officially support DependencyMatcher
Add official support for the `DependencyMatcher`. Redesign the pattern
specification. Fix and extend operator implementations. Update API docs
and add usage docs.

Patterns
--------

Refactor pattern structure to:

```
{
  "LEFT_ID": str,
  "REL_OP": str,
  "RIGHT_ID": str,
  "RIGHT_ATTRS": dict,
}
```

The first node contains only `RIGHT_ID` and `RIGHT_ATTRS` and all
subsequent nodes contain all four keys.

New operators
-------------

Because of the way patterns are constructed from left to right, it's
helpful to have `follows` operators along with `precedes` operators. Add
operators for simple precedes / follows alongside immediate precedes /
follows.

* `.*`: precedes
* `;`: immediately follows
* `;*`: follows

Operator fixes
--------------

* `<` and `<<` do not include the node itself
* Fix reversed order for all operators involving linear precedence (`.`,
  all sibling operators)
* Linear precedence operators do not match nodes outside the same parse

Additional fixes
----------------

* Use v3 Matcher API
* Support `get` and `remove`
* Support pickling
2020-09-02 17:45:29 +02:00
Ines Montani add9de5487 Deprecate (Phrase)Matcher.pipe 2020-08-31 17:01:24 +02:00
Sofie Van Landeghem ec14744ee4
Rename Transformer listener (#6001)
* rename to spacy-transformers.TransformerListener

* add some more tok2vec tests

* use select_pipes

* fix docs - annotation setter was not changed in the end
2020-08-31 12:41:39 +02:00
Adriane Boyd 216efaf5f5 Restrict tokenizer exceptions to ORTH and NORM 2020-08-31 09:55:01 +02:00
Ines Montani 9b86312bab Update docs [ci skip] 2020-08-29 18:43:19 +02:00
Adriane Boyd 870774f475
Merge branch 'develop' into docs/morph-usage-v3 2020-08-29 16:00:50 +02:00
Adriane Boyd f9ed31a757 Update usage docs for lemmatization and morphology 2020-08-29 15:56:50 +02:00
Ines Montani 66d76f5126 Update docs 2020-08-29 12:36:05 +02:00
Ines Montani 8ac5ef1284 Update docs 2020-08-25 11:54:37 +02:00
Matthew Honnibal e559867605
Allow spacy project to push and pull to/from remote storage (#5949)
* Add utils for working with remote storage

* WIP add remote_cache for project

* WIP add push and pull commands

* Use pathy in remote_cache

* Updarte util

* Update remote_cache

* Update util

* Update project assets

* Update pull script

* Update push script

* Fix type annotation in util

* Work on remote storage

* Remove site and env hash

* Fix imports

* Fix type annotation

* Require pathy

* Require pathy

* Fix import

* Add a util to handle project variable substitution

* Import push and pull commands

* Fix pull command

* Fix push command

* Fix tarfile in remote_storage

* Improve printing

* Fiddle with status messages

* Set version to v3.0.0a9

* Draft docs for spacy project remote storages

* Update docs [ci skip]

* Use Thinc config to simplify and unify template variables

* Auto-format

* Don't import Pathy globally for now

Causes slow and annoying Google Cloud warning

* Tidy up test

* Tidy up and update tests

* Update to latest Thinc

* Update docs

* variables -> vars

* Update docs [ci skip]

* Update docs [ci skip]

Co-authored-by: Ines Montani <ines@ines.io>
2020-08-23 18:32:09 +02:00
svlandeg 1b7cfa7347 Merge remote-tracking branch 'upstream/develop' into feature/docs-docs-docs 2020-08-21 18:36:18 +02:00