diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 81cfbf8cb..0abde2abf 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -5,7 +5,7 @@ Thanks for your interest in contributing to spaCy πŸŽ‰ The project is maintained by [@honnibal](https://github.com/honnibal) and [@ines](https://github.com/ines), and we'll do our best to help you get started. This page will give you a quick -overview of how things are organised and most importantly, how to get involved. +overview of how things are organized and most importantly, how to get involved. ## Table of contents @@ -195,7 +195,7 @@ modules in `.py` files, not Cython modules in `.pyx` and `.pxd` files.** ### Code formatting [`black`](https://github.com/ambv/black) is an opinionated Python code -formatter, optimised to produce readable code and small diffs. You can run +formatter, optimized to produce readable code and small diffs. You can run `black` from the command-line, or via your code editor. For example, if you're using [Visual Studio Code](https://code.visualstudio.com/), you can add the following to your `settings.json` to use `black` for formatting and auto-format @@ -286,7 +286,7 @@ Code that interacts with the file-system should accept objects that follow the If the function is user-facing and takes a path as an argument, it should check whether the path is provided as a string. Strings should be converted to `pathlib.Path` objects. Serialization and deserialization functions should always -accept **file-like objects**, as it makes the library io-agnostic. Working on +accept **file-like objects**, as it makes the library IO-agnostic. Working on buffers makes the code more general, easier to test, and compatible with Python 3's asynchronous IO. @@ -384,7 +384,7 @@ of Python and C++, with additional complexity and syntax from numpy. The many "traps for new players". Working in Cython is very rewarding once you're over the initial learning curve. As with C and C++, the first way you write something in Cython will often be the performance-optimal approach. In contrast, -Python optimisation generally requires a lot of experimentation. Is it faster to +Python optimization generally requires a lot of experimentation. Is it faster to have an `if item in my_dict` check, or to use `.get()`? What about `try`/`except`? Does this numpy operation create a copy? There's no way to guess the answers to these questions, and you'll usually be dissatisfied with your results β€” so @@ -400,7 +400,7 @@ Python. If it's not fast enough the first time, just switch to Cython. - [PEP 8 Style Guide for Python Code](https://www.python.org/dev/peps/pep-0008/) (python.org) - [Official Cython documentation](http://docs.cython.org/en/latest/) (cython.org) - [Writing C in Cython](https://explosion.ai/blog/writing-c-in-cython) (explosion.ai) -- [Multi-threading spaCy’s parser and named entity recogniser](https://explosion.ai/blog/multithreading-with-cython) (explosion.ai) +- [Multi-threading spaCy’s parser and named entity recognizer](https://explosion.ai/blog/multithreading-with-cython) (explosion.ai) ## Adding tests @@ -412,7 +412,7 @@ name. For example, tests for the `Tokenizer` can be found in all test files and test functions need to be prefixed with `test_`. When adding tests, make sure to use descriptive names, keep the code short and -concise and only test for one behaviour at a time. Try to `parametrize` test +concise and only test for one behavior at a time. Try to `parametrize` test cases wherever possible, use our pre-defined fixtures for spaCy components and avoid unnecessary imports. diff --git a/README.md b/README.md index 1fece1e5a..cef2a1fdd 100644 --- a/README.md +++ b/README.md @@ -49,9 +49,8 @@ It's commercial open-source software, released under the MIT license. ## πŸ’¬ Where to ask questions -The spaCy project is maintained by [@honnibal](https://github.com/honnibal) and -[@ines](https://github.com/ines), along with core contributors -[@svlandeg](https://github.com/svlandeg) and +The spaCy project is maintained by [@honnibal](https://github.com/honnibal), +[@ines](https://github.com/ines), [@svlandeg](https://github.com/svlandeg) and [@adrianeboyd](https://github.com/adrianeboyd). Please understand that we won't be able to provide individual support via email. We also believe that help is much more valuable if it's shared publicly, so that more people can benefit from diff --git a/pyproject.toml b/pyproject.toml index 9a646d0d7..77deb44b0 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -6,9 +6,10 @@ requires = [ "cymem>=2.0.2,<2.1.0", "preshed>=3.0.2,<3.1.0", "murmurhash>=0.28.0,<1.1.0", - "thinc>=8.0.0a28,<8.0.0a30", + "thinc>=8.0.0a29,<8.0.0a40", "blis>=0.4.0,<0.5.0", "pytokenizations", - "smart_open>=2.0.0,<3.0.0" + "smart_open>=2.0.0,<3.0.0", + "pathy" ] build-backend = "setuptools.build_meta" diff --git a/requirements.txt b/requirements.txt index 181cb2101..5aafd83dd 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,7 +1,7 @@ # Our libraries cymem>=2.0.2,<2.1.0 preshed>=3.0.2,<3.1.0 -thinc>=8.0.0a28,<8.0.0a30 +thinc>=8.0.0a29,<8.0.0a40 blis>=0.4.0,<0.5.0 ml_datasets>=0.1.1 murmurhash>=0.28.0,<1.1.0 @@ -9,6 +9,7 @@ wasabi>=0.7.1,<1.1.0 srsly>=2.1.0,<3.0.0 catalogue>=0.0.7,<1.1.0 typer>=0.3.0,<0.4.0 +pathy # Third party dependencies numpy>=1.15.0 requests>=2.13.0,<3.0.0 diff --git a/setup.cfg b/setup.cfg index d56eab3a6..8b4819ed8 100644 --- a/setup.cfg +++ b/setup.cfg @@ -34,18 +34,19 @@ setup_requires = cymem>=2.0.2,<2.1.0 preshed>=3.0.2,<3.1.0 murmurhash>=0.28.0,<1.1.0 - thinc>=8.0.0a28,<8.0.0a30 + thinc>=8.0.0a29,<8.0.0a40 install_requires = # Our libraries murmurhash>=0.28.0,<1.1.0 cymem>=2.0.2,<2.1.0 preshed>=3.0.2,<3.1.0 - thinc>=8.0.0a28,<8.0.0a30 + thinc>=8.0.0a29,<8.0.0a40 blis>=0.4.0,<0.5.0 wasabi>=0.7.1,<1.1.0 srsly>=2.1.0,<3.0.0 catalogue>=0.0.7,<1.1.0 typer>=0.3.0,<0.4.0 + pathy # Third-party dependencies tqdm>=4.38.0,<5.0.0 numpy>=1.15.0 diff --git a/spacy/about.py b/spacy/about.py index 77b00eb48..56bb016c3 100644 --- a/spacy/about.py +++ b/spacy/about.py @@ -1,6 +1,6 @@ # fmt: off __title__ = "spacy-nightly" -__version__ = "3.0.0a8" +__version__ = "3.0.0a9" __release__ = True __download_url__ = "https://github.com/explosion/spacy-models/releases/download" __compatibility__ = "https://raw.githubusercontent.com/explosion/spacy-models/master/compatibility.json" diff --git a/spacy/cli/__init__.py b/spacy/cli/__init__.py index 2b21e2f2b..8aea6ef45 100644 --- a/spacy/cli/__init__.py +++ b/spacy/cli/__init__.py @@ -21,6 +21,8 @@ from .project.clone import project_clone # noqa: F401 from .project.assets import project_assets # noqa: F401 from .project.run import project_run # noqa: F401 from .project.dvc import project_update_dvc # noqa: F401 +from .project.push import project_push # noqa: F401 +from .project.pull import project_pull # noqa: F401 @app.command("link", no_args_is_help=True, deprecated=True, hidden=True) diff --git a/spacy/cli/_util.py b/spacy/cli/_util.py index 9d3ae0913..b527ac2a0 100644 --- a/spacy/cli/_util.py +++ b/spacy/cli/_util.py @@ -1,4 +1,5 @@ -from typing import Dict, Any, Union, List, Optional +from typing import Dict, Any, Union, List, Optional, TYPE_CHECKING +import sys from pathlib import Path from wasabi import msg import srsly @@ -8,11 +9,13 @@ from typer.main import get_command from contextlib import contextmanager from thinc.config import Config, ConfigValidationError from configparser import InterpolationError -import sys from ..schemas import ProjectConfigSchema, validate from ..util import import_file +if TYPE_CHECKING: + from pathy import Pathy # noqa: F401 + PROJECT_FILE = "project.yml" PROJECT_LOCK = "project.lock" @@ -93,11 +96,12 @@ def parse_config_overrides(args: List[str]) -> Dict[str, Any]: return result -def load_project_config(path: Path) -> Dict[str, Any]: +def load_project_config(path: Path, interpolate: bool = True) -> Dict[str, Any]: """Load the project.yml file from a directory and validate it. Also make sure that all directories defined in the config exist. path (Path): The path to the project directory. + interpolate (bool): Whether to substitute project variables. RETURNS (Dict[str, Any]): The loaded project.yml. """ config_path = path / PROJECT_FILE @@ -110,16 +114,34 @@ def load_project_config(path: Path) -> Dict[str, Any]: msg.fail(invalid_err, e, exits=1) errors = validate(ProjectConfigSchema, config) if errors: - msg.fail(invalid_err, "\n".join(errors), exits=1) + msg.fail(invalid_err) + print("\n".join(errors)) + sys.exit(1) validate_project_commands(config) # Make sure directories defined in config exist for subdir in config.get("directories", []): dir_path = path / subdir if not dir_path.exists(): dir_path.mkdir(parents=True) + if interpolate: + err = "project.yml validation error" + with show_validation_error(title=err, hint_fill=False): + config = substitute_project_variables(config) return config +def substitute_project_variables(config: Dict[str, Any], overrides: Dict = {}): + key = "vars" + config.setdefault(key, {}) + config[key].update(overrides) + # Need to put variables in the top scope again so we can have a top-level + # section "project" (otherwise, a list of commands in the top scope wouldn't) + # be allowed by Thinc's config system + cfg = Config({"project": config, key: config[key]}) + interpolated = cfg.interpolate() + return dict(interpolated["project"]) + + def validate_project_commands(config: Dict[str, Any]) -> None: """Check that project commands and workflows are valid, don't contain duplicates, don't clash and only refer to commands that exist. @@ -230,3 +252,39 @@ def get_sourced_components(config: Union[Dict[str, Any], Config]) -> List[str]: for name, cfg in config.get("components", {}).items() if "factory" not in cfg and "source" in cfg ] + + +def upload_file(src: Path, dest: Union[str, "Pathy"]) -> None: + """Upload a file. + + src (Path): The source path. + url (str): The destination URL to upload to. + """ + dest = ensure_pathy(dest) + with dest.open(mode="wb") as output_file: + with src.open(mode="rb") as input_file: + output_file.write(input_file.read()) + + +def download_file(src: Union[str, "Pathy"], dest: Path, *, force: bool = False) -> None: + """Download a file using smart_open. + + url (str): The URL of the file. + dest (Path): The destination path. + force (bool): Whether to force download even if file exists. + If False, the download will be skipped. + """ + if dest.exists() and not force: + return None + src = ensure_pathy(src) + with src.open(mode="rb") as input_file: + with dest.open(mode="wb") as output_file: + output_file.write(input_file.read()) + + +def ensure_pathy(path): + """Temporary helper to prevent importing Pathy globally (which can cause + slow and annoying Google Cloud warning).""" + from pathy import Pathy # noqa: F811 + + return Pathy(path) diff --git a/spacy/cli/init_config.py b/spacy/cli/init_config.py index 9b47dea14..94e0bd6fc 100644 --- a/spacy/cli/init_config.py +++ b/spacy/cli/init_config.py @@ -24,7 +24,7 @@ class Optimizations(str, Enum): @init_cli.command("config") def init_config_cli( # fmt: off - output_file: Path = Arg("-", help="File to save config.cfg to (or - for stdout)", allow_dash=True), + output_file: Path = Arg(..., help="File to save config.cfg to or - for stdout (will only output config and no additional logging info)", allow_dash=True), lang: Optional[str] = Opt("en", "--lang", "-l", help="Two-letter code of the language to use"), pipeline: Optional[str] = Opt("tagger,parser,ner", "--pipeline", "-p", help="Comma-separated names of trainable pipeline components to include in the model (without 'tok2vec' or 'transformer')"), optimize: Optimizations = Opt(Optimizations.efficiency.value, "--optimize", "-o", help="Whether to optimize for efficiency (faster inference, smaller model, lower memory consumption) or higher accuracy (potentially larger and slower model). This will impact the choice of architecture, pretrained weights and related hyperparameters."), @@ -110,6 +110,13 @@ def init_config( "word_vectors": reco["word_vectors"], "has_letters": reco["has_letters"], } + if variables["transformer_data"] and not has_spacy_transformers(): + msg.warn( + "To generate a more effective transformer-based config (GPU-only), " + "install the spacy-transformers package and re-run this command. " + "The config generated now does not use transformers." + ) + variables["transformer_data"] = None base_template = template.render(variables).strip() # Giving up on getting the newlines right in jinja for now base_template = re.sub(r"\n\n\n+", "\n\n", base_template) @@ -126,8 +133,6 @@ def init_config( for label, value in use_case.items(): msg.text(f"- {label}: {value}") use_transformer = bool(template_vars.use_transformer) - if use_transformer: - require_spacy_transformers(msg) with show_validation_error(hint_fill=False): config = util.load_config_from_str(base_template) nlp, _ = util.load_model_from_config(config, auto_fill=True) @@ -149,12 +154,10 @@ def save_config(config: Config, output_file: Path, is_stdout: bool = False) -> N print(f"{COMMAND} train {output_file.parts[-1]} {' '.join(variables)}") -def require_spacy_transformers(msg: Printer) -> None: +def has_spacy_transformers() -> bool: try: import spacy_transformers # noqa: F401 + + return True except ImportError: - msg.fail( - "Using a transformer-based pipeline requires spacy-transformers " - "to be installed.", - exits=1, - ) + return False diff --git a/spacy/cli/project/assets.py b/spacy/cli/project/assets.py index 3be784e04..60cf95160 100644 --- a/spacy/cli/project/assets.py +++ b/spacy/cli/project/assets.py @@ -4,10 +4,10 @@ from wasabi import msg import re import shutil import requests -import smart_open from ...util import ensure_path, working_dir from .._util import project_cli, Arg, PROJECT_FILE, load_project_config, get_checksum +from .._util import download_file # TODO: find a solution for caches @@ -44,16 +44,14 @@ def project_assets(project_dir: Path) -> None: if not assets: msg.warn(f"No assets specified in {PROJECT_FILE}", exits=0) msg.info(f"Fetching {len(assets)} asset(s)") - variables = config.get("variables", {}) for asset in assets: - dest = asset["dest"].format(**variables) + dest = asset["dest"] url = asset.get("url") checksum = asset.get("checksum") if not url: # project.yml defines asset without URL that the user has to place check_private_asset(dest, checksum) continue - url = url.format(**variables) fetch_asset(project_path, url, dest, checksum) @@ -132,15 +130,3 @@ def convert_asset_url(url: str) -> str: ) return converted return url - - -def download_file(url: str, dest: Path, chunk_size: int = 1024) -> None: - """Download a file using smart_open. - - url (str): The URL of the file. - dest (Path): The destination path. - chunk_size (int): The size of chunks to read/write. - """ - with smart_open.open(url, mode="rb") as input_file: - with dest.open(mode="wb") as output_file: - output_file.write(input_file.read()) diff --git a/spacy/cli/project/dvc.py b/spacy/cli/project/dvc.py index 7386339d9..e0f6cd430 100644 --- a/spacy/cli/project/dvc.py +++ b/spacy/cli/project/dvc.py @@ -99,7 +99,6 @@ def update_dvc_config( if ref_hash == config_hash and not force: return False # Nothing has changed in project.yml, don't need to update dvc_config_path.unlink() - variables = config.get("variables", {}) dvc_commands = [] config_commands = {cmd["name"]: cmd for cmd in config.get("commands", [])} for name in workflows[workflow]: @@ -122,7 +121,7 @@ def update_dvc_config( dvc_commands.append(join_command(full_cmd)) with working_dir(path): dvc_flags = {"--verbose": verbose, "--quiet": silent} - run_dvc_commands(dvc_commands, variables, flags=dvc_flags) + run_dvc_commands(dvc_commands, flags=dvc_flags) with dvc_config_path.open("r+", encoding="utf8") as f: content = f.read() f.seek(0, 0) @@ -131,23 +130,16 @@ def update_dvc_config( def run_dvc_commands( - commands: List[str] = tuple(), - variables: Dict[str, str] = {}, - flags: Dict[str, bool] = {}, + commands: List[str] = tuple(), flags: Dict[str, bool] = {}, ) -> None: """Run a sequence of DVC commands in a subprocess, in order. commands (List[str]): The string commands without the leading "dvc". - variables (Dict[str, str]): Dictionary of variable names, mapped to their - values. Will be used to substitute format string variables in the - commands. flags (Dict[str, bool]): Conditional flags to be added to command. Makes it easier to pass flags like --quiet that depend on a variable or command-line setting while avoiding lots of nested conditionals. """ for command in commands: - # Substitute variables, e.g. "./{NAME}.json" - command = command.format(**variables) command = split_command(command) dvc_command = ["dvc", *command] # Add the flags if they are set to True diff --git a/spacy/cli/project/pull.py b/spacy/cli/project/pull.py new file mode 100644 index 000000000..1bf608c40 --- /dev/null +++ b/spacy/cli/project/pull.py @@ -0,0 +1,36 @@ +from pathlib import Path +from wasabi import msg +from .remote_storage import RemoteStorage +from .remote_storage import get_command_hash +from .._util import project_cli, Arg +from .._util import load_project_config + + +@project_cli.command("pull") +def project_pull_cli( + # fmt: off + remote: str = Arg("default", help="Name or path of remote storage"), + project_dir: Path = Arg(Path.cwd(), help="Location of project directory. Defaults to current working directory.", exists=True, file_okay=False), + # fmt: on +): + """Retrieve any precomputed outputs from a remote storage that are available. + You can alias remotes in your project.yml by mapping them to storage paths. + A storage can be anything that the smart-open library can upload to, e.g. + gcs, aws, ssh, local directories etc + """ + for url, output_path in project_pull(project_dir, remote): + if url is not None: + msg.good(f"Pulled {output_path} from {url}") + + +def project_pull(project_dir: Path, remote: str, *, verbose: bool = False): + config = load_project_config(project_dir) + if remote in config.get("remotes", {}): + remote = config["remotes"][remote] + storage = RemoteStorage(project_dir, remote) + for cmd in config.get("commands", []): + deps = [project_dir / dep for dep in cmd.get("deps", [])] + cmd_hash = get_command_hash("", "", deps, cmd["script"]) + for output_path in cmd.get("outputs", []): + url = storage.pull(output_path, command_hash=cmd_hash) + yield url, output_path diff --git a/spacy/cli/project/push.py b/spacy/cli/project/push.py new file mode 100644 index 000000000..0b070c9d8 --- /dev/null +++ b/spacy/cli/project/push.py @@ -0,0 +1,48 @@ +from pathlib import Path +from wasabi import msg +from .remote_storage import RemoteStorage +from .remote_storage import get_content_hash, get_command_hash +from .._util import load_project_config +from .._util import project_cli, Arg + + +@project_cli.command("push") +def project_push_cli( + # fmt: off + remote: str = Arg("default", help="Name or path of remote storage"), + project_dir: Path = Arg(Path.cwd(), help="Location of project directory. Defaults to current working directory.", exists=True, file_okay=False), + # fmt: on +): + """Persist outputs to a remote storage. You can alias remotes in your project.yml + by mapping them to storage paths. A storage can be anything that the smart-open + library can upload to, e.g. gcs, aws, ssh, local directories etc + """ + for output_path, url in project_push(project_dir, remote): + if url is None: + msg.info(f"Skipping {output_path}") + else: + msg.good(f"Pushed {output_path} to {url}") + + +def project_push(project_dir: Path, remote: str): + """Persist outputs to a remote storage. You can alias remotes in your project.yml + by mapping them to storage paths. A storage can be anything that the smart-open + library can upload to, e.g. gcs, aws, ssh, local directories etc + """ + config = load_project_config(project_dir) + if remote in config.get("remotes", {}): + remote = config["remotes"][remote] + storage = RemoteStorage(project_dir, remote) + for cmd in config.get("commands", []): + cmd_hash = get_command_hash( + "", "", [project_dir / dep for dep in cmd.get("deps", [])], cmd["script"] + ) + for output_path in cmd.get("outputs", []): + output_loc = project_dir / output_path + if output_loc.exists(): + url = storage.push( + output_path, + command_hash=cmd_hash, + content_hash=get_content_hash(output_loc), + ) + yield output_path, url diff --git a/spacy/cli/project/remote_storage.py b/spacy/cli/project/remote_storage.py new file mode 100644 index 000000000..e7e7cbbe8 --- /dev/null +++ b/spacy/cli/project/remote_storage.py @@ -0,0 +1,169 @@ +from typing import Optional, List, Dict, TYPE_CHECKING +import os +import site +import hashlib +import urllib.parse +import tarfile +from pathlib import Path + +from .._util import get_hash, get_checksum, download_file, ensure_pathy +from ...util import make_tempdir + +if TYPE_CHECKING: + from pathy import Pathy # noqa: F401 + + +class RemoteStorage: + """Push and pull outputs to and from a remote file storage. + + Remotes can be anything that `smart-open` can support: AWS, GCS, file system, + ssh, etc. + """ + + def __init__(self, project_root: Path, url: str, *, compression="gz"): + self.root = project_root + self.url = ensure_pathy(url) + self.compression = compression + + def push(self, path: Path, command_hash: str, content_hash: str) -> "Pathy": + """Compress a file or directory within a project and upload it to a remote + storage. If an object exists at the full URL, nothing is done. + + Within the remote storage, files are addressed by their project path + (url encoded) and two user-supplied hashes, representing their creation + context and their file contents. If the URL already exists, the data is + not uploaded. Paths are archived and compressed prior to upload. + """ + loc = self.root / path + if not loc.exists(): + raise IOError(f"Cannot push {loc}: does not exist.") + url = self.make_url(path, command_hash, content_hash) + if url.exists(): + return None + tmp: Path + with make_tempdir() as tmp: + tar_loc = tmp / self.encode_name(str(path)) + mode_string = f"w:{self.compression}" if self.compression else "w" + with tarfile.open(tar_loc, mode=mode_string) as tar_file: + tar_file.add(str(loc), arcname=str(path)) + with tar_loc.open(mode="rb") as input_file: + with url.open(mode="wb") as output_file: + output_file.write(input_file.read()) + return url + + def pull( + self, + path: Path, + *, + command_hash: Optional[str] = None, + content_hash: Optional[str] = None, + ) -> Optional["Pathy"]: + """Retrieve a file from the remote cache. If the file already exists, + nothing is done. + + If the command_hash and/or content_hash are specified, only matching + results are returned. If no results are available, an error is raised. + """ + dest = self.root / path + if dest.exists(): + return None + url = self.find(path, command_hash=command_hash, content_hash=content_hash) + if url is None: + return url + else: + # Make sure the destination exists + if not dest.parent.exists(): + dest.parent.mkdir(parents=True) + tmp: Path + with make_tempdir() as tmp: + tar_loc = tmp / url.parts[-1] + download_file(url, tar_loc) + mode_string = f"r:{self.compression}" if self.compression else "r" + with tarfile.open(tar_loc, mode=mode_string) as tar_file: + # This requires that the path is added correctly, relative + # to root. This is how we set things up in push() + tar_file.extractall(self.root) + return url + + def find( + self, + path: Path, + *, + command_hash: Optional[str] = None, + content_hash: Optional[str] = None, + ) -> Optional["Pathy"]: + """Find the best matching version of a file within the storage, + or `None` if no match can be found. If both the creation and content hash + are specified, only exact matches will be returned. Otherwise, the most + recent matching file is preferred. + """ + name = self.encode_name(str(path)) + if command_hash is not None and content_hash is not None: + url = self.make_url(path, command_hash, content_hash) + urls = [url] if url.exists() else [] + elif command_hash is not None: + urls = list((self.url / name / command_hash).iterdir()) + else: + urls = list((self.url / name).iterdir()) + if content_hash is not None: + urls = [url for url in urls if url.parts[-1] == content_hash] + return urls[-1] if urls else None + + def make_url(self, path: Path, command_hash: str, content_hash: str) -> "Pathy": + """Construct a URL from a subpath, a creation hash and a content hash.""" + return self.url / self.encode_name(str(path)) / command_hash / content_hash + + def encode_name(self, name: str) -> str: + """Encode a subpath into a URL-safe name.""" + return urllib.parse.quote_plus(name) + + +def get_content_hash(loc: Path) -> str: + return get_checksum(loc) + + +def get_command_hash( + site_hash: str, env_hash: str, deps: List[Path], cmd: List[str] +) -> str: + """Create a hash representing the execution of a command. This includes the + currently installed packages, whatever environment variables have been marked + as relevant, and the command. + """ + hashes = [site_hash, env_hash] + [get_checksum(dep) for dep in sorted(deps)] + hashes.extend(cmd) + creation_bytes = "".join(hashes).encode("utf8") + return hashlib.md5(creation_bytes).hexdigest() + + +def get_site_hash(): + """Hash the current Python environment's site-packages contents, including + the name and version of the libraries. The list we're hashing is what + `pip freeze` would output. + """ + site_dirs = site.getsitepackages() + if site.ENABLE_USER_SITE: + site_dirs.extend(site.getusersitepackages()) + packages = set() + for site_dir in site_dirs: + site_dir = Path(site_dir) + for subpath in site_dir.iterdir(): + if subpath.parts[-1].endswith("dist-info"): + packages.add(subpath.parts[-1].replace(".dist-info", "")) + package_bytes = "".join(sorted(packages)).encode("utf8") + return hashlib.md5sum(package_bytes).hexdigest() + + +def get_env_hash(env: Dict[str, str]) -> str: + """Construct a hash of the environment variables that will be passed into + the commands. + + Values in the env dict may be references to the current os.environ, using + the syntax $ENV_VAR to mean os.environ[ENV_VAR] + """ + env_vars = {} + for key, value in env.items(): + if value.startswith("$"): + env_vars[key] = os.environ.get(value[1:], "") + else: + env_vars[key] = value + return get_hash(env_vars) diff --git a/spacy/cli/project/run.py b/spacy/cli/project/run.py index 5c66095aa..6e1deeeee 100644 --- a/spacy/cli/project/run.py +++ b/spacy/cli/project/run.py @@ -44,7 +44,6 @@ def project_run( dry (bool): Perform a dry run and don't execute commands. """ config = load_project_config(project_dir) - variables = config.get("variables", {}) commands = {cmd["name"]: cmd for cmd in config.get("commands", [])} workflows = config.get("workflows", {}) validate_subcommand(commands.keys(), workflows.keys(), subcommand) @@ -54,22 +53,20 @@ def project_run( project_run(project_dir, cmd, force=force, dry=dry) else: cmd = commands[subcommand] - variables = config.get("variables", {}) for dep in cmd.get("deps", []): - dep = dep.format(**variables) if not (project_dir / dep).exists(): err = f"Missing dependency specified by command '{subcommand}': {dep}" err_kwargs = {"exits": 1} if not dry else {} msg.fail(err, **err_kwargs) with working_dir(project_dir) as current_dir: - rerun = check_rerun(current_dir, cmd, variables) + rerun = check_rerun(current_dir, cmd) if not rerun and not force: msg.info(f"Skipping '{cmd['name']}': nothing changed") else: msg.divider(subcommand) - run_commands(cmd["script"], variables, dry=dry) + run_commands(cmd["script"], dry=dry) if not dry: - update_lockfile(current_dir, cmd, variables) + update_lockfile(current_dir, cmd) def print_run_help(project_dir: Path, subcommand: Optional[str] = None) -> None: @@ -115,23 +112,15 @@ def print_run_help(project_dir: Path, subcommand: Optional[str] = None) -> None: def run_commands( - commands: List[str] = tuple(), - variables: Dict[str, Any] = {}, - silent: bool = False, - dry: bool = False, + commands: List[str] = tuple(), silent: bool = False, dry: bool = False, ) -> None: """Run a sequence of commands in a subprocess, in order. commands (List[str]): The string commands. - variables (Dict[str, Any]): Dictionary of variable names, mapped to their - values. Will be used to substitute format string variables in the - commands. silent (bool): Don't print the commands. dry (bool): Perform a dry run and don't execut anything. """ for command in commands: - # Substitute variables, e.g. "./{NAME}.json" - command = command.format(**variables) command = split_command(command) # Not sure if this is needed or a good idea. Motivation: users may often # use commands in their config that reference "python" and we want to @@ -173,15 +162,12 @@ def validate_subcommand( ) -def check_rerun( - project_dir: Path, command: Dict[str, Any], variables: Dict[str, Any] -) -> bool: +def check_rerun(project_dir: Path, command: Dict[str, Any]) -> bool: """Check if a command should be rerun because its settings or inputs/outputs changed. project_dir (Path): The current project directory. command (Dict[str, Any]): The command, as defined in the project.yml. - variables (Dict[str, Any]): The variables defined in the project.yml. RETURNS (bool): Whether to re-run the command. """ lock_path = project_dir / PROJECT_LOCK @@ -197,19 +183,16 @@ def check_rerun( # If the entry in the lockfile matches the lockfile entry that would be # generated from the current command, we don't rerun because it means that # all inputs/outputs, hashes and scripts are the same and nothing changed - return get_hash(get_lock_entry(project_dir, command, variables)) != get_hash(entry) + return get_hash(get_lock_entry(project_dir, command)) != get_hash(entry) -def update_lockfile( - project_dir: Path, command: Dict[str, Any], variables: Dict[str, Any] -) -> None: +def update_lockfile(project_dir: Path, command: Dict[str, Any]) -> None: """Update the lockfile after running a command. Will create a lockfile if it doesn't yet exist and will add an entry for the current command, its script and dependencies/outputs. project_dir (Path): The current project directory. command (Dict[str, Any]): The command, as defined in the project.yml. - variables (Dict[str, Any]): The variables defined in the project.yml. """ lock_path = project_dir / PROJECT_LOCK if not lock_path.exists(): @@ -217,13 +200,11 @@ def update_lockfile( data = {} else: data = srsly.read_yaml(lock_path) - data[command["name"]] = get_lock_entry(project_dir, command, variables) + data[command["name"]] = get_lock_entry(project_dir, command) srsly.write_yaml(lock_path, data) -def get_lock_entry( - project_dir: Path, command: Dict[str, Any], variables: Dict[str, Any] -) -> Dict[str, Any]: +def get_lock_entry(project_dir: Path, command: Dict[str, Any]) -> Dict[str, Any]: """Get a lockfile entry for a given command. An entry includes the command, the script (command steps) and a list of dependencies and outputs with their paths and file hashes, if available. The format is based on the @@ -231,12 +212,11 @@ def get_lock_entry( project_dir (Path): The current project directory. command (Dict[str, Any]): The command, as defined in the project.yml. - variables (Dict[str, Any]): The variables defined in the project.yml. RETURNS (Dict[str, Any]): The lockfile entry. """ - deps = get_fileinfo(project_dir, command.get("deps", []), variables) - outs = get_fileinfo(project_dir, command.get("outputs", []), variables) - outs_nc = get_fileinfo(project_dir, command.get("outputs_no_cache", []), variables) + deps = get_fileinfo(project_dir, command.get("deps", [])) + outs = get_fileinfo(project_dir, command.get("outputs", [])) + outs_nc = get_fileinfo(project_dir, command.get("outputs_no_cache", [])) return { "cmd": f"{COMMAND} run {command['name']}", "script": command["script"], @@ -245,20 +225,16 @@ def get_lock_entry( } -def get_fileinfo( - project_dir: Path, paths: List[str], variables: Dict[str, Any] -) -> List[Dict[str, str]]: +def get_fileinfo(project_dir: Path, paths: List[str]) -> List[Dict[str, str]]: """Generate the file information for a list of paths (dependencies, outputs). Includes the file path and the file's checksum. project_dir (Path): The current project directory. paths (List[str]): The file paths. - variables (Dict[str, Any]): The variables defined in the project.yml. RETURNS (List[Dict[str, str]]): The lockfile entry for a file. """ data = [] for path in paths: - path = path.format(**variables) file_path = project_dir / path md5 = get_checksum(file_path) if file_path.exists() else None data.append({"path": path, "md5": md5}) diff --git a/spacy/displacy/render.py b/spacy/displacy/render.py index 69f6df8f0..07550f9aa 100644 --- a/spacy/displacy/render.py +++ b/spacy/displacy/render.py @@ -252,8 +252,10 @@ class EntityRenderer: colors.update(user_color) colors.update(options.get("colors", {})) self.default_color = DEFAULT_ENTITY_COLOR - self.colors = colors + self.colors = {label.upper(): color for label, color in colors.items()} self.ents = options.get("ents", None) + if self.ents is not None: + self.ents = [ent.upper() for ent in self.ents] self.direction = DEFAULT_DIR self.lang = DEFAULT_LANG template = options.get("template") diff --git a/spacy/displacy/templates.py b/spacy/displacy/templates.py index ff99000f4..b9cbf717b 100644 --- a/spacy/displacy/templates.py +++ b/spacy/displacy/templates.py @@ -51,14 +51,14 @@ TPL_ENTS = """ TPL_ENT = """ {text} - {label} + {label} """ TPL_ENT_RTL = """ {text} - {label} + {label} """ diff --git a/spacy/schemas.py b/spacy/schemas.py index 3eef814c6..170342b54 100644 --- a/spacy/schemas.py +++ b/spacy/schemas.py @@ -303,7 +303,7 @@ class ProjectConfigCommand(BaseModel): class ProjectConfigSchema(BaseModel): # fmt: off - variables: Dict[StrictStr, Union[str, int, float, bool]] = Field({}, title="Optional variables to substitute in commands") + vars: Dict[StrictStr, Any] = Field({}, title="Optional variables to substitute in commands") assets: List[ProjectConfigAsset] = Field([], title="Data assets") workflows: Dict[StrictStr, List[StrictStr]] = Field({}, title="Named workflows, mapped to list of project commands to run in order") commands: List[ProjectConfigCommand] = Field([], title="Project command shortucts") diff --git a/spacy/tests/regression/test_issue4501-5000.py b/spacy/tests/regression/test_issue4501-5000.py index 0d4ce9a30..d16ecc1e6 100644 --- a/spacy/tests/regression/test_issue4501-5000.py +++ b/spacy/tests/regression/test_issue4501-5000.py @@ -65,7 +65,7 @@ def test_issue4590(en_vocab): def test_issue4651_with_phrase_matcher_attr(): - """Test that the EntityRuler PhraseMatcher is deserialize correctly using + """Test that the EntityRuler PhraseMatcher is deserialized correctly using the method from_disk when the EntityRuler argument phrase_matcher_attr is specified. """ @@ -87,7 +87,7 @@ def test_issue4651_with_phrase_matcher_attr(): def test_issue4651_without_phrase_matcher_attr(): - """Test that the EntityRuler PhraseMatcher is deserialize correctly using + """Test that the EntityRuler PhraseMatcher is deserialized correctly using the method from_disk when the EntityRuler argument phrase_matcher_attr is not specified. """ diff --git a/spacy/tests/test_cli.py b/spacy/tests/test_cli.py index 89ce740e0..104c7c516 100644 --- a/spacy/tests/test_cli.py +++ b/spacy/tests/test_cli.py @@ -6,9 +6,12 @@ from spacy.schemas import ProjectConfigSchema, RecommendationSchema, validate from spacy.cli.pretrain import make_docs from spacy.cli.init_config import init_config, RECOMMENDATIONS from spacy.cli._util import validate_project_commands, parse_config_overrides -from spacy.util import get_lang_class +from spacy.cli._util import load_project_config, substitute_project_variables +from thinc.config import ConfigValidationError import srsly +from .util import make_tempdir + def test_cli_converters_conllu2json(): # from NorNE: https://github.com/ltgoslo/norne/blob/3d23274965f513f23aa48455b28b1878dad23c05/ud/nob/no_bokmaal-ud-dev.conllu @@ -295,6 +298,24 @@ def test_project_config_validation2(config, n_errors): assert len(errors) == n_errors +def test_project_config_interpolation(): + variables = {"a": 10, "b": {"c": "foo", "d": True}} + commands = [ + {"name": "x", "script": ["hello ${vars.a} ${vars.b.c}"]}, + {"name": "y", "script": ["${vars.b.c} ${vars.b.d}"]}, + ] + project = {"commands": commands, "vars": variables} + with make_tempdir() as d: + srsly.write_yaml(d / "project.yml", project) + cfg = load_project_config(d) + assert cfg["commands"][0]["script"][0] == "hello 10 foo" + assert cfg["commands"][1]["script"][0] == "foo true" + commands = [{"name": "x", "script": ["hello ${vars.a} ${vars.b.e}"]}] + project = {"commands": commands, "vars": variables} + with pytest.raises(ConfigValidationError): + substitute_project_variables(project) + + @pytest.mark.parametrize( "args,expected", [ diff --git a/spacy/tests/test_displacy.py b/spacy/tests/test_displacy.py index adac0f7c3..1fa0eeaa1 100644 --- a/spacy/tests/test_displacy.py +++ b/spacy/tests/test_displacy.py @@ -1,6 +1,6 @@ import pytest from spacy import displacy -from spacy.displacy.render import DependencyRenderer +from spacy.displacy.render import DependencyRenderer, EntityRenderer from spacy.tokens import Span from spacy.lang.fa import Persian @@ -97,3 +97,17 @@ def test_displacy_render_wrapper(en_vocab): assert html.endswith("/div>TEST") # Restore displacy.set_render_wrapper(lambda html: html) + + +def test_displacy_options_case(): + ents = ["foo", "BAR"] + colors = {"FOO": "red", "bar": "green"} + renderer = EntityRenderer({"ents": ents, "colors": colors}) + text = "abcd" + labels = ["foo", "bar", "FOO", "BAR"] + spans = [{"start": i, "end": i + 1, "label": labels[i]} for i in range(len(text))] + result = renderer.render_ents("abcde", spans, None).split("\n\n") + assert "red" in result[0] and "foo" in result[0] + assert "green" in result[1] and "bar" in result[1] + assert "red" in result[2] and "FOO" in result[2] + assert "green" in result[3] and "BAR" in result[3] diff --git a/spacy/tokenizer.pyx b/spacy/tokenizer.pyx index a13299fff..9fda1800b 100644 --- a/spacy/tokenizer.pyx +++ b/spacy/tokenizer.pyx @@ -47,9 +47,9 @@ cdef class Tokenizer: `infix_finditer` (callable): A function matching the signature of `re.compile(string).finditer` to find infixes. token_match (callable): A boolean function matching strings to be - recognised as tokens. + recognized as tokens. url_match (callable): A boolean function matching strings to be - recognised as tokens after considering prefixes and suffixes. + recognized as tokens after considering prefixes and suffixes. EXAMPLE: >>> tokenizer = Tokenizer(nlp.vocab) diff --git a/spacy/tokens/doc.pyx b/spacy/tokens/doc.pyx index d37423e2f..cd080bf35 100644 --- a/spacy/tokens/doc.pyx +++ b/spacy/tokens/doc.pyx @@ -1193,8 +1193,7 @@ cdef class Doc: retokenizer.merge(span, attributes[i]) def to_json(self, underscore=None): - """Convert a Doc to JSON. The format it produces will be the new format - for the `spacy train` command (not implemented yet). + """Convert a Doc to JSON. underscore (list): Optional list of string names of custom doc._. attributes. Attribute values need to be JSON-serializable. Values will diff --git a/spacy/util.py b/spacy/util.py index 5eff82866..736f4d805 100644 --- a/spacy/util.py +++ b/spacy/util.py @@ -1,5 +1,5 @@ from typing import List, Union, Dict, Any, Optional, Iterable, Callable, Tuple -from typing import Iterator, Type, Pattern, TYPE_CHECKING +from typing import Iterator, Type, Pattern, Generator, TYPE_CHECKING from types import ModuleType import os import importlib @@ -610,7 +610,7 @@ def working_dir(path: Union[str, Path]) -> None: @contextmanager -def make_tempdir() -> None: +def make_tempdir() -> Generator[Path, None, None]: """Execute a block in a temporary directory and remove the directory and its contents at the end of the with block. diff --git a/website/docs/api/architectures.md b/website/docs/api/architectures.md index 835815496..3089fa1b3 100644 --- a/website/docs/api/architectures.md +++ b/website/docs/api/architectures.md @@ -11,9 +11,17 @@ menu: - ['Entity Linking', 'entitylinker'] --- -TODO: intro and how architectures work, link to -[`registry`](/api/top-level#registry), -[custom functions](/usage/training#custom-functions) usage etc. +A **model architecture** is a function that wires up a +[`Model`](https://thinc.ai/docs/api-model) instance, which you can then use in a +pipeline component or as a layer of a larger network. This page documents +spaCy's built-in architectures that are used for different NLP tasks. All +trainable [built-in components](/api#architecture-pipeline) expect a `model` +argument defined in the config and document their the default architecture. +Custom architectures can be registered using the +[`@spacy.registry.architectures`](/api/top-level#regsitry) decorator and used as +part of the [training config](/usage/training#custom-functions). Also see the +usage documentation on +[layers and model architectures](/usage/layers-architectures). ## Tok2Vec architectures {#tok2vec-arch source="spacy/ml/models/tok2vec.py"} @@ -110,13 +118,11 @@ Instead of defining its own `Tok2Vec` instance, a model architecture like [Tagger](/api/architectures#tagger) can define a listener as its `tok2vec` argument that connects to the shared `tok2vec` component in the pipeline. - - | Name | Description | | ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `width` | The width of the vectors produced by the "upstream" [`Tok2Vec`](/api/tok2vec) component. ~~int~~ | | `upstream` | A string to identify the "upstream" `Tok2Vec` component to communicate with. The upstream name should either be the wildcard string `"*"`, or the name of the `Tok2Vec` component. You'll almost never have multiple upstream `Tok2Vec` components, so the wildcard string will almost always be fine. ~~str~~ | -| **CREATES** | The model using the architecture. ~~Model~~ | +| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ | ### spacy.MultiHashEmbed.v1 {#MultiHashEmbed} @@ -139,15 +145,13 @@ definitions depending on the `Vocab` of the `Doc` object passed in. Vectors from pretrained static vectors can also be incorporated into the concatenated representation. - - | Name | Description | | ------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `width` | The output width. Also used as the width of the embedding tables. Recommended values are between `64` and `300`. ~~int~~ | | `rows` | The number of rows for the embedding tables. Can be low, due to the hashing trick. Embeddings for prefix, suffix and word shape use half as many rows. Recommended values are between `2000` and `10000`. ~~int~~ | | `also_embed_subwords` | Whether to use the `PREFIX`, `SUFFIX` and `SHAPE` features in the embeddings. If not using these, you may need more rows in your hash embeddings, as there will be increased chance of collisions. ~~bool~~ | | `also_use_static_vectors` | Whether to also use static word vectors. Requires a vectors table to be loaded in the [Doc](/api/doc) objects' vocab. ~~bool~~ | -| **CREATES** | The model using the architecture. ~~Model~~ | +| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ | ### spacy.CharacterEmbed.v1 {#CharacterEmbed} @@ -178,15 +182,13 @@ concatenated. A hash-embedded vector of the `NORM` of the word is also concatenated on, and the result is then passed through a feed-forward network to construct a single vector to represent the information. - - | Name | Description | | ----------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `width` | The width of the output vector and the `NORM` hash embedding. ~~int~~ | | `rows` | The number of rows in the `NORM` hash embedding table. ~~int~~ | | `nM` | The dimensionality of the character embeddings. Recommended values are between `16` and `64`. ~~int~~ | | `nC` | The number of UTF-8 bytes to embed per word. Recommended values are between `3` and `8`, although it may depend on the length of words in the language. ~~int~~ | -| **CREATES** | The model using the architecture. ~~Model~~ | +| **CREATES** | The model using the architecture. ~~Model[List[Doc], List[Floats2d]]~~ | ### spacy.MaxoutWindowEncoder.v1 {#MaxoutWindowEncoder} @@ -277,12 +279,10 @@ Embed [`Doc`](/api/doc) objects with their vocab's vectors table, applying a learned linear projection to control the dimensionality. See the documentation on [static vectors](/usage/embeddings-transformers#static-vectors) for details. - - | Name | Β Description | | ----------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `nO` | Defaults to `None`. ~~Optional[int]~~ | -| `nM` | Defaults to `None`. ~~Optional[int]~~ | +| `nO` | The output width of the layer, after the linear projection. ~~Optional[int]~~ | +| `nM` | The width of the static vectors. ~~Optional[int]~~ | | `dropout` | Optional dropout rate. If set, it's applied per dimension over the whole batch. Defaults to `None`. ~~Optional[float]~~ | | `init_W` | The [initialization function](https://thinc.ai/docs/api-initializers). Defaults to [`glorot_uniform_init`](https://thinc.ai/docs/api-initializers#glorot_uniform_init). ~~Callable[[Ops, Tuple[int, ...]]], FloatsXd]~~ | | `key_attr` | Defaults to `"ORTH"`. ~~str~~ | @@ -292,8 +292,18 @@ on [static vectors](/usage/embeddings-transformers#static-vectors) for details. The following architectures are provided by the package [`spacy-transformers`](https://github.com/explosion/spacy-transformers). See the -[usage documentation](/usage/embeddings-transformers) for how to integrate the -architectures into your training config. +[usage documentation](/usage/embeddings-transformers#transformers) for how to +integrate the architectures into your training config. + + + +Note that in order to use these architectures in your config, you need to +install the +[`spacy-transformers`](https://github.com/explosion/spacy-transformers). See the +[installation docs](/usage/embeddings-transformers#transformers-installation) +for details and system requirements. + + ### spacy-transformers.TransformerModel.v1 {#TransformerModel} @@ -311,7 +321,23 @@ architectures into your training config. > stride = 96 > ``` - +Load and wrap a transformer model from the +[HuggingFace `transformers`](https://huggingface.co/transformers) library. You +can any transformer that has pretrained weights and a PyTorch implementation. +The `name` variable is passed through to the underlying library, so it can be +either a string or a path. If it's a string, the pretrained weights will be +downloaded via the transformers library if they are not already available +locally. + +In order to support longer documents, the +[TransformerModel](/api/architectures#TransformerModel) layer allows you to pass +in a `get_spans` function that will divide up the [`Doc`](/api/doc) objects +before passing them through the transformer. Your spans are allowed to overlap +or exclude tokens. This layer is usually used directly by the +[`Transformer`](/api/transformer) component, which allows you to share the +transformer weights across your pipeline. For a layer that's configured for use +in other components, see +[Tok2VecTransformer](/api/architectures#Tok2VecTransformer). | Name | Description | | ------------------ | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | @@ -541,8 +567,6 @@ specific data and challenge. Stacked ensemble of a bag-of-words model and a neural network model. The neural network has an internal CNN Tok2Vec layer and uses attention. - - | Name | Description | | -------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `exclusive_classes` | Whether or not categories are mutually exclusive. ~~bool~~ | @@ -554,7 +578,7 @@ network has an internal CNN Tok2Vec layer and uses attention. | `ngram_size` | Determines the maximum length of the n-grams in the BOW model. For instance, `ngram_size=3`would give unigram, trigram and bigram features. ~~int~~ | | `dropout` | The dropout rate. ~~float~~ | | `nO` | Output dimension, determined by the number of different labels. If not set, the [`TextCategorizer`](/api/textcategorizer) component will set it when `begin_training` is called. ~~Optional[int]~~ | -| **CREATES** | The model using the architecture. ~~Model~~ | +| **CREATES** | The model using the architecture. ~~Model[List[Doc], Floats2d]~~ | ### spacy.TextCatCNN.v1 {#TextCatCNN} @@ -581,14 +605,12 @@ A neural network model where token vectors are calculated using a CNN. The vectors are mean pooled and used as features in a feed-forward network. This architecture is usually less accurate than the ensemble, but runs faster. - - | Name | Description | | ------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `exclusive_classes` | Whether or not categories are mutually exclusive. ~~bool~~ | | `tok2vec` | The [`tok2vec`](#tok2vec) layer of the model. ~~Model~~ | | `nO` | Output dimension, determined by the number of different labels. If not set, the [`TextCategorizer`](/api/textcategorizer) component will set it when `begin_training` is called. ~~Optional[int]~~ | -| **CREATES** | The model using the architecture. ~~Model~~ | +| **CREATES** | The model using the architecture. ~~Model[List[Doc], Floats2d]~~ | ### spacy.TextCatBOW.v1 {#TextCatBOW} @@ -606,15 +628,13 @@ architecture is usually less accurate than the ensemble, but runs faster. An ngram "bag-of-words" model. This architecture should run much faster than the others, but may not be as accurate, especially if texts are short. - - | Name | Description | | ------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `exclusive_classes` | Whether or not categories are mutually exclusive. ~~bool~~ | | `ngram_size` | Determines the maximum length of the n-grams in the BOW model. For instance, `ngram_size=3`would give unigram, trigram and bigram features. ~~int~~ | | `no_output_layer` | Whether or not to add an output layer to the model (`Softmax` activation if `exclusive_classes` is `True`, else `Logistic`. ~~bool~~ | | `nO` | Output dimension, determined by the number of different labels. If not set, the [`TextCategorizer`](/api/textcategorizer) component will set it when `begin_training` is called. ~~Optional[int]~~ | -| **CREATES** | The model using the architecture. ~~Model~~ | +| **CREATES** | The model using the architecture. ~~Model[List[Doc], Floats2d]~~ | ## Entity linking architectures {#entitylinker source="spacy/ml/models/entity_linker.py"} @@ -659,13 +679,11 @@ into the "real world". This requires 3 main components: The `EntityLinker` model architecture is a Thinc `Model` with a [`Linear`](https://thinc.ai/api-layers#linear) output layer. - - | Name | Description | | ----------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `tok2vec` | The [`tok2vec`](#tok2vec) layer of the model. ~~Model~~ | | `nO` | Output dimension, determined by the length of the vectors encoding each entity in the KB. If the `nO` dimension is not set, the entity linking component will set it when `begin_training` is called. ~~Optional[int]~~ | -| **CREATES** | The model using the architecture. ~~Model~~ | +| **CREATES** | The model using the architecture. ~~Model[List[Doc], Floats2d]~~ | ### spacy.EmptyKB.v1 {#EmptyKB} diff --git a/website/docs/api/cli.md b/website/docs/api/cli.md index 9cadb2f0f..7ce95c019 100644 --- a/website/docs/api/cli.md +++ b/website/docs/api/cli.md @@ -123,7 +123,7 @@ $ python -m spacy init config [output_file] [--lang] [--pipeline] [--optimize] [ | Name | Description | | ------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `output_file` | Path to output `.cfg` file. If not set, the config is written to stdout so you can pipe it forward to a file. ~~Path (positional)~~ | +| `output_file` | Path to output `.cfg` file or `-` to write the config to stdout (so you can pipe it forward to a file). Note that if you're writing to stdout, no additional logging info is printed. ~~Path (positional)~~ | | `--lang`, `-l` | Optional code of the [language](/usage/models#languages) to use. Defaults to `"en"`. ~~str (option)~~ | | `--pipeline`, `-p` | Comma-separated list of trainable [pipeline components](/usage/processing-pipelines#built-in) to include in the model. Defaults to `"tagger,parser,ner"`. ~~str (option)~~ | | `--optimize`, `-o` | `"efficiency"` or `"accuracy"`. Whether to optimize for efficiency (faster inference, smaller model, lower memory consumption) or higher accuracy (potentially larger and slower model). This will impact the choice of architecture, pretrained weights and related hyperparameters. Defaults to `"efficiency"`. ~~str (option)~~ | @@ -847,6 +847,92 @@ $ python -m spacy project run [subcommand] [project_dir] [--force] [--dry] | `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ | | **EXECUTES** | The command defined in the `project.yml`. | +### project push {#project-push tag="command"} + +Upload all available files or directories listed as in the `outputs` section of +commands to a remote storage. Outputs are archived and compressed prior to +upload, and addressed in the remote storage using the output's relative path +(URL encoded), a hash of its command string and dependencies, and a hash of its +file contents. This means `push` should **never overwrite** a file in your +remote. If all the hashes match, the contents are the same and nothing happens. +If the contents are different, the new version of the file is uploaded. Deleting +obsolete files is left up to you. + +Remotes can be defined in the `remotes` section of the +[`project.yml`](/usage/projects#project-yml). Under the hood, spaCy uses the +[`smart-open`](https://github.com/RaRe-Technologies/smart_open) library to +communicate with the remote storages, so you can use any protocol that +`smart-open` supports, including [S3](https://aws.amazon.com/s3/), +[Google Cloud Storage](https://cloud.google.com/storage), SSH and more, although +you may need to install extra dependencies to use certain protocols. + +```cli +$ python -m spacy project push [remote] [project_dir] +``` + +> #### Example +> +> ```cli +> $ python -m spacy project push my_bucket +> ``` +> +> ```yaml +> ### project.yml +> remotes: +> my_bucket: 's3://my-spacy-bucket' +> ``` + +| Name | Description | +| -------------- | --------------------------------------------------------------------------------------- | +| `remote` | The name of the remote to upload to. Defaults to `"default"`. ~~str (positional)~~ | +| `project_dir` | Path to project directory. Defaults to current working directory. ~~Path (positional)~~ | +| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ | +| **UPLOADS** | All project outputs that exist and are not already stored in the remote. | + +### project pull {#project-pull tag="command"} + +Download all files or directories listed as `outputs` for commands, unless they +are not already present locally. When searching for files in the remote, `pull` +won't just look at the output path, but will also consider the **command +string** and the **hashes of the dependencies**. For instance, let's say you've +previously pushed a model checkpoint to the remote, but now you've changed some +hyper-parameters. Because you've changed the inputs to the command, if you run +`pull`, you won't retrieve the stale result. If you train your model and push +the outputs to the remote, the outputs will be saved alongside the prior +outputs, so if you change the config back, you'll be able to fetch back the +result. + +Remotes can be defined in the `remotes` section of the +[`project.yml`](/usage/projects#project-yml). Under the hood, spaCy uses the +[`smart-open`](https://github.com/RaRe-Technologies/smart_open) library to +communicate with the remote storages, so you can use any protocol that +`smart-open` supports, including [S3](https://aws.amazon.com/s3/), +[Google Cloud Storage](https://cloud.google.com/storage), SSH and more, although +you may need to install extra dependencies to use certain protocols. + +```cli +$ python -m spacy project pull [remote] [project_dir] +``` + +> #### Example +> +> ```cli +> $ python -m spacy project pull my_bucket +> ``` +> +> ```yaml +> ### project.yml +> remotes: +> my_bucket: 's3://my-spacy-bucket' +> ``` + +| Name | Description | +| -------------- | --------------------------------------------------------------------------------------- | +| `remote` | The name of the remote to download from. Defaults to `"default"`. ~~str (positional)~~ | +| `project_dir` | Path to project directory. Defaults to current working directory. ~~Path (positional)~~ | +| `--help`, `-h` | Show help message and available arguments. ~~bool (flag)~~ | +| **DOWNLOADS** | All project outputs that do not exist locally and can be found in the remote. | + ### project dvc {#project-dvc tag="command"} Auto-generate [Data Version Control](https://dvc.org) (DVC) config file. Calls diff --git a/website/docs/api/data-formats.md b/website/docs/api/data-formats.md index 8b67aa263..727c0f35c 100644 --- a/website/docs/api/data-formats.md +++ b/website/docs/api/data-formats.md @@ -127,26 +127,24 @@ $ python -m spacy train config.cfg --paths.train ./corpus/train.spacy This section defines settings and controls for the training and evaluation process that are used when you run [`spacy train`](/api/cli#train). - - | Name | Description | | --------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | -| `seed` | The random seed. Defaults to variable `${system.seed}`. ~~int~~ | -| `dropout` | The dropout rate. Defaults to `0.1`. ~~float~~ | | `accumulate_gradient` | Whether to divide the batch up into substeps. Defaults to `1`. ~~int~~ | +| `batcher` | Callable that takes an iterator of [`Doc`](/api/doc) objects and yields batches of `Doc`s. Defaults to [`batch_by_words`](/api/top-level#batch_by_words). ~~Callable[[Iterator[Doc], Iterator[List[Doc]]]]~~ | +| `dev_corpus` | Callable that takes the current `nlp` object and yields [`Example`](/api/example) objects. Defaults to [`Corpus`](/api/corpus). ~~Callable[[Language], Iterator[Example]]~~ | +| `dropout` | The dropout rate. Defaults to `0.1`. ~~float~~ | +| `eval_frequency` | How often to evaluate during training (steps). Defaults to `200`. ~~int~~ | +| `frozen_components` | Pipeline component names that are "frozen" and shouldn't be updated during training. See [here](/usage/training#config-components) for details. Defaults to `[]`. ~~List[str]~~ | | `init_tok2vec` | Optional path to pretrained tok2vec weights created with [`spacy pretrain`](/api/cli#pretrain). Defaults to variable `${paths.init_tok2vec}`. ~~Optional[str]~~ | -| `raw_text` | Optional path to a jsonl file with unlabelled text documents for a [rehearsal](/api/language#rehearse) step. Defaults to variable `${paths.raw}`. ~~Optional[str]~~ | -| `vectors` | Model name or path to model containing pretrained word vectors to use, e.g. created with [`init model`](/api/cli#init-model). Defaults to `null`. ~~Optional[str]~~ | -| `patience` | How many steps to continue without improvement in evaluation score. Defaults to `1600`. ~~int~~ | | `max_epochs` | Maximum number of epochs to train for. Defaults to `0`. ~~int~~ | | `max_steps` | Maximum number of update steps to train for. Defaults to `20000`. ~~int~~ | -| `eval_frequency` | How often to evaluate during training (steps). Defaults to `200`. ~~int~~ | -| `score_weights` | Score names shown in metrics mapped to their weight towards the final weighted score. See [here](/usage/training#metrics) for details. Defaults to `{}`. ~~Dict[str, float]~~ | -| `frozen_components` | Pipeline component names that are "frozen" and shouldn't be updated during training. See [here](/usage/training#config-components) for details. Defaults to `[]`. ~~List[str]~~ | -| `train_corpus` | Callable that takes the current `nlp` object and yields [`Example`](/api/example) objects. Defaults to [`Corpus`](/api/corpus). ~~Callable[[Language], Iterator[Example]]~~ | -| `dev_corpus` | Callable that takes the current `nlp` object and yields [`Example`](/api/example) objects. Defaults to [`Corpus`](/api/corpus). ~~Callable[[Language], Iterator[Example]]~~ | -| `batcher` | Callable that takes an iterator of [`Doc`](/api/doc) objects and yields batches of `Doc`s. Defaults to [`batch_by_words`](/api/top-level#batch_by_words). ~~Callable[[Iterator[Doc], Iterator[List[Doc]]]]~~ | | `optimizer` | The optimizer. The learning rate schedule and other settings can be configured as part of the optimizer. Defaults to [`Adam`](https://thinc.ai/docs/api-optimizers#adam). ~~Optimizer~~ | +| `patience` | How many steps to continue without improvement in evaluation score. Defaults to `1600`. ~~int~~ | +| `raw_text` | Optional path to a jsonl file with unlabelled text documents for a [rehearsal](/api/language#rehearse) step. Defaults to variable `${paths.raw}`. ~~Optional[str]~~ | +| `score_weights` | Score names shown in metrics mapped to their weight towards the final weighted score. See [here](/usage/training#metrics) for details. Defaults to `{}`. ~~Dict[str, float]~~ | +| `seed` | The random seed. Defaults to variable `${system.seed}`. ~~int~~ | +| `train_corpus` | Callable that takes the current `nlp` object and yields [`Example`](/api/example) objects. Defaults to [`Corpus`](/api/corpus). ~~Callable[[Language], Iterator[Example]]~~ | +| `vectors` | Model name or path to model containing pretrained word vectors to use, e.g. created with [`init model`](/api/cli#init-model). Defaults to `null`. ~~Optional[str]~~ | ### pretraining {#config-pretraining tag="section,optional"} diff --git a/website/docs/api/morphology.md b/website/docs/api/morphology.md index 1b2e159d0..5d5324061 100644 --- a/website/docs/api/morphology.md +++ b/website/docs/api/morphology.md @@ -7,7 +7,7 @@ source: spacy/morphology.pyx Store the possible morphological analyses for a language, and index them by hash. To save space on each token, tokens only know the hash of their morphological analysis, so queries of morphological attributes are delegated to -this class. See [`MorphAnalysis`](/api/morphology#morphansalysis) for the +this class. See [`MorphAnalysis`](/api/morphology#morphanalysis) for the container storing a single morphological analysis. ## Morphology.\_\_init\_\_ {#init tag="method"} diff --git a/website/docs/api/token.md b/website/docs/api/token.md index 4a8e6eba7..0860797aa 100644 --- a/website/docs/api/token.md +++ b/website/docs/api/token.md @@ -450,8 +450,8 @@ The L2 norm of the token's vector representation. | `pos_` | Coarse-grained part-of-speech from the [Universal POS tag set](https://universaldependencies.org/docs/u/pos/). ~~str~~ | | `tag` | Fine-grained part-of-speech. ~~int~~ | | `tag_` | Fine-grained part-of-speech. ~~str~~ | -| `morph` | Morphological analysis. ~~MorphAnalysis~~ | -| `morph_` | Morphological analysis in the Universal Dependencies [FEATS]https://universaldependencies.org/format.html#morphological-annotation format. ~~str~~ | +| `morph` 3 | Morphological analysis. ~~MorphAnalysis~~ | +| `morph_` 3 | Morphological analysis in the Universal Dependencies [FEATS]https://universaldependencies.org/format.html#morphological-annotation format. ~~str~~ | | `dep` | Syntactic dependency relation. ~~int~~ | | `dep_` | Syntactic dependency relation. ~~str~~ | | `lang` | Language of the parent document's vocabulary. ~~int~~ | diff --git a/website/docs/api/top-level.md b/website/docs/api/top-level.md index 61fca6ec5..797fa0191 100644 --- a/website/docs/api/top-level.md +++ b/website/docs/api/top-level.md @@ -257,7 +257,7 @@ If a setting is not present in the options, the default value will be used. | Name | Description | | --------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `ents` | Entity types to highlight or `None` for all types (default). ~~Optional[List[str]]~~ | -| `colors` | Color overrides. Entity types in uppercase should be mapped to color names or values. ~~Dict[str, str]~~ | +| `colors` | Color overrides. Entity types should be mapped to color names or values. ~~Dict[str, str]~~ | | `template` 2.2 | Optional template to overwrite the HTML used to render entity spans. Should be a format string and can use `{bg}`, `{text}` and `{label}`. See [`templates.py`](https://github.com/explosion/spaCy/blob/master/spacy/displacy/templates.py) for examples. ~~Optional[str]~~ | By default, displaCy comes with colors for all entity types used by @@ -299,20 +299,20 @@ factories. | Registry name | Description | | ----------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | `architectures` | Registry for functions that create [model architectures](/api/architectures). Can be used to register custom model architectures and reference them in the `config.cfg`. | -| `factories` | Registry for functions that create [pipeline components](/usage/processing-pipelines#custom-components). Added automatically when you use the `@spacy.component` decorator and also reads from [entry points](/usage/saving-loading#entry-points). | -| `tokenizers` | Registry for tokenizer factories. Registered functions should return a callback that receives the `nlp` object and returns a [`Tokenizer`](/api/tokenizer) or a custom callable. | -| `languages` | Registry for language-specific `Language` subclasses. Automatically reads from [entry points](/usage/saving-loading#entry-points). | -| `lookups` | Registry for large lookup tables available via `vocab.lookups`. | -| `displacy_colors` | Registry for custom color scheme for the [`displacy` NER visualizer](/usage/visualizers). Automatically reads from [entry points](/usage/saving-loading#entry-points). | | `assets` | Registry for data assets, knowledge bases etc. | -| `callbacks` | Registry for custom callbacks to [modify the `nlp` object](/usage/training#custom-code-nlp-callbacks) before training. | -| `readers` | Registry for training and evaluation data readers like [`Corpus`](/api/corpus). | | `batchers` | Registry for training and evaluation [data batchers](#batchers). | -| `optimizers` | Registry for functions that create [optimizers](https://thinc.ai/docs/api-optimizers). | -| `schedules` | Registry for functions that create [schedules](https://thinc.ai/docs/api-schedules). | -| `layers` | Registry for functions that create [layers](https://thinc.ai/docs/api-layers). | -| `losses` | Registry for functions that create [losses](https://thinc.ai/docs/api-loss). | +| `callbacks` | Registry for custom callbacks to [modify the `nlp` object](/usage/training#custom-code-nlp-callbacks) before training. | +| `displacy_colors` | Registry for custom color scheme for the [`displacy` NER visualizer](/usage/visualizers). Automatically reads from [entry points](/usage/saving-loading#entry-points). | +| `factories` | Registry for functions that create [pipeline components](/usage/processing-pipelines#custom-components). Added automatically when you use the `@spacy.component` decorator and also reads from [entry points](/usage/saving-loading#entry-points). | | `initializers` | Registry for functions that create [initializers](https://thinc.ai/docs/api-initializers). | +| `languages` | Registry for language-specific `Language` subclasses. Automatically reads from [entry points](/usage/saving-loading#entry-points). | +| `layers` | Registry for functions that create [layers](https://thinc.ai/docs/api-layers). | +| `lookups` | Registry for large lookup tables available via `vocab.lookups`. | +| `losses` | Registry for functions that create [losses](https://thinc.ai/docs/api-loss). | +| `optimizers` | Registry for functions that create [optimizers](https://thinc.ai/docs/api-optimizers). | +| `readers` | Registry for training and evaluation data readers like [`Corpus`](/api/corpus). | +| `schedules` | Registry for functions that create [schedules](https://thinc.ai/docs/api-schedules). | +| `tokenizers` | Registry for tokenizer factories. Registered functions should return a callback that receives the `nlp` object and returns a [`Tokenizer`](/api/tokenizer) or a custom callable. | ### spacy-transformers registry {#registry-transformers} @@ -632,6 +632,23 @@ validate its contents. | `path` | Path to the model's `meta.json`. ~~Union[str, Path]~~ | | **RETURNS** | The model's meta data. ~~Dict[str, Any]~~ | +### util.get_installed_models {#util.get_installed_models tag="function" new="3"} + +List all model packages installed in the current environment. This will include +any spaCy model that was packaged with [`spacy package`](/api/cli#package). +Under the hood, model packages expose a Python entry point that spaCy can check, +without having to load the model. + +> #### Example +> +> ```python +> model_names = util.get_installed_models() +> ``` + +| Name | Description | +| ----------- | ---------------------------------------------------------------------------------- | +| **RETURNS** | The string names of the models installed in the current environment. ~~List[str]~~ | + ### util.is_package {#util.is_package tag="function"} Check if string maps to a package installed via pip. Mainly used to validate diff --git a/website/docs/images/layers-architectures.svg b/website/docs/images/layers-architectures.svg new file mode 100644 index 000000000..22e705ba1 --- /dev/null +++ b/website/docs/images/layers-architectures.svg @@ -0,0 +1,97 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/website/docs/images/projects.svg b/website/docs/images/projects.svg new file mode 100644 index 000000000..8de5f9ef6 --- /dev/null +++ b/website/docs/images/projects.svg @@ -0,0 +1,91 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/website/docs/images/thinc_mypy.jpg b/website/docs/images/thinc_mypy.jpg new file mode 100644 index 000000000..c0f7ee636 Binary files /dev/null and b/website/docs/images/thinc_mypy.jpg differ diff --git a/website/docs/usage/embeddings-transformers.md b/website/docs/usage/embeddings-transformers.md index c2727f5b1..7648a5d45 100644 --- a/website/docs/usage/embeddings-transformers.md +++ b/website/docs/usage/embeddings-transformers.md @@ -9,7 +9,24 @@ menu: next: /usage/training --- - +spaCy supports a number of **transfer and multi-task learning** workflows that +can often help improve your pipeline's efficiency or accuracy. Transfer learning +refers to techniques such as word vector tables and language model pretraining. +These techniques can be used to import knowledge from raw text into your +pipeline, so that your models are able to generalize better from your annotated +examples. + +You can convert **word vectors** from popular tools like +[FastText](https://fasttext.cc) and [Gensim](https://radimrehurek.com/gensim), +or you can load in any pretrained **transformer model** if you install +[`spacy-transformers`](https://github.com/explosion/spacy-transformers). You can +also do your own language model pretraining via the +[`spacy pretrain`](/api/cli#pretrain) command. You can even **share** your +transformer or other contextual embedding model across multiple components, +which can make long pipelines several times more efficient. To use transfer +learning, you'll need at least a few annotated examples for what you're trying +to predict. Otherwise, you could try using a "one-shot learning" approach using +[vectors and similarity](/usage/linguistic-features#vectors-similarity). @@ -53,19 +70,46 @@ of performance. ## Shared embedding layers {#embedding-layers} - +spaCy lets you share a single transformer or other token-to-vector ("tok2vec") +embedding layer between multiple components. You can even update the shared +layer, performing **multi-task learning**. Reusing the tok2vec layer between +components can make your pipeline run a lot faster and result in much smaller +models. However, it can make the pipeline less modular and make it more +difficult to swap components or retrain parts of the pipeline. Multi-task +learning can affect your accuracy (either positively or negatively), and may +require some retuning of your hyper-parameters. ![Pipeline components using a shared embedding component vs. independent embedding layers](../images/tok2vec.svg) | Shared | Independent | | ------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------- | | βœ… **smaller:** models only need to include a single copy of the embeddings | ❌ **larger:** models need to include the embeddings for each component | -| βœ… **faster:** | ❌ **slower:** | +| βœ… **faster:** embed the documents once for your whole pipeline | ❌ **slower:** rerun the embedding for each component | | ❌ **less composable:** all components require the same embedding component in the pipeline | βœ… **modular:** components can be moved and swapped freely | +You can share a single transformer or other tok2vec model between multiple +components by adding a [`Transformer`](/api/transformer) or +[`Tok2Vec`](/api/tok2vec) component near the start of your pipeline. Components +later in the pipeline can "connect" to it by including a **listener layer** like +[Tok2VecListener](/api/architectures#Tok2VecListener) within their model. + ![Pipeline components listening to shared embedding component](../images/tok2vec-listener.svg) - +At the beginning of training, the [`Tok2Vec`](/api/tok2vec) component will grab +a reference to the relevant listener layers in the rest of your pipeline. When +it processes a batch of documents, it will pass forward its predictions to the +listeners, allowing the listeners to **reuse the predictions** when they are +eventually called. A similar mechanism is used to pass gradients from the +listeners back to the model. The [`Transformer`](/api/transformer) component and +[TransformerListener](/api/architectures#TransformerListener) layer do the same +thing for transformer models, but the `Transformer` component will also save the +transformer outputs to the +[`Doc._.trf_data`](/api/transformer#custom_attributes) extension attribute, +giving you access to them after the pipeline has finished running. + + + + ## Using transformer models {#transformers} @@ -180,7 +224,7 @@ yourself. For details on how to get started with training your own model, check out the [training quickstart](/usage/training#quickstart). + +> #### Example +> +> ```python +> @spacy.registry.architectures.register("spacy.Tagger.v1") +> def build_tagger_model( +> tok2vec: Model[List[Doc], List[Floats2d]], nO: Optional[int] = None +> ) -> Model[List[Doc], List[Floats2d]]: +> t2v_width = tok2vec.get_dim("nO") if tok2vec.has_dim("nO") else None +> output_layer = Softmax(nO, t2v_width, init_W=zero_init) +> softmax = with_array(output_layer) +> model = chain(tok2vec, softmax) +> model.set_ref("tok2vec", tok2vec) +> model.set_ref("softmax", output_layer) +> model.set_ref("output_layer", output_layer) +> return model +> ``` + +​ The Thinc `Model` class is a **generic type** that can specify its input and +output types. Python uses a square-bracket notation for this, so the type +~~Model[List, Dict]~~ says that each batch of inputs to the model will be a +list, and the outputs will be a dictionary. Both `typing.List` and `typing.Dict` +are also generics, allowing you to be more specific about the data. For +instance, you can write ~~Model[List[Doc], Dict[str, float]]~~ to specify that +the model expects a list of [`Doc`](/api/doc) objects as input, and returns a +dictionary mapping strings to floats. Some of the most common types you'll see +are: ​ + +| Type | Description | +| ------------------ | ---------------------------------------------------------------------------------------------------- | +| ~~List[Doc]~~ | A batch of [`Doc`](/api/doc) objects. Most components expect their models to take this as input. | +| ~~Floats2d~~ | A two-dimensional `numpy` or `cupy` array of floats. Usually 32-bit. | +| ~~Ints2d~~ | A two-dimensional `numpy` or `cupy` array of integers. Common dtypes include uint64, int32 and int8. | +| ~~List[Floats2d]~~ | A list of two-dimensional arrays, generally with one array per `Doc` and one row per token. | +| ~~Ragged~~ | A container to handle variable-length sequence data in an unpadded contiguous array. | +| ~~Padded~~ | A container to handle variable-length sequence data in a passed contiguous array. | + +The model type signatures help you figure out which model architectures and +components can **fit together**. For instance, the +[`TextCategorizer`](/api/textcategorizer) class expects a model typed +~~Model[List[Doc], Floats2d]~~, because the model will predict one row of +category probabilities per [`Doc`](/api/doc). In contrast, the +[`Tagger`](/api/tagger) class expects a model typed ~~Model[List[Doc], +List[Floats2d]]~~, because it needs to predict one row of probabilities per +token. + +There's no guarantee that two models with the same type signature can be used +interchangeably. There are many other ways they could be incompatible. However, +if the types don't match, they almost surely _won't_ be compatible. This little +bit of validation goes a long way, especially if you +[configure your editor](https://thinc.ai/docs/usage-type-checking) or other +tools to highlight these errors early. Thinc will also verify that your types +match correctly when your config file is processed at the beginning of training. + + + +If you're using a modern editor like Visual Studio Code, you can +[set up `mypy`](https://thinc.ai/docs/usage-type-checking#install) with the +custom Thinc plugin and get live feedback about mismatched types as you write +code. + +[![](../images/thinc_mypy.jpg)](https://thinc.ai/docs/usage-type-checking#linting) + + + +## Defining sublayers {#sublayers} + +​ Model architecture functions often accept **sublayers as arguments**, so that +you can try **substituting a different layer** into the network. Depending on +how the architecture function is structured, you might be able to define your +network structure entirely through the [config system](/usage/training#config), +using layers that have already been defined. ​The +[transformers documentation](/usage/embeddings-transformers#transformers) +section shows a common example of swapping in a different sublayer. + +In most neural network models for NLP, the most important parts of the network +are what we refer to as the +[embed and encode](https://explosion.ai/blog/embed-encode-attend-predict) steps. +These steps together compute dense, context-sensitive representations of the +tokens. Most of spaCy's default architectures accept a +[`tok2vec` embedding layer](/api/architectures#tok2vec-arch) as an argument, so +you can control this important part of the network separately. This makes it +easy to **switch between** transformer, CNN, BiLSTM or other feature extraction +approaches. And if you want to define your own solution, all you need to do is +register a ~~Model[List[Doc], List[Floats2d]]~~ architecture function, and +you'll be able to try it out in any of spaCy components. ​ + + + +### Registering new architectures + +- Recap concept, link to config docs. ​ + +## Wrapping PyTorch, TensorFlow and other frameworks {#frameworks} + + + +Thinc allows you to wrap models written in other machine learning frameworks +like PyTorch, TensorFlow and MXNet using a unified +[`Model`](https://thinc.ai/docs/api-model) API. As well as **wrapping whole +models**, Thinc lets you call into an external framework for just **part of your +model**: you can have a model where you use PyTorch just for the transformer +layers, using "native" Thinc layers to do fiddly input and output +transformations and add on task-specific "heads", as efficiency is less of a +consideration for those parts of the network. + +Thinc uses a special class, [`Shim`](https://thinc.ai/docs/api-model#shim), to +hold references to external objects. This allows each wrapper space to define a +custom type, with whatever attributes and methods are helpful, to assist in +managing the communication between Thinc and the external library. The +[`Model`](https://thinc.ai/docs/api-model#model) class holds `shim` instances in +a separate list, and communicates with the shims about updates, serialization, +changes of device, etc. + +The wrapper will receive each batch of inputs, convert them into a suitable form +for the underlying model instance, and pass them over to the shim, which will +**manage the actual communication** with the model. The output is then passed +back into the wrapper, and converted for use in the rest of the network. The +equivalent procedure happens during backpropagation. Array conversion is handled +via the [DLPack](https://github.com/dmlc/dlpack) standard wherever possible, so +that data can be passed between the frameworks **without copying the data back** +to the host device unnecessarily. + +| Framework | Wrapper layer | Shim | DLPack | +| -------------- | ------------------------------------------------------------------------- | --------------------------------------------------------- | --------------- | +| **PyTorch** | [`PyTorchWrapper`](https://thinc.ai/docs/api-layers#pytorchwrapper) | [`PyTorchShim`](https://thinc.ai/docs/api-model#shims) | βœ… | +| **TensorFlow** | [`TensorFlowWrapper`](https://thinc.ai/docs/api-layers#tensorflowwrapper) | [`TensorFlowShim`](https://thinc.ai/docs/api-model#shims) | ❌ 1 | +| **MXNet** | [`MXNetWrapper`](https://thinc.ai/docs/api-layers#mxnetwrapper) | [`MXNetShim`](https://thinc.ai/docs/api-model#shims) | βœ… | + +1. DLPack support in TensorFlow is now + [available](<(https://github.com/tensorflow/tensorflow/issues/24453)>) but + still experimental. + + + +## Models for trainable components {#components} + +- Interaction with `predict`, `get_loss` and `set_annotations` +- Initialization life-cycle with `begin_training`. +- Link to relation extraction notebook. + +```python +def update(self, examples): + docs = [ex.predicted for ex in examples] + refs = [ex.reference for ex in examples] + predictions, backprop = self.model.begin_update(docs) + gradient = self.get_loss(predictions, refs) + backprop(gradient) + +def __call__(self, doc): + predictions = self.model([doc]) + self.set_annotations(predictions) +``` diff --git a/website/docs/usage/linguistic-features.md b/website/docs/usage/linguistic-features.md index 3aa0df7b4..f2ec48d63 100644 --- a/website/docs/usage/linguistic-features.md +++ b/website/docs/usage/linguistic-features.md @@ -429,7 +429,7 @@ nlp = spacy.load("en_core_web_sm") doc = nlp("fb is hiring a new vice president of global policy") ents = [(e.text, e.start_char, e.end_char, e.label_) for e in doc.ents] print('Before', ents) -# the model didn't recognise "fb" as an entity :( +# The model didn't recognize "fb" as an entity :( fb_ent = Span(doc, 0, 1, label="ORG") # create a Span for the new entity doc.ents = list(doc.ents) + [fb_ent] @@ -558,11 +558,11 @@ import spacy nlp = spacy.load("my_custom_el_model") doc = nlp("Ada Lovelace was born in London") -# document level +# Document level ents = [(e.text, e.label_, e.kb_id_) for e in doc.ents] print(ents) # [('Ada Lovelace', 'PERSON', 'Q7259'), ('London', 'GPE', 'Q84')] -# token level +# Token level ent_ada_0 = [doc[0].text, doc[0].ent_type_, doc[0].ent_kb_id_] ent_ada_1 = [doc[1].text, doc[1].ent_type_, doc[1].ent_kb_id_] ent_london_5 = [doc[5].text, doc[5].ent_type_, doc[5].ent_kb_id_] @@ -914,12 +914,12 @@ from spacy.lang.char_classes import ALPHA, ALPHA_LOWER, ALPHA_UPPER from spacy.lang.char_classes import CONCAT_QUOTES, LIST_ELLIPSES, LIST_ICONS from spacy.util import compile_infix_regex -# default tokenizer +# Default tokenizer nlp = spacy.load("en_core_web_sm") doc = nlp("mother-in-law") print([t.text for t in doc]) # ['mother', '-', 'in', '-', 'law'] -# modify tokenizer infix patterns +# Modify tokenizer infix patterns infixes = ( LIST_ELLIPSES + LIST_ICONS @@ -929,8 +929,8 @@ infixes = ( al=ALPHA_LOWER, au=ALPHA_UPPER, q=CONCAT_QUOTES ), r"(?<=[{a}]),(?=[{a}])".format(a=ALPHA), - # EDIT: commented out regex that splits on hyphens between letters: - #r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS), + # βœ… Commented out regex that splits on hyphens between letters: + # r"(?<=[{a}])(?:{h})(?=[{a}])".format(a=ALPHA, h=HYPHENS), r"(?<=[{a}0-9])[:<>=/](?=[{a}])".format(a=ALPHA), ] ) @@ -1550,7 +1550,7 @@ import Vectors101 from 'usage/101/\_vectors-similarity.md' ### Adding word vectors {#adding-vectors} Custom word vectors can be trained using a number of open-source libraries, such -as [Gensim](https://radimrehurek.com/gensim), [Fast Text](https://fasttext.cc), +as [Gensim](https://radimrehurek.com/gensim), [FastText](https://fasttext.cc), or Tomas Mikolov's original [Word2vec implementation](https://code.google.com/archive/p/word2vec/). Most word vector libraries output an easy-to-read text-based format, where each line diff --git a/website/docs/usage/processing-pipelines.md b/website/docs/usage/processing-pipelines.md index bc8c990e8..614f113b3 100644 --- a/website/docs/usage/processing-pipelines.md +++ b/website/docs/usage/processing-pipelines.md @@ -108,11 +108,11 @@ class, or defined within a [model package](/usage/saving-loading#models). > > [components.tagger] > factory = "tagger" -> # settings for the tagger component +> # Settings for the tagger component > > [components.parser] > factory = "parser" -> # settings for the parser component +> # Settings for the parser component > ``` When you load a model, spaCy first consults the model's @@ -171,11 +171,11 @@ lang = "en" pipeline = ["tagger", "parser", "ner"] data_path = "path/to/en_core_web_sm/en_core_web_sm-2.0.0" -cls = spacy.util.get_lang_class(lang) # 1. Get Language instance, e.g. English() -nlp = cls() # 2. Initialize it +cls = spacy.util.get_lang_class(lang) # 1. Get Language class, e.g. English +nlp = cls() # 2. Initialize it for name in pipeline: - nlp.add_pipe(name) # 3. Add the component to the pipeline -nlp.from_disk(model_data_path) # 4. Load in the binary data + nlp.add_pipe(name) # 3. Add the component to the pipeline +nlp.from_disk(model_data_path) # 4. Load in the binary data ``` When you call `nlp` on a text, spaCy will **tokenize** it and then **call each @@ -187,9 +187,9 @@ which is then processed by the component next in the pipeline. ```python ### The pipeline under the hood -doc = nlp.make_doc("This is a sentence") # create a Doc from raw text -for name, proc in nlp.pipeline: # iterate over components in order - doc = proc(doc) # apply each component +doc = nlp.make_doc("This is a sentence") # Create a Doc from raw text +for name, proc in nlp.pipeline: # Iterate over components in order + doc = proc(doc) # Apply each component ``` The current processing pipeline is available as `nlp.pipeline`, which returns a @@ -265,7 +265,7 @@ for doc in nlp.pipe(texts, disable=["tagger", "parser"]): If you need to **execute more code** with components disabled – e.g. to reset the weights or update only some components during training – you can use the -[`nlp.select_pipes`](/api/language#select_pipes) contextmanager. At the end of +[`nlp.select_pipes`](/api/language#select_pipes) context manager. At the end of the `with` block, the disabled pipeline components will be restored automatically. Alternatively, `select_pipes` returns an object that lets you call its `restore()` method to restore the disabled components when needed. This @@ -274,7 +274,7 @@ blocks. ```python ### Disable for block -# 1. Use as a contextmanager +# 1. Use as a context manager with nlp.select_pipes(disable=["tagger", "parser"]): doc = nlp("I won't be tagged and parsed") doc = nlp("I will be tagged and parsed") @@ -473,7 +473,7 @@ only being able to modify it afterwards. > > @Language.component("my_component") > def my_component(doc): -> # do something to the doc here +> # Do something to the doc here > return doc > ``` diff --git a/website/docs/usage/projects.md b/website/docs/usage/projects.md index 30e4394d1..1aaaeb3af 100644 --- a/website/docs/usage/projects.md +++ b/website/docs/usage/projects.md @@ -5,9 +5,12 @@ menu: - ['Intro & Workflow', 'intro'] - ['Directory & Assets', 'directory'] - ['Custom Projects', 'custom'] + - ['Remote Storage', 'remote'] - ['Integrations', 'integrations'] --- +## Introduction and workflow {#intro hidden="true"} + > #### πŸͺ Project templates > > Our [`projects`](https://github.com/explosion/projects) repo includes various @@ -19,20 +22,17 @@ spaCy projects let you manage and share **end-to-end spaCy workflows** for different **use cases and domains**, and orchestrate training, packaging and serving your custom models. You can start off by cloning a pre-defined project template, adjust it to fit your needs, load in your data, train a model, export -it as a Python package and share the project templates with your team. spaCy -projects can be used via the new [`spacy project`](/api/cli#project) command. -For an overview of the available project templates, check out the -[`projects`](https://github.com/explosion/projects) repo. spaCy projects also -[integrate](#integrations) with many other cool machine learning and data -science tools to track and manage your data and experiments, iterate on demos -and prototypes and ship your models into production. +it as a Python package, upload your outputs to a remote storage and share your +results with your team. spaCy projects can be used via the new +[`spacy project`](/api/cli#project) command and we provide templates in our +[`projects`](https://github.com/explosion/projects) repo. -## Introduction and workflow {#intro} - +![Illustration of project workflow and commands](../images/projects.svg) + ```yaml ### project.yml -variables: - BATCH_SIZE: 128 +vars: + batch_size: 128 commands: - name: evaluate script: - - 'python scripts/custom_evaluation.py {BATCH_SIZE} ./training/model-best ./corpus/eval.json' + - 'python scripts/custom_evaluation.py ${batch_size} ./training/model-best ./corpus/eval.json' deps: - 'training/model-best' - 'corpus/eval.json' @@ -421,6 +446,114 @@ assets: checksum: '5113dc04e03f079525edd8df3f4f39e3' ``` +## Remote Storage {#remote} + +You can persist your project outputs to a remote storage using the +[`project push`](/api/cli#project-push) command. This can help you **export** +your model packages, **share** work with your team, or **cache results** to +avoid repeating work. The [`project pull`](/api/cli#project-pull) command will +download any outputs that are in the remote storage and aren't available +locally. + +You can list one or more remotes in the `remotes` section of your +[`project.yml`](#project-yml) by mapping a string name to the URL of the +storage. Under the hood, spaCy uses the +[`smart-open`](https://github.com/RaRe-Technologies/smart_open) library to +communicate with the remote storages, so you can use any protocol that +`smart-open` supports, including [S3](https://aws.amazon.com/s3/), +[Google Cloud Storage](https://cloud.google.com/storage), SSH and more, although +you may need to install extra dependencies to use certain protocols. + +> #### Example +> +> ```cli +> $ python -m spacy project pull local +> ``` + +```yaml +### project.yml +remotes: + default: 's3://my-spacy-bucket' + local: '/mnt/scratch/cache' + stuff: 'ssh://myserver.example.com/whatever' +``` + + + +Inside the remote storage, spaCy uses a clever **directory structure** to avoid +overwriting files. The top level of the directory structure is a URL-encoded +version of the output's path. Within this directory are subdirectories named +according to a hash of the command string and the command's dependencies. +Finally, within those directories are files, named according to an MD5 hash of +their contents. + + + + +```yaml +└── urlencoded_file_path # Path of original file + β”œβ”€β”€ some_command_hash # Hash of command you ran + β”‚ β”œβ”€β”€ some_content_hash # Hash of file content + β”‚ └── another_content_hash + └── another_command_hash + └── third_content_hash +``` + + + +For instance, let's say you had the following command in your `project.yml`: + +```yaml +### project.yml +- name: train + help: 'Train a spaCy model using the specified corpus and config' + script: + - 'spacy train ./config.cfg --output training/' + deps: + - 'corpus/train' + - 'corpus/dev' + - 'config.cfg' + outputs: + - 'training/model-best' +``` + +> #### Example +> +> ``` +> └── s3://my-spacy-bucket/training%2Fmodel-best +> └── 1d8cb33a06cc345ad3761c6050934a1b +> └── d8e20c3537a084c5c10d95899fe0b1ff +> ``` + +After you finish training, you run [`project push`](/api/cli#project-push) to +make sure the `training/model-best` output is saved to remote storage. spaCy +will then construct a hash from your command script and the listed dependencies, +`corpus/train`, `corpus/dev` and `config.cfg`, in order to identify the +execution context of your output. It would then compute an MD5 hash of the +`training/model-best` directory, and use those three pieces of information to +construct the storage URL. + +```cli +$ python -m spacy project run train +$ python -m spacy project push +``` + +If you change the command or one of its dependencies (for instance, by editing +the [`config.cfg`](/usage/training#config) file to tune the hyperparameters, a +different creation hash will be calculated, so when you use +[`project push`](/api/cli#project-push) you won't be overwriting your previous +file. The system even supports multiple outputs for the same file and the same +context, which can happen if your training process is not deterministic, or if +you have dependencies that aren't represented in the command. + +In summary, the [`spacy project`](/api/cli#project) remote storages are designed +to make a particular set of trade-offs. Priority is placed on **convenience**, +**correctness** and **avoiding data loss**. You can use +[`project push`](/api/cli#project-push) freely, as you'll never overwrite remote +state, and you don't have to come up with names or version numbers. However, +it's up to you to manage the size of your remote storage, and to remove files +that are no longer relevant to you. + ## Integrations {#integrations} ### Data Version Control (DVC) {#dvc} @@ -517,16 +650,17 @@ and evaluation set. ```yaml ### project.yml -variables: - PRODIGY_DATASET: 'ner_articles' - PRODIGY_LABELS: 'PERSON,ORG,PRODUCT' - PRODIGY_MODEL: 'en_core_web_md' +vars: + prodigy: + dataset: 'ner_articles' + labels: 'PERSON,ORG,PRODUCT' + model: 'en_core_web_md' commands: - name: annotate - script: - - 'python -m prodigy ner.correct {PRODIGY_DATASET} ./assets/raw_data.jsonl {PRODIGY_MODEL} --labels {PRODIGY_LABELS}' - - 'python -m prodigy data-to-spacy ./corpus/train.json ./corpus/eval.json --ner {PRODIGY_DATASET}' + - 'python -m prodigy ner.correct ${vars.prodigy.dataset} ./assets/raw_data.jsonl ${vars.prodigy.model} --labels ${vars.prodigy.labels}' + - 'python -m prodigy data-to-spacy ./corpus/train.json ./corpus/eval.json --ner ${vars.prodigy.dataset}' - 'python -m spacy convert ./corpus/train.json ./corpus/train.spacy' - 'python -m spacy convert ./corpus/eval.json ./corpus/eval.spacy' - deps: diff --git a/website/docs/usage/rule-based-matching.md b/website/docs/usage/rule-based-matching.md index ce6625897..7fdce032e 100644 --- a/website/docs/usage/rule-based-matching.md +++ b/website/docs/usage/rule-based-matching.md @@ -511,21 +511,21 @@ from spacy.language import Language from spacy.matcher import Matcher from spacy.tokens import Token -# We're using a component factory because the component needs to be initialized -# with the shared vocab via the nlp object +# We're using a component factory because the component needs to be +# initialized with the shared vocab via the nlp object @Language.factory("html_merger") def create_bad_html_merger(nlp, name): - return BadHTMLMerger(nlp) + return BadHTMLMerger(nlp.vocab) class BadHTMLMerger: - def __init__(self, nlp): + def __init__(self, vocab): patterns = [ [{"ORTH": "<"}, {"LOWER": "br"}, {"ORTH": ">"}], [{"ORTH": "<"}, {"LOWER": "br/"}, {"ORTH": ">"}], ] # Register a new token extension to flag bad HTML Token.set_extension("bad_html", default=False) - self.matcher = Matcher(nlp.vocab) + self.matcher = Matcher(vocab) self.matcher.add("BAD_HTML", patterns) def __call__(self, doc): diff --git a/website/docs/usage/saving-loading.md b/website/docs/usage/saving-loading.md index b5e0f4370..3f9435f5e 100644 --- a/website/docs/usage/saving-loading.md +++ b/website/docs/usage/saving-loading.md @@ -243,7 +243,7 @@ file `data.json` in its subdirectory: ### Directory structure {highlight="2-3"} └── /path/to/model β”œβ”€β”€ my_component # data serialized by "my_component" - | └── data.json + β”‚ └── data.json β”œβ”€β”€ ner # data for "ner" component β”œβ”€β”€ parser # data for "parser" component β”œβ”€β”€ tagger # data for "tagger" component diff --git a/website/docs/usage/training.md b/website/docs/usage/training.md index 892fb7f48..59766bada 100644 --- a/website/docs/usage/training.md +++ b/website/docs/usage/training.md @@ -1,13 +1,12 @@ --- title: Training Models -next: /usage/projects +next: /usage/layers-architectures menu: - ['Introduction', 'basics'] - ['Quickstart', 'quickstart'] - ['Config System', 'config'] - ['Custom Functions', 'custom-functions'] - - ['Transfer Learning', 'transfer-learning'] - - ['Parallel Training', 'parallel-training'] + # - ['Parallel Training', 'parallel-training'] - ['Internal API', 'api'] --- @@ -35,8 +34,8 @@ ready-to-use spaCy models. The recommended way to train your spaCy models is via the [`spacy train`](/api/cli#train) command on the command line. It only needs a single [`config.cfg`](#config) **configuration file** that includes all settings -and hyperparameters. You can optionally [overwritten](#config-overrides) -settings on the command line, and load in a Python file to register +and hyperparameters. You can optionally [overwrite](#config-overrides) settings +on the command line, and load in a Python file to register [custom functions](#custom-code) and architectures. This quickstart widget helps you generate a starter config with the **recommended settings** for your specific use case. It's also available in spaCy as the @@ -82,7 +81,7 @@ $ python -m spacy init fill-config base_config.cfg config.cfg Instead of exporting your starter config from the quickstart widget and auto-filling it, you can also use the [`init config`](/api/cli#init-config) -command and specify your requirement and settings and CLI arguments. You can now +command and specify your requirement and settings as CLI arguments. You can now add your data and run [`train`](/api/cli#train) with your config. See the [`convert`](/api/cli#convert) command for details on how to convert your data to spaCy's binary `.spacy` format. You can either include the data paths in the @@ -92,23 +91,8 @@ spaCy's binary `.spacy` format. You can either include the data paths in the $ python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./dev.spacy ``` - - ## Training config {#config} - - Training config files include all **settings and hyperparameters** for training your model. Instead of providing lots of arguments on the command line, you only need to pass your `config.cfg` file to [`spacy train`](/api/cli#train). Under @@ -126,9 +110,10 @@ Some of the main advantages and features of spaCy's training config are: functions like [model architectures](/api/architectures), [optimizers](https://thinc.ai/docs/api-optimizers) or [schedules](https://thinc.ai/docs/api-schedules) and define arguments that are - passed into them. You can also register your own functions to define - [custom architectures](#custom-functions), reference them in your config and - tweak their parameters. + passed into them. You can also + [register your own functions](#custom-functions) to define custom + architectures or methods, reference them in your config and tweak their + parameters. - **Interpolation.** If you have hyperparameters or other settings used by multiple components, define them once and reference them as [variables](#config-interpolation). @@ -226,21 +211,21 @@ passed to the component factory as arguments. This lets you configure the model settings and hyperparameters. If a component block defines a `source`, the component will be copied over from an existing pretrained model, with its existing weights. This lets you include an already trained component in your -model pipeline, or update a pretrained components with more data specific to -your use case. +model pipeline, or update a pretrained component with more data specific to your +use case. ```ini ### config.cfg (excerpt) [components] -# "parser" and "ner" are sourced from pretrained model +# "parser" and "ner" are sourced from a pretrained model [components.parser] source = "en_core_web_sm" [components.ner] source = "en_core_web_sm" -# "textcat" and "custom" are created blank from built-in / custom factory +# "textcat" and "custom" are created blank from a built-in / custom factory [components.textcat] factory = "textcat" @@ -294,11 +279,11 @@ batch_size = 128 ``` To refer to a function instead, you can make `[training.batch_size]` its own -section and use the `@` syntax specify the function and its arguments – in this -case [`compounding.v1`](https://thinc.ai/docs/api-schedules#compounding) defined -in the [function registry](/api/top-level#registry). All other values defined in -the block are passed to the function as keyword arguments when it's initialized. -You can also use this mechanism to register +section and use the `@` syntax to specify the function and its arguments – in +this case [`compounding.v1`](https://thinc.ai/docs/api-schedules#compounding) +defined in the [function registry](/api/top-level#registry). All other values +defined in the block are passed to the function as keyword arguments when it's +initialized. You can also use this mechanism to register [custom implementations and architectures](#custom-functions) and reference them from your configs. @@ -404,13 +389,11 @@ recipe once the dish has already been prepared. You have to make a new one. spaCy includes a variety of built-in [architectures](/api/architectures) for different tasks. For example: - - | Architecture | Description | | ----------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | [HashEmbedCNN](/api/architectures#HashEmbedCNN) | Build spaCy’s "standard" embedding layer, which uses hash embedding with subword features and a CNN with layer-normalized maxout. ~~Model[List[Doc], List[Floats2d]]~~ | | [TransitionBasedParser](/api/architectures#TransitionBasedParser) | Build a [transition-based parser](https://explosion.ai/blog/parsing-english-in-python) model used in the default [`EntityRecognizer`](/api/entityrecognizer) and [`DependencyParser`](/api/dependencyparser). ~~Model[List[Docs], List[List[Floats2d]]]~~ | -| [TextCatEnsemble](/api/architectures#TextCatEnsemble) | Stacked ensemble of a bag-of-words model and a neural network model with an internal CNN embedding layer. Used in the default [`TextCategorizer`](/api/textcategorizer). ~~Model~~ | +| [TextCatEnsemble](/api/architectures#TextCatEnsemble) | Stacked ensemble of a bag-of-words model and a neural network model with an internal CNN embedding layer. Used in the default [`TextCategorizer`](/api/textcategorizer). ~~Model[List[Doc], Floats2d]~~ | @@ -726,9 +709,9 @@ a stream of items into a stream of batches. spaCy has several useful built-in [batching strategies](/api/top-level#batchers) with customizable sizes, but it's also easy to implement your own. For instance, the following function takes the stream of generated [`Example`](/api/example) objects, and removes those which -have the exact same underlying raw text, to avoid duplicates within each batch. -Note that in a more realistic implementation, you'd also want to check whether -the annotations are exactly the same. +have the same underlying raw text, to avoid duplicates within each batch. Note +that in a more realistic implementation, you'd also want to check whether the +annotations are the same. > #### config.cfg > @@ -759,71 +742,10 @@ def filter_batch(size: int) -> Callable[[Iterable[Example]], Iterator[List[Examp return create_filtered_batches ``` - - ### Defining custom architectures {#custom-architectures} -## Transfer learning {#transfer-learning} - - - -### Using transformer models like BERT {#transformers} - -spaCy v3.0 lets you use almost any statistical model to power your pipeline. You -can use models implemented in a variety of frameworks. A transformer model is -just a statistical model, so the -[`spacy-transformers`](https://github.com/explosion/spacy-transformers) package -actually has very little work to do: it just has to provide a few functions that -do the required plumbing. It also provides a pipeline component, -[`Transformer`](/api/transformer), that lets you do multi-task learning and lets -you save the transformer outputs for later use. - - - -For more details on how to integrate transformer models into your training -config and customize the implementations, see the usage guide on -[training transformers](/usage/embeddings-transformers#transformers-training). - -### Pretraining with spaCy {#pretraining} - - - -## Parallel Training with Ray {#parallel-training} - - - ## Internal training API {#api} @@ -843,8 +765,8 @@ called the **gold standard**. It's initialized with a [`Doc`](/api/doc) object that will hold the predictions, and another `Doc` object that holds the gold-standard annotations. It also includes the **alignment** between those two documents if they differ in tokenization. The `Example` class ensures that spaCy -can rely on one **standardized format** that's passed through the pipeline. -Here's an example of a simple `Example` for part-of-speech tags: +can rely on one **standardized format** that's passed through the pipeline. For +instance, let's say we want to define gold-standard part-of-speech tags: ```python words = ["I", "like", "stuff"] @@ -856,9 +778,10 @@ reference = Doc(vocab, words=words).from_array("TAG", numpy.array(tag_ids, dtype example = Example(predicted, reference) ``` -Alternatively, the `reference` `Doc` with the gold-standard annotations can be -created from a dictionary with keyword arguments specifying the annotations, -like `tags` or `entities`. Using the `Example` object and its gold-standard +As this is quite verbose, there's an alternative way to create the reference +`Doc` with the gold-standard annotations. The function `Example.from_dict` takes +a dictionary with keyword arguments specifying the annotations, like `tags` or +`entities`. Using the resulting `Example` object and its gold-standard annotations, the model can be updated to learn a sentence of three words with their assigned part-of-speech tags. @@ -883,8 +806,8 @@ example = Example.from_dict(predicted, {"tags": tags}) Here's another example that shows how to define gold-standard named entities. The letters added before the labels refer to the tags of the [BILUO scheme](/usage/linguistic-features#updating-biluo) – `O` is a token -outside an entity, `U` an single entity unit, `B` the beginning of an entity, -`I` a token inside an entity and `L` the last token of an entity. +outside an entity, `U` a single entity unit, `B` the beginning of an entity, `I` +a token inside an entity and `L` the last token of an entity. ```python doc = Doc(nlp.vocab, words=["Facebook", "released", "React", "in", "2014"]) @@ -958,7 +881,7 @@ dictionary of annotations: ```diff text = "Facebook released React in 2014" annotations = {"entities": ["U-ORG", "O", "U-TECHNOLOGY", "O", "U-DATE"]} -+ example = Example.from_dict(nlp.make_doc(text), {"entities": entities}) ++ example = Example.from_dict(nlp.make_doc(text), annotations) - nlp.update([text], [annotations]) + nlp.update([example]) ``` diff --git a/website/docs/usage/v3.md b/website/docs/usage/v3.md index 837818a83..2a47fd264 100644 --- a/website/docs/usage/v3.md +++ b/website/docs/usage/v3.md @@ -10,6 +10,32 @@ menu: ## Summary {#summary} + + +
+ +
+ + + +- [Summary](#summary) +- [New features](#features) +- [Training & config system](#features-training) +- [Transformer-based pipelines](#features-transformers) +- [Custom models](#features-custom-models) +- [End-to-end project workflows](#features-projects) +- [New built-in components](#features-pipeline-components) +- [New custom component API](#features-components) +- [Python type hints](#features-types) +- [New methods & attributes](#new-methods) +- [New & updated documentation](#new-docs) +- [Backwards incompatibilities](#incompat) +- [Migrating from spaCy v2.x](#migrating) + + + +
+ ## New Features {#features} ### New training workflow and config system {#features-training} @@ -28,6 +54,8 @@ menu: ### Transformer-based pipelines {#features-transformers} +![Pipeline components listening to shared embedding component](../images/tok2vec-listener.svg) + - **Usage:** [Embeddings & Transformers](/usage/embeddings-transformers), @@ -38,7 +66,7 @@ menu: - **Architectures: ** [TransformerModel](/api/architectures#TransformerModel), [Tok2VecListener](/api/architectures#transformers-Tok2VecListener), [Tok2VecTransformer](/api/architectures#Tok2VecTransformer) -- **Models:** [`en_core_bert_sm`](/models/en) +- **Models:** [`en_core_trf_lg_sm`](/models/en) - **Implementation:** [`spacy-transformers`](https://github.com/explosion/spacy-transformers) @@ -46,8 +74,57 @@ menu: ### Custom models using any framework {#features-custom-models} + + + + +- **Thinc: ** + [Wrapping PyTorch, TensorFlow & MXNet](https://thinc.ai/docs/usage-frameworks) +- **API:** [Model architectures](/api/architectures), [`Pipe`](/api/pipe) + + + ### Manage end-to-end workflows with projects {#features-projects} + + +> #### Example +> +> ```cli +> # Clone a project template +> $ python -m spacy project clone example +> $ cd example +> # Download data assets +> $ python -m spacy project assets +> # Run a workflow +> $ python -m spacy project run train +> ``` + +spaCy projects let you manage and share **end-to-end spaCy workflows** for +different **use cases and domains**, and orchestrate training, packaging and +serving your custom models. You can start off by cloning a pre-defined project +template, adjust it to fit your needs, load in your data, train a model, export +it as a Python package, upload your outputs to a remote storage and share your +results with your team. + +![Illustration of project workflow and commands](../images/projects.svg) + +spaCy projects also make it easy to **integrate with other tools** in the data +science and machine learning ecosystem, including [DVC](/usage/projects#dvc) for +data version control, [Prodigy](/usage/projects#prodigy) for creating labelled +data, [Streamlit](/usage/projects#streamlit) for building interactive apps, +[FastAPI](/usage/projects#fastapi) for serving models in production, +[Ray](/usage/projects#ray) for parallel training, +[Weights & Biases](/usage/projects#wandb) for experiment tracking, and more! + + + - **Usage:** [spaCy projects](/usage/projects), @@ -59,6 +136,16 @@ menu: ### New built-in pipeline components {#features-pipeline-components} +spaCy v3.0 includes several new trainable and rule-based components that you can +add to your pipeline and customize for your use case: + +> #### Example +> +> ```python +> nlp = spacy.blank("en") +> nlp.add_pipe("lemmatizer") +> ``` + | Name | Description | | ----------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | [`SentenceRecognizer`](/api/sentencerecognizer) | Trainable component for sentence segmentation. | @@ -78,15 +165,37 @@ menu: ### New and improved pipeline component APIs {#features-components} -- `Language.factory`, `Language.component` -- `Language.analyze_pipes` -- Adding components from other models +> #### Example +> +> ```python +> @Language.component("my_component") +> def my_component(doc): +> return doc +> +> nlp.add_pipe("my_component") +> nlp.add_pipe("ner", source=other_nlp) +> nlp.analyze_pipes(pretty=True) +> ``` + +Defining, configuring, reusing, training and analyzing pipeline components is +now easier and more convenient. The `@Language.component` and +`@Language.factory` decorators let you register your component, define its +default configuration and meta data, like the attribute values it assigns and +requires. Any custom component can be included during training, and sourcing +components from existing pretrained models lets you **mix and match custom +pipelines**. The `nlp.analyze_pipes` method outputs structured information about +the current pipeline and its components, including the attributes they assign, +the scores they compute during training and whether any required attributes +aren't set. - **Usage:** [Custom components](/usage/processing-pipelines#custom_components), - [Defining components during training](/usage/training#config-components) -- **API:** [`Language`](/api/language) + [Defining components for training](/usage/training#config-components) +- **API:** [`@Language.component`](/api/language#component), + [`@Language.factory`](/api/language#factory), + [`Language.add_pipe`](/api/language#add_pipe), + [`Language.analyze_pipes`](/api/language#analyze_pipes) - **Implementation:** [`spacy/language.py`](https://github.com/explosion/spaCy/tree/develop/spacy/language.py) @@ -136,14 +245,15 @@ in your config and see validation errors if the argument values don't match. -### New methods, attributes and commands +### New methods, attributes and commands {#new-methods} The following methods, attributes and commands are new in spaCy v3.0. | Name | Description | | ----------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | [`Token.lex`](/api/token#attributes) | Access a token's [`Lexeme`](/api/lexeme). | -| [`Language.select_pipes`](/api/language#select_pipes) | Contextmanager for enabling or disabling specific pipeline components for a block. | +| [`Token.morph`](/api/token#attributes) [`Token.morph_`](/api/token#attributes) | Access a token's morphological analysis. | +| [`Language.select_pipes`](/api/language#select_pipes) | Context manager for enabling or disabling specific pipeline components for a block. | | [`Language.analyze_pipes`](/api/language#analyze_pipes) | [Analyze](/usage/processing-pipelines#analysis) components and their interdependencies. | | [`Language.resume_training`](/api/language#resume_training) | Experimental: continue training a pretrained model and initialize "rehearsal" for components that implement a `rehearse` method to prevent catastrophic forgetting. | | [`@Language.factory`](/api/language#factory) [`@Language.component`](/api/language#component) | Decorators for [registering](/usage/processing-pipelines#custom-components) pipeline component factories and simple stateless component functions. | @@ -153,9 +263,55 @@ The following methods, attributes and commands are new in spaCy v3.0. | [`Pipe.score`](/api/pipe#score) | Method on trainable pipeline components that returns a dictionary of evaluation scores. | | [`registry`](/api/top-level#registry) | Function registry to map functions to string names that can be referenced in [configs](/usage/training#config). | | [`util.load_meta`](/api/top-level#util.load_meta) [`util.load_config`](/api/top-level#util.load_config) | Updated helpers for loading a model's [`meta.json`](/api/data-formats#meta) and [`config.cfg`](/api/data-formats#config). | +| [`util.get_installed_models`](/api/top-level#util.get_installed_models) | Names of all models installed in the environment. | | [`init config`](/api/cli#init-config) [`init fill-config`](/api/cli#init-fill-config) [`debug config`](/api/cli#debug-config) | CLI commands for initializing, auto-filling and debugging [training configs](/usage/training). | | [`project`](/api/cli#project) | Suite of CLI commands for cloning, running and managing [spaCy projects](/usage/projects). | +### New and updated documentation {#new-docs} + + + +
+ +To help you get started with spaCy v3.0 and the new features, we've added +several new or rewritten documentation pages, including a new usage guide on +[embeddings, transformers and transfer learning](/usage/embeddings-transformers), +a guide on [training models](/usage/training) rewritten from scratch, a page +explaining the new [spaCy projects](/usage/projects) and updated usage +documentation on +[custom pipeline components](/usage/processing-pipelines#custom-components). +We've also added a bunch of new illustrations and new API reference pages +documenting spaCy's machine learning [model architectures](/api/architectures) +and the expected [data formats](/api/data-formats). API pages about +[pipeline components](/api/#architecture-pipeline) now include more information, +like the default config and implementation, and we've adopted a more detailed +format for documenting argument and return types. + +
+ +[![Library architecture](../images/architecture.svg)](/api) + +
+ + + +- **Usage: ** [Embeddings & Transformers](/usage/embeddings-transformers), + [Training models](/usage/training), + [Layers & Architectures](/usage/layers-architectures), + [Projects](/usage/projects), + [Custom pipeline components](/usage/processing-pipelines#custom-components), + [Custom tokenizers](/usage/linguistic-features#custom-tokenizer) +- **API Reference: ** [Library architecture](/api), + [Model architectures](/api/architectures), [Data formats](/api/data-formats) +- **New Classes: ** [`Example`](/api/example), [`Tok2Vec`](/api/tok2vec), + [`Transformer`](/api/transformer), [`Lemmatizer`](/api/lemmatizer), + [`Morphologizer`](/api/morphologizer), + [`AttributeRuler`](/api/attributeruler), + [`SentenceRecognizer`](/api/sentencerecognizer), [`Pipe`](/api/pipe), + [`Corpus`](/api/corpus) + + + ## Backwards Incompatibilities {#incompat} As always, we've tried to keep the breaking changes to a minimum and focus on @@ -186,13 +342,13 @@ Note that spaCy v3.0 now requires **Python 3.6+**. [training config](/usage/training#config). - [`Language.add_pipe`](/api/language#add_pipe) now takes the **string name** of the component factory instead of the component function. -- **Custom pipeline components** now needs to be decorated with the +- **Custom pipeline components** now need to be decorated with the [`@Language.component`](/api/language#component) or [`@Language.factory`](/api/language#factory) decorator. - [`Language.update`](/api/language#update) now takes a batch of [`Example`](/api/example) objects instead of raw texts and annotations, or `Doc` and `GoldParse` objects. -- The `Language.disable_pipes` contextmanager has been replaced by +- The `Language.disable_pipes` context manager has been replaced by [`Language.select_pipes`](/api/language#select_pipes), which can explicitly disable or enable components. - The [`Language.update`](/api/language#update), @@ -212,15 +368,16 @@ Note that spaCy v3.0 now requires **Python 3.6+**. ### Removed or renamed API {#incompat-removed} -| Removed | Replacement | -| ------------------------------------------------------ | ----------------------------------------------------------------------------------------- | -| `Language.disable_pipes` | [`Language.select_pipes`](/api/language#select_pipes) | -| `GoldParse` | [`Example`](/api/example) | -| `GoldCorpus` | [`Corpus`](/api/corpus) | -| `KnowledgeBase.load_bulk` `KnowledgeBase.dump` | [`KnowledgeBase.from_disk`](/api/kb#from_disk) [`KnowledgeBase.to_disk`](/api/kb#to_disk) | -| `spacy debug-data` | [`spacy debug data`](/api/cli#debug-data) | -| `spacy profile` | [`spacy debug profile`](/api/cli#debug-profile) | -| `spacy link` `util.set_data_path` `util.get_data_path` | not needed, model symlinks are deprecated | +| Removed | Replacement | +| -------------------------------------------------------- | ------------------------------------------------------------------------------------------ | +| `Language.disable_pipes` | [`Language.select_pipes`](/api/language#select_pipes) | +| `GoldParse` | [`Example`](/api/example) | +| `GoldCorpus` | [`Corpus`](/api/corpus) | +| `KnowledgeBase.load_bulk`, `KnowledgeBase.dump` | [`KnowledgeBase.from_disk`](/api/kb#from_disk), [`KnowledgeBase.to_disk`](/api/kb#to_disk) | +| `spacy init-model` | [`spacy init model`](/api/cli#init-model) | +| `spacy debug-data` | [`spacy debug data`](/api/cli#debug-data) | +| `spacy profile` | [`spacy debug profile`](/api/cli#debug-profile) | +| `spacy link`, `util.set_data_path`, `util.get_data_path` | not needed, model symlinks are deprecated | The following deprecated methods, attributes and arguments were removed in v3.0. Most of them have been **deprecated for a while** and many would previously @@ -236,7 +393,7 @@ on them. | `Language.tagger`, `Language.parser`, `Language.entity` | [`Language.get_pipe`](/api/language#get_pipe) | | keyword-arguments like `vocab=False` on `to_disk`, `from_disk`, `to_bytes`, `from_bytes` | `exclude=["vocab"]` | | `n_threads` argument on [`Tokenizer`](/api/tokenizer), [`Matcher`](/api/matcher), [`PhraseMatcher`](/api/phrasematcher) | `n_process` | -| `verbose` argument on [`Language.evaluate`] | logging | +| `verbose` argument on [`Language.evaluate`](/api/language#evaluate) | logging (`DEBUG`) | | `SentenceSegmenter` hook, `SimilarityHook` | [user hooks](/usage/processing-pipelines#custom-components-user-hooks), [`Sentencizer`](/api/sentencizer), [`SentenceRecognizer`](/api/sentenceregognizer) | ## Migrating from v2.x {#migrating} diff --git a/website/docs/usage/visualizers.md b/website/docs/usage/visualizers.md index f33340063..4ba0112b6 100644 --- a/website/docs/usage/visualizers.md +++ b/website/docs/usage/visualizers.md @@ -121,10 +121,10 @@ import DisplacyEntHtml from 'images/displacy-ent2.html' The entity visualizer lets you customize the following `options`: -| Argument | Description | -| -------- | -------------------------------------------------------------------------------------------------------------------------- | -| `ents` | Entity types to highlight (`None` for all types). Defaults to `None`. ~~Optional[List[str]]~~ | `None` | -| `colors` | Color overrides. Entity types in uppercase should be mapped to color names or values. Defaults to `{}`. ~~Dict[str, str]~~ | +| Argument | Description | +| -------- | ------------------------------------------------------------------------------------------------------------- | +| `ents` | Entity types to highlight (`None` for all types). Defaults to `None`. ~~Optional[List[str]]~~ | `None` | +| `colors` | Color overrides. Entity types should be mapped to color names or values. Defaults to `{}`. ~~Dict[str, str]~~ | If you specify a list of `ents`, only those entity types will be rendered – for example, you can choose to display `PERSON` entities. Internally, the visualizer diff --git a/website/meta/sidebars.json b/website/meta/sidebars.json index c830619c5..94fbc2492 100644 --- a/website/meta/sidebars.json +++ b/website/meta/sidebars.json @@ -24,6 +24,11 @@ "tag": "new" }, { "text": "Training Models", "url": "/usage/training", "tag": "new" }, + { + "text": "Layers & Model Architectures", + "url": "/usage/layers-architectures", + "tag": "new" + }, { "text": "spaCy Projects", "url": "/usage/projects", "tag": "new" }, { "text": "Saving & Loading", "url": "/usage/saving-loading" }, { "text": "Visualizers", "url": "/usage/visualizers" } diff --git a/website/meta/type-annotations.json b/website/meta/type-annotations.json index 3cfcf5f75..b1d94403d 100644 --- a/website/meta/type-annotations.json +++ b/website/meta/type-annotations.json @@ -29,6 +29,8 @@ "Optimizer": "https://thinc.ai/docs/api-optimizers", "Model": "https://thinc.ai/docs/api-model", "Ragged": "https://thinc.ai/docs/api-types#ragged", + "Padded": "https://thinc.ai/docs/api-types#padded", + "Ints2d": "https://thinc.ai/docs/api-types#types", "Floats2d": "https://thinc.ai/docs/api-types#types", "Floats3d": "https://thinc.ai/docs/api-types#types", "FloatsXd": "https://thinc.ai/docs/api-types#types", diff --git a/website/src/components/table.js b/website/src/components/table.js index 3f41a587b..bd3d663f3 100644 --- a/website/src/components/table.js +++ b/website/src/components/table.js @@ -5,6 +5,8 @@ import Icon from './icon' import { isString } from './util' import classes from '../styles/table.module.sass' +const FOOT_ROW_REGEX = /^(RETURNS|YIELDS|CREATES|PRINTS|EXECUTES|UPLOADS|DOWNLOADS)/ + function isNum(children) { return isString(children) && /^\d+[.,]?[\dx]+?(|x|ms|mb|gb|k|m)?$/i.test(children) } @@ -43,7 +45,6 @@ function isDividerRow(children) { } function isFootRow(children) { - const rowRegex = /^(RETURNS|YIELDS|CREATES|PRINTS|EXECUTES)/ if (children.length && children[0].props.name === 'td') { const cellChildren = children[0].props.children if ( @@ -52,7 +53,7 @@ function isFootRow(children) { cellChildren.props.children && isString(cellChildren.props.children) ) { - return rowRegex.test(cellChildren.props.children) + return FOOT_ROW_REGEX.test(cellChildren.props.children) } } return false diff --git a/website/src/fonts/jetBrainsmono-italic.woff b/website/src/fonts/jetbrainsmono-italic.woff similarity index 100% rename from website/src/fonts/jetBrainsmono-italic.woff rename to website/src/fonts/jetbrainsmono-italic.woff diff --git a/website/src/styles/code.module.sass b/website/src/styles/code.module.sass index 2d213d001..aa1f499dd 100644 --- a/website/src/styles/code.module.sass +++ b/website/src/styles/code.module.sass @@ -67,7 +67,7 @@ border: 0 // Special style for types in API tables - td > &:last-child + td:not(:first-child) > &:last-child display: block border-top: 1px dotted var(--color-subtle) border-radius: 0 diff --git a/website/src/styles/readnext.module.sass b/website/src/styles/readnext.module.sass index 23aa7f016..aef91c09e 100644 --- a/website/src/styles/readnext.module.sass +++ b/website/src/styles/readnext.module.sass @@ -12,7 +12,7 @@ background: var(--color-subtle-light) color: var(--color-subtle-dark) border-radius: 50% - padding: 0.5rem + padding: 0.5rem 0.65rem 0.5rem 0 transition: color 0.2s ease float: right margin-left: 3rem