6.5 KiB
title | new | menu | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Projects | 3 |
|
Project templates
Our
projects
repo includes various project templates for different tasks and models that you can clone and run.
spaCy projects let you manage and share end-to-end spaCy workflows for
training, packaging and serving your custom models. You can start off by cloning
a pre-defined project template, adjust it to fit your needs, load in your data,
train a model, export it as a Python package and share the project templates
with your team. Under the hood, project use
Data Version Control (DVC) to track and version inputs and
outputs, and make sure you're only re-running what's needed. spaCy projects can
be used via the new spacy project
command. For an overview
of the available project templates, check out the
projects
repo.
Introduction and workflow
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus interdum sodales lectus, ut sodales orci ullamcorper id. Sed condimentum neque ut erat mattis pretium.
1. Clone a project template
The spacy project clone
command clones an existing
project template and copies the files to a local directory. You can then run the
project, e.g. to train a model and edit the commands and scripts to build fully
custom workflows.
Cloning under the hood
To clone a project, spaCy calls into
git
and uses the "sparse checkout" feature to only clone the relevant directory or directories.
$ python -m spacy clone some_example_project
By default, the project will be cloned into the current working directory. You
can specify an optional second argument to define the output directory. The
--repo
option lets you define a custom repo to clone from, if you don't want
to use the spaCy projects
repo. You
can also use any private repo you have access to with Git.
If you plan on making the project a Git repo, you can set the --git
flag to
set it up automatically before initializing DVC, so DVC can integrate with
Git. This means that it will automatically add asset files to a .gitignore
(so
you never check assets into the repo, only the asset meta files).
2. Fetch the project assets
Assets are data files your project needs – for example, the training and evaluation data or pretrained vectors and embeddings to initialize your model with.
cd some_example_project
python -m spacy project assets
3. Run the steps
$ python -m spacy project run-all
4. Run single commands
$ python -m spacy project run visualize
Project directory and assets
project.yml
The project config, project.yml
, defines the assets a project depends on, like
datasets and pretrained weights, as well as a series of commands that can be run
separately or as a pipeline – for instance, to preprocess the data, convert it
to spaCy's format, train a model, evaluate it and export metrics, package it and
spin up a quick web demo. It looks pretty similar to a config file used to
define CI pipelines.
Files and directory structure
A project directory created by spacy project clone
includes the following files and directories. They can optionally be
pre-populated by a project template (most commonly used for metas, configs or
scripts).
### Project directory
├── project.yml # the project configuration
├── dvc.yaml # auto-generated Data Version Control config
├── dvc.lock # auto-generated Data Version control lock file
├── assets/ # downloaded data assets and DVC meta files
├── metrics/ # output directory for evaluation metrics
├── training/ # output directory for trained models
├── corpus/ # output directory for training corpus
├── packages/ # output directory for model Python packages
├── metrics/ # output directory for evaluation metrics
├── notebooks/ # directory for Jupyter notebooks
├── scripts/ # directory for scripts, e.g. referenced in commands
├── metas/ # model meta.json templates used for packaging
├── configs/ # model config.cfg files used for training
└── ... # any other files, like a requirements.txt etc.
When the project is initialized, spaCy will auto-generate a dvc.yaml
based on
the project config. The file is updated whenever the project config has changed
and includes all commands defined in the run
section of the project config.
This allows DVC to track the inputs and outputs and know which steps need to be
re-run.
Why Data Version Control (DVC)?
Data assets like training corpora or pretrained weights are at the core of any NLP project, but they're often difficult to manage: you can't just check them into your Git repo to version and keep track of them. And if you have multiple steps that depend on each other, like a preprocessing step that generates your training data, you need to make sure the data is always up-to-date, and re-run all steps of your process every time, just to be safe.
Data Version Control (DVC) is a standalone open-source tool
that integrates into your workflow like Git, builds a dependency graph for your
data pipelines and tracks and caches your data files. If you're downloading data
from an external source, like a storage bucket, DVC can tell whether the
resource has changed. It can also determine whether to re-run a step, depending
on whether its input have changed or not. All metadata can be checked into a Git
repo, so you'll always be able to reproduce your experiments. spacy project
uses DVC under the hood and you typically don't have to think about it if you
don't want to. But if you do want to integrate with DVC more deeply, you can.
Each spaCy project is also a regular DVC project.