spaCy/projects.md at bb3ee38cf9a1e83cd1d50b7ddd6bf658566359c7

6.5 KiB

Raw Blame History

title

new

Projects

Intro & Workflow

intro

Directory & Assets

Introduction and workflow

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Phasellus interdum sodales lectus, ut sodales orci ullamcorper id. Sed condimentum neque ut erat mattis pretium.

1. Clone a project template

The spacy project clone command clones an existing project template and copies the files to a local directory. You can then run the project, e.g. to train a model and edit the commands and scripts to build fully custom workflows.

Cloning under the hood

To clone a project, spaCy calls into git and uses the "sparse checkout" feature to only clone the relevant directory or directories.

$ python -m spacy clone some_example_project

By default, the project will be cloned into the current working directory. You can specify an optional second argument to define the output directory. The --repo option lets you define a custom repo to clone from, if you don't want to use the spaCy projects repo. You can also use any private repo you have access to with Git.

If you plan on making the project a Git repo, you can set the --git flag to set it up automatically before initializing DVC, so DVC can integrate with Git. This means that it will automatically add asset files to a .gitignore (so you never check assets into the repo, only the asset meta files).

2. Fetch the project assets

Assets are data files your project needs – for example, the training and evaluation data or pretrained vectors and embeddings to initialize your model with.

cd some_example_project
python -m spacy project assets

3. Run the steps

$ python -m spacy project run-all

4. Run single commands

$ python -m spacy project run visualize

Project directory and assets

project.yml

The project config, project.yml, defines the assets a project depends on, like datasets and pretrained weights, as well as a series of commands that can be run separately or as a pipeline – for instance, to preprocess the data, convert it to spaCy's format, train a model, evaluate it and export metrics, package it and spin up a quick web demo. It looks pretty similar to a config file used to define CI pipelines.

Files and directory structure

A project directory created by spacy project clone includes the following files and directories. They can optionally be pre-populated by a project template (most commonly used for metas, configs or scripts).

### Project directory
├── project.yml          # the project configuration
├── dvc.yaml             # auto-generated Data Version Control config
├── dvc.lock             # auto-generated Data Version control lock file
├── assets/              # downloaded data assets and DVC meta files
├── metrics/             # output directory for evaluation metrics
├── training/            # output directory for trained models
├── corpus/              # output directory for training corpus
├── packages/            # output directory for model Python packages
├── metrics/             # output directory for evaluation metrics
├── notebooks/           # directory for Jupyter notebooks
├── scripts/             # directory for scripts, e.g. referenced in commands
├── metas/               # model meta.json templates used for packaging
├── configs/             # model config.cfg files used for training
└── ...                  # any other files, like a requirements.txt etc.

When the project is initialized, spaCy will auto-generate a dvc.yaml based on the project config. The file is updated whenever the project config has changed and includes all commands defined in the run section of the project config. This allows DVC to track the inputs and outputs and know which steps need to be re-run.

Why Data Version Control (DVC)?

Data assets like training corpora or pretrained weights are at the core of any NLP project, but they're often difficult to manage: you can't just check them into your Git repo to version and keep track of them. And if you have multiple steps that depend on each other, like a preprocessing step that generates your training data, you need to make sure the data is always up-to-date, and re-run all steps of your process every time, just to be safe.

Data Version Control (DVC) is a standalone open-source tool that integrates into your workflow like Git, builds a dependency graph for your data pipelines and tracks and caches your data files. If you're downloading data from an external source, like a storage bucket, DVC can tell whether the resource has changed. It can also determine whether to re-run a step, depending on whether its input have changed or not. All metadata can be checked into a Git repo, so you'll always be able to reproduce your experiments. spacy project uses DVC under the hood and you typically don't have to think about it if you don't want to. But if you do want to integrate with DVC more deeply, you can. Each spaCy project is also a regular DVC project.

6.5 KiB

Raw Blame History

Project templates

Introduction and workflow

1. Clone a project template

Cloning under the hood

2. Fetch the project assets

3. Run the steps

4. Run single commands

Project directory and assets

project.yml

Files and directory structure

Why Data Version Control (DVC)?

Checking projects into Git

Custom projects and scripts

6.5 KiB Raw Blame History Unescape Escape

Project templates

Introduction and workflow

1. Clone a project template

Cloning under the hood

2. Fetch the project assets

3. Run the steps

4. Run single commands

Project directory and assets

project.yml

Files and directory structure

Why Data Version Control (DVC)?

Checking projects into Git

Custom projects and scripts

6.5 KiB

Raw Blame History