lightning/docs/source/asr_tts.rst

#########
ASR & TTS
#########
These are amazing ecosystems to help with Automatic Speech Recognition (ASR) and Text to speech (TTS).

----

****
NeMo
****

`NVIDIA NeMo <https://github.com/NVIDIA/NeMo>`_ is a toolkit for building new State-of-the-Art
Conversational AI models. NeMo has separate collections for Automatic Speech Recognition (ASR),
Natural Language Processing (NLP), and Text-to-Speech (TTS) models. Each collection consists of
prebuilt modules that include everything needed to train on your data.
Every module can easily be customized, extended, and composed to create new Conversational AI
model architectures.

Conversational AI architectures are typically very large and require a lot of data  and compute
for training. NeMo uses PyTorch Lightning for easy and performant multi-GPU/multi-node
mixed-precision training.

.. note:: Every NeMo model is a LightningModule that comes equipped with all supporting infrastructure for training and reproducibility.

----------

NeMo Models
===========

NeMo Models contain everything needed to train and reproduce state of the art Conversational AI
research and applications, including:

- neural network architectures
- datasets/data loaders
- data preprocessing/postprocessing
- data augmentors
- optimizers and schedulers
- tokenizers
- language models

NeMo uses `Hydra <https://hydra.cc/>`_ for configuring both NeMo models and the PyTorch Lightning Trainer.
Depending on the domain and application, many different AI libraries will have to be configured
to build the application. Hydra makes it easy to bring all of these libraries together
so that each can be configured from .yaml or the Hydra CLI.

.. note:: Every NeMo model has an example configuration file and a corresponding script that contains all configurations needed for training.

The end result of using NeMo, Pytorch Lightning, and Hydra is that
NeMo models all have the same look and feel. This makes it easy to do Conversational AI research
across multiple domains. NeMo models are also fully compatible with the PyTorch ecosystem.

Installing NeMo
---------------

Before installing NeMo, please install Cython first.

.. code-block:: bash

    pip install Cython

For ASR and TTS models, also install these linux utilities.

.. code-block:: bash

    apt-get update && apt-get install -y libsndfile1 ffmpeg

Then installing the latest NeMo release is a simple pip install.

.. code-block:: bash

    pip install nemo_toolkit[all]==1.0.0b1

To install the main branch from GitHub:

.. code-block:: bash

    python -m pip install git+https://github.com/NVIDIA/NeMo.git@main#egg=nemo_toolkit[all]

To install from a local clone of NeMo:

.. code-block:: bash

    ./reinstall.sh # from cloned NeMo's git root

For Docker users, the NeMo container is available on
`NGC <https://ngc.nvidia.com/catalog/containers/nvidia:nemo>`_.

.. code-block:: bash

    docker pull nvcr.io/nvidia/nemo:v1.0.0b1

.. code-block:: bash

    docker run --runtime=nvidia -it --rm -v --shm-size=8g -p 8888:8888 -p 6006:6006 --ulimit memlock=-1 --ulimit stack=67108864 nvcr.io/nvidia/nemo:1.0.0b1

Experiment Manager
------------------

NeMo's Experiment Manager leverages PyTorch Lightning for model checkpointing,
TensorBoard Logging, and Weights and Biases logging. The Experiment Manager is included by default
in all NeMo example scripts.

.. code-block:: python

    exp_manager(trainer, cfg.get("exp_manager", None))

And is configurable via .yaml with Hydra.

.. code-block:: bash

    exp_manager:
        exp_dir: null
        name: *name
        create_tensorboard_logger: True
        create_checkpoint_callback: True

Optionally launch Tensorboard to view training results in ./nemo_experiments (by default).

.. code-block:: bash

    tensorboard --bind_all --logdir nemo_experiments

--------

Automatic Speech Recognition (ASR)
==================================

Everything needed to train Convolutional ASR models is included with NeMo.
NeMo supports multiple Speech Recognition architectures, including Jasper and QuartzNet.
`NeMo Speech Models <https://ngc.nvidia.com/catalog/models/nvidia:nemospeechmodels>`_
can be trained from scratch on custom datasets or
fine-tuned using pre-trained checkpoints trained on thousands of hours of audio
that can be restored for immediate use.

Some typical ASR tasks are included with NeMo:

- `Audio transcription <https://github.com/NVIDIA/NeMo/blob/main/tutorials/asr/01_ASR_with_NeMo.ipynb>`_
- `Byte Pair/Word Piece Training <https://github.com/NVIDIA/NeMo/blob/main/examples/asr/speech_to_text_bpe.py>`_
- `Speech Commands <https://github.com/NVIDIA/NeMo/blob/main/tutorials/asr/03_Speech_Commands.ipynb>`_
- `Voice Activity Detection <https://github.com/NVIDIA/NeMo/blob/main/tutorials/asr/06_Voice_Activiy_Detection.ipynb>`_
- `Speaker Recognition <https://github.com/NVIDIA/NeMo/blob/main/examples/speaker_recognition/speaker_reco.py>`_

See this `asr notebook <https://github.com/NVIDIA/NeMo/blob/main/tutorials/asr/01_ASR_with_NeMo.ipynb>`_
for a full tutorial on doing ASR with NeMo, PyTorch Lightning, and Hydra.

Specify ASR Model Configurations with YAML File
-----------------------------------------------

NeMo Models and the PyTorch Lightning Trainer can be fully configured from .yaml files using Hydra.

See this `asr config <https://github.com/NVIDIA/NeMo/blob/main/examples/asr/conf/config.yaml>`_
for the entire speech to text .yaml file.

.. code-block:: yaml

    # configure the PyTorch Lightning Trainer
    trainer:
        gpus: 0 # number of gpus
        max_epochs: 5
        max_steps: null # computed at runtime if not set
        num_nodes: 1
        distributed_backend: ddp
        ...
    # configure the ASR model
    model:
        ...
        encoder:
            _target_: nemo.collections.asr.modules.ConvASREncoder
            params:
            feat_in: *n_mels
            activation: relu
            conv_mask: true

            jasper:
                - filters: 128
                repeat: 1
                kernel: [11]
                stride: [1]
                dilation: [1]
                dropout: *dropout
                ...
        # all other configuration, data, optimizer, preprocessor, etc
        ...

Developing ASR Model From Scratch
---------------------------------

`speech_to_text.py <https://github.com/NVIDIA/NeMo/blob/main/examples/asr/speech_to_text.py>`_

.. code-block:: python

    # hydra_runner calls hydra.main and is useful for multi-node experiments
    @hydra_runner(config_path="conf", config_name="config")
    def main(cfg):
        trainer = Trainer(**cfg.trainer)
        asr_model = EncDecCTCModel(cfg.model, trainer)
        trainer.fit(asr_model)


Hydra makes every aspect of the NeMo model,
including the PyTorch Lightning Trainer, customizable from the command line.

.. code-block:: bash

    python NeMo/examples/asr/speech_to_text.py --config-name=quartznet_15x5 \
        trainer.gpus=4 \
        trainer.max_epochs=128 \
        +trainer.precision=16 \
        model.train_ds.manifest_filepath=<PATH_TO_DATA>/librispeech-train-all.json \
        model.validation_ds.manifest_filepath=<PATH_TO_DATA>/librispeech-dev-other.json \
        model.train_ds.batch_size=64 \
        +model.validation_ds.num_workers=16 \
        +model.train_ds.num_workers=16

.. note:: Training NeMo ASR models can take days/weeks so it is highly recommended to use multiple GPUs and multiple nodes with the PyTorch Lightning Trainer.


Using State-Of-The-Art Pre-trained ASR Model
--------------------------------------------

Transcribe audio with QuartzNet model pretrained on ~3300 hours of audio.

.. code-block:: python

    quartznet = EncDecCTCModel.from_pretrained('QuartzNet15x5Base-En')

    files = ['path/to/my.wav'] # file duration should be less than 25 seconds

    for fname, transcription in zip(files, quartznet.transcribe(paths2audio_files=files)):
        print(f"Audio in {fname} was recognized as: {transcription}")

To see the available pretrained checkpoints:

.. code-block:: python

    EncDecCTCModel.list_available_models()

NeMo ASR Model Under the Hood
-----------------------------

Any aspect of ASR training or model architecture design can easily be customized
with PyTorch Lightning since every NeMo model is a Lightning Module.

.. code-block:: python

    class EncDecCTCModel(ASRModel):
        """Base class for encoder decoder CTC-based models."""
    ...
        @typecheck()
        def forward(self, input_signal, input_signal_length):
            processed_signal, processed_signal_len = self.preprocessor(
                input_signal=input_signal, length=input_signal_length,
            )
            # Spec augment is not applied during evaluation/testing
            if self.spec_augmentation is not None and self.training:
                processed_signal = self.spec_augmentation(input_spec=processed_signal)
            encoded, encoded_len = self.encoder(audio_signal=processed_signal, length=processed_signal_len)
            log_probs = self.decoder(encoder_output=encoded)
            greedy_predictions = log_probs.argmax(dim=-1, keepdim=False)
            return log_probs, encoded_len, greedy_predictions

        # PTL-specific methods
        def training_step(self, batch, batch_nb):
            audio_signal, audio_signal_len, transcript, transcript_len = batch
            log_probs, encoded_len, predictions = self.forward(
                input_signal=audio_signal, input_signal_length=audio_signal_len
            )
            loss_value = self.loss(
                log_probs=log_probs, targets=transcript, input_lengths=encoded_len, target_lengths=transcript_len
            )
            wer_num, wer_denom = self._wer(predictions, transcript, transcript_len)
            tensorboard_logs = {
                'train_loss': loss_value,
                'training_batch_wer': wer_num / wer_denom,
                'learning_rate': self._optimizer.param_groups[0]['lr'],
            }
            return {'loss': loss_value, 'log': tensorboard_logs}

Neural Types in NeMo ASR
------------------------

NeMo Models and Neural Modules come with Neural Type checking.
Neural type checking is extremely useful when combining many different neural
network architectures for a production-grade application.

.. code-block:: python

        @property
        def input_types(self) -> Optional[Dict[str, NeuralType]]:
            if hasattr(self.preprocessor, '_sample_rate'):
                audio_eltype = AudioSignal(freq=self.preprocessor._sample_rate)
            else:
                audio_eltype = AudioSignal()
            return {
                "input_signal": NeuralType(('B', 'T'), audio_eltype),
                "input_signal_length": NeuralType(tuple('B'), LengthsType()),
            }

        @property
        def output_types(self) -> Optional[Dict[str, NeuralType]]:
            return {
                "outputs": NeuralType(('B', 'T', 'D'), LogprobsType()),
                "encoded_lengths": NeuralType(tuple('B'), LengthsType()),
                "greedy_predictions": NeuralType(('B', 'T'), LabelsType()),
            }

--------

Natural Language Processing (NLP)
=================================

Everything needed to finetune BERT-like language models for NLP tasks is included with NeMo.
`NeMo NLP Models <https://ngc.nvidia.com/catalog/models/nvidia:nemonlpmodels>`_
include `HuggingFace Transformers <https://github.com/huggingface/transformers>`_
and `NVIDIA Megatron-LM <https://github.com/NVIDIA/Megatron-LM>`_ BERT and Bio-Megatron models.
NeMo can also be used for pretraining BERT-based language models from HuggingFace.

Any of the HuggingFace encoders or Megatron-LM encoders can easily be used for the NLP tasks
that are included with NeMo:

- `Glue Benchmark (All tasks) <https://github.com/NVIDIA/NeMo/blob/main/tutorials/nlp/GLUE_Benchmark.ipynb>`_
- `Intent Slot Classification <https://github.com/NVIDIA/NeMo/tree/main/examples/nlp/intent_slot_classification>`_
- `Language Modeling (BERT Pretraining) <https://github.com/NVIDIA/NeMo/blob/main/tutorials/nlp/01_Pretrained_Language_Models_for_Downstream_Tasks.ipynb>`_
- `Question Answering <https://github.com/NVIDIA/NeMo/blob/main/tutorials/nlp/Question_Answering_Squad.ipynb>`_
- `Text Classification <https://github.com/NVIDIA/NeMo/tree/main/examples/nlp/text_classification>`_ (including Sentiment Analysis)
- `Token Classifcation <https://github.com/NVIDIA/NeMo/tree/main/examples/nlp/token_classification>`_ (including Named Entity Recognition)
- `Punctuation and Capitalization <https://github.com/NVIDIA/NeMo/blob/main/tutorials/nlp/Punctuation_and_Capitalization.ipynb>`_

Named Entity Recognition (NER)
------------------------------

NER (or more generally token classifcation) is the NLP task of detecting and classifying key information (entities) in text.
This task is very popular in Healthcare and Finance. In finance, for example, it can be important to identify
geographical, geopolitical, organizational, persons, events, and natural phenomenon entities.
See this `NER notebook <https://github.com/NVIDIA/NeMo/blob/main/tutorials/nlp/Token_Classification_Named_Entity_Recognition.ipynb>`_
for a full tutorial on doing NER with NeMo, PyTorch Lightning, and Hydra.

Specify NER Model Configurations with YAML File
-----------------------------------------------

..note NeMo Models and the PyTorch Lightning Trainer can be fully configured from .yaml files using Hydra.

See this `token classification config <https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/token_classification/conf/token_classification_config.yaml>`_
for the entire NER (token classification) .yaml file.

.. code-block:: yaml

    # configure any argument of the PyTorch Lightning Trainer
    trainer:
        gpus: 1 # the number of gpus, 0 for CPU
        num_nodes: 1
        max_epochs: 5
        ...
    # configure any aspect of the token classification model here
    model:
        dataset:
            data_dir: ??? # /path/to/data
            class_balancing: null # choose from [null, weighted_loss]. Weighted_loss enables the weighted class balancing of the loss, may be used for handling unbalanced classes
            max_seq_length: 128
            ...
      tokenizer:
        tokenizer_name: ${model.language_model.pretrained_model_name} # or sentencepiece
        vocab_file: null # path to vocab file
        ...
    # the language model can be from HuggingFace or Megatron-LM
    language_model:
        pretrained_model_name: bert-base-uncased
        lm_checkpoint: null
        ...
    # the classifier for the downstream task
      head:
        num_fc_layers: 2
        fc_dropout: 0.5
        activation: 'relu'
        ...
    # all other configuration: train/val/test/ data, optimizer, experiment manager, etc
    ...

Developing NER Model From Scratch
---------------------------------

`token_classification.py <https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/token_classification/token_classification.py>`_

.. code-block:: python

    # hydra_runner calls hydra.main and is useful for multi-node experiments
    @hydra_runner(config_path="conf", config_name="token_classification_config")
    def main(cfg: DictConfig) -> None:
        trainer = pl.Trainer(**cfg.trainer)
        model = TokenClassificationModel(cfg.model, trainer=trainer)
        trainer.fit(model)

After training, we can do inference with the saved NER model using PyTorch Lightning.

Inference from file:

.. code-block:: python

    gpu = 1 if cfg.trainer.gpus != 0 else 0
    trainer = pl.Trainer(gpus=gpu)
    model.set_trainer(trainer)
    model.evaluate_from_file(
        text_file=os.path.join(cfg.model.dataset.data_dir, cfg.model.validation_ds.text_file),
        labels_file=os.path.join(cfg.model.dataset.data_dir, cfg.model.validation_ds.labels_file),
        output_dir=exp_dir,
        add_confusion_matrix=True,
        normalize_confusion_matrix=True,
    )

Or we can run inference on a few examples:

.. code-block:: python

    queries = ['we bought four shirts from the nvidia gear store in santa clara.', 'Nvidia is a company in Santa Clara.']
    results = model.add_predictions(queries)

    for query, result in zip(queries, results):
        logging.info(f'Query : {query}')
        logging.info(f'Result: {result.strip()}\n')

Hydra makes every aspect of the NeMo model, including the PyTorch Lightning Trainer, customizable from the command line.

.. code-block:: bash

    python token_classification.py \
        model.language_model.pretrained_model_name=bert-base-cased \
        model.head.num_fc_layers=2 \
        model.dataset.data_dir=/path/to/my/data  \
        trainer.max_epochs=5 \
        trainer.gpus=[0,1]

-----------

Tokenizers
==========

Tokenization is the process of converting natural langauge text into integer arrays
which can be used for machine learning.
For NLP tasks, tokenization is an essential part of data preprocessing.
NeMo supports all BERT-like model tokenizers from
`HuggingFace's AutoTokenizer <https://huggingface.co/transformers/model_doc/auto.html#autotokenizer>`_
and also supports `Google's SentencePieceTokenizer <https://github.com/google/sentencepiece>`_
which can be trained on custom data.

To see the list of supported tokenizers:

.. code-block:: python

    from nemo.collections import nlp as nemo_nlp

    nemo_nlp.modules.get_tokenizer_list()

See this `tokenizer notebook <https://github.com/NVIDIA/NeMo/blob/main/tutorials/nlp/02_NLP_Tokenizers.ipynb>`_
for a full tutorial on using tokenizers in NeMo.

Language Models
---------------

Language models are used to extract information from (tokenized) text.
Much of the state-of-the-art in natural language processing is achieved
by fine-tuning pretrained language models on the downstream task.

With NeMo, you can either `pretrain <https://github.com/NVIDIA/NeMo/blob/main/examples/nlp/language_modeling/bert_pretraining.py>`_
a BERT model on your data or use a pretrained lanugage model from `HuggingFace Transformers <https://github.com/huggingface/transformers>`_
or `NVIDIA Megatron-LM <https://github.com/NVIDIA/Megatron-LM>`_.

To see the list of language models available in NeMo:

.. code-block:: python

    nemo_nlp.modules.get_pretrained_lm_models_list(include_external=True)

Easily switch between any language model in the above list by using `.get_lm_model`.

.. code-block:: python

    nemo_nlp.modules.get_lm_model(pretrained_model_name='distilbert-base-uncased')

See this `language model notebook <https://github.com/NVIDIA/NeMo/blob/main/tutorials/nlp/01_Pretrained_Language_Models_for_Downstream_Tasks.ipynb>`_
for a full tutorial on using pretrained language models in NeMo.

Using a Pre-trained NER Model
-----------------------------

NeMo has pre-trained NER models that can be used
to get started with Token Classification right away.
Models are automatically downloaded from NGC,
cached locally to disk,
and loaded into GPU memory using the `.from_pretrained` method.

.. code-block:: python

    # load pre-trained NER model
    pretrained_ner_model = TokenClassificationModel.from_pretrained(model_name="NERModel")

    # define the list of queries for inference
    queries = [
        'we bought four shirts from the nvidia gear store in santa clara.',
        'Nvidia is a company.',
        'The Adventures of Tom Sawyer by Mark Twain is an 1876 novel about a young boy growing '
        + 'up along the Mississippi River.',
    ]
    results = pretrained_ner_model.add_predictions(queries)

    for query, result in zip(queries, results):
        print()
        print(f'Query : {query}')
        print(f'Result: {result.strip()}\n')

NeMo NER Model Under the Hood
-----------------------------

Any aspect of NLP training or model architecture design can easily be customized with PyTorch Lightning
since every NeMo model is a Lightning Module.

.. code-block:: python

    class TokenClassificationModel(ModelPT):
        """
        Token Classification Model with BERT, applicable for tasks such as Named Entity Recognition
        """
        ...
        @typecheck()
        def forward(self, input_ids, token_type_ids, attention_mask):
            hidden_states = self.bert_model(
                input_ids=input_ids, token_type_ids=token_type_ids, attention_mask=attention_mask
            )
            logits = self.classifier(hidden_states=hidden_states)
            return logits

        # PTL-specfic methods
        def training_step(self, batch, batch_idx):
            """
            Lightning calls this inside the training loop with the data from the training dataloader
            passed in as `batch`.
            """
            input_ids, input_type_ids, input_mask, subtokens_mask, loss_mask, labels = batch
            logits = self(input_ids=input_ids, token_type_ids=input_type_ids, attention_mask=input_mask)

            loss = self.loss(logits=logits, labels=labels, loss_mask=loss_mask)
            tensorboard_logs = {'train_loss': loss, 'lr': self._optimizer.param_groups[0]['lr']}
            return {'loss': loss, 'log': tensorboard_logs}
        ...

Neural Types in NeMo NLP
------------------------

NeMo Models and Neural Modules come with Neural Type checking.
Neural type checking is extremely useful when combining many different neural network architectures
for a production-grade application.

.. code-block:: python

    @property
    def input_types(self) -> Optional[Dict[str, NeuralType]]:
        return self.bert_model.input_types

    @property
    def output_types(self) -> Optional[Dict[str, NeuralType]]:
        return self.classifier.output_types

--------

Text-To-Speech (TTS)
====================

Everything needed to train TTS models and generate audio is included with NeMo.
`NeMo TTS Models <https://ngc.nvidia.com/catalog/models/nvidia:nemottsmodels>`_
can be trained from scratch on your own data or pretrained models can be downloaded
automatically. NeMo currently supports  a two step inference procedure.
First, a model is used to generate a mel spectrogram from text.
Second, a model is used to generate audio from a mel spectrogram.

Mel Spectrogram Generators:

- `Tacotron 2 <https://github.com/NVIDIA/NeMo/blob/main/examples/tts/tacotron2.py>`_
- `Glow-TTS <https://github.com/NVIDIA/NeMo/blob/main/examples/tts/glow_tts.py>`_

Audio Generators:

- Griffin-Lim
- `WaveGlow <https://github.com/NVIDIA/NeMo/blob/main/examples/tts/waveglow.py>`_
- `SqueezeWave <https://github.com/NVIDIA/NeMo/blob/main/examples/tts/squeezewave.py>`_


Specify TTS Model Configurations with YAML File
-----------------------------------------------

..note NeMo Models and PyTorch Lightning Trainer can be fully configured from .yaml files using Hydra.

`tts/conf/glow_tts.yaml <https://github.com/NVIDIA/NeMo/blob/main/examples/tts/conf/glow_tts.yaml>`_

.. code-block:: yaml

    # configure the PyTorch Lightning Trainer
    trainer:
        gpus: -1 # number of gpus
        max_epochs: 350
        num_nodes: 1
        distributed_backend: ddp
        ...

    # configure the TTS model
    model:
        ...
        encoder:
            _target_: nemo.collections.tts.modules.glow_tts.TextEncoder
            params:
            n_vocab: 148
            out_channels: *n_mels
            hidden_channels: 192
            filter_channels: 768
            filter_channels_dp: 256
            ...
    # all other configuration, data, optimizer, parser, preprocessor, etc
    ...

Developing TTS Model From Scratch
---------------------------------

`tts/glow_tts.py <https://github.com/NVIDIA/NeMo/blob/main/examples/tts/glow_tts.py>`_

.. code-block:: python

    # hydra_runner calls hydra.main and is useful for multi-node experiments
    @hydra_runner(config_path="conf", config_name="glow_tts")
    def main(cfg):
        trainer = pl.Trainer(**cfg.trainer)
        model = GlowTTSModel(cfg=cfg.model, trainer=trainer)
        trainer.fit(model)

Hydra makes every aspect of the NeMo model, including the PyTorch Lightning Trainer, customizable from the command line.

.. code-block:: bash

    python NeMo/examples/tts/glow_tts.py \
        trainer.gpus=4 \
        trainer.max_epochs=400 \
        ...
        train_dataset=/path/to/train/data \
        validation_datasets=/path/to/val/data \
        model.train_ds.batch_size = 64 \

..note Training NeMo TTTs models from scratch take days/weeks so it is highly recommended to use multiple GPUs and multiple nodes with the PyTorch Lightning Trainer.

Using State-Of-The-Art Pre-trained TTS Model
--------------------------------------------

Generate speech using models trained on `LJSpeech <https://keithito.com/LJ-Speech-Dataset/>`,
around 24 hours of single speaker data.

See this `TTS notebook <https://github.com/NVIDIA/NeMo/blob/main/tutorials/tts/1_TTS_inference.ipynb>`_
for a full tutorial on generating speech with NeMo, PyTorch Lightning, and Hydra.

.. code-block:: python

    # load pretrained spectrogram model
    spec_gen = SpecModel.from_pretrained('GlowTTS-22050Hz').cuda()

    # load pretrained Generators
    vocoder = WaveGlowModel.from_pretrained('WaveGlow-22050Hz').cuda()

    def infer(spec_gen_model, vocder_model, str_input):
        with torch.no_grad():
            parsed = spec_gen.parse(text_to_generate)
            spectrogram = spec_gen.generate_spectrogram(tokens=parsed)
            audio = vocoder.convert_spectrogram_to_audio(spec=spectrogram)
        if isinstance(spectrogram, torch.Tensor):
            spectrogram = spectrogram.to('cpu').numpy()
        if len(spectrogram.shape) == 3:
            spectrogram = spectrogram[0]
        if isinstance(audio, torch.Tensor):
            audio = audio.to('cpu').numpy()
        return spectrogram, audio

    text_to_generate = input("Input what you want the model to say: ")
    spec, audio = infer(spec_gen, vocoder, text_to_generate)

To see the available pretrained checkpoints:

.. code-block:: python

    # spec generator
    GlowTTSModel.list_available_models()

    # vocoder
    WaveGlowModel.list_available_models()

NeMo TTS Model Under the Hood
-----------------------------

Any aspect of TTS training or model architecture design can easily
be customized with PyTorch Lightning since every NeMo model is a LightningModule.

`glow_tts.py <https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/tts/models/glow_tts.py>`_

.. code-block:: python

    class GlowTTSModel(SpectrogramGenerator):
        """
        GlowTTS model used to generate spectrograms from text
        Consists of a text encoder and an invertible spectrogram decoder
        """
        ...
        # NeMo models come with neural type checking
        @typecheck(
            input_types={
                "x": NeuralType(('B', 'T'), TokenIndex()),
                "x_lengths": NeuralType(('B'), LengthsType()),
                "y": NeuralType(('B', 'D', 'T'), MelSpectrogramType(), optional=True),
                "y_lengths": NeuralType(('B'), LengthsType(), optional=True),
                "gen": NeuralType(optional=True),
                "noise_scale": NeuralType(optional=True),
                "length_scale": NeuralType(optional=True),
            }
        )
        def forward(self, *, x, x_lengths, y=None, y_lengths=None, gen=False, noise_scale=0.3, length_scale=1.0):
            if gen:
                return self.glow_tts.generate_spect(
                    text=x, text_lengths=x_lengths, noise_scale=noise_scale, length_scale=length_scale
                )
            else:
                return self.glow_tts(text=x, text_lengths=x_lengths, spect=y, spect_lengths=y_lengths)
        ...
        def step(self, y, y_lengths, x, x_lengths):
            z, y_m, y_logs, logdet, logw, logw_, y_lengths, attn = self(
                x=x, x_lengths=x_lengths, y=y, y_lengths=y_lengths, gen=False
            )

            l_mle, l_length, logdet = self.loss(
                z=z,
                y_m=y_m,
                y_logs=y_logs,
                logdet=logdet,
                logw=logw,
                logw_=logw_,
                x_lengths=x_lengths,
                y_lengths=y_lengths,
            )

            loss = sum([l_mle, l_length])

            return l_mle, l_length, logdet, loss, attn

        # PTL-specfic methods
        def training_step(self, batch, batch_idx):
            y, y_lengths, x, x_lengths = batch

            y, y_lengths = self.preprocessor(input_signal=y, length=y_lengths)

            l_mle, l_length, logdet, loss, _ = self.step(y, y_lengths, x, x_lengths)

            output = {
                "loss": loss,  # required
                "progress_bar": {"l_mle": l_mle, "l_length": l_length, "logdet": logdet},
                "log": {"loss": loss, "l_mle": l_mle, "l_length": l_length, "logdet": logdet},
            }

            return output
        ...

Neural Types in NeMo TTS
------------------------

NeMo Models and Neural Modules come with Neural Type checking.
Neural type checking is extremely useful when combining many different neural network architectures
for a production-grade application.

.. code-block:: python

    @typecheck(
        input_types={
            "x": NeuralType(('B', 'T'), TokenIndex()),
            "x_lengths": NeuralType(('B'), LengthsType()),
            "y": NeuralType(('B', 'D', 'T'), MelSpectrogramType(), optional=True),
            "y_lengths": NeuralType(('B'), LengthsType(), optional=True),
            "gen": NeuralType(optional=True),
            "noise_scale": NeuralType(optional=True),
            "length_scale": NeuralType(optional=True),
        }
    )
    def forward(self, *, x, x_lengths, y=None, y_lengths=None, gen=False, noise_scale=0.3, length_scale=1.0):
        ...

--------

Learn More
==========

Download pre-trained
`ASR <https://ngc.nvidia.com/catalog/models/nvidia:nemospeechmodels>`_,
`NLP <https://ngc.nvidia.com/catalog/models/nvidia:nemonlpmodels>`_,
and `TTS <https://ngc.nvidia.com/catalog/models/nvidia:nemospeechmodels>`_ models
on `NVIDIA NGC <https://ngc.nvidia.com/>`_ to quickly get started with NeMo.


Become an expert on Building Conversational AI applications with
our `tutorials <https://github.com/NVIDIA/NeMo#tutorials>`_,
and `example scripts <https://github.com/NVIDIA/NeMo/tree/main/examples>`_,

.. note:: Most NeMo tutorial notebooks can be run on `Google Colab <https://colab.research.google.com/notebooks/intro.ipynb>`_.

`NVIDIA NeMo <https://github.com/NVIDIA/NeMo>`_ is actively being developed on GitHub.
`Contributions <https://github.com/NVIDIA/NeMo/blob/main/CONTRIBUTING.md>`_ are welcome!

See our `developer guide <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/>`_ for
more information on core NeMo concepts, ASR/NLP/TTS collections,
and the NeMo API.