Lightning Transformers¶
Quick Start¶
Installation¶
pip install lightning-transformers
Install bleeding-edge:
pip install git+https://github.com/PytorchLightning/lightning-transformers.git@master --upgrade
Install all optional dependencies as well:
pip install lightning-transformers["extra"]
Using Lightning-Transformers¶
Lightning Transformers provides LightningModules, LightningDataModules and Strategies to use 🤗 Transformers with the PyTorch Lightning Trainer, supporting tasks such as:ref:language_modeling, Translation and more. To use, simply:
Pick a task to train (LightningModule)
Pick a dataset (LightningDataModule)
Here is an example of training bert-base-cased on the CARER emotion dataset using the Text Classification task.
import pytorch_lightning as pl
from transformers import AutoTokenizer
from lightning_transformers.task.nlp.text_classification import (
TextClassificationDataModule,
TextClassificationTransformer,
)
tokenizer = AutoTokenizer.from_pretrained(
pretrained_model_name_or_path="bert-base-cased"
)
dm = TextClassificationDataModule(
batch_size=1,
dataset_name="emotion",
max_length=512,
tokenizer=tokenizer,
)
model = TextClassificationTransformer(pretrained_model_name_or_path="bert-base-cased", num_labels=dm.num_classes)
trainer = pl.Trainer(accelerator="auto", devices="auto", max_epochs=1)
trainer.fit(model, dm)
Changing the Optimizer¶
Swapping to the RMSProp optimizer:
import pytorch_lightning as pl
import torch
import transformers
from transformers import AutoTokenizer
from lightning_transformers.task.nlp.text_classification import (
TextClassificationDataModule,
TextClassificationTransformer,
)
class RMSPropTransformer(TextClassificationTransformer):
def configure_optimizers(self):
optimizer = torch.optim.RMSprop(self.parameters(), lr=1e-5)
# automatically find the total number of steps we need!
num_training_steps, num_warmup_steps = self.compute_warmup(self.num_training_steps, num_warmup_steps=0.1)
scheduler = transformers.get_linear_schedule_with_warmup(
self.optimizer, num_warmup_steps=num_warmup_steps, num_training_steps=num_training_steps
)
return {
"optimizer": optimizer,
"lr_scheduler": {"scheduler": scheduler, "interval": "step", "frequency": 1},
}
tokenizer = AutoTokenizer.from_pretrained(
pretrained_model_name_or_path="bert-base-cased"
)
dm = TextClassificationDataModule(
batch_size=1,
dataset_name="emotion",
max_length=512,
tokenizer=tokenizer,
)
model = RMSPropTransformer(pretrained_model_name_or_path="bert-base-cased", num_labels=dm.num_classes)
trainer = pl.Trainer(accelerator="auto", devices="auto", max_epochs=1)
trainer.fit(model, dm)
Enabling DeepSpeed/Sharded/Mixed Precision and more is super simple through the Lightning Trainer.
# enable DeepSpeed with 16bit precision
trainer = pl.Trainer(accelerator="auto", devices="auto", max_epochs=1, strategy='deepspeed', precision=16)
# enable DeepSpeed ZeRO Stage 3 with BFLOAT16 precision
trainer = pl.Trainer(accelerator="auto", devices="auto", max_epochs=1, strategy='deepspeed_stage_3_offload', precision="bf16")
# enable Sharded Training with 16bit precision
trainer = pl.Trainer(accelerator="auto", devices="auto", max_epochs=1, strategy='ddp_sharded', precision=16)
Inference¶
from transformers import AutoTokenizer
from lightning_transformers.task.nlp.text_classification import TextClassificationTransformer
model = TextClassificationTransformer(
pretrained_model_name_or_path="bert-base-uncased",
tokenizer=AutoTokenizer.from_pretrained(pretrained_model_name_or_path="bert-base-uncased"),
)
model.hf_predict("Lightning rocks!")
# Returns [{'label': 'LABEL_0', 'score': 0.545...}]
Language Modeling¶
The Task¶
Causal Language Modeling is the vanilla autoregressive pre-training method common to most language models such as GPT-3 or CTRL (Excluding BERT-like models, which were pre-trained using the Masked Language Modeling training method).
During training, we minimize the maximum likelihood during training across spans of text data (usually in some context window/block size). The model is able to attend to the left context (left of the mask). When trained on large quantities of text data, this gives us strong language models such as GPT-3 to use for downstream tasks.
Datasets¶
Currently supports the wikitext2 dataset, or custom input files. Since this task is usually the pre-training task for Transformers, it can be used to train new language models from scratch or to fine-tune a language model onto your own unlabeled text data.
Usage¶
Language Models pre-trained or fine-tuned to the Causal Language Modeling task can then be used in generative predictions.
import pytorch_lightning as pl
from transformers import AutoTokenizer
from lightning_transformers.task.nlp.language_modeling import (
LanguageModelingDataModule,
LanguageModelingTransformer,
)
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path="gpt2")
model = LanguageModelingTransformer(pretrained_model_name_or_path="gpt2")
dm = LanguageModelingDataModule(
batch_size=1,
dataset_name="wikitext",
dataset_config_name="wikitext-2-raw-v1",
tokenizer=tokenizer,
)
trainer = pl.Trainer(accelerator="auto", devices="auto", max_epochs=1)
trainer.fit(model, dm)
We report the Cross Entropy Loss for validation.
Language Modeling Using Your Own Files¶
To use custom text files, the files should contain the raw data you want to train and validate on.
During data pre-processing the text is flattened, and the model is trained and validated on context windows (block size) made from the input text. We override the dataset files, allowing us to still use the data transforms defined with the base datamodule.
Below we have defined a csv file to use as our input data.
text,
this is the first sentence,
this is the second sentence,
from lightning_transformers.task.nlp.language_modeling import (
LanguageModelingDataModule,
)
dm = LanguageModelingDataModule(
batch_size=1,
train_file="path/train.csv",
validation_file="/path/valid.csv"
tokenizer=tokenizer,
)
Language Modeling Inference Pipeline¶
By default we use the text generation pipeline, which requires a conditional input string and generates an output string.
from transformers import AutoTokenizer
from lightning_transformers.task.nlp.language_modeling import LanguageModelingTransformer
model = LanguageModelingTransformer(
pretrained_model_name_or_path="prajjwal1/bert-tiny",
tokenizer=AutoTokenizer.from_pretrained(pretrained_model_name_or_path="prajjwal1/bert-tiny"),
)
model.hf_predict("The house:")
Multiple Choice¶
The Task¶
The Multiple Choice task requires the model to decide on a set of options, given a question with optional context.
Similar to the text classification task, the model is fine-tuned on multi-class classification to provide probabilities across all possible answers. This is useful if the data you’d like the model to predict on requires selecting from a set of answers based on context or questions, where the answers can be variable. In contrast, use the text classification task if the answers remain static and are not needed to be included during training.
Datasets¶
Currently supports the RACE and SWAG datasets, or custom input files.
Question: What color is the sky?
Answers:
A: Blue
B: Green
C: Red
Model answer: A
Training¶
import pytorch_lightning as pl
from transformers import AutoTokenizer
from lightning_transformers.task.nlp.multiple_choice import (
MultipleChoiceTransformer,
SwagMultipleChoiceDataModule,
)
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path="bert-base-uncased")
model = MultipleChoiceTransformer(pretrained_model_name_or_path="bert-base-uncased")
dm = SwagMultipleChoiceDataModule(
batch_size=1,
dataset_config_name="regular",
padding=False,
tokenizer=tokenizer,
)
trainer = pl.Trainer(accelerator="auto", devices="auto", max_epochs=1)
trainer.fit(model, dm)
We report Cross Entropy Loss, Precision, Recall and Accuracy for validation.
Multiple Choice Using Your Own Files¶
To use custom text files, the files should contain the data you want to train and validate on and be in CSV or JSON format as described below.
The format varies from dataset to dataset as input columns may differ, as well as pre-processing. To make our life easier, we use the RACE dataset format and override the files that are loaded.
Below we have defined a json file to use as our input data.
{
"article": "The man walked into the red house but couldn't see where the light was.",
"question": "What colour is the house?",
"options": ["White", "Red", "Blue"]
"answer": "Red"
}
We override the dataset files, allowing us to still use the data transforms defined with the RACE dataset.
from lightning_transformers.task.nlp.multiple_choice import (
RaceMultipleChoiceDataModule,
)
dm = RaceMultipleChoiceDataModule(
batch_size=1,
dataset_config_name="all",
padding=False,
train_file="path/train.json",
validation_file="/path/valid.json"
tokenizer=tokenizer,
)
Multiple Choice Inference¶
Currently there is no HF pipeline available for this model. Feel free to make an issue or PR if you require this functionality.
Question Answering¶
The Task¶
The Question Answering task requires the model to determine the start and end of a span within the given context, that answers a given question. This allows the model to pre-condition on contextual information to determine an answer.
Use this task when you would like to fine-tune onto data where an answer can be extracted from context information. Since this is an extraction task, you can rely on most Transformer models as your backbone.
Datasets¶
Currently supports the SQuAD dataset or custom input text files.
Context: The ground is black, the sky is blue and the car is red.
Question: What color is the sky?
Model answer: {"answer": "the sky is blue", "start": 21, "end": 35}
Training¶
import pytorch_lightning as pl
from transformers import AutoTokenizer
from lightning_transformers.task.nlp.question_answering import (
QuestionAnsweringTransformer,
SquadDataModule,
)
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path="bert-base-uncased")
model = QuestionAnsweringTransformer(pretrained_model_name_or_path="bert-base-uncased")
dm = SquadDataModule(
batch_size=1,
dataset_config_name="plain_text",
max_length=384,
version_2_with_negative=False,
null_score_diff_threshold=0.0,
doc_stride=128,
n_best_size=20,
max_answer_length=30,
tokenizer=tokenizer,
)
trainer = pl.Trainer(accelerator="auto", devices="auto", max_epochs=1)
trainer.fit(model, dm)
Question Answering Using Your Own Files¶
To use custom text files, the files should contain new line delimited json objects within the text files.
The format varies from dataset to dataset as input columns may differ, as well as pre-processing. To make our life easier, we use the SWAG dataset format and override the files that are loaded.
{
"answers": {
"answer_start": [1],
"text": ["This is a test text"]
},
"context": "This is a test context.",
"id": "1",
"question": "Is this a test?",
"title": "train test"
}
We override the dataset files, allowing us to still use the data transforms defined with this dataset.
from lightning_transformers.task.nlp.question_answering import (
SquadDataModule,
)
dm = SquadDataModule(
batch_size=1,
dataset_config_name="plain_text",
max_length=384,
version_2_with_negative=False,
null_score_diff_threshold=0.0,
doc_stride=128,
n_best_size=20,
max_answer_length=30,
train_file="path/train.csv",
validation_file="/path/valid.csv"
tokenizer=tokenizer,
)
Question Answering Inference Pipeline¶
By default we use the question answering pipeline, which requires a context and a question as input.
from transformers import AutoTokenizer
from lightning_transformers.task.nlp.question_answering import QuestionAnsweringTransformer
model = QuestionAnsweringTransformer(
pretrained_model_name_or_path="sshleifer/tiny-gpt2",
tokenizer=AutoTokenizer.from_pretrained(pretrained_model_name_or_path="sshleifer/tiny-gpt2"),
)
model.hf_predict(dict(context="Lightning is great", question="What is great?"))
Summarization¶
The Task¶
The Summarization task requires the model to summarize a document into a shorter sentence.
Datasets¶
Currently supports the CNN/DailyMail and XSUM dataset or custom input text files.
In the CNN/Daily Mail dataset, this involves taking long articles and summarizing them.
document: "The car was racing towards the tunnel, whilst blue lights were flashing behind it. The car entered the tunnel and vanished..."
Model answer: "Police are chasing a car entering a tunnel."
Training¶
To use this task, we must select a Seq2Seq Encoder/Decoder based model, such as T5 or BART. Encoder only models like GPT/BERT will not work as they are encoder only.
import pytorch_lightning as pl
from transformers import AutoTokenizer
from lightning_transformers.task.nlp.summarization import (
SummarizationTransformer,
XsumSummarizationDataModule,
)
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path="t5-base")
model = SummarizationTransformer(
pretrained_model_name_or_path="t5-base",
use_stemmer=True,
val_target_max_length=142,
num_beams=None,
compute_generate_metrics=True,
)
dm = XsumSummarizationDataModule(
batch_size=1,
max_source_length=128,
max_target_length=128,
tokenizer=tokenizer,
)
trainer = pl.Trainer(accelerator="auto", devices=1, max_epochs=1)
trainer.fit(model, dm)
Summarization Using Your Own Files¶
To use custom text files, the files should contain new line delimited json objects within the text files.
{
"source": "some-body",
"target": "some-sentence"
}
We override the dataset files, allowing us to still use the data transforms defined with this dataset.
from lightning_transformers.task.nlp.summarization import (
XsumSummarizationDataModule,
)
dm = XsumSummarizationDataModule(
batch_size=1,
max_source_length=128,
max_target_length=128,
train_file="path/train.csv",
validation_file="/path/valid.csv"
tokenizer=tokenizer,
)
Summarization Inference Pipeline¶
By default we use the summarization pipeline, which requires an input document as text.
from transformers import AutoTokenizer
from lightning_transformers.task.nlp.summarization import SummarizationTransformer
model = SummarizationTransformer(
pretrained_model_name_or_path="patrickvonplaten/t5-tiny-random",
tokenizer=AutoTokenizer.from_pretrained(pretrained_model_name_or_path="patrickvonplaten/t5-tiny-random"),
)
model.hf_predict(
"The results found significant improvements over all tasks evaluated",
min_length=2,
max_length=12,
)
Text Classification¶
The Task¶
The Text Classification Task fine-tunes the model to predict probabilities across a set of labels given input text. The task supports both binary and multi-class/multi-label classification.
Datasets¶
Currently supports the XLNI, GLUE and emotion datasets, or custom input files.
Input: I don't like this at all!
Model answer: {"label": "angry", "score": 0.8}
Training¶
Use this task when you would like to fine-tune Transformers on a labeled text classification task. For this task, you can rely on most Transformer models as your backbone.
import pytorch_lightning as pl
from transformers import AutoTokenizer
from lightning_transformers.task.nlp.text_classification import (
TextClassificationDataModule,
TextClassificationTransformer,
)
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path="bert-base-uncased")
dm = TextClassificationDataModule(
batch_size=1,
dataset_name="glue",
dataset_config_name="sst2",
max_length=512,
tokenizer=tokenizer,
)
model = TextClassificationTransformer(pretrained_model_name_or_path="bert-base-uncased", num_labels=dm.num_classes)
trainer = pl.Trainer(accelerator="auto", devices="auto", max_epochs=1)
trainer.fit(model, dm)
We report the Precision, Recall, Accuracy and Cross Entropy Loss for validation.
Text Classification Using Your Own Files¶
To use custom text files, the files should contain new line delimited json objects within the text files.
The label mapping is automatically generated from the training dataset labels if no mapping is given.
{
"label": "sad",
"text": "I'm feeling quite sad and sorry for myself but I'll snap out of it soon."
}
from lightning_transformers.task.nlp.text_classification import (
TextClassificationDataModule,
TextClassificationTransformer,
)
dm = TextClassificationDataModule(
batch_size=1,
max_length=512,
train_file="path/train.json",
validation_file="/path/valid.json"
tokenizer=tokenizer,
)
Text Classification Inference Pipeline¶
By default we use the sentiment-analysis pipeline, which requires an input string.
from transformers import AutoTokenizer
from lightning_transformers.task.nlp.text_classification import TextClassificationTransformer
model = TextClassificationTransformer(
pretrained_model_name_or_path="prajjwal1/bert-tiny",
tokenizer=AutoTokenizer.from_pretrained(pretrained_model_name_or_path="prajjwal1/bert-tiny"),
)
model.hf_predict("Lightning rocks!")
Token Classification¶
The Task¶
The Token classification Task is similar to text classification, except each token within the text receives a prediction. A common use of this task is Named Entity Recognition (NER). Use this task if you require your data to be classified at the token level.
Training¶
import pytorch_lightning as pl
from transformers import AutoTokenizer
from lightning_transformers.task.nlp.token_classification import (
TokenClassificationDataModule,
TokenClassificationTransformer,
)
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path="bert-base-uncased")
dm = TokenClassificationDataModule(
batch_size=1,
task_name="ner",
dataset_name="conll2003",
preprocessing_num_workers=1,
label_all_tokens=False,
revision="master",
tokenizer=tokenizer,
)
model = TokenClassificationTransformer(pretrained_model_name_or_path="bert-base-uncased", labels=dm.num_classes)
trainer = pl.Trainer(accelerator="auto", devices="auto", max_epochs=1)
trainer.fit(model, dm)
We report the Precision, Recall, Accuracy and Cross Entropy Loss for validation.
Token Classification Using Your Own Files¶
To use custom text files, the files should contain new line delimited json objects within the text files. For each token, there should be an associated label.
{
"label_tags": [11, 12, 12, 21, 13, 11, 11, 21, 13, 11, 12, 13, 11, 21, 22, 11, 12, 17, 11, 21, 17, 11, 12, 12, 21, 22, 22, 13, 11, 0],
"tokens": ["The", "European", "Commission", "said", "on", "Thursday", "it", "disagreed", "with", "German", "advice", "to", "consumers"]
}
from lightning_transformers.task.nlp.token_classification import TokenClassificationDataModule
dm = TokenClassificationDataModule(
batch_size=1,
task_name="ner",
dataset_name="conll2003",
preprocessing_num_workers=1,
label_all_tokens=False,
revision="master",
train_file="path/train.json",
validation_file="/path/valid.json"
tokenizer=tokenizer,
)
Token Classification Inference Pipeline¶
By default we use the NER pipeline, which requires a an input sequence string and the number of labels.
from transformers import AutoTokenizer
from lightning_transformers.task.nlp.token_classification import TokenClassificationTransformer
model = TokenClassificationTransformer(
pretrained_model_name_or_path="prajjwal1/bert-tiny",
tokenizer=AutoTokenizer.from_pretrained(pretrained_model_name_or_path="prajjwal1/bert-tiny"),
labels=2,
)
model.hf_predict("Have a good day!")
Translation¶
The Task¶
The Translation task fine-tunes the model to translate text from one language to another.
Datasets¶
Currently supports the WMT16 dataset or custom input text files.
Input Text (English): "The ground is black, the sky is blue and the car is red."
Model Output (German): "Der Boden ist schwarz, der Himmel ist blau und das Auto ist rot."
Training¶
To use this task, select a Seq2Seq Encoder/Decoder based model, such as multi-lingual T5 or BART. Conventional models like GPT/BERT will not work as they are encoder only. In addition, you also need a tokenizer that has been created on multi-lingual text. This is true for mt5 and mbart.
import pytorch_lightning as pl
from transformers import AutoTokenizer
from lightning_transformers.task.nlp.translation import (
TranslationTransformer,
WMT16TranslationDataModule,
)
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path="t5-base")
model = TranslationTransformer(
pretrained_model_name_or_path="t5-base",
n_gram=4,
smooth=False,
val_target_max_length=142,
num_beams=None,
compute_generate_metrics=True,
)
dm = WMT16TranslationDataModule(
# WMT translation datasets: ['cs-en', 'de-en', 'fi-en', 'ro-en', 'ru-en', 'tr-en']
dataset_config_name="ro-en",
source_language="en",
target_language="ro",
max_source_length=128,
max_target_length=128,
tokenizer=tokenizer,
)
trainer = pl.Trainer(accelerator="auto", devices="auto", max_epochs=1)
trainer.fit(model, dm)
Translation Using Your Own Files¶
To use custom text files, the files should contain new line delimited json objects within the text files.
{
"source": "example source text",
"target": "example target text"
}
We override the dataset files, allowing us to still use the data transforms defined with this dataset.
from lightning_transformers.task.nlp.translation import WMT16TranslationDataModule
dm = WMT16TranslationDataModule(
# WMT translation datasets: ['cs-en', 'de-en', 'fi-en', 'ro-en', 'ru-en', 'tr-en']
dataset_config_name="ro-en",
source_language="en",
target_language="ro",
max_source_length=128,
max_target_length=128,
train_file="path/train.json",
validation_file="/path/valid.json"
tokenizer=tokenizer,
)
Translation Inference Pipeline¶
By default we use the translation pipeline, which requires a source text string.
from transformers import AutoTokenizer
from lightning_transformers.task.nlp.translation import TranslationTransformer
model = TranslationTransformer(
pretrained_model_name_or_path="patrickvonplaten/t5-tiny-random",
tokenizer=AutoTokenizer.from_pretrained(pretrained_model_name_or_path="patrickvonplaten/t5-tiny-random"),
)
model.hf_predict("¡Hola Sean!")
Image Classification¶
The Task¶
Image classification is the task of classifying an image to a label or task.
Usage¶
Language Models pre-trained or fine-tuned to the Causal Language Modeling task can then be used in generative predictions.
import pytorch_lightning as pl
from transformers import AutoFeatureExtractor
from lightning_transformers.task.vision.image_classification import (
ImageClassificationDataModule,
ImageClassificationTransformer,
)
feature_extractor = AutoFeatureExtractor.from_pretrained(pretrained_model_name_or_path="nateraw/vit-base-beans")
dm = ImageClassificationDataModule(
batch_size=8,
dataset_name="beans",
num_workers=8,
feature_extractor=feature_extractor,
)
model = ImageClassificationTransformer(pretrained_model_name_or_path="nateraw/vit-base-beans", num_labels=dm.num_classes)
trainer = pl.Trainer(accelerator="auto", devices="auto", max_epochs=5)
trainer.fit(model, dm)
We report the Cross Entropy Loss for validation.
Image Classification Inference Pipeline¶
from transformers import AutoTokenizer
from lightning_transformers.task.vision.image_classification import ImageClassificationTransformer
model = ImageClassificationTransformer(
pretrained_model_name_or_path="nateraw/vit-base-beans",
tokenizer=AutoFeatureExtractor.from_pretrained(pretrained_model_name_or_path="nateraw/vit-base-beans"),
)
# predict on the logo
model.hf_predict(
"https://github.com/PyTorchLightning/lightning-transformers/blob/master/"
"docs/source/_static/images/logo.png?raw=true"
)
SparseML¶
SparseML provides GPU-class performance on CPUs through sparsification, pruning, and quantization. For more details, see SparseML docs.
With multiple machines, the command has to be run on all machines either manually, or using an orchestration system such as SLURM or TorchElastic. More information can be seen in the Pytorch Lightning Computing Cluster.
We provide out of the box configs to use SparseML. Just pass the SparseML Callback when training.
import pytorch_lightning as pl
from lightning_transformers.callbacks import TransformerSparseMLCallback
pl.Trainer(
callbacks=TransformerSparseMLCallback(
output_dir="/content/MODELS",
recipe_path="/content/recipe.yaml"
)
)
These commands are only useful when a recipe has already been created. Example recipes can be found here.
After training, this will leave two ONNX models in the trainer.callbacks.output_dir folder: small_model.onnx and model.onnx. small_model.onnx is excellent for demos. For reliable inference, it is recommended to optimize model.onnx with your compression algorithm.
HuggingFace Hub Checkpoints¶
Lightning Transformers default behaviour means we save PyTorch based checkpoints.
HuggingFace Transformers provides a separate API for saving checkpoints. Below we describe two ways to save HuggingFace checkpoints manually or during training.
To manually save checkpoints from your model:
from lightning_transformers.task.nlp.text_classification import TextClassificationTransformer
model = TextClassificationTransformer(pretrained_model_name_or_path="prajjwal1/bert-tiny")
# saves a HF checkpoint to this path.
model.save_hf_checkpoint("checkpoint")
To save an additional HF Checkpoint everytime the checkpoint callback saves, pass in the HFSaveCheckpoint
plugin:
import pytorch_lightning as pl
from transformers import AutoTokenizer
from lightning_transformers.plugins.checkpoint import HFSaveCheckpoint
from lightning_transformers.task.nlp.text_classification import (
TextClassificationDataModule,
TextClassificationTransformer,
)
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path="prajjwal1/bert-tiny")
dm = TextClassificationDataModule(
batch_size=1,
dataset_name="glue",
dataset_config_name="sst2",
max_length=512,
tokenizer=tokenizer,
)
model = TextClassificationTransformer(pretrained_model_name_or_path="prajjwal1/bert-tiny")
trainer = pl.Trainer(plugins=HFSaveCheckpoint(model=model))
trainer.fit(model, dm)
Big Transformers Model Inference¶
Lightning Transformers provides out of the box support for running inference with very large billion parameter models. Under-the-hood we use HF Transformer’s large model support to auto-select devices for optimal throughput and memory usage.
Below is an example of how you can run generation with a large 6B parameter transformer model using Lightning Transformers. We’ve also managed to run bigscience/bloom
which is 176B parameters using 8 A100s with the below code.
pip install accelerate
import torch
from transformers import AutoTokenizer
from lightning_transformers.task.nlp.language_modeling import LanguageModelingTransformer
model = LanguageModelingTransformer(
pretrained_model_name_or_path="EleutherAI/gpt-j-6B",
tokenizer=AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B"),
low_cpu_mem_usage=True,
device_map="auto",
)
output = model.generate("Hello, my name is", device=torch.device("cuda"))
print(model.tokenizer.decode(output[0].tolist()))
This will allow the model to be split onto GPUs/CPUs and even kept onto Disk to optimize memory space.
Inference with Manual Checkpoints¶
Download the sharded checkpoint weights that we’ll be using:
git clone https://huggingface.co/sgugger/sharded-gpt-j-6B
cd sharded-gpt-j-6B
git-lfs install
git pull
import torch
from accelerate import init_empty_weights
from transformers import AutoTokenizer
from lightning_transformers.task.nlp.language_modeling import LanguageModelingTransformer
# initializes empty model for us to the load the checkpoint.
with init_empty_weights():
model = LanguageModelingTransformer(
pretrained_model_name_or_path="EleutherAI/gpt-j-6B",
tokenizer=AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B")
)
# automatically selects the best devices (cpu/gpu) to load model layers based on available memory
model.load_checkpoint_and_dispatch("sharded-gpt-j-6B", device_map="auto", no_split_module_classes=["GPTJBlock"])
output = model.generate("Hello, my name is", device=torch.device("cuda"))
print(model.tokenizer.decode(output[0].tolist()))
To see more details about the API, see here.
DeepSpeed Training with Big Transformer Models¶
Below is an example of how you can train a 6B parameter transformer model using Lightning Transformers and DeepSpeed.
The below script was tested on an 8 A100 machine.
import pytorch_lightning as pl
from transformers import AutoTokenizer
from lightning_transformers.task.nlp.language_modeling import LanguageModelingDataModule, LanguageModelingTransformer
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path="gpt2")
model = LanguageModelingTransformer(
pretrained_model_name_or_path="EleutherAI/gpt-j-6B",
tokenizer=AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B"),
deepspeed_sharding=True # defer initialization of the model to shard/load pre-train weights
)
dm = LanguageModelingDataModule(
batch_size=1,
dataset_name="wikitext",
dataset_config_name="wikitext-2-raw-v1",
tokenizer=tokenizer,
)
trainer = pl.Trainer(accelerator="gpu", devices="auto", strategy="deepspeed_stage_3", precision=16, max_epochs=1)
trainer.fit(model, dm)
If you have your own pl.LightningModule
you can use DeepSpeed ZeRO Stage 3 parameter sharding & Transformers as well, just add this code:
import pytorch_lightning as pl
from transformers import T5ForConditionalGeneration
from lightning_transformers.utilities.deepspeed import enable_transformers_pretrained_deepspeed_sharding
class MyModel(pl.LightningModule):
def setup(self, stage: Optional[str] = None) -> None:
if not hasattr(self, "ptlm"):
enable_transformers_pretrained_deepspeed_sharding(self)
self.ptlm = T5ForConditionalGeneration.from_pretrained("t5-11b")
Custom Data Files¶
In most cases when training/validating/testing on custom files, you’ll be able to do so without modifying any code, using the general data module classes directly.
Below we show per task how to fine-tune/validate/test on your own files per task or modify the logic within the data classes. Some tasks are more involved than others, as they may require more data processing.
Custom Subset Names (Edge Cases such as MNLI)¶
Some datasets, such as MNLI when loaded from the Huggingface datasets library, have special subset names that don’t match the standard train/validation/test convention. Specifically, MNLI has two validation and two test sets, with flavors ‘matched’ and ‘mismatched’. When using such datasets, you must manually indicate which subset names you want to use for each of train/validation/text.
An example for how to train and validate on MNLI would the the following:
from lightning_transformers.task.nlp.text_classification import (
TextClassificationDataModule,
TextClassificationTransformer,
)
dm = TextClassificationDataModule(
batch_size=1,
dataset_name="glue",
dataset_config_name="mnli",
max_length=512,
validation_subset_name="validation_matched"
tokenizer=tokenizer,
)
Language Modeling Using Your Own Files¶
To use custom text files, the files should contain the raw data you want to train and validate on.
During data pre-processing the text is flattened, and the model is trained and validated on context windows (block size) made from the input text. We override the dataset files, allowing us to still use the data transforms defined with the base datamodule.
Below we have defined a csv file to use as our input data.
text,
this is the first sentence,
this is the second sentence,
from lightning_transformers.task.nlp.language_modeling import (
LanguageModelingDataModule,
)
dm = LanguageModelingDataModule(
batch_size=1,
train_file="path/train.csv",
validation_file="/path/valid.csv"
tokenizer=tokenizer,
)
Multiple Choice Using Your Own Files¶
To use custom text files, the files should contain the data you want to train and validate on and be in CSV or JSON format as described below.
The format varies from dataset to dataset as input columns may differ, as well as pre-processing. To make our life easier, we use the RACE dataset format and override the files that are loaded.
Below we have defined a json file to use as our input data.
{
"article": "The man walked into the red house but couldn't see where the light was.",
"question": "What colour is the house?",
"options": ["White", "Red", "Blue"]
"answer": "Red"
}
We override the dataset files, allowing us to still use the data transforms defined with the RACE dataset.
from lightning_transformers.task.nlp.multiple_choice import (
RaceMultipleChoiceDataModule,
)
dm = RaceMultipleChoiceDataModule(
batch_size=1,
dataset_config_name="all",
padding=False,
train_file="path/train.json",
validation_file="/path/valid.json"
tokenizer=tokenizer,
)
Question Answering Using Your Own Files¶
To use custom text files, the files should contain new line delimited json objects within the text files.
The format varies from dataset to dataset as input columns may differ, as well as pre-processing. To make our life easier, we use the SWAG dataset format and override the files that are loaded.
{
"answers": {
"answer_start": [1],
"text": ["This is a test text"]
},
"context": "This is a test context.",
"id": "1",
"question": "Is this a test?",
"title": "train test"
}
We override the dataset files, allowing us to still use the data transforms defined with this dataset.
from lightning_transformers.task.nlp.question_answering import (
SquadDataModule,
)
dm = SquadDataModule(
batch_size=1,
dataset_config_name="plain_text",
max_length=384,
version_2_with_negative=False,
null_score_diff_threshold=0.0,
doc_stride=128,
n_best_size=20,
max_answer_length=30,
train_file="path/train.csv",
validation_file="/path/valid.csv"
tokenizer=tokenizer,
)
Summarization Using Your Own Files¶
To use custom text files, the files should contain new line delimited json objects within the text files.
{
"source": "some-body",
"target": "some-sentence"
}
We override the dataset files, allowing us to still use the data transforms defined with this dataset.
from lightning_transformers.task.nlp.summarization import (
XsumSummarizationDataModule,
)
dm = XsumSummarizationDataModule(
batch_size=1,
max_source_length=128,
max_target_length=128,
train_file="path/train.csv",
validation_file="/path/valid.csv"
tokenizer=tokenizer,
)
Text Classification Using Your Own Files¶
To use custom text files, the files should contain new line delimited json objects within the text files.
The label mapping is automatically generated from the training dataset labels if no mapping is given.
{
"label": "sad",
"text": "I'm feeling quite sad and sorry for myself but I'll snap out of it soon."
}
from lightning_transformers.task.nlp.text_classification import (
TextClassificationDataModule,
TextClassificationTransformer,
)
dm = TextClassificationDataModule(
batch_size=1,
max_length=512,
train_file="path/train.json",
validation_file="/path/valid.json"
tokenizer=tokenizer,
)
Token Classification Using Your Own Files¶
To use custom text files, the files should contain new line delimited json objects within the text files. For each token, there should be an associated label.
{
"label_tags": [11, 12, 12, 21, 13, 11, 11, 21, 13, 11, 12, 13, 11, 21, 22, 11, 12, 17, 11, 21, 17, 11, 12, 12, 21, 22, 22, 13, 11, 0],
"tokens": ["The", "European", "Commission", "said", "on", "Thursday", "it", "disagreed", "with", "German", "advice", "to", "consumers"]
}
from lightning_transformers.task.nlp.token_classification import TokenClassificationDataModule
dm = TokenClassificationDataModule(
batch_size=1,
task_name="ner",
dataset_name="conll2003",
preprocessing_num_workers=1,
label_all_tokens=False,
revision="master",
train_file="path/train.json",
validation_file="/path/valid.json"
tokenizer=tokenizer,
)
Translation Using Your Own Files¶
To use custom text files, the files should contain new line delimited json objects within the text files.
{
"source": "example source text",
"target": "example target text"
}
We override the dataset files, allowing us to still use the data transforms defined with this dataset.
from lightning_transformers.task.nlp.translation import WMT16TranslationDataModule
dm = WMT16TranslationDataModule(
# WMT translation datasets: ['cs-en', 'de-en', 'fi-en', 'ro-en', 'ru-en', 'tr-en']
dataset_config_name="ro-en",
source_language="en",
target_language="ro",
max_source_length=128,
max_target_length=128,
train_file="path/train.json",
validation_file="/path/valid.json"
tokenizer=tokenizer,
)
Customizing Datasets¶
You can use Lightning Transformers task on custom datasets by extending the base DataModule classes to implement your own data processing logic.
This is useful when you have specific data processing you’d like to apply when training/validating or testing using a task, or would like to modify how data is loaded/transformed for the model.
Currently we have examples for two tasks (one encoder, one encoder/decoder), more examples coming soon!
Language Modeling using Custom Data Processing¶
Below we show how to override data processing logic.
Extend the LanguageModelingDataModule
base class¶
The base data module can be used to modify this code, and follows a simple pattern. Internally the dataset is loaded via HuggingFace Datasets, which returns an Apache Arrow Parquet Dataset. This data format is easy to transform and modify using map functions, which you’ll see within the class.
class LanguageModelingDataModule(HFDataModule):
def process_data(self, dataset: Dataset, stage: Optional[str] = None) -> Dataset:
# `process_data` converting the dataset into features.
# The dataset is pre-loaded using `load_dataset`.
...
return dataset
@staticmethod
def tokenize_function(
examples,
tokenizer: Union[PreTrainedTokenizerBase],
text_column_name: str = None,
):
# tokenizes the data in a specific column using the AutoTokenizer,
# called by `process_data`
return tokenizer(examples[text_column_name])
@staticmethod
def convert_to_features(examples, block_size: int = None):
# `process_data` calls this function to convert samples in the dataset into features
...
@property
def collate_fn(self) -> Callable:
# `Describes how to collate the samples for the batch given to the model`
return default_data_collator
Extend LanguageModelingDataModule
, like this.
from lightning_transformers.task.nlp.language_modeling import LanguageModelingDataModule
class MyLanguageModelingDataModule(LanguageModelingDataModule):
...
Make any changes you’d like to the dataset processing via the hooks.
Below we have the pseudo code version to show where most of the changes happened within the hooks:
from functools import partial
from datasets import Dataset, Optional
from transformers import PreTrainedTokenizerBase
from lightning_transformers.task.nlp.language_modeling import LanguageModelingDataModule
class MyLanguageModelingDataModule(LanguageModelingDataModule):
def __init__(self, tokenizer: PreTrainedTokenizerBase, *args, **kwargs):
super().__init__(tokenizer, *args, **kwargs)
self.tokenized_condition_term = tokenizer("This is a story: ")
def process_data(self, dataset: Dataset, stage: Optional[str] = None) -> Dataset:
...
# Pass in our additional condition term when converting to features
convert_to_features = partial(
self.convert_to_features,
block_size=self.effective_block_size,
tokenized_condition_term=self.tokenized_condition_term
)
...
return dataset
@staticmethod
def convert_to_features(examples, block_size: int, **kwargs):
# Our argument is passed in via kwargs
tokenized_condition_term = kwargs['tokenized_condition_term']
...
# Add the term to the tokenized blocks of text
result = {
k: [tokenized_condition_term + t[i:i + block_size] for i in range(0, total_length, block_size)]
for k, t in concatenated_examples.items()
}
result["labels"] = result["input_ids"].copy()
return result
Translation using Custom Data Processing¶
Below we show how to override data processing logic.
Extend the TranslationDataModule
base class¶
The base data module can be used to modify this code, and follows a simple pattern. Internally the dataset is loaded via HuggingFace Datasets, which returns an Apache Arrow Parquet Dataset. This data format is easy to transform and modify using map functions, which you’ll see within the class.
class TranslationDataModule(Seq2SeqDataModule):
@property
def source_target_column_names(self) -> Tuple[str, str]:
return self.cfg.source_language, self.cfg.target_language
...
class Seq2SeqDataModule(HFDataModule):
def process_data(self, dataset: Dataset, stage: Optional[str] = None) -> Dataset:
# `process_data` converting the dataset into features.
# The dataset is pre-loaded using `load_dataset`.
...
return dataset
@property
def source_target_column_names(self) -> Tuple[str, str]:
return 'source', 'target'
@staticmethod
def convert_to_features(examples, block_size: int = None):
# `process_data` calls this function to convert samples in the dataset into features
...
@property
def collate_fn(self) -> Callable:
# `Describes how to collate the samples for the batch given to the model`
return default_data_collator
Extend TranslationDataModule
, like this.
from lightning_transformers.task.nlp.translation import TranslationDataModule
class MyTranslationDataModule(TranslationDataModule):
...
Make any changes you’d like to the dataset processing via the hooks.
Customizing Tasks¶
Below we describe how you can customize the Language Modeling Task. In our example, we add weight noise when training and freeze the backbone. For the purpose of this example we freeze the backbone within the Task, however this is recommended to be done via a Callback as seen in the Freeze Embeddings Callback.
Tasks are based of a AutoModel Transformer, which handles all the internal logic when running the forward pass through the model, and the loss calculation for a specific task. Below are the steps to customize a task within the LightningModule.
Inherit from Lightning Transformers Base Class
Add custom task logic
1. Inherit from Lightning Transformers Base Class¶
For our example, we inherit from the Language Modeling base class.
from lightning_transformers.task.nlp.language_modeling import LanguageModelingTransformer
class MyLanguageModelingTransformer(LanguageModelingTransformer):
...
Typically you’d store the file within the lightning_transformers/task/
directory, in the appropriate task folder.
In our example, we’d store our file in lightning_transformers/task/language_modeling/custom_model.py
.
2. Add Custom Task Logic¶
The class follows a standard pl.LightningModule
, thus all hooks and logic can be overridden easily.
Below we override the training step to add our logic, as well as on_fit_start
to freeze the model before training.
The LMHeadAutoModel
task provides separate keys for the backbone and the fully connected layer.
from lightning_transformers.task.nlp.language_modeling import LanguageModelingTransformer
class MyLanguageModelingTransformer(LanguageModelingTransformer):
def setup(self, stage):
# Freeze BERT backbone
for param in self.model.bert.parameters():
param.requires_grad = False
def training_step(self, batch, batch_idx):
loss = super().training_step(batch, batch_idx)
# Add weight noise every training step
with torch.no_grad():
for param in self.model.parameters():
param.add_(torch.randn(param.size()) * 0.1)
return loss