Language Modeling¶
The Task¶
Causal Language Modeling is the vanilla autoregressive pre-training method common to most language models such as GPT-3 or CTRL (Excluding BERT-like models, which were pre-trained using the Masked Language Modeling training method).
During training, we minimize the maximum likelihood during training across spans of text data (usually in some context window/block size). The model is able to attend to the left context (left of the mask). When trained on large quantities of text data, this gives us strong language models such as GPT-3 to use for downstream tasks.
Datasets¶
Currently supports the wikitext2 dataset, or custom input files. Since this task is usually the pre-training task for Transformers, it can be used to train new language models from scratch or to fine-tune a language model onto your own unlabeled text data.
Usage¶
Language Models pre-trained or fine-tuned to the Causal Language Modeling task can then be used in generative predictions.
import pytorch_lightning as pl
from transformers import AutoTokenizer
from lightning_transformers.task.nlp.language_modeling import (
LanguageModelingDataModule,
LanguageModelingTransformer,
)
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path="gpt2")
model = LanguageModelingTransformer(pretrained_model_name_or_path="gpt2")
dm = LanguageModelingDataModule(
batch_size=1,
dataset_name="wikitext",
dataset_config_name="wikitext-2-raw-v1",
tokenizer=tokenizer,
)
trainer = pl.Trainer(accelerator="auto", devices="auto", max_epochs=1)
trainer.fit(model, dm)
We report the Cross Entropy Loss for validation.
Language Modeling Using Your Own Files¶
To use custom text files, the files should contain the raw data you want to train and validate on.
During data pre-processing the text is flattened, and the model is trained and validated on context windows (block size) made from the input text. We override the dataset files, allowing us to still use the data transforms defined with the base datamodule.
Below we have defined a csv file to use as our input data.
text,
this is the first sentence,
this is the second sentence,
from lightning_transformers.task.nlp.language_modeling import (
LanguageModelingDataModule,
)
dm = LanguageModelingDataModule(
batch_size=1,
train_file="path/train.csv",
validation_file="/path/valid.csv"
tokenizer=tokenizer,
)
Language Modeling Inference Pipeline¶
By default we use the text generation pipeline, which requires a conditional input string and generates an output string.
from transformers import AutoTokenizer
from lightning_transformers.task.nlp.language_modeling import LanguageModelingTransformer
model = LanguageModelingTransformer(
pretrained_model_name_or_path="prajjwal1/bert-tiny",
tokenizer=AutoTokenizer.from_pretrained(pretrained_model_name_or_path="prajjwal1/bert-tiny"),
)
model.hf_predict("The house:")