Shortcuts

Language Modeling

The Task

Causal Language Modeling is the vanilla autoregressive pre-training method common to most language models such as GPT-3 or CTRL (Excluding BERT-like models, which were pre-trained using the Masked Language Modeling training method).

During training, we minimize the maximum likelihood during training across spans of text data (usually in some context window/block size). The model is able to attend to the left context (left of the mask). When trained on large quantities of text data, this gives us strong language models such as GPT-3 to use for downstream tasks.

Datasets

Currently supports the wikitext2 dataset, or custom input files. Since this task is usually the pre-training task for Transformers, it can be used to train new language models from scratch or to fine-tune a language model onto your own unlabeled text data.

Usage

Language Models pre-trained or fine-tuned to the Causal Language Modeling task can then be used in generative predictions.

python train.py task=nlp/language_modeling dataset=nlp/language_modeling/wikitext

Swap to GPT backbone:

python train.py task=nlp/language_modeling dataset=nlp/language_modeling/wikitext backbone.pretrained_model_name_or_path=gpt2

We report the Cross Entropy Loss for validation. Find all options available for the task here.

Language Modeling Using Your Own Files

To use custom text files, the files should contain the raw data you want to train and validate on.

During data pre-processing the text is flattened, and the model is trained and validated on context windows (block size) made from the input text. We override the dataset files, allowing us to still use the data transforms defined with the base datamodule.

Below we have defined a csv file to use as our input data.

text,
this is the first sentence,
this is the second sentence,
python train.py task=nlp/language_modeling dataset.cfg.train_file=train.csv dataset.cfg.validation_file=valid.csv

Language Modeling Inference Pipeline (experimental)

By default we use the text generation pipeline, which requires a conditional input string and generates an output string.

For Hydra to correctly parse your input argument, if your input contains any special characters you must either wrap the entire call in single quotes like ‘+x=”my, sentence”’ or escape special characters. See Escaped characters in unquoted values.

python predict.py task=nlp/language_modeling +checkpoint_path=/path/to/model.ckpt +x="Condition sentence for the language model"

You can also run prediction using a default HuggingFace pre-trained model:

python predict.py task=nlp/language_modeling +x="Condition sentence for the language model"

Or run prediction on a specified HuggingFace pre-trained model:

python predict.py task=nlp/language_modeling backbone.pretrained_model_name_or_path=bert-base-cased +x="Condition sentence for the language model"
Read the Docs v: stable
Versions
latest
stable
Downloads
On Read the Docs
Project Home
Builds

Free document hosting provided by Read the Docs.