Causal Language Modeling is the vanilla autoregressive pre-training method common to most language models such as GPT-3 or CTRL (Excluding BERT-like models, which were pre-trained using the Masked Language Modeling training method).
During training, we minimize the maximum likelihood during training across spans of text data (usually in some context window/block size). The model is able to attend to the left context (left of the mask). When trained on large quantities of text data, this gives us strong language models such as GPT-3 to use for downstream tasks.
Currently supports the wikitext2 dataset, or custom input files. Since this task is usually the pre-training task for Transformers, it can be used to train new language models from scratch or to fine-tune a language model onto your own unlabeled text data.
Language Models pre-trained or fine-tuned to the Causal Language Modeling task can then be used in generative predictions.
python train.py task=nlp/language_modeling dataset=nlp/language_modeling/wikitext
Swap to GPT backbone:
python train.py task=nlp/language_modeling dataset=nlp/language_modeling/wikitext backbone.pretrained_model_name_or_path=gpt2
We report the Cross Entropy Loss for validation. Find all options available for the task here.
Language Modeling Using Your Own Files¶
To use custom text files, the files should contain the raw data you want to train and validate on.
During data pre-processing the text is flattened, and the model is trained and validated on context windows (block size) made from the input text. We override the dataset files, allowing us to still use the data transforms defined with the base datamodule.
Below we have defined a csv file to use as our input data.
text, this is the first sentence, this is the second sentence,
python train.py task=nlp/language_modeling dataset.cfg.train_file=train.csv dataset.cfg.validation_file=valid.csv
Language Modeling Inference Pipeline (experimental)¶
By default we use the text generation pipeline, which requires a conditional input string and generates an output string.
For Hydra to correctly parse your input argument, if your input contains any special characters you must either wrap the entire call in single quotes like ‘+x=”my, sentence”’ or escape special characters. See Escaped characters in unquoted values.
python predict.py task=nlp/language_modeling +checkpoint_path=/path/to/model.ckpt +x="Condition sentence for the language model"
You can also run prediction using a default HuggingFace pre-trained model:
python predict.py task=nlp/language_modeling +x="Condition sentence for the language model"
Or run prediction on a specified HuggingFace pre-trained model:
python predict.py task=nlp/language_modeling backbone.pretrained_model_name_or_path=bert-base-cased +x="Condition sentence for the language model"