Shortcuts

Custom Data Files

In most cases when training/validating/testing on custom files, you’ll be able to do so without modifying any code, using the general data module classes directly.

Below we show per task how to fine-tune/validate/test on your own files per task or modify the logic within the data classes. Some tasks are more involved than others, as they may require more data processing.

Language Modeling Using Your Own Files

To use custom text files, the files should contain the raw data you want to train and validate on.

During data pre-processing the text is flattened, and the model is trained and validated on context windows (block size) made from the input text. We override the dataset files, allowing us to still use the data transforms defined with the base datamodule.

Below we have defined a csv file to use as our input data.

text,
this is the first sentence,
this is the second sentence,
python train.py task=nlp/language_modeling dataset.cfg.train_file=train.csv dataset.cfg.validation_file=valid.csv

Multiple Choice Using Your Own Files

To use custom text files, the files should contain the data you want to train and validate on and be in CSV or JSON format as described below.

The format varies from dataset to dataset as input columns may differ, as well as pre-processing. To make our life easier, we use the RACE dataset format and override the files that are loaded.

Below we have defined a json file to use as our input data.

{
    "article": "The man walked into the red house but couldn't see where the light was.",
    "question": "What colour is the house?",
    "options": ["White", "Red", "Blue"]
    "answer": "Red"
}

We override the dataset files, allowing us to still use the data transforms defined with the RACE dataset.

python train.py task=nlp/multiple_choice dataset=language_modeling/race dataset.cfg.train_file=train.json dataset.cfg.validation_file=valid.json

Question Answering Using Your Own Files

To use custom text files, the files should contain new line delimited json objects within the text files.

The format varies from dataset to dataset as input columns may differ, as well as pre-processing. To make our life easier, we use the SWAG dataset format and override the files that are loaded.

{
    "answers": {
        "answer_start": [1],
        "text": ["This is a test text"]
    },
    "context": "This is a test context.",
    "id": "1",
    "question": "Is this a test?",
    "title": "train test"
}

We override the dataset files, allowing us to still use the data transforms defined with this dataset.

python train.py task=nlp/question_answering dataset=nlp/question_answering/squad dataset.cfg.train_file=train.json dataset.cfg.validation_file=valid.json

Summarization Using Your Own Files

To use custom text files, the files should contain new line delimited json objects within the text files.

{
    "source": "some-body",
    "target": "some-sentence"
}

We override the dataset files, allowing us to still use the data transforms defined with this dataset.

python train.py task=nlp/summarization dataset.cfg.train_file=train.json dataset.cfg.validation_file=valid.json

Text Classification Using Your Own Files

To use custom text files, the files should contain new line delimited json objects within the text files.

{
    "label": 0,
    "text": "I'm feeling quite sad and sorry for myself but I'll snap out of it soon."
}
python train.py task=nlp/text_classification dataset.cfg.train_file=train.json dataset.cfg.validation_file=valid.json

Token Classification Using Your Own Files

To use custom text files, the files should contain new line delimited json objects within the text files. For each token, there should be an associated label.

{
    "label_tags": [11, 12, 12, 21, 13, 11, 11, 21, 13, 11, 12, 13, 11, 21, 22, 11, 12, 17, 11, 21, 17, 11, 12, 12, 21, 22, 22, 13, 11, 0],
    "tokens": ["The", "European", "Commission", "said", "on", "Thursday", "it", "disagreed", "with", "German", "advice", "to", "consumers"]
}
python train.py task=nlp/token_classification dataset.cfg.train_file=train.json dataset.cfg.validation_file=valid.json

Translation Using Your Own Files

To use custom text files, the files should contain new line delimited json objects within the text files.

{
    "source": "example source text",
    "target": "example target text"
}

We override the dataset files, allowing us to still use the data transforms defined with this dataset.

python train.py task=nlp/translation dataset.cfg.train_file=train.json dataset.cfg.validation_file=valid.json
Read the Docs v: stable
Versions
latest
stable
Downloads
On Read the Docs
Project Home
Builds

Free document hosting provided by Read the Docs.