# PG-19 Language Modelling Benchmark
This repository contains the PG-19 language modeling benchmark. It includes a
set of books extracted from the Project Gutenberg books library [1], that were
published before 1919. It also contains metadata of book titles and publication
dates.
Full dataset download link
PG-19 is over double the size of the Billion Word benchmark [2] and contains
documents that are 20X longer, on average, than the WikiText long-range language
modelling benchmark [3].
Books are partitioned into a \`train\`, \`validation\`, and \`test\` set. Book
metadata is stored in \`metadata.csv\` which contains
\`(book_id, short_book_title, publication_date)\`.
Unlike prior benchmarks, we do not constrain the vocabulary size ---
i.e. mapping rare words to an UNK token --- but instead release the data as an
open-vocabulary benchmark. The only processing of the text that has been applied
is the removal of boilerplate license text, and the mapping of offensive
discriminatory words as specified by Ofcom [4] to placeholder
tokens. Users
are free to model the data at the character-level, subword-level, or via any
mechanism that can model an arbitrary string of text.
To compare models we propose to continue measuring the word-level perplexity,
by calculating the total likelihood of the dataset (via any chosen subword
vocabulary or character-based scheme) divided by the number of tokens ---
specified below in the dataset statistics table.
One could use this dataset for benchmarking long-range language models, or
use it to pre-train for other natural language processing tasks which require
long-range reasoning, such as LAMBADA [5] or NarrativeQA [6]. We would not
recommend using this dataset to train a general-purpose language model, e.g.
for applications to a production-system dialogue agent, due to the dated
linguistic style of old texts and the inherent biases present in historical
writing.
### Dataset Statistics
| |
Train |
Validation |
Test |
| Books |
28,602 |
50 |
100 |
| Num. Tokens |
1,973,136,207 |
3,007,061 |
6,966,499 |
### Bibtex
\`\`\`
@article\{raecompressive2019,
author = \{Rae, Jack W and Potapenko, Anna and Jayakumar, Siddhant M and
Hillier, Chloe and Lillicrap, Timothy P\},
title = \{Compressive Transformers for Long-Range Sequence Modelling\},
journal = \{arXiv preprint\},
url = \{https://arxiv.org/abs/1911.05507\},
year = \{2019\},
\}
\`\`\`
### Dataset Metadata
The following table is necessary for this dataset to be indexed by search
engines such as Google Dataset Search.
| property |
value |
| name |
The PG-19 Language Modeling Benchmark |
| alternateName |
PG-19 |
| url |
https://github.com/deepmind/pg19 |
| sameAs |
https://github.com/deepmind/pg19 |
| description |
This repository contains the PG-19 dataset.
It includes a set of books extracted from the Project Gutenberg
books project (https://www.gutenberg.org), that were published before
1919. It also contains metadata of book titles and publication dates. |
| provider |
| property |
value |
| name |
DeepMind |
| sameAs |
https://en.wikipedia.org/wiki/DeepMind |
|
| license |
| property |
value |
| name |
Apache License, Version 2.0 |
| url |
https://www.apache.org/licenses/LICENSE-2.0.html |
|
| citation |
https://identifiers.org/arxiv:1911.05507 |
### Contact
If you have any questions, please contact Jack Rae.
### References
- [1] https://www.gutenberg.org
- [2] Chelba et al. "One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling" (2013)
- [3] Merity et al. "Pointer Sentinel Mixture Models" (2016)
- [4] Ofcom offensive language guide
- [5] Paperno et al. "The LAMBADA dataset: Word prediction requiring a broad discourse context" (2016)
- [6] Kočiský et al. "The narrativeqa reading comprehension challenge" (2018)