X

pg19

Information

# PG-19 Language Modelling Benchmark This repository contains the PG-19 language modeling benchmark. It includes a set of books extracted from the Project Gutenberg books library [1], that were published before 1919. It also contains metadata of book titles and publication dates. Full dataset download link PG-19 is over double the size of the Billion Word benchmark [2] and contains documents that are 20X longer, on average, than the WikiText long-range language modelling benchmark [3]. Books are partitioned into a \`train\`, \`validation\`, and \`test\` set. Book metadata is stored in \`metadata.csv\` which contains \`(book_id, short_book_title, publication_date)\`. Unlike prior benchmarks, we do not constrain the vocabulary size --- i.e. mapping rare words to an UNK token --- but instead release the data as an open-vocabulary benchmark. The only processing of the text that has been applied is the removal of boilerplate license text, and the mapping of offensive discriminatory words as specified by Ofcom [4] to placeholder tokens. Users are free to model the data at the character-level, subword-level, or via any mechanism that can model an arbitrary string of text. To compare models we propose to continue measuring the word-level perplexity, by calculating the total likelihood of the dataset (via any chosen subword vocabulary or character-based scheme) divided by the number of tokens --- specified below in the dataset statistics table. One could use this dataset for benchmarking long-range language models, or use it to pre-train for other natural language processing tasks which require long-range reasoning, such as LAMBADA [5] or NarrativeQA [6]. We would not recommend using this dataset to train a general-purpose language model, e.g. for applications to a production-system dialogue agent, due to the dated linguistic style of old texts and the inherent biases present in historical writing. ### Dataset Statistics
Train Validation Test
Books 28,602 50 100
Num. Tokens 1,973,136,207 3,007,061 6,966,499
### Bibtex \`\`\` @article\{raecompressive2019, author = \{Rae, Jack W and Potapenko, Anna and Jayakumar, Siddhant M and Hillier, Chloe and Lillicrap, Timothy P\}, title = \{Compressive Transformers for Long-Range Sequence Modelling\}, journal = \{arXiv preprint\}, url = \{https://arxiv.org/abs/1911.05507\}, year = \{2019\}, \} \`\`\` ### Dataset Metadata The following table is necessary for this dataset to be indexed by search engines such as Google Dataset Search.
property value
name The PG-19 Language Modeling Benchmark
alternateName PG-19
url
sameAs https://github.com/deepmind/pg19
description This repository contains the PG-19 dataset. It includes a set of books extracted from the Project Gutenberg books project (https://www.gutenberg.org), that were published before 1919. It also contains metadata of book titles and publication dates.
provider
property value
name DeepMind
sameAs https://en.wikipedia.org/wiki/DeepMind
license
property value
name Apache License, Version 2.0
url
citation https://identifiers.org/arxiv:1911.05507
### Contact If you have any questions, please contact Jack Rae. ### References
  • [1] https://www.gutenberg.org
  • [2] Chelba et al. "One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling" (2013)
  • [3] Merity et al. "Pointer Sentinel Mixture Models" (2016)
  • [4] Ofcom offensive language guide
  • [5] Paperno et al. "The LAMBADA dataset: Word prediction requiring a broad discourse context" (2016)
  • [6] Kočiský et al. "The narrativeqa reading comprehension challenge" (2018)

Prompts

Reviews

Tags

Write Your Review

Detailed Ratings

ALL
Correctness
Helpfulness
Interesting
Upload Pictures and Videos

Name
Size
Type
Download
Last Modified
  • Community

Add Discussion

Upload Pictures and Videos