chainer.datasets.get_ptb_words

chainer.datasets.get_ptb_words()[source]

Gets the Penn Tree Bank dataset as long word sequences.

Penn Tree Bank is originally a corpus of English sentences with linguistic structure annotations. This function uses a variant distributed at https://github.com/wojzaremba/lstm, which omits the annotation and splits the dataset into three parts: training, validation, and test.

This function returns the training, validation, and test sets, each of which is represented as a long array of word IDs. All sentences in the dataset are concatenated by End-of-Sentence mark ‘<eos>’, which is treated as one of the vocabulary.

Returns

Int32 vectors of word IDs.

Return type

tuple of numpy.ndarray

See also

Use get_ptb_words_vocabulary() to get the mapping between the words and word IDs.