Gets the Penn Tree Bank dataset as long word sequences.
Penn Tree Bank is originally a corpus of English sentences with linguistic structure annotations. This function uses a variant distributed at https://github.com/wojzaremba/lstm, which omits the annotation and splits the dataset into three parts: training, validation, and test.
This function returns the training, validation, and test sets, each of which is represented as a long array of word IDs. All sentences in the dataset are concatenated by End-of-Sentence mark ‘<eos>’, which is treated as one of the vocabulary.
Int32 vectors of word IDs.
- Return type
tuple of numpy.ndarray
get_ptb_words_vocabulary()to get the mapping between the words and word IDs.