Dataset examples

The most basic dataset implementation is an array. Both NumPy and CuPy arrays can be used directly as datasets.

In many cases, though, the simple arrays are not enough to write the training procedure. In order to cover most of such cases, Chainer provides many built-in implementations of datasets.

These built-in datasets are divided into two groups. One is a group of general datasets. Most of them are wrapper of other datasets to introduce some structures (e.g., tuple or dict) to each data point. The other one is a group of concrete, popular datasets. These concrete examples use the downloading utilities in the chainer.dataset module to cache downloaded and converted datasets.

General datasets

General datasets are further divided into four types.

The first one is DictDataset and TupleDataset, both of which combine other datasets and introduce some structures on them.

The second one is ConcatenatedDataset and SubDataset. ConcatenatedDataset represents a concatenation of existing datasets. It can be used to merge datasets and make a larger dataset. SubDataset represents a subset of an existing dataset. It can be used to separate a dataset for hold-out validation or cross validation. Convenient functions to make random splits are also provided.

The third one is TransformDataset, which wraps around a dataset by applying a function to data indexed from the underlying dataset. It can be used to modify behavior of a dataset that is already prepared.

The last one is a group of domain-specific datasets. Currently, ImageDataset and LabeledImageDataset are provided for datasets of images.

DictDataset

chainer.datasets.DictDataset Dataset of a dictionary of datasets.

TupleDataset

chainer.datasets.TupleDataset Dataset of tuples from multiple equal-length datasets.

ConcatenatedDataset

class chainer.datasets.ConcatenatedDataset(*datasets)[source]

Dataset which concatenates some base datasets.

This dataset wraps some base datasets and works as a concatenated dataset. For example, if a base dataset with 10 samples and another base dataset with 20 samples are given, this dataset works as a dataset which has 30 samples.

Parameters:datasets – The underlying datasets. Each dataset has to support __len__() and __getitem__().

SubDataset

chainer.datasets.SubDataset Subset of a base dataset.
chainer.datasets.split_dataset Splits a dataset into two subsets.
chainer.datasets.split_dataset_random Splits a dataset into two subsets randomly.
chainer.datasets.get_cross_validation_datasets Creates a set of training/test splits for cross validation.
chainer.datasets.get_cross_validation_datasets_random Creates a set of training/test splits for cross validation randomly.

TransformDataset

chainer.datasets.TransformDataset Dataset that indexes the base dataset and transforms the data.

ImageDataset

chainer.datasets.ImageDataset Dataset of images built from a list of paths to image files.

LabeledImageDataset

chainer.datasets.LabeledImageDataset Dataset of image and label pairs built from a list of paths and labels.

Concrete datasets

chainer.datasets.get_mnist Gets the MNIST dataset.
chainer.datasets.get_cifar10 Gets the CIFAR-10 dataset.
chainer.datasets.get_cifar100 Gets the CIFAR-100 dataset.
chainer.datasets.get_ptb_words Gets the Penn Tree Bank dataset as long word sequences.
chainer.datasets.get_ptb_words_vocabulary Gets the Penn Tree Bank word vocabulary.
chainer.datasets.get_svhn Gets the SVHN dataset.