Dataset examples

The most basic dataset implementation is an array. Both NumPy and CuPy arrays can be used directly as datasets.

In many cases, though, the simple arrays are not enough to write the training procedure. In order to cover most of such cases, Chainer provides many built-in implementations of datasets.

These built-in datasets are divided into two groups. One is a group of general datasets. Most of them are wrapper of other datasets to introduce some structures (e.g., tuple or dict) to each data point. The other one is a group of concrete, popular datasets. These concrete examples use the downloading utilities in the chainer.dataset module to cache downloaded and converted datasets.

General datasets

General datasets are further divided into three types.

The first one is DictDataset and TupleDataset, both of which combine other datasets and introduce some structures on them.

The second one is SubDataset, which represents a subset of an existing dataset. It can be used to separate a dataset for hold-out validation or cross validation. Convenient functions to make random splits are also provided.

The third one is TransformDataset, which wraps around a dataset by applying a function to data indexed from the underlying dataset. It can be used to modify behavior of a dataset that is already prepared.

The last one is a group of domain-specific datasets. Currently, ImageDataset and LabeledImageDataset are provided for datasets of images.

DictDataset

class chainer.datasets.DictDataset(**datasets)[source]

Dataset of a dictionary of datasets.

It combines multiple datasets into one dataset. Each example is represented by a dictionary mapping a key to an example of the corresponding dataset.

Parameters:datasets – Underlying datasets. The keys are used as the keys of each example. All datasets must have the same length.

TupleDataset

class chainer.datasets.TupleDataset(*datasets)[source]

Dataset of a tuple of datasets.

It combines multiple datasets into one dataset. Each example is represented by a tuple whose i-th item corresponds to the i-th dataset.

Parameters:datasets – Underlying datasets. The i-th one is used for the i-th item of each example. All datasets must have the same length.

SubDataset

class chainer.datasets.SubDataset(dataset, start, finish, order=None)[source]

Subset of a base dataset.

SubDataset defines a subset of a given base dataset. The subset is defined as an interval of indexes, optionally with a given permutation.

If order is given, then the i-th example of this dataset is the order[start + i]-th example of the base dataset, where i is a non-negative integer. If order is not given, then the i-th example of this dataset is the start + i-th example of the base dataset. Negative indexing is also allowed: in this case, the term start + i is replaced by finish + i.

SubDataset is often used to split a dataset into training and validation subsets. The training set is used for training, while the validation set is used to track the generalization performance, i.e. how the learned model works well on unseen data. We can tune hyperparameters (e.g. number of hidden units, weight initializers, learning rate, etc.) by comparing the validation performance. Note that we often use another set called test set to measure the quality of the tuned hyperparameter, which can be made by nesting multiple SubDatasets.

There are two ways to make training-validation splits. One is a single split, where the dataset is split just into two subsets. It can be done by split_dataset() or split_dataset_random(). The other one is a \(k\)-fold cross validation, in which the dataset is divided into \(k\) subsets, and \(k\) different splits are generated using each of the \(k\) subsets as a validation set and the rest as a training set. It can be done by get_cross_validation_datasets().

Parameters:
  • dataset – Base dataset.
  • start (int) – The first index in the interval.
  • finish (int) – The next-to-the-last index in the interval.
  • order (sequence of ints) – Permutation of indexes in the base dataset. If this is None, then the ascending order of indexes is used.
chainer.datasets.split_dataset(dataset, split_at, order=None)[source]

Splits a dataset into two subsets.

This function creates two instances of SubDataset. These instances do not share any examples, and they together cover all examples of the original dataset.

Parameters:
  • dataset – Dataset to split.
  • split_at (int) – Position at which the base dataset is split.
  • order (sequence of ints) – Permutation of indexes in the base dataset. See the document of SubDataset for details.
Returns:

Two SubDataset objects. The first subset represents the

examples of indexes order[:split_at] while the second subset represents the examples of indexes order[split_at:].

Return type:

tuple

chainer.datasets.split_dataset_random(dataset, first_size, seed=None)[source]

Splits a dataset into two subsets randomly.

This function creates two instances of SubDataset. These instances do not share any examples, and they together cover all examples of the original dataset. The split is automatically done randomly.

Parameters:
  • dataset – Dataset to split.
  • first_size (int) – Size of the first subset.
  • seed (int) – Seed the generator used for the permutation of indexes. If an integer being convertible to 32 bit unsigned integers is specified, it is guaranteed that each sample in the given dataset always belongs to a specific subset. If None, the permutation is changed randomly.
Returns:

Two SubDataset objects. The first subset contains

first_size examples randomly chosen from the dataset without replacement, and the second subset contains the rest of the dataset.

Return type:

tuple

chainer.datasets.get_cross_validation_datasets(dataset, n_fold, order=None)[source]

Creates a set of training/test splits for cross validation.

This function generates n_fold splits of the given dataset. The first part of each split corresponds to the training dataset, while the second part to the test dataset. No pairs of test datasets share any examples, and all test datasets together cover the whole base dataset. Each test dataset contains almost same number of examples (the numbers may differ up to 1).

Parameters:
  • dataset – Dataset to split.
  • n_fold (int) – Number of splits for cross validation.
  • order (sequence of ints) – Order of indexes with which each split is determined. If it is None, then no permutation is used.
Returns:

List of dataset splits.

Return type:

list of tuples

chainer.datasets.get_cross_validation_datasets_random(dataset, n_fold, seed=None)[source]

Creates a set of training/test splits for cross validation randomly.

This function acts almost same as get_cross_validation_dataset(), except automatically generating random permutation.

Parameters:
  • dataset – Dataset to split.
  • n_fold (int) – Number of splits for cross validation.
  • seed (int) – Seed the generator used for the permutation of indexes. If an integer beging convertible to 32 bit unsigned integers is specified, it is guaranteed that each sample in the given dataset always belongs to a specific subset. If None, the permutation is changed randomly.
Returns:

List of dataset splits.

Return type:

list of tuples

TransformDataset

class chainer.datasets.TransformDataset(dataset, transform)[source]

Dataset that indexes the base dataset and transforms the data.

This dataset wraps the base dataset by modifying the behavior of the base dataset’s __getitem__(). Arrays returned by __getitem__() of the base dataset with integer as an argument are transformed by the given function transform. Also, __len__() returns the integer returned by the base dataset’s __len__().

The function transform takes, as an argument, in_data, which is the output of the base dataset’s __getitem__(), and returns the transformed arrays as output. Please see the following example.

>>> from chainer.datasets import get_mnist
>>> from chainer.datasets import TransformDataset
>>> dataset, _ = get_mnist()
>>> def transform(in_data):
...     img, label = in_data
...     img -= 0.5  # scale to [-0.5, -0.5]
...     return img, label
>>> dataset = TransformDataset(dataset, transform)
Parameters:
  • dataset – The underlying dataset. The index of this dataset corresponds to the index of the base dataset. This object needs to support functions __getitem__() and __len__() as described above.
  • transform (callable) – A function that is called to transform values returned by the underlying dataset’s __getitem__().

ImageDataset

class chainer.datasets.ImageDataset(paths, root='.', dtype=<type 'numpy.float32'>)[source]

Dataset of images built from a list of paths to image files.

This dataset reads an external image file on every call of the __getitem__() operator. The paths to the image to retrieve is given as either a list of strings or a text file that contains paths in distinct lines.

Each image is automatically converted to arrays of shape channels, height, width, where channels represents the number of channels in each pixel (e.g., 1 for grey-scale images, and 3 for RGB-color images).

Note

This dataset requires the Pillow package being installed. In order to use this dataset, install Pillow (e.g. by using the command pip install Pillow). Be careful to prepare appropriate libraries for image formats you want to use (e.g. libpng for PNG images, and libjpeg for JPG images).

Parameters:
  • paths (str or list of strs) – If it is a string, it is a path to a text file that contains paths to images in distinct lines. If it is a list of paths, the i-th element represents the path to the i-th image. In both cases, each path is a relative one from the root path given by another argument.
  • root (str) – Root directory to retrieve images from.
  • dtype – Data type of resulting image arrays.

LabeledImageDataset

class chainer.datasets.LabeledImageDataset(pairs, root='.', dtype=<type 'numpy.float32'>, label_dtype=<type 'numpy.int32'>)[source]

Dataset of image and label pairs built from a list of paths and labels.

This dataset reads an external image file like ImageDataset. The difference from ImageDataset is that this dataset also returns a label integer. The paths and labels are given as either a list of pairs or a text file contains paths/labels pairs in distinct lines. In the latter case, each path and corresponding label are separated by white spaces. This format is same as one used in Caffe.

Note

This dataset requires the Pillow package being installed. In order to use this dataset, install Pillow (e.g. by using the command pip install Pillow). Be careful to prepare appropriate libraries for image formats you want to use (e.g. libpng for PNG images, and libjpeg for JPG images).

Parameters:
  • pairs (str or list of tuples) – If it is a string, it is a path to a text file that contains paths to images in distinct lines. If it is a list of pairs, the i-th element represents a pair of the path to the i-th image and the corresponding label. In both cases, each path is a relative one from the root path given by another argument.
  • root (str) – Root directory to retrieve images from.
  • dtype – Data type of resulting image arrays.
  • label_dtype – Data type of the labels.

Concrete datasets

MNIST

chainer.datasets.get_mnist(withlabel=True, ndim=1, scale=1.0, dtype=<type 'numpy.float32'>, label_dtype=<type 'numpy.int32'>)[source]

Gets the MNIST dataset.

MNIST is a set of hand-written digits represented by grey-scale 28x28 images. In the original images, each pixel is represented by one-byte unsigned integer. This function scales the pixels to floating point values in the interval [0, scale].

This function returns the training set and the test set of the official MNIST dataset. If withlabel is True, each dataset consists of tuples of images and labels, otherwise it only consists of images.

Parameters:
  • withlabel (bool) – If True, it returns datasets with labels. In this case, each example is a tuple of an image and a label. Otherwise, the datasets only contain images.
  • ndim (int) –

    Number of dimensions of each image. The shape of each image is determined depending on ndim as follows:

    • ndim == 1: the shape is (784,)
    • ndim == 2: the shape is (28, 28)
    • ndim == 3: the shape is (1, 28, 28)
  • scale (float) – Pixel value scale. If it is 1 (default), pixels are scaled to the interval [0, 1].
  • dtype – Data type of resulting image arrays.
  • label_dtype – Data type of the labels.
Returns:

A tuple of two datasets. If withlabel is True, both datasets are TupleDataset instances. Otherwise, both datasets are arrays of images.

CIFAR10/100

chainer.datasets.get_cifar10(withlabel=True, ndim=3, scale=1.0)[source]

Gets the CIFAR-10 dataset.

CIFAR-10 is a set of small natural images. Each example is an RGB color image of size 32x32, classified into 10 groups. In the original images, each component of pixels is represented by one-byte unsigned integer. This function scales the components to floating point values in the interval [0, scale].

This function returns the training set and the test set of the official CIFAR-10 dataset. If withlabel is True, each dataset consists of tuples of images and labels, otherwise it only consists of images.

Parameters:
  • withlabel (bool) – If True, it returns datasets with labels. In this case, each example is a tuple of an image and a label. Otherwise, the datasets only contain images.
  • ndim (int) –

    Number of dimensions of each image. The shape of each image is determined depending on ndim as follows:

    • ndim == 1: the shape is (3072,)
    • ndim == 3: the shape is (3, 32, 32)
  • scale (float) – Pixel value scale. If it is 1 (default), pixels are scaled to the interval [0, 1].
Returns:

A tuple of two datasets. If withlabel is True, both datasets are TupleDataset instances. Otherwise, both datasets are arrays of images.

chainer.datasets.get_cifar100(withlabel=True, ndim=3, scale=1.0)[source]

Gets the CIFAR-100 dataset.

CIFAR-100 is a set of small natural images. Each example is an RGB color image of size 32x32, classified into 100 groups. In the original images, each component pixels is represented by one-byte unsigned integer. This function scales the components to floating point values in the interval [0, scale].

This function returns the training set and the test set of the official CIFAR-100 dataset. If withlabel is True, each dataset consists of tuples of images and labels, otherwise it only consists of images.

Parameters:
  • withlabel (bool) – If True, it returns datasets with labels. In this case, each example is a tuple of an image and a label. Otherwise, the datasets only contain images.
  • ndim (int) –

    Number of dimensions of each image. The shape of each image is determined depending on ndim as follows:

    • ndim == 1: the shape is (3072,)
    • ndim == 3: the shape is (3, 32, 32)
  • scale (float) – Pixel value scale. If it is 1 (default), pixels are scaled to the interval [0, 1].
Returns:

A tuple of two datasets. If withlabel is True, both are TupleDataset instances. Otherwise, both datasets are arrays of images.

Penn Tree Bank

chainer.datasets.get_ptb_words()[source]

Gets the Penn Tree Bank dataset as long word sequences.

Penn Tree Bank is originally a corpus of English sentences with linguistic structure annotations. This function uses a variant distributed at https://github.com/tomsercu/lstm, which omits the annotation and splits the dataset into three parts: training, validation, and test.

This function returns the training, validation, and test sets, each of which is represented as a long array of word IDs. All sentences in the dataset are concatenated by End-of-Sentence mark ‘<eos>’, which is treated as one of the vocabulary.

Returns:Int32 vectors of word IDs.
Return type:tuple of numpy.ndarray

See also

Use get_ptb_words_vocabulary() to get the mapping between the words and word IDs.

chainer.datasets.get_ptb_words_vocabulary()[source]

Gets the Penn Tree Bank word vocabulary.

Returns:
Dictionary that maps words to corresponding word IDs. The IDs are
used in the Penn Tree Bank long sequence datasets.
Return type:dict

See also

See get_ptb_words() for the actual datasets.