Datasets

Dataset Abstraction (chainer.dataset)

Chainer supports a common interface for training and validation of datasets. The dataset support consists of three components: datasets, iterators, and batch conversion functions.

Dataset represents a set of examples. The interface is only determined by combination with iterators you want to use on it. The built-in iterators of Chainer require the dataset to support __getitem__ and __len__ methods. In particular, the __getitem__ method should support indexing by both an integer and a slice. We can easily support slice indexing by inheriting DatasetMixin, in which case users only have to implement get_example() method for indexing. Basically, datasets are considered as stateless objects, so that we do not need to save the dataset as a checkpoint of the training procedure.

Iterator iterates over the dataset, and at each iteration, it yields a mini-batch of examples as a list. Iterators should support the Iterator interface, which includes the standard iterator protocol of Python. Iterators manage where to read next, which means they are stateful.

Batch conversion function converts the mini-batch into arrays to feed to the neural nets. They are also responsible to send each array to an appropriate device. Chainer currently provides two implementations:

These components are all customizable, and designed to have a minimum interface to restrict the types of datasets and ways to handle them. In most cases, though, implementations provided by Chainer itself are enough to cover the usages.

Chainer also has a light system to download, manage, and cache concrete examples of datasets. All datasets managed through the system are saved under the dataset root directory, which is determined by the CHAINER_DATASET_ROOT environment variable, and can also be set by the set_dataset_root() function.

Dataset Representation

See Dataset Examples (chainer.datasets) for dataset implementations.

chainer.dataset.DatasetMixin

Default implementation of dataset indexing.

Tabular Dataset Representation

chainer.dataset.TabularDataset

An abstract class that represents tabular dataset.

Iterator Interface

See Iterator for dataset iterator implementations.

chainer.dataset.Iterator

Base class of all dataset iterators.

Batch Conversion Function

chainer.dataset.converter

Decorator to make a converter function.

chainer.dataset.concat_examples

Concatenates a list of examples into array(s).

chainer.dataset.ConcatWithAsyncTransfer

Interface to concatenate data and transfer them to GPU asynchronously.

chainer.dataset.to_device

Send an array to a given device.

Dataset Management

chainer.dataset.get_dataset_root

Gets the path to the root directory to download and cache datasets.

chainer.dataset.set_dataset_root

Sets the root directory to download and cache datasets.

chainer.dataset.cached_download

Downloads a file and caches it.

chainer.dataset.cache_or_load_file

Caches a file if it does not exist, or loads it otherwise.

Dataset Examples (chainer.datasets)

The most basic dataset implementation is an array. Both NumPy and CuPy arrays can be used directly as datasets.

In many cases, though, the simple arrays are not enough to write the training procedure. In order to cover most of such cases, Chainer provides many built-in implementations of datasets.

These built-in datasets are divided into two groups. One is a group of general datasets. Most of them are wrapper of other datasets to introduce some structures (e.g., tuple or dict) to each data point. The other one is a group of concrete, popular datasets. These concrete examples use the downloading utilities in the chainer.dataset module to cache downloaded and converted datasets.

General Datasets

General datasets are further divided into four types.

The first one is DictDataset and TupleDataset, both of which combine other datasets and introduce some structures on them.

The second one is ConcatenatedDataset and SubDataset. ConcatenatedDataset represents a concatenation of existing datasets. It can be used to merge datasets and make a larger dataset. SubDataset represents a subset of an existing dataset. It can be used to separate a dataset for hold-out validation or cross validation. Convenient functions to make random splits are also provided.

The third one is TransformDataset, which wraps around a dataset by applying a function to data indexed from the underlying dataset. It can be used to modify behavior of a dataset that is already prepared.

The last one is a group of domain-specific datasets. Currently, implementations for datasets of images (ImageDataset, LabeledImageDataset, etc.) and text (TextDataset) are provided.

DictDataset

chainer.datasets.DictDataset

Dataset of a dictionary of datasets.

TupleDataset

chainer.datasets.TupleDataset

Dataset of tuples from multiple equal-length datasets.

ConcatenatedDataset

chainer.datasets.ConcatenatedDataset

Dataset which concatenates some base datasets.

SubDataset

chainer.datasets.SubDataset

Subset of a base dataset.

chainer.datasets.split_dataset

Splits a dataset into two subsets.

chainer.datasets.split_dataset_random

Splits a dataset into two subsets randomly.

chainer.datasets.get_cross_validation_datasets

Creates a set of training/test splits for cross validation.

chainer.datasets.get_cross_validation_datasets_random

Creates a set of training/test splits for cross validation randomly.

TransformDataset

chainer.datasets.TransformDataset

Dataset that indexes the base dataset and transforms the data.

ImageDataset

chainer.datasets.ImageDataset

Dataset of images built from a list of paths to image files.

chainer.datasets.ZippedImageDataset

Dataset of images built from a zip file.

chainer.datasets.MultiZippedImageDataset

Dataset of images built from a list of paths to zip files.

LabeledImageDataset

chainer.datasets.LabeledImageDataset

Dataset of image and label pairs built from a list of paths and labels.

chainer.datasets.LabeledZippedImageDataset

Dataset of zipped image and label pairs.

TextDataset

chainer.datasets.TextDataset

Dataset of a line-oriented text file.

PickleDataset

chainer.datasets.PickleDataset

Dataset stored in a storage using pickle.

chainer.datasets.PickleDatasetWriter

Writer class that makes PickleDataset.

chainer.datasets.open_pickle_dataset

Opens a dataset stored in a given path.

chainer.datasets.open_pickle_dataset_writer

Opens a writer to make a PickleDataset.

Concrete Datasets

chainer.datasets.get_mnist

Gets the MNIST dataset.

chainer.datasets.get_kuzushiji_mnist

Gets the Kuzushiji-MNIST dataset.

chainer.datasets.get_kuzushiji_mnist_labels

Provides a list of labels for the Kuzushiji-MNIST dataset.

chainer.datasets.get_fashion_mnist_labels

Provide a list of the string value names of the labels.

chainer.datasets.get_fashion_mnist

Gets the Fashion-MNIST dataset.

chainer.datasets.get_cifar10

Gets the CIFAR-10 dataset.

chainer.datasets.get_cifar100

Gets the CIFAR-100 dataset.

chainer.datasets.get_ptb_words

Gets the Penn Tree Bank dataset as long word sequences.

chainer.datasets.get_ptb_words_vocabulary

Gets the Penn Tree Bank word vocabulary.

chainer.datasets.get_svhn

Gets the SVHN dataset.

Note

ChainerCV supports implementations of datasets that are useful for computer vision problems, which can be found in chainercv.datasets. Here is a subset of data loaders supported by ChainerCV: