Chainer supports a common interface for training and validation of datasets. The dataset support consists of three components: datasets, iterators, and batch conversion functions.
Dataset represents a set of examples. The interface is only determined by combination with iterators you want to use on it. The built-in iterators of Chainer require the dataset to support
__len__ methods. In particular, the
__getitem__ method should support indexing by both an integer and a slice. We can easily support slice indexing by inheriting
DatasetMixin, in which case users only have to implement
get_example() method for indexing. Basically, datasets are considered as stateless objects, so that we do not need to save the dataset as a checkpoint of the training procedure.
Iterator iterates over the dataset, and at each iteration, it yields a mini-batch of examples as a list. Iterators should support the
Iterator interface, which includes the standard iterator protocol of Python. Iterators manage where to read next, which means they are stateful.
Batch conversion function converts the mini-batch into arrays to feed to the neural nets. They are also responsible to send each array to an appropriate device. Chainer currently provides two implementations:
concat_examples()is a plain implementation which is used as the default choice.
ConcatWithAsyncTransferis a variant which is basically same as
concat_examples()except that it overlaps other GPU computations and data transfer for the next iteration.
These components are all customizable, and designed to have a minimum interface to restrict the types of datasets and ways to handle them. In most cases, though, implementations provided by Chainer itself are enough to cover the usages.
Chainer also has a light system to download, manage, and cache concrete examples of datasets. All datasets managed through the system are saved under the dataset root directory, which is determined by the
CHAINER_DATASET_ROOT environment variable, and can also be set by the
See Examples for dataset implementations.
||Default implementation of dataset indexing.|
See Iterator for dataset iterator implementations.
||Base class of all dataset iterators.|
Batch Conversion Function¶
||Concatenates a list of examples into array(s).|
||Interface to concatenate data and transfer them to GPU asynchronously.|
||Send an array to a given device.|
||Gets the path to the root directory to download and cache datasets.|
||Sets the root directory to download and cache datasets.|
||Downloads a file and caches it.|
||Caches a file if it does not exist, or loads it otherwise.|
The most basic
dataset implementation is an array.
Both NumPy and CuPy arrays can be used directly as datasets.
In many cases, though, the simple arrays are not enough to write the training procedure. In order to cover most of such cases, Chainer provides many built-in implementations of datasets.
These built-in datasets are divided into two groups.
One is a group of general datasets.
Most of them are wrapper of other datasets to introduce some structures (e.g., tuple or dict) to each data point.
The other one is a group of concrete, popular datasets.
These concrete examples use the downloading utilities in the
chainer.dataset module to cache downloaded and converted datasets.
General datasets are further divided into four types.
The second one is
ConcatenatedDataset represents a concatenation of existing datasets. It can be used to merge datasets and make a larger dataset.
SubDataset represents a subset of an existing dataset. It can be used to separate a dataset for hold-out validation or cross validation. Convenient functions to make random splits are also provided.
The third one is
TransformDataset, which wraps around a dataset by applying a function to data indexed from the underlying dataset.
It can be used to modify behavior of a dataset that is already prepared.
||Dataset which concatenates some base datasets.|
||Subset of a base dataset.|
||Splits a dataset into two subsets.|
||Splits a dataset into two subsets randomly.|
||Creates a set of training/test splits for cross validation.|
||Creates a set of training/test splits for cross validation randomly.|
||Dataset that indexes the base dataset and transforms the data.|
||Dataset of images built from a list of paths to image files.|
||Dataset of images built from a zip file.|
||Dataset of images built from a list of paths to zip files.|
||Dataset of image and label pairs built from a list of paths and labels.|
||Gets the MNIST dataset.|
||Gets the Fashion-MNIST dataset.|
||Gets the CIFAR-10 dataset.|
||Gets the CIFAR-100 dataset.|
||Gets the Penn Tree Bank dataset as long word sequences.|
||Gets the Penn Tree Bank word vocabulary.|
||Gets the SVHN dataset.|