chainer.datasets.TextDataset

class chainer.datasets.TextDataset(paths, encoding=None, errors=None, newline=None, filter_func=None)[source]

Dataset of a line-oriented text file.

This dataset reads each line of text file(s) on every call of the __getitem__() operator. Positions of line boundaries are cached so that you can quickliy random access the text file by the line number.

Note

Cache will be built in the constructor. You can pickle and unpickle the dataset to reuse the cache, but in that case you are responsible to guarantee that files are not modified after the cache has built.

Parameters
  • paths (str or list of str) – Path to the text file(s). If it is a string, this dataset reads a line from the text file and emits it as str. If it is a list of string, this dataset reads lines from each text file and emits it as a tuple of str. In this case, number of lines in all files must be the same.

  • encoding (str or list of str) – Name of the encoding used to decode the file. See the description in open() for the supported options and how it works. When reading from multiple text files, you can also pass a list of str to use different encoding for each file.

  • errors (str or list of str) – String that specifies how decoding errors are to be handled. See the description in open() for the supported options and how it works. When reading from multiple text files, you can also pass a list of str to use different error handling policy for each file.

  • newline (str or list of str) – Controls how universal newlines mode works. See the description in open() for the supported options and how it works. When reading from multiple text files, you can also pass a list of str to use different mode for each file.

  • filter_func (callable) – Function to filter each line of the text file. It should be a function that takes number of arguments equals to the number of files. Arguments are lines loaded from each file. The filter function must return True to accept the line, or return False to skip the line.

Methods

__getitem__(index)[source]

Returns an example or a sequence of examples.

It implements the standard Python indexing and one-dimensional integer array indexing. It uses the get_example() method by default, but it may be overridden by the implementation to, for example, improve the slicing performance.

Parameters

index (int, slice, list or numpy.ndarray) – An index of an example or indexes of examples.

Returns

If index is int, returns an example created by get_example. If index is either slice or one-dimensional list or numpy.ndarray, returns a list of examples created by get_example.

Example

>>> import numpy
>>> from chainer import dataset
>>> class SimpleDataset(dataset.DatasetMixin):
...     def __init__(self, values):
...         self.values = values
...     def __len__(self):
...         return len(self.values)
...     def get_example(self, i):
...         return self.values[i]
...
>>> ds = SimpleDataset([0, 1, 2, 3, 4, 5])
>>> ds[1]   # Access by int
1
>>> ds[1:3]  # Access by slice
[1, 2]
>>> ds[[4, 0]]  # Access by one-dimensional integer list
[4, 0]
>>> index = numpy.arange(3)
>>> ds[index]  # Access by one-dimensional integer numpy.ndarray
[0, 1, 2]
__len__()[source]

Returns the number of data points.

close()[source]

Manually closes all text files.

In most cases, you do not have to call this method, because files will automatically be closed after TextDataset instance goes out of scope.

get_example(idx)[source]

Returns the i-th example.

Implementations should override it. It should raise IndexError if the index is invalid.

Parameters

i (int) – The index of the example.

Returns

The i-th example.

__eq__(value, /)

Return self==value.

__ne__(value, /)

Return self!=value.

__lt__(value, /)

Return self<value.

__le__(value, /)

Return self<=value.

__gt__(value, /)

Return self>value.

__ge__(value, /)

Return self>=value.