Dataset¶
Dataset is the essential component of every machine learning problem. In fact, the dataset is often the problem itself. Regardless of sizes, complexities and formats, all datasets are beautiful. At some point, though, we need to iterate through the data points or examples if you wish.
As a matter of fact, the ability of providing example iterators is the only requirement from emloop on datasets.
A dataset used for training has to implement a train_stream
, in addition to that, any <stream_name>_stream
may be used for evaluation.
To make a dataset compatible with emloop one must implement a class which complies
to the emloop.datasets.AbstractDataset
concept.
With that, emloop can create and manage the dataset for you.
To use the dataset in training, specify its fully-qualified name in dataset
section of emloop configuration
file. See the configuration section for more information.
Note
emloop datasets are configured from emloop configuration files. When creating a dataset, emloop encodes the parameters as YAML string in order to ease interoperability in the case the dataset is implemented in a different language such as in c++.
BaseDataset¶
To write your very first dataset in python, we recommend to inherit from emloop.datasets.BaseDataset
as it parses the YAML string automatically. Parsed arguments are passed to the
emloop.datasets.BaseDataset._configure_dataset()
which is required from you implementation.
To give an example, we ll write a skeleton of MyDataset
in datasets.my_dataset.py
:
datasets.my_dataset.py
¶ from emloop import BaseDataset
class MyDataset(BaseDataset):
def _configure_dataset(batch_size: int, augment: dict, **kwargs):
# ...
This class requires two arguments, batch_size
and augment
. Any other argument
is ignored and hidden in the **kwargs
.
Next, we define the dataset
section in the config file:
MyDataset
in emloop configuration¶dataset:
class: datasets.MyDataset
batch_size: 16
augment:
rotate: true # enable random rotations
blur_prob: 0.05 # probability of blurring
Now given this configuration, emloop can find, create and configure new MyDataset
instance seamlessly.
Data Streams¶
In most cases, datasets are quite large and can not be fed to the model as whole. For this reason, emloop operates with streams of so called mini-batches, i.e. small portions of the dataset. In particular, emloop works with data on the following levels:
stream in an iterable of batches (
emloop.Stream
)batch is a dictionary of stream sources (
emloop.BatchData
)stream source is a list of example fields
For instance, imagine you are classifying images of animals. An example in this case would be a tuple of two fields, image and label.
Now, the stream would yield batches with image and label stream sources similar to this one:
{
'image': [img1, img2, img3, img4],
'label': ['cat', 'cat', 'dog', 'rabbit']
}
Implementing a <name>_stream
method which returns stream iterator allows emloop to use the respective stream.
When training, emloop requires the train stream to be provided by train_stream
method similar to the following one:
train_stream
method example¶def train_stream(self):
for i in range(10):
yield load_training_batch(num=i)
Analogously, additional methods such as valid_stream
and test_stream
can be easily implemented.
If they are registered in the config file under main_loop.extra_streams
, they will be evaluated
along with the train stream. The configuration may look as follows:
main_loop:
extra_streams: [valid, test]
The extra streams, however, are not used for training, that is, the model is won`t be updated when it iterates through them.
Finally, emloop allows evaluation of any additional stream with emloop eval <stream_name> ..
command.
Additional Methods¶
Alongside providing the streams, the dataset may implement additional methods downloading, validating or visualizing the data.
For example, is can contain a download
method, which checks whether the dataset has all the data it requires.
If not, it downloads them from the internet/database/drive. These methods may be easily invoked with
emloop dataset <method-name> <config>
Additional useful method could be statistics
, which would print various statistics of provided data,
plot some figures etc.
Sometimes, we need to split the whole dataset into training, validation and testing sets.
For this purpose, we would implement a split
function.
The suggested methods are completely arbitrary. The key concept is to keep data-related functions bundled together in the dataset object, so that one doesn’t need to implement several separate scripts for fetching/visualization/statistics etc.
A typical pipeline contains the following commands. We leave them without further comments as they are self-describing.
emloop dataset download config/my-data.yaml
emloop dataset validate config/my-data.yaml
emloop dataset print_statistics config/my-data.yaml
emloop dataset plot_histogram config/my-data.yaml
emloop train config/my-data.yaml
emloop eval test log/my-model
The Philosophy of Laziness¶
In our experience, the best practice for the dataset is to perform all the initialization on demand.
This technique is sometimes called lazy initialization.
That is, the constructor should not perform any time-consuming operation such as loading and decoding the data.
Instead, the data should be loaded and decoded in the first moment they are truly necessary (e.g.,
in the train_stream
method).
The main reason for laziness is that the dataset doesn’t know for which purpose it was constructed. It might be queried to provide the training data or only to print some simple checksums. In the cases of extremely big datasets, it is useless and annoying to waste the time by loading the data without their actual use.