Configuration is a crucial part of emloop linking all the components together with just a few lines of YAML.

If you aren’t comfortable with YAML, JSON documents are supported as well (YAML is actually a superset of JSON). However, JSON does not support advanced features such as anchors or simple comments.

Each configuration file is divided into following sections:

  • dataset
  • model
  • hooks
  • main_loop
  • eval

wherein only dataset and model sections are mandatory.


The dataset section contains, unsurprisingly, the dataset configuration. First of all, one need to specify the fully qualified name of the dataset in the class entry.

All the remaining parameters are passed to the dataset constructor in the form of a string-encoded YAML.

The common dataset arguments can include augmentation setting for the training stream, the data root directory or a batch size, to name but a few.

example dataset configuration
  class: datasets.my_dataset.MyDataset

  data_root: /var/my_data
  batch_size: 16
    rotate: true     # enable random rotations
    blur_prob: 0.05  # probability of blurring

See dataset documentation for more information.


The model section specifies the configuration of the model to be trained. Again, it contains a class that emloop uses for constructing the model.

In addition, inputs and outputs entries are required as well. These arguments define what sources will be obtained from the dataset stream and which will be provided by the model. The remaining parameters are directly passed to the model constructor from where they might (or might not) be used.

For example, an image-processing deep neural network might require the input image resolution, the number of hidden neurons or a learning rate as in the following example:

model configuration with additional parameters
  class: models.my_model.MyModel
  inputs: [image, animal]
  outputs: [prediction, loss, accuracy]

  width: 800
  height: 600
  learning_rate: 0.003
  n_hidden_neurons: 512


Note that in addition to the passed variables from the configuration, the dataset object is automatically passed to the model. Therefore, the model might use attributes and methods of the dataset as well. For example, the dataset might compute the number of target classes and the network can build the classifier in a way that suits the given dataset. Another example from natural language processing would be a number of distinct tokens for which the network train their embeddings.

See model documentation for more information.


The hooks section is optional yet omnipresent. It contains a list of hooks to be registered in the main loop (see main loop documentation) to save your model, terminate the training, visualize results etc.

Hooks are specified by their fully qualified names as in the following example:

non-parametrized hooks
  - LogProfile  # emloop.log_profile.LogProfile
  - emloop_tensorflow.TensorBoardHook
  - my_hooks.my_hook.MyHook


Standard emloop hooks from emloop.hooks module may be referenced only by their names, e.g.: LogProfile instead of emloop.log_profile.LogProfile.

In some cases, we need to configure the hooks being created with additional parameters. To do so, simply define a dictionary of parameters which will be passed to the hook constructor. E.g.:

parametrized hooks
  - emloop_scikit.ClassificationInfoHook:
      predicted_variable: predictions
      gold_variable: labels

  - ComputeStats:
      variables: [loss]

Main Loop

The main_loop section is optional. Any parameter specified there is forwarded to the emloop.MainLoop constructor which takes the following arguments:

MainLoop.__init__(model, dataset, hooks=(), train_stream_name='train', extra_streams=(), buffer=0, on_empty_batch='error', on_empty_stream='error', on_unused_sources='warn', on_incorrect_config='error', fixed_batch_size=None, fixed_epoch_size=None, skip_zeroth_epoch=False, **kwargs)[source]
  • model (AbstractModel) – trained model
  • dataset (AbstractDataset) – loaded dataset
  • hooks (Iterable[AbstractHook]) – training hooks
  • train_stream_name (str) – name of the training stream
  • extra_streams (List[str]) – additional stream names to be evaluated between epochs
  • buffer (int) – size of the batch buffer, 0 means no buffer
  • on_empty_batch (str) – action to take when batch is empty; one of MainLoop.EMPTY_ACTIONS
  • on_empty_stream (str) – action to take when stream is empty; one of MainLoop.EMPTY_ACTIONS
  • on_unused_sources (str) – action to take when stream provides an unused sources; one of UNUSED_SOURCE_ACTIONS
  • on_incorrect_config (str) – action to take when mainloop config contains unexpected arguments; one of MainLoop.INCORRECT_CONFIG_ACTIONS
  • fixed_batch_size (Optional[int]) – if specified, main_loop removes all batches that do not have the specified size
  • fixed_epoch_size (Optional[int]) – if specified, cut the train stream to epochs of at most fixed_epoch_size batches
  • skip_zeroth_epoch (bool) – if specified, main loop skips the 0th epoch

AssertionError – in case of unsupported value of on_empty_batch, on_empty_stream or on_unused_sources

main loop configuration
  extra_streams: [valid, test]
  skip_zeroth_epoch: True


Naturally, the evaluation (sometimes referred as prediction or inference) of the model on new unannotated data differs from its training. In this phase, we don’t know the ground truth, hence the dataset sources are different. In such a situation, some of the metrics are impossible to measure, e.g. accuracy, which requires the ground truth. Most likely, we also need a different set of hooks to process the model outputs.

For this reason, one can override the configuration with a special eval section. For each data stream, a sub-section (e.g.: eval.my_stream) is expected to match the overall configuration structure, i.e. it may contain the model, dataset, hooks and/or main_loop sections.

In the following example, we use all the original settings but the model inputs and outputs are overridden. Furthermore, a different list of hooks is specified. Yet another example is available in our examples repository @GitHub.

eval section of emloop configuration
  predict:  # configuration for the predict_stream
      inputs: [images]
      outputs: [predictions]

      - hooks.inference_logging_hook.InferenceLoggingHook:
          variables: [ids, predictions]

Evaluation of the predict stream can then be invoked with:

` emloop eval predict path/to/model `


The main motivation for this type of configuration is its modularity. The developer might easily produce various general models that will be trained or evaluated on different datasets, just by changing a few lines in the configuration file.

With this approach, the whole process of developing machine learning models is modularized. Once the interface (the names and the types of the data sources) are defined, the development of the model and the dataset might be done separately. In addition, the individual components are resusable for further experiments.

Furthermore, the configuration is backed up to the log directory. Therefore, it is clear what combination of model and dataset was used in the experiment, including all the parameters.

By registering custom hooks, the training and inference process might be arbitrarily changed. For instance, the results may be saved into a file/database, or they can be streamed