Experiment Class of Fed-BioMed

Fed-BioMed provides a model training process over multiple nodes where the datasets are stored and models get trained. The experiment is in charge of managing the orchestration of the training process on available nodes. Orchestration means;

Searching the datasets on existing nodes, based on specific tags given by a researcher and used by the nodes to identify their data.
Uploading the model file created by the researcher and sending the file URL to the nodes.
Sending model and training arguments to the nodes.
Tracking training process in the nodes during all training rounds.
Checking the nodes responses to make sure that each round is successfully completed in every node.
Downloading the local model parameters after every round of training.
Aggregating the local model parameters based on the specified federated approach, and eventually sending the aggregated parameters to the selected nodes for the next round.

Arguments of The Experiment

The list above basically explains what an experiment does during a federated model training. Some arguments have to be passed to the experiment to make the experiment able to manage these steps. The following code snippet represents a basic experiment initializing for federated training.

Experiment( tags=tags,
            clients=None,
            model_path=model_file,
            model_class='Net',
            training_args=training_args,
            rounds=rounds,
            aggregator=FedAverage(),
            node_selection_strategy=None)

When you first initialize your experiment, it creates a FederatedDataSet by searching the datasets in nodes based on the given list of tags, initialize a job based on model_file, training_arguments, rounds, model_args, model_class, and select nodes according to node_selection_strategy.

The model file is a path where the model class is included. It is necessary to upload it to the file repository when using Jupyter Notebooks, in order to make it accessible for the nodes. The experiment sends its URL to nodes during every round of training. Thanks to that, each node can construct the model and do the training. The model_class is used for initializing the model by the nodes.

The model_args is a dictionary with the arguments related to the model (e.g. number of layers, features, etc.). This will be passed to the model class during model initialization by the nodes. For example, the number of features that are going to be used in network layers can be passed with model_args. An example is shown below.

{
    "in_features"   : 15
    "out_features"  : 1
}

These parameters can be used as the example below,

class MyTrainingPlan(TorchTrainingPlan):       
    def __init__(self, kwargs):
        super(MyTrainingPlan, self).__init__()
        # kwargs should match the model arguments to be passed below to the experiment class
        self.in_features = kwargs['in_features']
        self.out_features = kwargs['out_features']
        self.fc1 = nn.Linear(self.in_features, 5)
        self.fc2 = nn.Linear(5, self.out_features)

training_args is also a dictionary, containing the arguments for the training routine (e.g. batch size, learning rate, epochs, etc.). This will be passed to the training routine on the node side to indicate how the model is going to get trained. An example is shown below.

training_args = {
    'batch_size': 20, 
    'lr': 1e-3, 
    'epochs': 10, 
    'dry_run': False,  
    'batch_maxnum': 100 
}

Other Arguments

The experiment has also nodes and tensorboard arguments.

nodes argument is a list that contains node ids. When it is set, the experiment sends a dataset search request to these ndoes only. You can visit listing dataset and selecting nodes documentation to get more information.
tensorboard argument is a boolean for activating tensorboard during the training. When it is True the loss values received from each node will be written tensorboard event files to display on the tensorboard. You can visit tensorboard documentation page to get more information about using tensorboard with Fed-BioMed

Running an Experiment

As mentioned before, the experiment is an orchestration class. Running an experiment means starting the training process. It publishes training commands on the MQTT server that is subscribed by each live node. After sending training commands it waits for the responses that will be sent by the nodes. Training commands are a JSON string that includes the following pieces of information.

{
   "researcher_id":"researcher id that sends training command",
   "job_id":"created job id by exeperiment",
   "training_args":{
      "batch_size":32,
      "lr":0.001,
      "epochs":1,
      "dry_run":false,
      "batch_maxnum":100
   },
   "model_args":<args object>,
   "command":"train",
   "model_url":"<model url>",
   "params_url":"<model_paramater_url>",
   "model_class":"Net",
   "training_data":{
      "node_id":[
         "dataset_id"
      ]
   }

The experiment sends training commands to each node one by one. After sending them, it waits for each reply that is going to be published on each node. Training replies include the parameter file URL that has been saved by the node to the file repository. The following code snippet shows an example of a reply from a node.

{
   "researcher_id":"researcher id that sends the training command",
   "job_id":"job id that creates training job",
   "success":true,
   "node_id":"ID of the node that completes the training ",
   "dataset_id":"dataset_dcf88a68-7f66-4b60-9b65-db09c6d970ee",
   "params_url":"URL of the model parameters' file obtained after training",
   "timing":{
      "rtime_training":87.74385611899197,
      "ptime_training":330.388954968
   },
   "msg":"",
   "command":"train"
}

The experiment waits until a response is sent from the nodes who received a training command. It downloads the model parameters that are indicated in the training replies after getting each reply. It aggregates the model parameters based on a given federated aggregator and saves it. This process occurs until every round is completed.