Loading Dataset for Training

Datasets in the nodes are saved on the disk. Therefore, before the training, each node should load these datasets from the file system. Since the type of datasets (image, tabular, etc.) and the way of loading might vary from one to another, the user (researcher) should define a method called training_data. The method training_data is mandatory for each training plan (TrochTrainingPlan and SkLearnSGDModel). If it is not defined nodes will return an error at the very beginning of the first round.

Defining The Method Training Data

The method training_data is the method that should be defined while creating the training plan class. This method can take one input argument which is the batch_size. This argument represents the batch size that is going to be used for the data loader. Although this argument is not mandatory for training_data, if it exists, each node will pass the value batch_size defined in the training_args while calling the training_data method.

from fedbiomed.common.training_plans import TorchTrainingPlan

class MyTrainingPlan(TorchTrainingPlan):
    def __init__(self):
        pass
        # ....
    def training_data(self, batch_size):
        pass

You can also remove the batch_size argument and define it inside the method. In this case, the batch_sizedefined in the training arguments will no longer have an impact.

def training_data(self):
    batch_size = 48
    pass

training_data can have a maximum of one argument which is batch_size. Since training_data is executed by the node and the nodes are aware of only one argument, adding extra argument to this method will raise errors on the node side.

Reading Datafiles

As mentioned before, a node can store IMAGE and CSV dataset types. In this case, the way of loading these datasets might be different. The only information necessary for loading will be the file path where data files are stored. This path is accessible through self.dataset_path. If the dataset is a tabular CSV dataset self.dataset_path will address to a file. Otherwise, if it is an IMAGE dataset, self.dataset_path will address to a directory where all the images are stored. Therefore, before loading the dataset it is important to know what type of dataset is going to be loaded. It is possible to send list request to the nodes to get meta-data of the dataset there are deployed.

The following snippet shows an example of loading operation for a dataset of CSV type.

import pandas as pd
from fedbiomed.common.training_plans import TorchTrainingPlan
from fedbiomed.common.data import DataManager

class MyTrainingPlan(TorchTrainingPlan):
    def init_model(self):
        # ....
    def init_dependencies(self):
        # ....
    def init_optimizer(self):
        # ....

    def training_data(self, batch_size):
        dataset = pd.read_csv(self.dataset_path, header=None, delimiter=',')
        X = dataset.iloc[:,0:15].values
        y = dataset.iloc[:,15]
        return DataManager(dataset=X, target=y.values, batch_size=batch_size)

It is also possible to use some model arguments in the training data method. For example, if the following model argument is passed to the model by the experiment.

model_args = {
    'feature_cols' : 15
}

training_data can be configured as follows:

import pandas as pd
from fedbiomed.common.training_plans import TorchTrainingPlan
from fedbiomed.common.data import DataManager

class MyTrainingPlan(TorchTrainingPlan):
    def init_model(self):
        # ....
    def init_dependencies(self):
        # ....
    def init_optimizer(self):
        # ....

    def training_data(self, batch_size):
        feature_cols = self.model_args()["feature_cols"]
        dataset = pd.read_csv(self.dataset_path, header=None, delimiter=',')
        X = dataset.iloc[:,0:feature_cols].values
        y = dataset.iloc[:,feature_cols]
        return DataManager(dataset=X, target=y.values, batch_size=batch_size)

What `training_data` Should Return?

The method training_data should always return DataManager of Fed-BioMed defined in the module fedbiomed.common.data.DataManager. DataManager has been designed for managing different types of data objects for different types of training plans. It is also responsible for splitting a given dataset into training and validation if model validation is activated in the experiment.

DataManager takes two main input arguments as dataset and target. dataset should be an instance of one of PyTorch Dataset, Numpy ndarray, pd.DataFrame or pd.Series. The argument target should be an instance of one of Numpy ndarray, pd.DataFrame or pd.Series. By default, the argument target is None. If target is None the data manager considers that the dataset is an object that includes both input and target variables. This is the case where the dataset is an instance of the PyTorch dataset. If dataset is an instance of Numpy Array or Pandas DataFrame, it is mandatory to provide the target variable.

As it is mentioned, DataManager is capable of managing/configuring datasets/data-loaders based on the training plans that are going to be used for training. This configuration is necessary since each training plan requires different types of data loader/batch iterator.

Defining Training Data in Different Training Plans

Defining Training Data for PyTorch Based Training Plans

In the following code snippet, training_data of PyTorch-based training plan returns a DataManager object instantiated with dataset and target as pd.Series. Since PyTorch-based training requires a PyTorch DataLoader, DataManager converts pd.Series to a proper torch.utils.data.Dataset object and create a PyTorch DataLoader to pass it to the training loop on the node side.

import pandas as pd
from fedbiomed.common.training_plans import TorchTrainingPlan
from fedbiomed.common.data import DataManager

class MyTrainingPlan(TorchTrainingPlan):
    def init_model(self):
        # ....
    def init_dependencies(self):
        # ....
    def init_optimizer(self):
        # ....

    def training_data(self, batch_size):
        feature_cols = self.model_args()["feature_cols"]
        dataset = pd.read_csv(self.dataset_path, header=None, delimiter=',')
        X = dataset.iloc[:,0:feature_cols].values
        y = dataset.iloc[:,feature_cols]
        return DataManager(dataset=X, target=y.values, batch_size=batch_size)

It is also possible to define a custom PyTorch Dataset and use it in the DataManager without declaring the argument target.

import pandas as pd
from torch.utils.data import Dataset
from fedbiomed.common.training_plans import TorchTrainingPlan
from fedbiomed.common.data import DataManager

class MyTrainingPlan(TorchTrainingPlan):

    class CSVDataset(Dataset):
        """ Cusotm PyTorch Dataset """
        def __init__(self, dataset_path, features):
            self.input_file = pd.read_csv(dataset_path,sep=',',index_col=False)
            x_train = self.input_file.iloc[:,0:features].values
            y_train = self.input_file.iloc[:,features].values
            self.X_train = torch.from_numpy(x_train).float()
            self.Y_train = torch.from_numpy(y_train).float()

        def __len__(self):            
            return len(self.Y_train)

        def __getitem__(self, idx):
            return self.X_train[idx], self.Y_train[idx]

    def training_data(self, batch_size): 
        feature_cols = self.model_args()["feature_cols"]    
        dataset = self.CSVDataset(self.dataset_path, feature_cols)
        loader_kwargs = {'batch_size': batch_size, 'shuffle': True}
        return DataManager(dataset=dataset, **loader_kwargs)

In the code snippet above, loader_kwargs contains the arguments that are going to be used while creating a PyTorch DataLoader.

Defining Training Data for SkLearn Based Training Plans

The operations in the training_data for SkLearn based training plans are not much different thanTorchTrainingPlan. Currently, SkLearn based training plans do not require a data loader for training. This means that all samples will be used for fitting the model. That's why passing **loader_args does not make sense for SkLearn based training plans. These arguments will be ignored even if they are set.

import pandas as pd
from fedbiomed.common.training_plans import FedPerceptron
from fedbiomed.common.data import DataManager

class SGDRegressorTrainingPlan(FedPerceptron):

    def training_data(self, batch_size):
        num_cols = self.model_args()["number_cols"]
        dataset = pd.read_csv(self.dataset_path, header=None, delimiter=',')
        X = dataset.iloc[:,0:num_cols].values
        y = dataset.iloc[:,num_cols]
        return DataManager(dataset=X, target=y.values, batch_size)

Preprocessing for Data

Since the method training_data is defined by the user, it is possible to do preprocessing before creating the DataManager object. In the code snippet below, a preprocess for normalization is shown for the dataset MNIST.

def training_data(self, batch_size = 48):
    # Custom torch Dataloader for MNIST data
    transform = transforms.Compose([transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))])
    dataset_mnist = datasets.MNIST(self.dataset_path, train=True, download=False, transform=transform)
    train_kwargs = {'batch_size': batch_size, 'shuffle': True}
    return DataManager(dataset=dataset_mnist, **train_kwargs)

Training and validation partitions are created on the node side using returned DataManager object. Therefore, preprocessing in training_data will be applied for both validation and training data.

Data Loaders

A DataLoader is a class that takes care of handling the logic of iterating over a certain dataset. Thus, while a Dataset is concerned with loading and preprocessing samples one-by-one, the DataLoader is responsible for:

calling the dataset's __getitem__ method when needed
collating samples in a batch
shuffling the data at every epoch
in general, managing the actions related to iterating over a certain dataset

Passing arguments to Data Loaders

Any keyword argument, except dataset and target, that was provided to the DataManager constructor in the training_data function will be passed as a keyword argument to the DataLoader. For PyTorch and scikit-learn experiments, the DataLoaders have been heavily inspired by the torch.utils.data.DataLoader class, so please refer to that documentation for the meaning of the supported keyword arguments. For example:

class MyTrainingPlan(SKLearnTrainingPlan):
    def training_data(self, batch_size):
        return DataManager(dataset=X, target=Y, 
                           # The following arguments will be passed to the DataLoader:
                           batch_size=batch_size, shuffle=True, drop_last=False)

The special case of batch size

As you may have noticed below, batch_size represents a special case since it can be provided as argument to the training_data function.

Recommended approach

For batch_size, we recommend that you always include it is argument to the training_data function, and forward that value as keyworkd argument to the DataManager, as such:

class MyTrainingPlan(SKLearnTrainingPlan):
    def training_data(self, batch_size):
        return DataManager(dataset=X, target=Y, batch_size=batch_size)

It is not useful to set a default value for the batch_size argument, as it will be ignored in favour of Fed-BioMed's internal default value set in the TrainingArgs class.

Conclusion

training_data should be provided in each training plan. The way it is defined is almost the same for each framework's training plan as long as the structure of the datasets is simple. Since the method is defined by users, it provides flexibility to load and pre-process complex datasets distributed on the nodes. However, this method will be executed on the node side. Therefore, typos and lack of arguments may cause errors in the nodes even if it does not create any errors on the researcher side.