The Method training_data in Training Plan

The method training_data of the training plan is needed to process the dataset. It is a method that has to be defined in every model class (training plan class). Fed-BioMed gets models trained in different nodes which stores their own datasets: the training_data method manages the loading process during or before training. Therefore, the training_data must be defined based on both the dataset that is going to be used for training and the training model (eg if you intend to use rather a supervised or an unsupervised learning algorithm). In addition, the training_data has to account for the framework used. Indeed, each framework has its own way to load data: for instance, while the training_data must return a DataLoader object for PyTorch, it might return Pandas series for Scikit-Learn.

The input arguments can also vary based on the framework. Therefore, you should correctly define the training_args while creating an experiment. For instance, you might need to define a batch size for a neural network and you may not need to add this argument for training a regression model using Scikit-Learn. This means that this method might change according to what kind of dataset you are working on. For example, let's suppose that you are working on a CSV dataset. Then, training_data should first load your CSV data. After, that it should select the feature variables and target variable. In the end, it should return the selected features and target variable. Following training_data can be an example of that scenario.

    def training_data(self):
        NUMBER_COLS = 5
        dataset = pd.read_csv(self.dataset_path,header=None,delimiter=',')
        X = dataset.iloc[:,0:NUMBER_COLS].values
        y = dataset.iloc[:,NUMBER_COLS]
        return (X,y.values)

Let's say that you want to pass NUMBER_COLS as model arguments. Then you can create your model class and training_data method as shown below.

Note: In such a case, you should pass number_cols as model argumant while intializing the experiment class.

class SGDRegressorTrainingPlan(SGDSkLearnModel):
    def __init__(self, model_args: dict = {}):
        super(SGDRegressorTrainingPlan, self).__init__(model_args)
        self.number_cols = model_args['number_cols']
        self.add_dependency(["from sklearn.linear_model import SGDRegressor"])

    def training_data(self):
        dataset = pd.read_csv(self.dataset_path,header=None,delimiter=',')
        X = dataset.iloc[:,0:self.number_cols].values
        y = dataset.iloc[:,self.number_cols]
        return (X,y.values)

Note: self.data_path in the training_data method is set by set_dataset method that comes from SGDSkLearnModel and TorchTrainingPlan. It gets done by the node after it receives the training command that includes the dataset path during every round of training.

For the PyTorch models, things might be different. You may need to use PyTorch's DataLoader to load your dataset. This means that you need to create a custom PyTorch Dataset in your model class. Let's assume that you are going to use a CSV dataset to train your model. The following example shows how you can create your custom dataset and training_data.

class MyTrainingPlan(TorchTrainingPlan):       
    def __init__(self, model_args: dict = {}):
        super(MyTrainingPlan, self).__init__(model_args)
        # model_args should match the model arguments to be passed below to the experiment class
        self.in_features = model_args['in_features']
        self.out_features = model_args['out_features']

        # Network layers
        # .....


    # Other methods 
    # .....

    class csv_Dataset(Dataset):
        # Here we define a custom Dataset class inherited from the general torch Dataset class
        # This class takes as argument a .csv file path and creates a torch Dataset 
        def __init__(self, dataset_path, x_dim):
            self.input_file = pd.read_csv(dataset_path,sep=';',index_col=False)
            x_train = self.input_file.iloc[:,:x_dim].values
            y_train = self.input_file.iloc[:,-1].values
            self.X_train = torch.from_numpy(x_train).float()
            self.Y_train = torch.from_numpy(y_train).float()

        def __len__(self):            
            return len(self.Y_train)

        def __getitem__(self, idx):

            return (self.X_train[idx], self.Y_train[idx])

    def training_data(self,  batch_size = 48):
        # The training_data creates the Dataloader to be used for training in the general class TorchTrainingPlan of Fed-BioMed
        dataset = self.csv_Dataset(self.dataset_path, self.in_features)
        train_kwargs = {'batch_size': batch_size, 'shuffle': True}
        data_loader = DataLoader(dataset, **train_kwargs)
        return data_loader

As you can see from the code snippet above, first you need to define a custom dataset and create your DataLoader using that custom dataset class.

Conclusions

In this article, the method training_data is explained based on different requirements. This method is important because it is where the dataset gets loaded for the training. Typos, lack of arguments might raise errors or affects the performance of the model. However, in the future, as a new framework supports added in Fed-BioMed the way of defining this method might have extra features.