• Home
  • User Documentation
  • About
  • More
    • Funding
    • News
    • Contributors
    • Users
    • Roadmap
    • Contact Us
  • Home
  • User Documentation
  • About
  • More
    • Funding
    • News
    • Contributors
    • Users
    • Roadmap
    • Contact Us
  • Getting Started
    • What's Fed-BioMed
    • Basic Example
  • Tutorials
    • Installation
      • Software Installation
      • Setting Up Environment
    • PyTorch
      • PyTorch MNIST Basic Example
      • How to Create Your Custom PyTorch Model
      • MNIST classification with PyTorch, comparing federated model vs model trained locally
      • PyTorch Used Cars Dataset Example
    • MONAI
      • Federated 2d image classification with MONAI
      • Federated 2d XRay registration with MONAI
    • Scikit-Learn
      • MNIST classification with Scikit-Learn Classifier (Perceptron)
      • Fedbiomed to train a federated SGD regressor model
      • Implementing other Scikit Learn models for Federated Learning
    • Advanced
      • In Depth Experiment Configuration
      • PyTorch model training using a GPU
      • Breakpoints
    • Security
      • Using Differential Privacy with OPACUS on Fed-BioMed
      • Training with Approved Models Files
  • User Guide
    • Glossary
    • Node
      • Configuring Nodes
      • Deploying Datasets
      • Model Management
      • Using GPU
    • Researcher
      • Training Plan
      • Training Data
      • Experiment
      • Aggregation
      • Listing Datasets and Selecting Nodes
      • Tensorboard
  • Developer
    • Usage and Tools
    • Continuous Integration
Download Notebook

PyTorch model training using a GPU¶

Introduction¶

This example demonstrates using a Nvidia GPU for training a model.

The nodes for this example need to run on a machine providing a Nvidia GPU with enough GPU memory (and from a not-too-old model, so that it is supported by PyTorch).

If GPU doesn't have enough memory you will get a out of memory error at run time.

You can check the Fed-BioMed GPU documentation for some background about using GPUs with Fed-BioMed.

Start the network¶

Before running this notebook, start the network with ./scripts/fedbiomed_run network

Set up the nodes up¶

We need at least 1 node, let's test using 3 nodes.

  1. For each node, add the MNIST dataset :

    ./scripts/fedbiomed_run node config config1.ini add
    ./scripts/fedbiomed_run node config config2.ini add
    ./scripts/fedbiomed_run node config config3.ini add
    
    • Select option 2 (default) to add MNIST to the node
    • Confirm default tags by hitting "y" and ENTER
    • Pick the folder where MNIST is already downloaded (or where to download MNIST)
  2. Check that your data has been added by executing

    ./scripts/fedbiomed_run node config config1.ini list
    ./scripts/fedbiomed_run node config config2.ini list
    ./scripts/fedbiomed_run node config config3.ini list
    
  3. Run the first node using

    ./scripts/fedbiomed_run node config config1.ini run --gpu
    

    so that the node offers to use GPU for training, with the default GPU device.

  4. Run the second node using

    ./scripts/fedbiomed_run node config config2.ini run --gpu-only --gpunum 1
    

    so that the node enforces use of GPU for training even if the researcher doesn't request it, and requests using the 2nd GPU (device 1) but will fallback to default device if you don't have 2 GPUs on this machine.

  5. Run the third node using

    ./scripts/fedbiomed_run node config config3.ini run
    

    so that the node doesn't offer to use GPU for training (default behaviour).

  6. Wait until you get Starting task manager for each node, it means you are online.

Define the experiment model¶

All this part is the same as when running a model using CPU : model in unchanged

Declare a torch.nn MyTrainingPlan class to send for training on the node

In [ ]:
Copied!
from fedbiomed.researcher.environ import environ
import tempfile
tmp_dir_model = tempfile.TemporaryDirectory(dir=environ['TMP_DIR']+'/')
model_file = tmp_dir_model.name + '/class_export_mnist.py'
from fedbiomed.researcher.environ import environ import tempfile tmp_dir_model = tempfile.TemporaryDirectory(dir=environ['TMP_DIR']+'/') model_file = tmp_dir_model.name + '/class_export_mnist.py'
In [ ]:
Copied!
%%writefile "$model_file"

import torch
import torch.nn as nn
from fedbiomed.common.torchnn import TorchTrainingPlan
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

# Here we define the model to be used. 
# You can use any class name (here 'Net')
class MyTrainingPlan(TorchTrainingPlan):
    def __init__(self, model_args: dict = {}):
        super(MyTrainingPlan, self).__init__(model_args)
        self.conv1 = nn.Conv2d(1, 32, 3, 1)
        self.conv2 = nn.Conv2d(32, 64, 3, 1)
        self.dropout1 = nn.Dropout(0.25)
        self.dropout2 = nn.Dropout(0.5)
        self.fc1 = nn.Linear(9216, 128)
        self.fc2 = nn.Linear(128, 10)
        
        # Here we define the custom dependencies that will be needed by our custom Dataloader
        # In this case, we need the torch DataLoader classes
        # Since we will train on MNIST, we need datasets and transform from torchvision
        deps = ["from torchvision import datasets, transforms",
               "from torch.utils.data import DataLoader"]
        self.add_dependency(deps)

    def forward(self, x):
        x = self.conv1(x)
        x = F.relu(x)
        x = self.conv2(x)
        x = F.relu(x)
        x = F.max_pool2d(x, 2)
        x = self.dropout1(x)
        x = torch.flatten(x, 1)
        x = self.fc1(x)
        x = F.relu(x)
        x = self.dropout2(x)
        x = self.fc2(x)
        
        
        output = F.log_softmax(x, dim=1)
        return output

    def training_data(self, batch_size = 48):
        # Custom torch Dataloader for MNIST data
        transform = transforms.Compose([transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,))])
        dataset1 = datasets.MNIST(self.dataset_path, train=True, download=False, transform=transform)
        train_kwargs = {'batch_size': batch_size, 'shuffle': True}
        data_loader = torch.utils.data.DataLoader(dataset1, **train_kwargs)
        return data_loader
    
    def training_step(self, data, target):
        output = self.forward(data)
        loss   = torch.nn.functional.nll_loss(output, target)
        return loss
%%writefile "$model_file" import torch import torch.nn as nn from fedbiomed.common.torchnn import TorchTrainingPlan from torch.utils.data import DataLoader from torchvision import datasets, transforms # Here we define the model to be used. # You can use any class name (here 'Net') class MyTrainingPlan(TorchTrainingPlan): def __init__(self, model_args: dict = {}): super(MyTrainingPlan, self).__init__(model_args) self.conv1 = nn.Conv2d(1, 32, 3, 1) self.conv2 = nn.Conv2d(32, 64, 3, 1) self.dropout1 = nn.Dropout(0.25) self.dropout2 = nn.Dropout(0.5) self.fc1 = nn.Linear(9216, 128) self.fc2 = nn.Linear(128, 10) # Here we define the custom dependencies that will be needed by our custom Dataloader # In this case, we need the torch DataLoader classes # Since we will train on MNIST, we need datasets and transform from torchvision deps = ["from torchvision import datasets, transforms", "from torch.utils.data import DataLoader"] self.add_dependency(deps) def forward(self, x): x = self.conv1(x) x = F.relu(x) x = self.conv2(x) x = F.relu(x) x = F.max_pool2d(x, 2) x = self.dropout1(x) x = torch.flatten(x, 1) x = self.fc1(x) x = F.relu(x) x = self.dropout2(x) x = self.fc2(x) output = F.log_softmax(x, dim=1) return output def training_data(self, batch_size = 48): # Custom torch Dataloader for MNIST data transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.1307,), (0.3081,))]) dataset1 = datasets.MNIST(self.dataset_path, train=True, download=False, transform=transform) train_kwargs = {'batch_size': batch_size, 'shuffle': True} data_loader = torch.utils.data.DataLoader(dataset1, **train_kwargs) return data_loader def training_step(self, data, target): output = self.forward(data) loss = torch.nn.functional.nll_loss(output, target) return loss

Define the experiment parameters¶

model_args are used by the researcher to request the nodes to use GPU for training, if the node has a GPU and offers to use it.

In [ ]:
Copied!
model_args = {
    # Model wants to use GPU (or not) if available on node and proposed by node
    'use_gpu': True
}

training_args = {
    'batch_size': 48, 
    'lr': 1e-3, 
    'epochs': 1, 
    'dry_run': False,  
    'batch_maxnum': 100 # Fast pass for development : only use ( batch_maxnum * batch_size ) samples
}
model_args = { # Model wants to use GPU (or not) if available on node and proposed by node 'use_gpu': True } training_args = { 'batch_size': 48, 'lr': 1e-3, 'epochs': 1, 'dry_run': False, 'batch_maxnum': 100 # Fast pass for development : only use ( batch_maxnum * batch_size ) samples }

Declare and run the experiment¶

All this part is the same as when running a model using CPU : experiment declaration and running is unchanged

In [ ]:
Copied!
from fedbiomed.researcher.experiment import Experiment
from fedbiomed.researcher.aggregators.fedavg import FedAverage

tags =  ['#MNIST', '#dataset']
rounds = 2

exp = Experiment(tags=tags,
                 #nodes=None,
                 model_path=model_file,
                 model_args=model_args,
                 model_class='MyTrainingPlan',
                 training_args=training_args,
                 round_limit=rounds,
                 aggregator=FedAverage(),
                 node_selection_strategy=None)
from fedbiomed.researcher.experiment import Experiment from fedbiomed.researcher.aggregators.fedavg import FedAverage tags = ['#MNIST', '#dataset'] rounds = 2 exp = Experiment(tags=tags, #nodes=None, model_path=model_file, model_args=model_args, model_class='MyTrainingPlan', training_args=training_args, round_limit=rounds, aggregator=FedAverage(), node_selection_strategy=None)

Let's start the experiment.

By default, this function doesn't stop until all the round_limit rounds are done for all the nodes

In [ ]:
Copied!
exp.run()
exp.run()

You have completed training a TorchTrainingPlan using a GPU for acceleration.

Download Notebook
  • Introduction
  • Start the network
  • Set up the nodes up
  • Define the experiment model
  • Define the experiment parameters
  • Declare and run the experiment
Address:

2004 Rte des Lucioles, 06902 Sophia Antipolis

E-mail:

fedbiomed _at_ inria _dot_ fr

Fed-BioMed © 2021