Fedbiomed to train a federated SGD regressor model¶

Data¶

This tutorial shows how to deploy in Fed-BioMed to solve a federated regression problem with scikit-learn.

In this tutorial we are using the wrapper of Fed-BioMed for the SGD regressor. The goal of the notebook is to train a model on a realistic dataset of (synthetic) medical information mimicking the ADNI dataset.

Creating nodes¶

To proceed with the tutorial, we create 3 clients with corresponding dataframes of clinical information in .csv format. Each client has 300 data points composed by several features corresponding to clinical and medical imaging information. The data is entirely synthetic and randomly sampled to mimick the variability of the real ADNI dataset. The training partitions are available at the following link:

https://drive.google.com/file/d/1R39Ir60oQi8ZnmHoPz5CoGCrVIglcO9l/view?usp=sharing

The federated task we aim at solve is to predict a clinical variable (the mini-mental state examination, MMSE) from a combination of demographic and imaging features. The regressors variables are the following features:

['SEX', 'AGE', 'PTEDUCAT', 'WholeBrain.bl', 'Ventricles.bl', 'Hippocampus.bl', 'MidTemp.bl', 'Entorhinal.bl']

and the target variable is:

['MMSE.bl']

To create the federated dataset, we follow the standard procedure for node creation/population of Fed-BioMed. After activating the fedbiomed network with the commands

$ source ./scripts/fedbiomed_environment network`

and

$ ./scripts/fedbiomed_run network`

we create a first node by using the commands

$ source ./scripts/fedbiomed_environment node`

$ ./scripts/fedbiomed_run node start

We then populate the node with the data of first client:

$ ./scripts/fedbiomed_run node add`

We select option 1 (csv) to add the .csv partition of client 1, by just picking the .csv of client 1. We use adni as tag to save the selected dataset. We can further check that the data has been added by executing ./scripts/fedbiomed_run node list

Following the same procedure, we create the other two nodes with the datasets of client 2 and client 3 respectively.

Fed-BioMed Researcher¶

We are now ready to start the researcher environment with the following command. This command will activate researcher environment and start Jupyter Notebook.

$ ./scripts/fedbiomed_run researcher

We can first query the network for the adni dataset. In this case, the nodes are sharing the respective partitions using the same tag adni:

In [1]:

  Copied!     
 
%load_ext autoreload
%autoreload 2
%load_ext autoreload %autoreload 2

In [2]:

  Copied!     
 
from fedbiomed.researcher.requests import Requests
req = Requests()
req.list(verbose=True)
from fedbiomed.researcher.requests import Requests req = Requests() req.list(verbose=True)

2022-01-10 15:03:20,966 fedbiomed INFO - Component environment:
2022-01-10 15:03:20,967 fedbiomed INFO - - type = ComponentType.RESEARCHER
2022-01-10 15:03:21,298 fedbiomed INFO - Messaging researcher_a7319768-5c08-43f6-a819-9a487cb1cc02 successfully connected to the message broker, object = <fedbiomed.common.messaging.Messaging object at 0x103a311c0>
2022-01-10 15:03:21,318 fedbiomed INFO - Listing available datasets in all nodes... 
2022-01-10 15:03:21,326 fedbiomed INFO - log from: node_c27b3141-213a-4221-9dcf-7e885a30738b / DEBUG - Message received: {'researcher_id': 'researcher_a7319768-5c08-43f6-a819-9a487cb1cc02', 'command': 'list'}
2022-01-10 15:03:21,327 fedbiomed INFO - log from: node_ff1ad308-d26a-4c73-8ffc-87df03618014 / DEBUG - Message received: {'researcher_id': 'researcher_a7319768-5c08-43f6-a819-9a487cb1cc02', 'command': 'list'}
2022-01-10 15:03:21,328 fedbiomed INFO - log from: node_0f5cb1d3-621f-45b9-9f45-4a38758e5ebf / DEBUG - Message received: {'researcher_id': 'researcher_a7319768-5c08-43f6-a819-9a487cb1cc02', 'command': 'list'}
2022-01-10 15:03:31,331 fedbiomed INFO - 
 Node: node_c27b3141-213a-4221-9dcf-7e885a30738b | Number of Datasets: 1 
+--------+-------------+----------+---------------+-----------+
| name   | data_type   | tags     | description   | shape     |
+========+=============+==========+===============+===========+
| adni   | csv         | ['adni'] | bla           | [300, 20] |
+--------+-------------+----------+---------------+-----------+

2022-01-10 15:03:31,332 fedbiomed INFO - 
 Node: node_ff1ad308-d26a-4c73-8ffc-87df03618014 | Number of Datasets: 1 
+--------+-------------+----------+---------------+-----------+
| name   | data_type   | tags     | description   | shape     |
+========+=============+==========+===============+===========+
| adni   | csv         | ['adni'] | bla           | [300, 20] |
+--------+-------------+----------+---------------+-----------+

2022-01-10 15:03:31,333 fedbiomed INFO - 
 Node: node_0f5cb1d3-621f-45b9-9f45-4a38758e5ebf | Number of Datasets: 1 
+--------+-------------+----------+---------------+-----------+
| name   | data_type   | tags     | description   | shape     |
+========+=============+==========+===============+===========+
| adni   | csv         | ['adni'] | bla           | [300, 20] |
+--------+-------------+----------+---------------+-----------+

Out[2]:

{'node_c27b3141-213a-4221-9dcf-7e885a30738b': [{'name': 'adni',
   'data_type': 'csv',
   'tags': ['adni'],
   'description': 'bla',
   'shape': [300, 20]}],
 'node_ff1ad308-d26a-4c73-8ffc-87df03618014': [{'name': 'adni',
   'data_type': 'csv',
   'tags': ['adni'],
   'description': 'bla',
   'shape': [300, 20]}],
 'node_0f5cb1d3-621f-45b9-9f45-4a38758e5ebf': [{'name': 'adni',
   'data_type': 'csv',
   'tags': ['adni'],
   'description': 'bla',
   'shape': [300, 20]}]}

In [3]:

  Copied!     
 
import numpy as np
from fedbiomed.researcher.environ import environ
import tempfile
tmp_dir_model = tempfile.TemporaryDirectory(dir=environ['TMP_DIR']+'/')
model_file = tmp_dir_model.name + '/fedbiosklearn.py'
import numpy as np from fedbiomed.researcher.environ import environ import tempfile tmp_dir_model = tempfile.TemporaryDirectory(dir=environ['TMP_DIR']+'/') model_file = tmp_dir_model.name + '/fedbiosklearn.py'

Create an experiment to train a model on the data found¶

The code for network and data loader of the sklearn SGDRegressor can now be deployed in Fed-BioMed. We first import the necessary module SGDSkLearnModel from fedbiomed:

init : we add here the needed sklearn libraries

training_data : you must return here the (X,y) that must be of the same type of your method partial_fit parameters.

We note that this model performs a common standardization across federated datasets by centering with respect to the same parameters.

In [ ]:

  Copied!     
 
%%writefile "$model_file"

from fedbiomed.common.fedbiosklearn import SGDSkLearnModel
from sklearn.linear_model import SGDRegressor


class SGDRegressorTrainingPlan(SGDSkLearnModel):
    def __init__(self, kwargs):
        super(SGDRegressorTrainingPlan, self).__init__(kwargs)
        self.add_dependency(["from sklearn.linear_model import SGDRegressor"])
    
    def training_data(self):
        NUMBER_COLS = 5
        dataset = pd.read_csv(self.dataset_path,delimiter=',')
        regressors_col = ['SEX', 'AGE', 'PTEDUCAT', 'WholeBrain.bl', 'Ventricles.bl', 'Hippocampus.bl', 'MidTemp.bl', 'Entorhinal.bl']
        target_col = ['MMSE.bl']
        
        scaling_mean = np.array([0.8, 72.3, 16.2, 0.7, 0.0, 0.0, 0.0, 0.0])
        scaling_sd = np.array([3.5e-01, 7.3e+00, 2.7e+00, 5.0e-02, 1.1e-02, 1.0e-03, 2.0e-03, 1.0e-03])
        
        X = (dataset[regressors_col].values-scaling_mean)/scaling_sd
        y = dataset[target_col]
        return (X,y.values.ravel())
%%writefile "$model_file" from fedbiomed.common.fedbiosklearn import SGDSkLearnModel from sklearn.linear_model import SGDRegressor class SGDRegressorTrainingPlan(SGDSkLearnModel): def __init__(self, kwargs): super(SGDRegressorTrainingPlan, self).__init__(kwargs) self.add_dependency(["from sklearn.linear_model import SGDRegressor"]) def training_data(self): NUMBER_COLS = 5 dataset = pd.read_csv(self.dataset_path,delimiter=',') regressors_col = ['SEX', 'AGE', 'PTEDUCAT', 'WholeBrain.bl', 'Ventricles.bl', 'Hippocampus.bl', 'MidTemp.bl', 'Entorhinal.bl'] target_col = ['MMSE.bl'] scaling_mean = np.array([0.8, 72.3, 16.2, 0.7, 0.0, 0.0, 0.0, 0.0]) scaling_sd = np.array([3.5e-01, 7.3e+00, 2.7e+00, 5.0e-02, 1.1e-02, 1.0e-03, 2.0e-03, 1.0e-03]) X = (dataset[regressors_col].values-scaling_mean)/scaling_sd y = dataset[target_col] return (X,y.values.ravel()) 

model_args is a dictionary containing your model arguments, in case of SGDRegressor this will be max_iter and tol. nfeatures is provided to correctly initialize the SGDRegressor coef array.

training_args is a dictionary with parameters related to Federated Learning.

In [5]:

  Copied!     
 
model_args = { 'max_iter':1000, 'tol': 1e-3 , 'model': 'SGDRegressor' , 'n_features': 8}

training_args = {
    'epochs': 5, 
}
model_args = { 'max_iter':1000, 'tol': 1e-3 , 'model': 'SGDRegressor' , 'n_features': 8} training_args = { 'epochs': 5, }

The experiment can be now defined, by providing the adni tag, and running the local training on nodes with model defined in model_path, standard aggregator (FedAvg) and client_selection_strategy (all nodes used). Federated learning is going to be performed through 10 optimization rounds.

In [6]:

  Copied!     
 
from fedbiomed.researcher.experiment import Experiment
from fedbiomed.researcher.aggregators.fedavg import FedAverage

tags =  ['adni']
rounds = 10

# select nodes participing to this experiment
exp = Experiment(tags=tags,
                 model_path=model_file,
                 model_args=model_args,
                 model_class='SGDRegressorTrainingPlan',
                 training_args=training_args,
                 rounds=rounds,
                 aggregator=FedAverage(),
                 node_selection_strategy=None)
from fedbiomed.researcher.experiment import Experiment from fedbiomed.researcher.aggregators.fedavg import FedAverage tags = ['adni'] rounds = 10 # select nodes participing to this experiment exp = Experiment(tags=tags, model_path=model_file, model_args=model_args, model_class='SGDRegressorTrainingPlan', training_args=training_args, rounds=rounds, aggregator=FedAverage(), node_selection_strategy=None)

2022-01-10 15:03:33,031 fedbiomed INFO - Searching dataset with data tags: ['adni'] for all nodes
2022-01-10 15:03:33,038 fedbiomed INFO - log from: node_c27b3141-213a-4221-9dcf-7e885a30738b / DEBUG - Message received: {'researcher_id': 'researcher_a7319768-5c08-43f6-a819-9a487cb1cc02', 'tags': ['adni'], 'command': 'search'}
2022-01-10 15:03:33,039 fedbiomed INFO - log from: node_0f5cb1d3-621f-45b9-9f45-4a38758e5ebf / DEBUG - Message received: {'researcher_id': 'researcher_a7319768-5c08-43f6-a819-9a487cb1cc02', 'tags': ['adni'], 'command': 'search'}
2022-01-10 15:03:33,043 fedbiomed INFO - log from: node_ff1ad308-d26a-4c73-8ffc-87df03618014 / DEBUG - Message received: {'researcher_id': 'researcher_a7319768-5c08-43f6-a819-9a487cb1cc02', 'tags': ['adni'], 'command': 'search'}
2022-01-10 15:03:43,044 fedbiomed INFO - Node selected for training -> node_c27b3141-213a-4221-9dcf-7e885a30738b
2022-01-10 15:03:43,045 fedbiomed INFO - Node selected for training -> node_0f5cb1d3-621f-45b9-9f45-4a38758e5ebf
2022-01-10 15:03:43,045 fedbiomed INFO - Node selected for training -> node_ff1ad308-d26a-4c73-8ffc-87df03618014
2022-01-10 15:03:43,048 fedbiomed INFO - Checking data quality of federated datasets...

In [ ]:

  Copied!     
 
# start federated training
exp.run()
# start federated training exp.run()

Testing¶

Once the federated model is obtained, it is possible to test it locally on an independent testing partition. The test dataset is available at this link:

https://drive.google.com/file/d/1zNUGp6TMn6WSKYVC8FQiQ9lJAUdasxk1/

In [ ]:

  Copied!     
 
!pip install matplotlib
!pip install gdown
!pip install matplotlib !pip install gdown

Download the testing dataset on the local temporary folder.

In [9]:

  Copied!     
 
import os
import gdown
import zipfile

resource = "https://drive.google.com/uc?id=19kxuI146WA2fhcOU2_AvF8dy-ppJkzW7"
base_dir = tmp_dir_model.name 

test_file = os.path.join(base_dir, "test_data.zip")
gdown.download(resource, test_file, quiet=False)

zf = zipfile.ZipFile(test_file)

for file in zf.infolist():
    zf.extract(file, base_dir)
import os import gdown import zipfile resource = "https://drive.google.com/uc?id=19kxuI146WA2fhcOU2_AvF8dy-ppJkzW7" base_dir = tmp_dir_model.name test_file = os.path.join(base_dir, "test_data.zip") gdown.download(resource, test_file, quiet=False) zf = zipfile.ZipFile(test_file) for file in zf.infolist(): zf.extract(file, base_dir) 

Downloading...
From: https://drive.google.com/uc?id=19kxuI146WA2fhcOU2_AvF8dy-ppJkzW7
To: /Users/mlorenzi/works/temp/fedbiomed/var/tmp/tmpgu33_tb6/test_data.zip
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12.4k/12.4k [00:00<00:00, 7.47MB/s]

In [10]:

  Copied!     
 
import pandas as pd
n_features = 8

test_data = pd.read_csv(os.path.join(base_dir,'adni_validation.csv'))
import pandas as pd n_features = 8 test_data = pd.read_csv(os.path.join(base_dir,'adni_validation.csv'))

In [11]:

  Copied!     
 
from sklearn.linear_model import SGDRegressor
import matplotlib.pyplot as plt
from sklearn.linear_model import SGDRegressor import matplotlib.pyplot as plt

In [12]:

  Copied!     
 
%matplotlib inline
%matplotlib inline

Here we extract the relevant regressors and target from the testing data

In [13]:

  Copied!     
 
regressors_col = ['SEX', 'AGE', 'PTEDUCAT', 'WholeBrain.bl', 'Ventricles.bl', 'Hippocampus.bl', 'MidTemp.bl', 'Entorhinal.bl']
target_col = ['MMSE.bl']
X_test = test_data[regressors_col].values
y_test = test_data[target_col].values
regressors_col = ['SEX', 'AGE', 'PTEDUCAT', 'WholeBrain.bl', 'Ventricles.bl', 'Hippocampus.bl', 'MidTemp.bl', 'Entorhinal.bl'] target_col = ['MMSE.bl'] X_test = test_data[regressors_col].values y_test = test_data[target_col].values

To inspect the model evolution across FL rounds, we export exp.aggregated_params containing models parameters collected at the end of each round. The MSE should be decreasing at each iteration with the federated parameters.

In [14]:

  Copied!     
 
scaling_mean = np.array([0.8, 72.3, 16.2, 0.7, 0.0, 0.0, 0.0, 0.0])
scaling_sd = np.array([3.5e-01, 7.3e+00, 2.7e+00, 5.0e-02, 1.1e-02, 1.0e-03, 2.0e-03, 1.0e-03])

testing_error = []

for i in range(rounds):
    fed_model = SGDRegressor(max_iter=1000, tol=1e-3)
    fed_model.coef_ = exp.aggregated_params[i]['params']['coef_'].copy()
    fed_model.intercept_ = exp.aggregated_params[i]['params']['intercept_'].copy()  
    mse = np.mean((fed_model.predict((X_test-scaling_mean)/scaling_sd) - y_test)**2)
    testing_error.append(mse)

plt.plot(testing_error)
plt.title('FL testing loss')
plt.xlabel('FL round')
plt.ylabel('testing loss')
scaling_mean = np.array([0.8, 72.3, 16.2, 0.7, 0.0, 0.0, 0.0, 0.0]) scaling_sd = np.array([3.5e-01, 7.3e+00, 2.7e+00, 5.0e-02, 1.1e-02, 1.0e-03, 2.0e-03, 1.0e-03]) testing_error = [] for i in range(rounds): fed_model = SGDRegressor(max_iter=1000, tol=1e-3) fed_model.coef_ = exp.aggregated_params[i]['params']['coef_'].copy() fed_model.intercept_ = exp.aggregated_params[i]['params']['intercept_'].copy() mse = np.mean((fed_model.predict((X_test-scaling_mean)/scaling_sd) - y_test)**2) testing_error.append(mse) plt.plot(testing_error) plt.title('FL testing loss') plt.xlabel('FL round') plt.ylabel('testing loss')

Out[14]:

Text(0, 0.5, 'testing loss')

We finally inspect the predictions of the final federated model on the testing data.

In [15]:

  Copied!     
 
plt.scatter(fed_model.predict((X_test-scaling_mean)/scaling_sd), y_test)
plt.xlabel('predicted')
plt.ylabel('target')
plt.title('Federated model testing prediction')
plt.scatter(fed_model.predict((X_test-scaling_mean)/scaling_sd), y_test) plt.xlabel('predicted') plt.ylabel('target') plt.title('Federated model testing prediction')

Out[15]:

Text(0.5, 1.0, 'Federated model testing prediction')

Download Notebook