Fedbiomed to train a federated SGD regressor model¶
Data¶
This tutorial shows how to deploy in Fed-BioMed to solve a federated regression problem with scikit-learn.
In this tutorial we are using the wrapper of Fed-BioMed for the SGD regressor. The goal of the notebook is to train a model on a realistic dataset of (synthetic) medical information mimicking the ADNI dataset.
Creating nodes¶
To proceed with the tutorial, we create 3 clients with corresponding dataframes of clinical information in .csv format. Each client has 300 data points composed by several features corresponding to clinical and medical imaging information. The data is entirely synthetic and randomly sampled to mimick the variability of the real ADNI dataset. The training partitions are available at the following link:
https://drive.google.com/file/d/1R39Ir60oQi8ZnmHoPz5CoGCrVIglcO9l/view?usp=sharing
The federated task we aim at solve is to predict a clinical variable (the mini-mental state examination, MMSE) from a combination of demographic and imaging features. The regressors variables are the following features:
['SEX', 'AGE', 'PTEDUCAT', 'WholeBrain.bl', 'Ventricles.bl', 'Hippocampus.bl', 'MidTemp.bl', 'Entorhinal.bl']
and the target variable is:
['MMSE.bl']
To create the federated dataset, we follow the standard procedure for node creation/population of Fed-BioMed. After activating the fedbiomed network with the commands
$ source ./scripts/fedbiomed_environment network`
and
$ ./scripts/fedbiomed_run network`
we create a first node by using the commands
$ source ./scripts/fedbiomed_environment node`
$ ./scripts/fedbiomed_run node start
We then populate the node with the data of first client:
$ ./scripts/fedbiomed_run node add`
We select option 1 (csv) to add the .csv partition of client 1, by just picking the .csv of client 1. We use adni
as tag to save the selected dataset. We can further check that the data has been added by executing ./scripts/fedbiomed_run node list
Following the same procedure, we create the other two nodes with the datasets of client 2 and client 3 respectively.
Fed-BioMed Researcher¶
We are now ready to start the researcher environment with the following command. This command will activate researcher environment and start Jupyter Notebook.
$ ./scripts/fedbiomed_run researcher
We can first query the network for the adni
dataset. In this case, the nodes are sharing the respective partitions using the same tag adni
:
%load_ext autoreload
%autoreload 2
from fedbiomed.researcher.requests import Requests
req = Requests()
req.list(verbose=True)
2022-01-10 15:03:20,966 fedbiomed INFO - Component environment: 2022-01-10 15:03:20,967 fedbiomed INFO - - type = ComponentType.RESEARCHER 2022-01-10 15:03:21,298 fedbiomed INFO - Messaging researcher_a7319768-5c08-43f6-a819-9a487cb1cc02 successfully connected to the message broker, object = <fedbiomed.common.messaging.Messaging object at 0x103a311c0> 2022-01-10 15:03:21,318 fedbiomed INFO - Listing available datasets in all nodes... 2022-01-10 15:03:21,326 fedbiomed INFO - log from: node_c27b3141-213a-4221-9dcf-7e885a30738b / DEBUG - Message received: {'researcher_id': 'researcher_a7319768-5c08-43f6-a819-9a487cb1cc02', 'command': 'list'} 2022-01-10 15:03:21,327 fedbiomed INFO - log from: node_ff1ad308-d26a-4c73-8ffc-87df03618014 / DEBUG - Message received: {'researcher_id': 'researcher_a7319768-5c08-43f6-a819-9a487cb1cc02', 'command': 'list'} 2022-01-10 15:03:21,328 fedbiomed INFO - log from: node_0f5cb1d3-621f-45b9-9f45-4a38758e5ebf / DEBUG - Message received: {'researcher_id': 'researcher_a7319768-5c08-43f6-a819-9a487cb1cc02', 'command': 'list'} 2022-01-10 15:03:31,331 fedbiomed INFO - Node: node_c27b3141-213a-4221-9dcf-7e885a30738b | Number of Datasets: 1 +--------+-------------+----------+---------------+-----------+ | name | data_type | tags | description | shape | +========+=============+==========+===============+===========+ | adni | csv | ['adni'] | bla | [300, 20] | +--------+-------------+----------+---------------+-----------+ 2022-01-10 15:03:31,332 fedbiomed INFO - Node: node_ff1ad308-d26a-4c73-8ffc-87df03618014 | Number of Datasets: 1 +--------+-------------+----------+---------------+-----------+ | name | data_type | tags | description | shape | +========+=============+==========+===============+===========+ | adni | csv | ['adni'] | bla | [300, 20] | +--------+-------------+----------+---------------+-----------+ 2022-01-10 15:03:31,333 fedbiomed INFO - Node: node_0f5cb1d3-621f-45b9-9f45-4a38758e5ebf | Number of Datasets: 1 +--------+-------------+----------+---------------+-----------+ | name | data_type | tags | description | shape | +========+=============+==========+===============+===========+ | adni | csv | ['adni'] | bla | [300, 20] | +--------+-------------+----------+---------------+-----------+
{'node_c27b3141-213a-4221-9dcf-7e885a30738b': [{'name': 'adni', 'data_type': 'csv', 'tags': ['adni'], 'description': 'bla', 'shape': [300, 20]}], 'node_ff1ad308-d26a-4c73-8ffc-87df03618014': [{'name': 'adni', 'data_type': 'csv', 'tags': ['adni'], 'description': 'bla', 'shape': [300, 20]}], 'node_0f5cb1d3-621f-45b9-9f45-4a38758e5ebf': [{'name': 'adni', 'data_type': 'csv', 'tags': ['adni'], 'description': 'bla', 'shape': [300, 20]}]}
import numpy as np
from fedbiomed.researcher.environ import environ
import tempfile
tmp_dir_model = tempfile.TemporaryDirectory(dir=environ['TMP_DIR']+'/')
model_file = tmp_dir_model.name + '/fedbiosklearn.py'
Create an experiment to train a model on the data found¶
The code for network and data loader of the sklearn SGDRegressor can now be deployed in Fed-BioMed. We first import the necessary module SGDSkLearnModel
from fedbiomed
:
init : we add here the needed sklearn libraries
training_data : you must return here the (X,y) that must be of the same type of your method partial_fit parameters.
We note that this model performs a common standardization across federated datasets by centering with respect to the same parameters.
%%writefile "$model_file"
from fedbiomed.common.fedbiosklearn import SGDSkLearnModel
from sklearn.linear_model import SGDRegressor
class SGDRegressorTrainingPlan(SGDSkLearnModel):
def __init__(self, kwargs):
super(SGDRegressorTrainingPlan, self).__init__(kwargs)
self.add_dependency(["from sklearn.linear_model import SGDRegressor"])
def training_data(self):
NUMBER_COLS = 5
dataset = pd.read_csv(self.dataset_path,delimiter=',')
regressors_col = ['SEX', 'AGE', 'PTEDUCAT', 'WholeBrain.bl', 'Ventricles.bl', 'Hippocampus.bl', 'MidTemp.bl', 'Entorhinal.bl']
target_col = ['MMSE.bl']
scaling_mean = np.array([0.8, 72.3, 16.2, 0.7, 0.0, 0.0, 0.0, 0.0])
scaling_sd = np.array([3.5e-01, 7.3e+00, 2.7e+00, 5.0e-02, 1.1e-02, 1.0e-03, 2.0e-03, 1.0e-03])
X = (dataset[regressors_col].values-scaling_mean)/scaling_sd
y = dataset[target_col]
return (X,y.values.ravel())
model_args is a dictionary containing your model arguments, in case of SGDRegressor this will be max_iter and tol. nfeatures is provided to correctly initialize the SGDRegressor coef array.
training_args is a dictionary with parameters related to Federated Learning.
model_args = { 'max_iter':1000, 'tol': 1e-3 , 'model': 'SGDRegressor' , 'n_features': 8}
training_args = {
'epochs': 5,
}
The experiment can be now defined, by providing the adni
tag, and running the local training on nodes with model defined in model_path
, standard aggregator
(FedAvg) and client_selection_strategy
(all nodes used). Federated learning is going to be performed through 10 optimization rounds.
from fedbiomed.researcher.experiment import Experiment
from fedbiomed.researcher.aggregators.fedavg import FedAverage
tags = ['adni']
rounds = 10
# select nodes participing to this experiment
exp = Experiment(tags=tags,
model_path=model_file,
model_args=model_args,
model_class='SGDRegressorTrainingPlan',
training_args=training_args,
rounds=rounds,
aggregator=FedAverage(),
node_selection_strategy=None)
2022-01-10 15:03:33,031 fedbiomed INFO - Searching dataset with data tags: ['adni'] for all nodes 2022-01-10 15:03:33,038 fedbiomed INFO - log from: node_c27b3141-213a-4221-9dcf-7e885a30738b / DEBUG - Message received: {'researcher_id': 'researcher_a7319768-5c08-43f6-a819-9a487cb1cc02', 'tags': ['adni'], 'command': 'search'} 2022-01-10 15:03:33,039 fedbiomed INFO - log from: node_0f5cb1d3-621f-45b9-9f45-4a38758e5ebf / DEBUG - Message received: {'researcher_id': 'researcher_a7319768-5c08-43f6-a819-9a487cb1cc02', 'tags': ['adni'], 'command': 'search'} 2022-01-10 15:03:33,043 fedbiomed INFO - log from: node_ff1ad308-d26a-4c73-8ffc-87df03618014 / DEBUG - Message received: {'researcher_id': 'researcher_a7319768-5c08-43f6-a819-9a487cb1cc02', 'tags': ['adni'], 'command': 'search'} 2022-01-10 15:03:43,044 fedbiomed INFO - Node selected for training -> node_c27b3141-213a-4221-9dcf-7e885a30738b 2022-01-10 15:03:43,045 fedbiomed INFO - Node selected for training -> node_0f5cb1d3-621f-45b9-9f45-4a38758e5ebf 2022-01-10 15:03:43,045 fedbiomed INFO - Node selected for training -> node_ff1ad308-d26a-4c73-8ffc-87df03618014 2022-01-10 15:03:43,048 fedbiomed INFO - Checking data quality of federated datasets...
# start federated training
exp.run()
Testing¶
Once the federated model is obtained, it is possible to test it locally on an independent testing partition. The test dataset is available at this link:
https://drive.google.com/file/d/1zNUGp6TMn6WSKYVC8FQiQ9lJAUdasxk1/
!pip install matplotlib
!pip install gdown
Download the testing dataset on the local temporary folder.
import os
import gdown
import zipfile
resource = "https://drive.google.com/uc?id=19kxuI146WA2fhcOU2_AvF8dy-ppJkzW7"
base_dir = tmp_dir_model.name
test_file = os.path.join(base_dir, "test_data.zip")
gdown.download(resource, test_file, quiet=False)
zf = zipfile.ZipFile(test_file)
for file in zf.infolist():
zf.extract(file, base_dir)
Downloading... From: https://drive.google.com/uc?id=19kxuI146WA2fhcOU2_AvF8dy-ppJkzW7 To: /Users/mlorenzi/works/temp/fedbiomed/var/tmp/tmpgu33_tb6/test_data.zip 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12.4k/12.4k [00:00<00:00, 7.47MB/s]
import pandas as pd
n_features = 8
test_data = pd.read_csv(os.path.join(base_dir,'adni_validation.csv'))
from sklearn.linear_model import SGDRegressor
import matplotlib.pyplot as plt
%matplotlib inline
Here we extract the relevant regressors and target from the testing data
regressors_col = ['SEX', 'AGE', 'PTEDUCAT', 'WholeBrain.bl', 'Ventricles.bl', 'Hippocampus.bl', 'MidTemp.bl', 'Entorhinal.bl']
target_col = ['MMSE.bl']
X_test = test_data[regressors_col].values
y_test = test_data[target_col].values
To inspect the model evolution across FL rounds, we export exp.aggregated_params
containing models parameters collected at the end of each round. The MSE should be decreasing at each iteration with the federated parameters.
scaling_mean = np.array([0.8, 72.3, 16.2, 0.7, 0.0, 0.0, 0.0, 0.0])
scaling_sd = np.array([3.5e-01, 7.3e+00, 2.7e+00, 5.0e-02, 1.1e-02, 1.0e-03, 2.0e-03, 1.0e-03])
testing_error = []
for i in range(rounds):
fed_model = SGDRegressor(max_iter=1000, tol=1e-3)
fed_model.coef_ = exp.aggregated_params[i]['params']['coef_'].copy()
fed_model.intercept_ = exp.aggregated_params[i]['params']['intercept_'].copy()
mse = np.mean((fed_model.predict((X_test-scaling_mean)/scaling_sd) - y_test)**2)
testing_error.append(mse)
plt.plot(testing_error)
plt.title('FL testing loss')
plt.xlabel('FL round')
plt.ylabel('testing loss')
Text(0, 0.5, 'testing loss')
We finally inspect the predictions of the final federated model on the testing data.
plt.scatter(fed_model.predict((X_test-scaling_mean)/scaling_sd), y_test)
plt.xlabel('predicted')
plt.ylabel('target')
plt.title('Federated model testing prediction')
Text(0.5, 1.0, 'Federated model testing prediction')