🚀 How to Use This Guide
Hey there! This guide walks you through building federated analytics projects from start to finish. Each step builds on the previous one, but feel free to adapt things to fit your specific needs and constraints.
This guide builds on our FL Lifecycle framework and pulls in lessons from real-world FL Use Cases across healthcare and research domains. Want the academic background? Check out McMahan et al. (2021) and Zhang et al. (2021).
We'll build on solid theoretical foundations with actionable implementation steps. Each section includes real examples, common pitfalls, and specific deliverables you can implement right away. Check out our domain-specific examples.
📚 Quick Start Paths
Start with our FL Lifecycle and glossary, then dive into real examples. Want hands-on learning? Try our interactive tutorials.
Check out framework selection and browse our tools & resources. Popular picks include Flower, PySyft, and TensorFlow Federated.
Join our community or check out training materials. Looking for role-specific guidance? Check our role pages.
📋 Every Project is Unique: This guide gives you a solid roadmap for federated analytics implementation, but your journey will be shaped by your specific domain, data characteristics, and organizational context. Healthcare projects face different regulatory challenges than finance or research applications. For regulatory guidance, check out GDPR, HIPAA, and other relevant frameworks in your jurisdiction.
📅 Realistic Timelines: The timelines you'll see are based on successful projects across different domains. Your actual schedule will depend on data complexity, team experience, regulatory requirements, and infrastructure readiness. Ethics approval alone can take months in many jurisdictions, so plan accordingly. Use these estimates as planning references, not rigid deadlines.
🎯 Customize Your Approach: This guide covers the complete implementation pipeline, but you might need to adapt, skip, or modify steps based on your unique constraints. A research prototype has different requirements than a production healthcare system. Check out our use case library for domain-specific examples and adaptations. The goal is to give you a structured framework that you can tailor to your needs.
🚀 Ready to Start? Here's Your Implementation Roadmap
Now that you know how to use this guide, let's dive into the four core steps that'll take you from initial concept to successful deployment. Each step builds on the previous one, with clear deliverables and checkpoints along the way.
💡 Pro tip: You can click on any step card below to jump straight to the detailed implementation guide, or scroll through them sequentially to get the complete picture.
🎯 Planning
- Problem definition and federation justification
- Ethics approval and legal compliance
- Stakeholder identification and team formation
- Project timeline and resource planning
🔧 Preparation
- Data exploration and quality assessment
- Infrastructure setup and security configuration
- Federation framework selection and testing
- Data preprocessing and standardization
⚙️ Training
- Algorithm design and model architecture
- Federated training implementation
- Privacy-preserving techniques (differential privacy, secure multi-party computation)
- Performance evaluation and validation
🚀 Deployment
- Reproducibility documentation and code sharing
- Model maturity assessment and validation
- Deployment planning and production readiness
- Results dissemination and knowledge transfer
1. Planning
▼
What This Step Covers
This is where everything begins. The planning step combines problem definition, scope setting, governance, and ethics approval. Before writing code or contacting potential partners, you need crystal-clear answers to fundamental questions: What problem are you solving? Why does it matter? Why cannot you just pool data centrally? What exactly will you deliver, and what is out of scope?
📚 New to federated learning? Start with our FL Lifecycle overview and glossary to understand key concepts. Explore real-world use cases to see how others have approached similar challenges.
Key Questions to Answer
What specific clinical, scientific, or operational gap are you filling? Who benefits (patients, researchers, policymakers)? Why is federated analytics necessary rather than just preferable? What does success look like concretely? What assumptions underpin your approach, and how will you validate them?
💡 Pro tip: Use our FL Use Cases to see how others have justified federation for similar problems.
Who Is Involved
Principal investigators and domain experts define the problem and objectives. Legal and compliance officers identify regulatory barriers to centralization. Ethics committees review protocols. Project managers establish scope boundaries and timelines. Funding agencies assess whether problem justifies resources.
Timeline
Initial scoping: 1-2 weeks of stakeholder discussions and literature review. Ethics and governance: 3-6 months (often the longest step). Refinement happens throughout project as you learn more, but major scope changes after the next step are costly.
Common Mistakes
Vague problem statements ("improve healthcare"), unrealistic objectives (overpromising performance), weak federation justification (could actually centralize with proper agreements), underestimating ethics timeline, unclear authorship rules leading to publication conflicts.
Deliverables
One-page project summary, stakeholder list with commitments, signed consortium agreement or data use agreements, ethics approval letters with reference numbers, explicit list of in-scope and out-of-scope deliverables, documented assumptions with validation plan.
What to Document
- Problem and Domain: Describe the clinical, scientific, or operational challenge you are addressing in plain language. What gap in knowledge or capability are you filling? Who will benefit from solving this problem?
- Why Federated Analytics: Explain the specific barriers preventing centralized data pooling. These might be regulatory (GDPR, HIPAA), institutional policy, data ownership concerns, trust issues between organizations, technical infrastructure limitations, or competitive considerations.
- Primary Objective: State your single most important goal in concrete, measurable terms. Be specific about what success looks like. Examples: "Train a prognostic model achieving AUROC ≥0.80" or "Estimate treatment effect heterogeneity across 20 sites with 95% confidence intervals."
- Population and Setting: Describe your data subjects or sources. For clinical studies: patient populations, inclusion/exclusion criteria, geographic and temporal scope. For other domains: IoT devices, sensor networks, administrative databases, etc.
- Stakeholders and Roles: List all key participants with their roles and responsibilities: principal investigators, data custodians at each site, technical developers, funders, oversight bodies, and end users.
- Ethics and Approvals: Name all institutional review boards, ethics committees, or data protection impact assessments that reviewed and approved your work. Include approval reference numbers, dates, and key conditions or restrictions.
- Legal Basis and Agreements: Describe the legal foundation for data processing (GDPR Article citations, HIPAA provisions, institutional policies). List data use agreements, consortium contracts, or memoranda of understanding between participating organizations.
- Access and Sharing Policy: Specify who can access what. For raw data: controlled access procedures, credentialing requirements. For intermediate artifacts: sharing rules. For results: publication policies, embargoes, site-level result disclosure rules.
- Outputs: List all deliverables. For modeling projects: trained models, performance metrics, calibration curves. For analytics: summary statistics, dashboards, reports. For infrastructure: frameworks, APIs, deployment guides.
- Out of Scope: Explicitly state what this project will NOT address. This prevents misunderstandings and helps focus resources.
- Assumptions and Constraints: List key assumptions underlying your approach (e.g., "Sites use consistent diagnostic criteria," "Clients remain online during training"). Note technical, organizational, and regulatory constraints. Describe how you will validate assumptions.
Background: This example is based on a real-world federated learning study for multiple sclerosis (MS) disability progression prediction, published in Pirmani et al. (2025). The study demonstrates how federated learning can be applied to clinical prediction tasks while maintaining data privacy and regulatory compliance.
Problem: Early prediction of disability progression in multiple sclerosis (MS) remains challenging despite its critical importance for therapeutic decision-making. MS affects millions of people worldwide (Walton et al., 2020), with each patient experiencing unique disease progressions and varying responses to treatment. The primary challenge lies in capturing this heterogeneity to enable personalized, data-driven treatment strategies. While machine learning shows promise for improving our understanding of MS progression and predicting individual treatment responses, developing advanced ML models remains constrained by limited access to large-scale, high-quality datasets. Although MS impacts an estimated 2.8 million individuals globally, clinical data needed for precision modeling remain fragmented and siloed across healthcare institutions.
Why Federated: Aggregating MS clinical data across international institutions is complicated by legitimate but complex regulatory constraints (GDPR, national health data laws), data ownership concerns, and inconsistent data quality standards. Healthcare institutions are reluctant to share raw patient records due to privacy regulations, competitive concerns, and liability fears. Federated learning offers a decentralized learning paradigm that enables training ML models while preserving data localization, strongly aligned with data privacy and protection standards. This approach allows collaborative model development without requiring data centralization.
Objective: Assess whether personalized federated learning can match or exceed centralized model performance in predicting 2-year disability progression in multiple sclerosis patients, while maintaining data localization and privacy. Success defined as achieving comparable ROC-AUC to centralized baseline (~0.81) using federated approaches across multiple international sites.
Ethics Approval: Hasselt University and KU Leuven PRET Approval: G 2023 6771. Legal Basis: GDPR Article 6(1)(e) + Article 9(2)(j) for health data research with appropriate safeguards.
2. Preparation
▼
What This Step Covers
This is where planning becomes implementation. The preparation step combines data understanding, infrastructure setup, and technical architecture. You will characterize what data exist at each site, understand data heterogeneity, set up federation infrastructure, and prepare data pipelines for federated training.
🛠️ Need framework guidance? Check our FL Framework Assembly and Tools and Resources for technical recommendations. For data harmonization challenges, explore our use case examples.
Key Questions to Answer
Where does your data live? How many clients will participate? How different are they from each other (data volume, feature distributions, outcome prevalence)? What clinical or technical vocabularies do you need to align? What data quality issues exist? How will you handle train/test splits in distributed setting? What normalization strategy avoids information leakage?
Who Is Involved
Data engineers at each site implement local preprocessing pipelines. Machine learning engineers set up federation framework (Flower, PySyft, TensorFlow Federated, etc.). IT administrators configure servers and network access. DevOps engineers handle monitoring and logging. System architects design topology.
Timeline
Data characterization: 2-4 weeks. Infrastructure setup: 2-4 weeks. Integration and debugging: 2-4 weeks. Total: 4-12 weeks depending on number of sites and technical complexity. More sites equals more coordination overhead.
Common Mistakes
Assuming all sites have similar data quality or feature availability, underestimating harmonization effort, ignoring extreme heterogeneity that might make federation infeasible, inconsistent preprocessing across sites (subtle bugs multiply), ignoring data leakage (test statistics leaking into normalization), underestimating infrastructure complexity (firewalls, authentication, monitoring).
Deliverables
Client inventory with data volumes, data quality assessment reports from each site, harmonization mapping documents, validated preprocessing pipelines running at each site, federation infrastructure passing integration tests, monitoring dashboards operational, documented hardware and software specifications, runbooks for troubleshooting common issues.
What to Document
- Data Sources: List all participating data silos (hospitals, clinics, research cohorts, IoT deployments). Provide context on their systems: EHR vendors, data collection protocols, approximate data volumes. Describe how you defined "clients" (by institution, by country, by device type, etc.).
- Federation Mode: Specify whether you are using simulated federation (single infrastructure with partitioned data), live federation (truly distributed clients), or hybrid approaches. If simulated: describe data partitioning strategy and what real-world factors you are NOT capturing.
- Inclusion Criteria: Define eligibility for records, subjects, or data samples at both the dataset level (which sites participate) and record level (which observations are analyzed). Document any temporal restrictions, data quality filters, or minimum sample size requirements.
- Features and Data Dictionary: List all variables used in your analysis, grouped by type (demographics, clinical measures, lab values, etc.). Provide or link to a detailed data dictionary with units, permissible ranges, and definitions. Note any derived features (feature engineering).
- Outcome Definitions: For supervised learning: precisely define your target variable(s). For hypothesis testing: state your null and alternative hypotheses. Include operational definitions, time windows, confirmation requirements, and any exclusions.
- Dataset Characteristics: Report sample sizes (total and per client), class balance or outcome prevalence, known biases or selection effects, and missing data patterns. Quantify heterogeneity across clients if applicable.
- Data Quality and Harmonization: Note any data quality issues discovered during exploration (missingness, outliers, measurement errors). List vocabularies, ontologies, coding systems, and unit conventions used for harmonization across sites.
- Local Data Preparation: Describe your complete preprocessing pipeline: data transformations, filtering, feature engineering, temporal windowing, and quality checks. Explain how you create train/validation/test splits (temporal, random, stratified). Detail your normalization strategy (local statistics, global statistics, fixed references).
- Federation Infrastructure: Describe your network topology (star, hierarchical, peer-to-peer), orchestration framework, scheduling approach, client participation policies, and hardware specifications. Include monitoring, failure recovery, and baseline security measures (transport encryption, authentication).
Data Source: MSBase international MS registry, a large prospective observational cohort collecting routine clinical data from MS patients worldwide. Clients defined at country level, resulting in multiple federated sites. This choice balances sufficient sample size per client (many countries have 1000+ patients) with meaningful clinical heterogeneity (MS care practices, genetic backgrounds, environmental factors vary by country).
Federation Mode: Simulated on Flanders Supercomputer, captures data heterogeneity but not real network challenges. Data partitioned by country to create multiple virtual clients, but all data physically reside on same computing cluster. Each virtual client has exclusive access to its country's data subset during training (enforced programmatically).
Features: 42 tabular features including demographics, Expanded Disability Status Scale (EDSS) scores, relapse history, treatment records. Outcome: Confirmed Disability Progression (CDP), sustained EDSS increase confirmed at 6 months. Final dataset: 283,115 episodes from 26,246 patients across multiple countries.
Infrastructure: Centralized star topology with Flower 1.5.0, 32 clients on Flanders Supercomputer. Preprocessing: Episode construction with 3.25-year observation windows, 60/20/20 train/validation/test splits at patient level. Normalization: Per-client using training set statistics to avoid information leakage.
3. Training
▼
What This Step Covers
This is where the federated learning happens. The training step combines algorithm development, model training, evaluation, and privacy/security implementation. You will design your analytical approach, train models, rigorously evaluate results, and implement technical safeguards. This step generates your scientific findings and determines whether federated analytics successfully solved your problem.
Key Questions to Answer
What algorithms will you test? How do they handle data heterogeneity across clients? What does "success" look like (metrics, thresholds)? How does federated performance compare to centralized and local-only baselines? Does the model work fairly across all participating sites? What could go wrong (threat model)? What defenses have you implemented?
Who Is Involved
Data scientists and machine learning engineers design and implement methods. Domain experts (clinicians, scientists) validate that approach makes sense for the problem. Statisticians ensure rigorous evaluation. Security engineers conduct threat modeling and implement controls. Privacy experts assess re-identification risks. Site coordinators support distributed execution.
Timeline
Algorithm development: Weeks to months depending on complexity. Training and evaluation: Weeks to months depending on number of experiments and computational resources. Privacy/security implementation: 2-8 weeks for basic controls, 4-8+ weeks for advanced privacy tech (differential privacy, secure multi-party computation). Total: This is typically the longest step.
Common Mistakes
Assuming "federated equals private" automatically (not true without additional safeguards), ignoring simulation versus production threat model differences, implementing differential privacy without understanding epsilon parameters, over-claiming privacy guarantees, no incident response plan, not including centralized baseline to quantify privacy-utility trade-off.
Deliverables
Trained models or computed statistics, comprehensive performance reports comparing multiple approaches, fairness analysis showing per-client results, documented hyperparameters and training configurations, threat model document, implemented security controls checklist, privacy impact assessment, incident response plan.
What to Document
- Algorithm Selection: List all algorithms tested, both federated (FedAvg, FedProx, etc.) and baselines (centralized, local-only). Describe your model architecture in detail. Explain hyperparameter choices and how you tuned them. Document training schedules (number of rounds, local epochs, batch sizes, early stopping).
- Evaluation and Success Criteria: Define your primary metric (the ONE number you will use to judge success) and secondary metrics. Explain how evaluation happens: where are models tested (centrally or at each client), how do you aggregate local performance metrics, and how do you ensure test data never leaked into training. Compare federated performance to three baselines: (1) centralized, (2) local-only, (3) simpler methods.
- Fairness Analysis: Do results hold across all clients, or do some sites get worse predictions? Report performance stratified by client size, class balance, or other relevant factors.
- Threat Model and Controls: Identify potential adversaries (honest-but-curious server, malicious clients, external attackers, insiders) and attack vectors (model inversion, membership inference, gradient leakage, Byzantine updates). List implemented defenses (secure aggregation, encryption, differential privacy, access logging).
- Privacy Guarantees: What protection is actually provided? If using differential privacy: document privacy budget accounting. For simulations: note how threat model differs from live deployment. Be transparent about what is and is not protected.
- Reproducibility: Set and document random seeds for reproducibility. Note any sources of non-determinism (GPU operations, asynchronous updates, client ordering effects). Document software versions, hardware specs, dependencies.
Architecture: 5-layer Multi-Layer Perceptron (MLP) (512 neurons/layer) with AdaptiveDualBranchNet for personalization. Total parameters: ~330K. Centralized baseline achieved ROC-AUC ~0.81, PR-AUC ~0.46 on country-partitioned test set.
Key Finding: Personalized FL achieved ROC-AUC 0.84 ± 0.002, exceeding centralized baseline (0.81). Small clients benefit most from personalization, large clients show modest gains. 10 repeated runs with different seeds, low variance across experiments (SD ~0.001-0.003).
Threat Model: Honest-but-curious server attempting to infer patient information from model updates. Controls: Data localization, aggregation-based privacy, no raw data sharing, access controls. Limitations: No differential privacy (would degrade performance), no secure multi-party computation (complexity versus benefit).
Reproducibility: Seeds 0-9 for 10 runs, highly reproducible results. Environment: Python 3.9, PyTorch 1.12.1, Flower 1.5.0 on Linux with Intel Xeon CPUs.
4. Deployment
▼
What This Step Covers
This step honestly assesses where you are and what comes next. The deployment step combines reproducibility, sharing, and maturity assessment. Making your work reproducible is not optional, it is a scientific and ethical obligation. This step documents everything someone else needs to validate your findings or apply your methods to their own data. Not every federated learning project needs to reach production deployment, many generate valuable scientific insights while remaining research tools.
Key Questions to Answer
Can someone else reproduce your results with access to the same (or similar) data? Have you documented enough detail about your environment, hyperparameters, and preprocessing that results should be identical? Is this production-ready, or proof-of-concept? What evidence supports your maturity claim? What would it actually take (time, money, people, approvals) to reach the next level?
Who Is Involved
Research software engineers ensure code quality and documentation. Data stewards document data access procedures. Legal/ethics teams review what can be publicly shared. Principal investigators assess scientific maturity. Clinical champions evaluate deployment feasibility. IT leaders estimate infrastructure requirements. Funders consider return on investment.
Timeline
Reproducibility documentation: 2-4 weeks for code cleanup, documentation writing, and artifact archiving. Technology Readiness Level (TRL) assessment: 1-2 weeks after results finalize. Gap analysis and roadmap: 1-2 weeks if pursuing deployment. Total: 2-8 weeks depending on scope of sharing and deployment plans.
Common Mistakes
Waiting until submission deadline to organize code (leads to rushed, poor documentation), forgetting to document version numbers, releasing code that cannot run without undocumented dependencies, not explaining data access process clearly, overstating maturity (damages credibility), understating maturity (misses deployment opportunities).
Deliverables
Public code repository with documentation, environment specification files, archived artifacts with persistent identifiers (DOIs), data access guide, documented limitations, TRL assessment with evidence, gap analysis with resource estimates, deployment roadmap OR research-only justification, lessons learned document, recommendations for future work.
What to Document
- Code Repository: GitHub/GitLab repository URL with specific commit hash or release tag used for published results. README explaining how to run code. Requirements file with package versions. If using containers: Docker image tags or Conda environment files.
- Environment: Software versions (Python, R, libraries), operating system, hardware (CPU/GPU specs). Note: exact hardware may not be reproducible, but document what you used.
- Random Seeds: All seeds for data splits, initialization, sampling. Mention sources of non-determinism (GPU operations, async updates).
- Data Availability: Choose one path and explain clearly: (1) Public - provide dataset name and download link, (2) Restricted - explain application process, timeline, costs, (3) Synthetic - provide generator scripts or synthetic samples.
- Artifacts: List everything you are releasing: model weights (if allowed by governance), configuration files, evaluation outputs, figures. Use persistent identifiers (Zenodo DOI, Figshare).
- Limitations: Be honest about what does not work, what you could not validate, and where your approach might not generalize.
- TRL Assessment: Claim specific TRL (or range like 4-5) and justify with evidence: publications, pilot results, user feedback, regulatory interactions, deployment case studies.
- Gap Analysis: For each gap to next TRL, specify: (1) What needs to happen, (2) Who needs to do it, (3) Estimated time and cost, (4) Key risks or blockers.
- Lessons Learned: What worked well that you would recommend to others? What surprised you or turned out harder than expected? What would you do differently if starting over? What assumptions that turned out wrong?
- Next Steps: If continuing the work: prioritized action items with timelines and resource estimates. If wrapping up: suggestions for what future research should tackle.
Technology Readiness Levels
- TRL 1-3: Basic research, proof of concept on toy data
- TRL 4: Technology validated in lab (realistic data, controlled environment)
- TRL 5: Technology validated in relevant environment (simulation with real data characteristics)
- TRL 6: Technology demonstrated in relevant environment (pilot with actual users)
- TRL 7-8: System prototype demonstration in operational environment
- TRL 9: Actual system proven in operational use
Code: GitHub repository with Apache 2.0 license, includes preprocessing, training, and evaluation scripts. Environment: Python 3.9, PyTorch 1.12.1, Flower 1.5.0 on Linux with Intel Xeon CPUs. Reproducibility: Seeds 0-9 for 10 runs, highly reproducible results (SD ~0.001-0.003).
Data Access: Restricted - researchers can apply at www.msbase.org with 2-6 month approval timeline. Model weights: NOT released (privacy concerns even for aggregated models, plus MSBase data use restrictions).
Current TRL: 4-5 (validated in simulation with realistic data, published in peer-reviewed journal). Gaps to TRL 6-7: Live federation deployment, prospective validation, clinical usability studies, regulatory pathway. Timeline to TRL 7: 2-3 years
Key Lessons: Personalization crucial for heterogeneous data, simulation valuable for algorithm development, stakeholder engagement needed earlier. Next steps: Apply for funding for live deployment pilot, identify 3-5 MSBase sites willing to pilot live FL infrastructure.