FLkit

🚀 How to Use This Guide

Hey there! This guide walks you through building federated analytics projects from start to finish. Each step builds on the previous one, but feel free to adapt things to fit your specific needs and constraints.

📖

Built on solid foundations

This guide builds on our FL Lifecycle framework and pulls in lessons from real-world FL Use Cases across healthcare and research domains. Want the academic background? Check out McMahan et al. (2021) and Zhang et al. (2021).

⚡

Practical focus

We'll build on solid theoretical foundations with actionable implementation steps. Each section includes real examples, common pitfalls, and specific deliverables you can implement right away. Check out our domain-specific examples.

📚 Quick Start Paths

New to FL?

Start with our FL Lifecycle and glossary, then dive into real examples. Want hands-on learning? Try our interactive tutorials.

Ready to build?

Check out framework selection and browse our tools & resources. Popular picks include Flower, PySyft, and TensorFlow Federated.

Need help?

Join our community or check out training materials. Looking for role-specific guidance? Check our role pages.

📋 Every Project is Unique: This guide gives you a solid roadmap for federated analytics implementation, but your journey will be shaped by your specific domain, data characteristics, and organizational context. Healthcare projects face different regulatory challenges than finance or research applications. For regulatory guidance, check out GDPR, HIPAA, and other relevant frameworks in your jurisdiction.

📅 Realistic Timelines: The timelines you'll see are based on successful projects across different domains. Your actual schedule will depend on data complexity, team experience, regulatory requirements, and infrastructure readiness. Ethics approval alone can take months in many jurisdictions, so plan accordingly. Use these estimates as planning references, not rigid deadlines.

🎯 Customize Your Approach: This guide covers the complete implementation pipeline, but you might need to adapt, skip, or modify steps based on your unique constraints. A research prototype has different requirements than a production healthcare system. Check out our use case library for domain-specific examples and adaptations. The goal is to give you a structured framework that you can tailor to your needs.

🚀 Ready to Start? Here's Your Implementation Roadmap

Now that you know how to use this guide, let's dive into the four core steps that'll take you from initial concept to successful deployment. Each step builds on the previous one, with clear deliverables and checkpoints along the way.

💡 Pro tip: You can click on any step card below to jump straight to the detailed implementation guide, or scroll through them sequentially to get the complete picture.

🎯 Planning

Step 1

Establish clear problem definition, regulatory compliance, and stakeholder alignment. This foundational step sets the stage for successful implementation by addressing governance, ethics, and strategic planning requirements.

Problem definition and federation justification
Ethics approval and legal compliance
Stakeholder identification and team formation
Project timeline and resource planning

⏱️ 1-6 months

👥 Principal Investigators, Legal Teams, Ethics Coordinators

Foundation

🔧 Preparation

Step 2

Characterize data sources, establish technical infrastructure, and implement preprocessing pipelines. This step transforms planning into actionable technical implementation with proper data harmonization and system architecture.

Data exploration and quality assessment
Infrastructure setup and security configuration
Federation framework selection and testing
Data preprocessing and standardization

⏱️ 2-12 weeks

👥 Data Teams, DevOps Engineers

Technical

⚙️ Training

Step 3

Develop and implement federated algorithms, conduct model training, and establish privacy-preserving mechanisms. This step focuses on the core machine learning development with rigorous evaluation and security implementation.

Algorithm design and model architecture
Federated training implementation
Privacy-preserving techniques (differential privacy, secure multi-party computation)
Performance evaluation and validation

⏱️ Weeks-Months

👥 Machine Learning Engineers, Security Teams

Research

🚀 Deployment

Step 4

Ensure reproducibility, evaluate technology readiness, and facilitate knowledge transfer. This final step focuses on documentation, maturity assessment, and strategic planning for continued development or deployment.

Reproducibility documentation and code sharing
Model maturity assessment and validation
Deployment planning and production readiness
Results dissemination and knowledge transfer

⏱️ 2-8 weeks

👥 Research Teams, Leadership

Strategic

🎯 1. Planning

▼

💡 What This Step Covers

This is where everything begins. The planning step combines problem definition, scope setting, governance, and ethics approval. Before writing code or contacting potential partners, you need crystal-clear answers to fundamental questions: What problem are you solving? Why does it matter? Why cannot you just pool data centrally? What exactly will you deliver, and what is out of scope?

📚 New to federated learning? Start with our FL Lifecycle overview and glossary to understand key concepts. Explore real-world use cases to see how others have approached similar challenges.

❓ Key Questions to Answer

What specific clinical, scientific, or operational gap are you filling? Who benefits (patients, researchers, policymakers)? Why is federated analytics necessary rather than just preferable? What does success look like concretely? What assumptions underpin your approach, and how will you validate them?

💡 Pro tip: Use our FL Use Cases to see how others have justified federation for similar problems.

👥 Who Is Involved

Principal investigators and domain experts define the problem and objectives. Legal and compliance officers identify regulatory barriers to centralization. Ethics committees review protocols. Project managers establish scope boundaries and timelines. Funding agencies assess whether problem justifies resources.

⏱️ Timeline

Initial scoping: 1-2 weeks of stakeholder discussions and literature review. Ethics and governance: 3-6 months (often the longest step). Refinement happens throughout project as you learn more, but major scope changes after the next step are costly.

⚠️ Common Mistakes

Vague problem statements ("improve healthcare"), unrealistic objectives (overpromising performance), weak federation justification (could actually centralize with proper agreements), underestimating ethics timeline, unclear authorship rules leading to publication conflicts.

📦 Deliverables

One-page project summary, stakeholder list with commitments, signed consortium agreement or data use agreements, ethics approval letters with reference numbers, explicit list of in-scope and out-of-scope deliverables, documented assumptions with validation plan.

📝 What to Document

Problem and Domain: Describe the clinical, scientific, or operational challenge you are addressing in plain language. What gap in knowledge or capability are you filling? Who will benefit from solving this problem?
Why Federated Analytics: Explain the specific barriers preventing centralized data pooling. These might be regulatory (GDPR, HIPAA), institutional policy, data ownership concerns, trust issues between organizations, technical infrastructure limitations, or competitive considerations.
Primary Objective: State your single most important goal in concrete, measurable terms. Be specific about what success looks like. Examples: "Train a prognostic model achieving AUROC ≥0.80" or "Estimate treatment effect heterogeneity across 20 sites with 95% confidence intervals."
Population and Setting: Describe your data subjects or sources. For clinical studies: patient populations, inclusion/exclusion criteria, geographic and temporal scope. For other domains: IoT devices, sensor networks, administrative databases, etc.
Stakeholders and Roles: List all key participants with their roles and responsibilities: principal investigators, data custodians at each site, technical developers, funders, oversight bodies, and end users.
Ethics and Approvals: Name all institutional review boards, ethics committees, or data protection impact assessments that reviewed and approved your work. Include approval reference numbers, dates, and key conditions or restrictions.
Legal Basis and Agreements: Describe the legal foundation for data processing (GDPR Article citations, HIPAA provisions, institutional policies). List data use agreements, consortium contracts, or memoranda of understanding between participating organizations.
Access and Sharing Policy: Specify who can access what. For raw data: controlled access procedures, credentialing requirements. For intermediate artifacts: sharing rules. For results: publication policies, embargoes, site-level result disclosure rules.
Outputs: List all deliverables. For modeling projects: trained models, performance metrics, calibration curves. For analytics: summary statistics, dashboards, reports. For infrastructure: frameworks, APIs, deployment guides.
Out of Scope: Explicitly state what this project will NOT address. This prevents misunderstandings and helps focus resources.
Assumptions and Constraints: List key assumptions underlying your approach (e.g., "Sites use consistent diagnostic criteria," "Clients remain online during training"). Note technical, organizational, and regulatory constraints. Describe how you will validate assumptions.

📝 Example: FL-MS Study Planning

Background: This example is based on a real-world federated learning study for multiple sclerosis (MS) disability progression prediction, published in Pirmani et al. (2025). The study demonstrates how federated learning can be applied to clinical prediction tasks while maintaining data privacy and regulatory compliance.

Problem: Early prediction of disability progression in multiple sclerosis (MS) remains challenging despite its critical importance for therapeutic decision-making. MS affects millions of people worldwide (Walton et al., 2020), with each patient experiencing unique disease progressions and varying responses to treatment. The primary challenge lies in capturing this heterogeneity to enable personalized, data-driven treatment strategies. While machine learning shows promise for improving our understanding of MS progression and predicting individual treatment responses, developing advanced ML models remains constrained by limited access to large-scale, high-quality datasets. Although MS impacts an estimated 2.8 million individuals globally, clinical data needed for precision modeling remain fragmented and siloed across healthcare institutions.

Why Federated: Aggregating MS clinical data across international institutions is complicated by legitimate but complex regulatory constraints (GDPR, national health data laws), data ownership concerns, and inconsistent data quality standards. Healthcare institutions are reluctant to share raw patient records due to privacy regulations, competitive concerns, and liability fears. Federated learning offers a decentralized learning paradigm that enables training ML models while preserving data localization, strongly aligned with data privacy and protection standards. This approach allows collaborative model development without requiring data centralization.

Objective: Assess whether personalized federated learning can match or exceed centralized model performance in predicting 2-year disability progression in multiple sclerosis patients, while maintaining data localization and privacy. Success defined as achieving comparable ROC-AUC to centralized baseline (~0.81) using federated approaches across multiple international sites.

Ethics Approval: Hasselt University and KU Leuven PRET Approval: G 2023 6771. Legal Basis: GDPR Article 6(1)(e) + Article 9(2)(j) for health data research with appropriate safeguards.

🔧 2. Preparation

▼

💡 What This Step Covers

This is where planning becomes implementation. The preparation step combines data understanding, infrastructure setup, and technical architecture. You will characterize what data exist at each site, understand data heterogeneity, set up federation infrastructure, and prepare data pipelines for federated training.

🛠️ Need framework guidance? Check our FL Framework Assembly and Tools and Resources for technical recommendations. For data harmonization challenges, explore our use case examples.

❓ Key Questions to Answer

Where does your data live? How many clients will participate? How different are they from each other (data volume, feature distributions, outcome prevalence)? What clinical or technical vocabularies do you need to align? What data quality issues exist? How will you handle train/test splits in distributed setting? What normalization strategy avoids information leakage?

👥 Who Is Involved

Data engineers at each site implement local preprocessing pipelines. Machine learning engineers set up federation framework (Flower, PySyft, TensorFlow Federated, etc.). IT administrators configure servers and network access. DevOps engineers handle monitoring and logging. System architects design topology.

⏱️ Timeline

Data characterization: 2-4 weeks. Infrastructure setup: 2-4 weeks. Integration and debugging: 2-4 weeks. Total: 4-12 weeks depending on number of sites and technical complexity. More sites equals more coordination overhead.

⚠️ Common Mistakes

Assuming all sites have similar data quality or feature availability, underestimating harmonization effort, ignoring extreme heterogeneity that might make federation infeasible, inconsistent preprocessing across sites (subtle bugs multiply), ignoring data leakage (test statistics leaking into normalization), underestimating infrastructure complexity (firewalls, authentication, monitoring).

📦 Deliverables

Client inventory with data volumes, data quality assessment reports from each site, harmonization mapping documents, validated preprocessing pipelines running at each site, federation infrastructure passing integration tests, monitoring dashboards operational, documented hardware and software specifications, runbooks for troubleshooting common issues.

📝 What to Document

Data Sources: List all participating data silos (hospitals, clinics, research cohorts, IoT deployments). Provide context on their systems: EHR vendors, data collection protocols, approximate data volumes. Describe how you defined "clients" (by institution, by country, by device type, etc.).
Federation Mode: Specify whether you are using simulated federation (single infrastructure with partitioned data), live federation (truly distributed clients), or hybrid approaches. If simulated: describe data partitioning strategy and what real-world factors you are NOT capturing.
Inclusion Criteria: Define eligibility for records, subjects, or data samples at both the dataset level (which sites participate) and record level (which observations are analyzed). Document any temporal restrictions, data quality filters, or minimum sample size requirements.
Features and Data Dictionary: List all variables used in your analysis, grouped by type (demographics, clinical measures, lab values, etc.). Provide or link to a detailed data dictionary with units, permissible ranges, and definitions. Note any derived features (feature engineering).
Outcome Definitions: For supervised learning: precisely define your target variable(s). For hypothesis testing: state your null and alternative hypotheses. Include operational definitions, time windows, confirmation requirements, and any exclusions.
Dataset Characteristics: Report sample sizes (total and per client), class balance or outcome prevalence, known biases or selection effects, and missing data patterns. Quantify heterogeneity across clients if applicable.
Data Quality and Harmonization: Note any data quality issues discovered during exploration (missingness, outliers, measurement errors). List vocabularies, ontologies, coding systems, and unit conventions used for harmonization across sites.
Local Data Preparation: Describe your complete preprocessing pipeline: data transformations, filtering, feature engineering, temporal windowing, and quality checks. Explain how you create train/validation/test splits (temporal, random, stratified). Detail your normalization strategy (local statistics, global statistics, fixed references).
Federation Infrastructure: Describe your network topology (star, hierarchical, peer-to-peer), orchestration framework, scheduling approach, client participation policies, and hardware specifications. Include monitoring, failure recovery, and baseline security measures (transport encryption, authentication).

📝 Example: FL-MS Study Preparation

Data Source: MSBase international MS registry, a large prospective observational cohort collecting routine clinical data from MS patients worldwide. Clients defined at country level, resulting in multiple federated sites. This choice balances sufficient sample size per client (many countries have 1000+ patients) with meaningful clinical heterogeneity (MS care practices, genetic backgrounds, environmental factors vary by country).

Federation Mode: Simulated on Flanders Supercomputer, captures data heterogeneity but not real network challenges. Data partitioned by country to create multiple virtual clients, but all data physically reside on same computing cluster. Each virtual client has exclusive access to its country's data subset during training (enforced programmatically).

Features: 42 tabular features including demographics, Expanded Disability Status Scale (EDSS) scores, relapse history, treatment records. Outcome: Confirmed Disability Progression (CDP), sustained EDSS increase confirmed at 6 months. Final dataset: 283,115 episodes from 26,246 patients across multiple countries.

Infrastructure: Centralized star topology with Flower 1.5.0, 32 clients on Flanders Supercomputer. Preprocessing: Episode construction with 3.25-year observation windows, 60/20/20 train/validation/test splits at patient level. Normalization: Per-client using training set statistics to avoid information leakage.

⚙️ 3. Training

▼

💡 What This Step Covers

This is where the federated learning happens. The training step combines algorithm development, model training, evaluation, and privacy/security implementation. You will design your analytical approach, train models, rigorously evaluate results, and implement technical safeguards. This step generates your scientific findings and determines whether federated analytics successfully solved your problem.

❓ Key Questions to Answer

What algorithms will you test? How do they handle data heterogeneity across clients? What does "success" look like (metrics, thresholds)? How does federated performance compare to centralized and local-only baselines? Does the model work fairly across all participating sites? What could go wrong (threat model)? What defenses have you implemented?

👥 Who Is Involved

Data scientists and machine learning engineers design and implement methods. Domain experts (clinicians, scientists) validate that approach makes sense for the problem. Statisticians ensure rigorous evaluation. Security engineers conduct threat modeling and implement controls. Privacy experts assess re-identification risks. Site coordinators support distributed execution.

⏱️ Timeline

Algorithm development: Weeks to months depending on complexity. Training and evaluation: Weeks to months depending on number of experiments and computational resources. Privacy/security implementation: 2-8 weeks for basic controls, 4-8+ weeks for advanced privacy tech (differential privacy, secure multi-party computation). Total: This is typically the longest step.

⚠️ Common Mistakes

Assuming "federated equals private" automatically (not true without additional safeguards), ignoring simulation versus production threat model differences, implementing differential privacy without understanding epsilon parameters, over-claiming privacy guarantees, no incident response plan, not including centralized baseline to quantify privacy-utility trade-off.

📦 Deliverables

Trained models or computed statistics, comprehensive performance reports comparing multiple approaches, fairness analysis showing per-client results, documented hyperparameters and training configurations, threat model document, implemented security controls checklist, privacy impact assessment, incident response plan.

📝 What to Document

Algorithm Selection: List all algorithms tested, both federated (FedAvg, FedProx, etc.) and baselines (centralized, local-only). Describe your model architecture in detail. Explain hyperparameter choices and how you tuned them. Document training schedules (number of rounds, local epochs, batch sizes, early stopping).
Evaluation and Success Criteria: Define your primary metric (the ONE number you will use to judge success) and secondary metrics. Explain how evaluation happens: where are models tested (centrally or at each client), how do you aggregate local performance metrics, and how do you ensure test data never leaked into training. Compare federated performance to three baselines: (1) centralized, (2) local-only, (3) simpler methods.
Fairness Analysis: Do results hold across all clients, or do some sites get worse predictions? Report performance stratified by client size, class balance, or other relevant factors.
Threat Model and Controls: Identify potential adversaries (honest-but-curious server, malicious clients, external attackers, insiders) and attack vectors (model inversion, membership inference, gradient leakage, Byzantine updates). List implemented defenses (secure aggregation, encryption, differential privacy, access logging).
Privacy Guarantees: What protection is actually provided? If using differential privacy: document privacy budget accounting. For simulations: note how threat model differs from live deployment. Be transparent about what is and is not protected.
Reproducibility: Set and document random seeds for reproducibility. Note any sources of non-determinism (GPU operations, asynchronous updates, client ordering effects). Document software versions, hardware specs, dependencies.

📝 Example: FL-MS Study Training

Architecture: 5-layer Multi-Layer Perceptron (MLP) (512 neurons/layer) with AdaptiveDualBranchNet for personalization. Total parameters: ~330K. Centralized baseline achieved ROC-AUC ~0.81, PR-AUC ~0.46 on country-partitioned test set.

Key Finding: Personalized FL achieved ROC-AUC 0.84 ± 0.002, exceeding centralized baseline (0.81). Small clients benefit most from personalization, large clients show modest gains. 10 repeated runs with different seeds, low variance across experiments (SD ~0.001-0.003).

Threat Model: Honest-but-curious server attempting to infer patient information from model updates. Controls: Data localization, aggregation-based privacy, no raw data sharing, access controls. Limitations: No differential privacy (would degrade performance), no secure multi-party computation (complexity versus benefit).

Reproducibility: Seeds 0-9 for 10 runs, highly reproducible results. Environment: Python 3.9, PyTorch 1.12.1, Flower 1.5.0 on Linux with Intel Xeon CPUs.

🚀 4. Deployment

▼

💡 What This Step Covers

This step honestly assesses where you are and what comes next. The deployment step combines reproducibility, sharing, and maturity assessment. Making your work reproducible is not optional, it is a scientific and ethical obligation. This step documents everything someone else needs to validate your findings or apply your methods to their own data. Not every federated learning project needs to reach production deployment, many generate valuable scientific insights while remaining research tools.

❓ Key Questions to Answer

Can someone else reproduce your results with access to the same (or similar) data? Have you documented enough detail about your environment, hyperparameters, and preprocessing that results should be identical? Is this production-ready, or proof-of-concept? What evidence supports your maturity claim? What would it actually take (time, money, people, approvals) to reach the next level?

👥 Who Is Involved

Research software engineers ensure code quality and documentation. Data stewards document data access procedures. Legal/ethics teams review what can be publicly shared. Principal investigators assess scientific maturity. Clinical champions evaluate deployment feasibility. IT leaders estimate infrastructure requirements. Funders consider return on investment.

⏱️ Timeline

Reproducibility documentation: 2-4 weeks for code cleanup, documentation writing, and artifact archiving. Technology Readiness Level (TRL) assessment: 1-2 weeks after results finalize. Gap analysis and roadmap: 1-2 weeks if pursuing deployment. Total: 2-8 weeks depending on scope of sharing and deployment plans.

⚠️ Common Mistakes

Waiting until submission deadline to organize code (leads to rushed, poor documentation), forgetting to document version numbers, releasing code that cannot run without undocumented dependencies, not explaining data access process clearly, overstating maturity (damages credibility), understating maturity (misses deployment opportunities).

📦 Deliverables

Public code repository with documentation, environment specification files, archived artifacts with persistent identifiers (DOIs), data access guide, documented limitations, TRL assessment with evidence, gap analysis with resource estimates, deployment roadmap OR research-only justification, lessons learned document, recommendations for future work.

📝 What to Document

Code Repository: GitHub/GitLab repository URL with specific commit hash or release tag used for published results. README explaining how to run code. Requirements file with package versions. If using containers: Docker image tags or Conda environment files.
Environment: Software versions (Python, R, libraries), operating system, hardware (CPU/GPU specs). Note: exact hardware may not be reproducible, but document what you used.
Random Seeds: All seeds for data splits, initialization, sampling. Mention sources of non-determinism (GPU operations, async updates).
Data Availability: Choose one path and explain clearly: (1) Public - provide dataset name and download link, (2) Restricted - explain application process, timeline, costs, (3) Synthetic - provide generator scripts or synthetic samples.
Artifacts: List everything you are releasing: model weights (if allowed by governance), configuration files, evaluation outputs, figures. Use persistent identifiers (Zenodo DOI, Figshare).
Limitations: Be honest about what does not work, what you could not validate, and where your approach might not generalize.
TRL Assessment: Claim specific TRL (or range like 4-5) and justify with evidence: publications, pilot results, user feedback, regulatory interactions, deployment case studies.
Gap Analysis: For each gap to next TRL, specify: (1) What needs to happen, (2) Who needs to do it, (3) Estimated time and cost, (4) Key risks or blockers.
Lessons Learned: What worked well that you would recommend to others? What surprised you or turned out harder than expected? What would you do differently if starting over? What assumptions that turned out wrong?
Next Steps: If continuing the work: prioritized action items with timelines and resource estimates. If wrapping up: suggestions for what future research should tackle.

📊 Technology Readiness Levels

TRL 1-3: Basic research, proof of concept on toy data
TRL 4: Technology validated in lab (realistic data, controlled environment)
TRL 5: Technology validated in relevant environment (simulation with real data characteristics)
TRL 6: Technology demonstrated in relevant environment (pilot with actual users)
TRL 7-8: System prototype demonstration in operational environment
TRL 9: Actual system proven in operational use

📝 Example: FL-MS Study Deployment

Code: GitHub repository with Apache 2.0 license, includes preprocessing, training, and evaluation scripts. Environment: Python 3.9, PyTorch 1.12.1, Flower 1.5.0 on Linux with Intel Xeon CPUs. Reproducibility: Seeds 0-9 for 10 runs, highly reproducible results (SD ~0.001-0.003).

Data Access: Restricted - researchers can apply at www.msbase.org with 2-6 month approval timeline. Model weights: NOT released (privacy concerns even for aggregated models, plus MSBase data use restrictions).

Current TRL: 4-5 (validated in simulation with realistic data, published in peer-reviewed journal). Gaps to TRL 6-7: Live federation deployment, prospective validation, clinical usability studies, regulatory pathway. Timeline to TRL 7: 2-3 years

Key Lessons: Personalization crucial for heterogeneous data, simulation valuable for algorithm development, stakeholder engagement needed earlier. Next steps: Apply for funding for live deployment pilot, identify 3-5 MSBase sites willing to pilot live FL infrastructure.