Distributed Systems - what should be considered
Design in Distributed World
Composition
Typical Architecture
- Load balancer with multiple backend replicas
- Server with multiple backends
- Server tree
Distributed State
The CAP Principle
- Consistency
- Availability
- Partition Tolerance
Design for Operations
Operational Requirements
- Configuration
- Startup and shutdown
- Queue draining
- Software upgrades
- Backups and restores
- Redundancy
- Replicated databases
- Hot swaps
- Toggles for individual features
- Graceful degradation
- Access controls and rate limits
- Data import controls
- Monitoring
- Auditing
- Debug instrumentation
- Exception collection
- Documentation for Operations
Platform Selection
Platform Description
A platform may be described along three axes
- Level of service abstraction: IaaS, PaaS, SaaS
- Type of machine: Physical, virtual, or process container
- Level of resource sharing: Shared or private
Selection Strategies
Common Strategies
- Default to Virtual
- Make a Cost-Based Decision
- Leverage Provider Expertise
- Get Started Quickly
- Implement Ephemeral Computing
- Use the Cloud for Overflow Capacity
- Leverage Superior Infrastructure
- Develop an In-House Service Provider
- Contract for an On-Premises, Externally Run Service
- Implement a Bare Metal Cloud
Application Architectures
General architecture categories
- Single-Machine Web Server
- Two-Tier Web Service
- Three-Tier Web Service
- Four-Tier Web Service
Load Balancer Types
- DNS Round Robin
- Layer 3 and 4 Load Balancers
- Layer 7 Load Balancer
Load Balancing Methods
- Round Robin (RR)
- Weighted RR
- Least Loaded (LL)
- Least Loaded with Slow Start
- Utilization Limit
- Latency
- Cascade
Also need to consider
- Reverse Proxy Service
- Cloud-Scale Service
- Message Bus Architectures
- Service-Oriented Architecture
Design for Scaling
General Strategy
- Identify Bottlenecks
- Reengineer Components
- Measure Results
- Be Proactive
The AKF Scaling Cube
developed by Abbott, Keeven, and Fisher
- x: Horizontal Duplication (also known as horizontal scaling or scaling out.)
- y: Functional or Service Splits
- z: Lookup-Oriented Split
Others need to consider
- Caching
- Data Sharding
- Threading
- Queueing
- CDN
Caching
- Cache Effectiveness
- Cache Placement
- Cache Persistence
- Cache Replacement Algorithms
- Cache Entry Invalidation
- Cache Size
Design for Resiliency
key features for Resiliency
- Everything Malfunctions Eventually
- Resiliency through Spare Capacity
- Failure Domains
- Software Failures
- Physical Failures
- Overload Failures
- Human Error
Operations in Distributed World
Distributed Systems Operations
- Defining SRE
- Change versus Stability
- Operations at Scale
Service Life Cycle
- Service Launch
- Emergency Tasks
- Nonemergency Tasks
- Upgrades
- Decommissioning
- Project Work
Organizing Strategy
Daily work categories
- Emergency Issues
- Normal Requests
- Project Work
Team Member Day Types
- Project-Focused Days
- Oncall Days
- Ticket Duty Days
DevOps
Three Ways of DevOps
- Workflow
- Ensure each step is done in a repeatable way
- Never pass defects to the next step
- Ensure no local optimizations degrade global performance
- Increase the flow of work
- Improve Feedback
- Understand and respond to all customers, internal and external
- Shorten feedback loops
- Amplify all feedback
- Embed knowledge where it is needed
- ContinualExperimentationand Learning
- Rituals are created that reward risk taking
- Management allocates time for projects that improve the system
- Faults are introduced into the system to increase resilience
- You try “crazy” or audacious things
Common Technical DevOps Practices
- Same Development and Operations Toolchain
- Consistent Software Development Life Cycle (SDLC)
- Managed Configuration and Automation
- Infrastructure as Code
- Automated Provisioning and Deployment
- Artifact-Scripted Database Changes
- Automated Build and Release
- Release Vehicle Packaging
- Abstracted Administration
Service Delivery
Build-Phase Steps: Develop -> Commit -> Build -> Package -> Register
delivery platform should consider
- Confidence
- Reduced Risk
- Shorter Interval from Keyboard to Production
- Less Wait Time
- Less Rework
- Improved Execution
- A Culture of Continuous Improvement
- Improved Job Satisfaction
Deployment-Phase Steps: promoted, installed, and configured
Upgrading
There are many kinds of upgrading.
- Taking the Service Down for Upgrading
- Rolling Upgrades
- Canary
- Phased Roll-outs
- Proportional Shedding
- Blue-Green Deployment
- Toggling Features
Taking Toggling Features as an example.
reasons to use flag flips
- Rapid Development
- Gradual Introduction of New Features
- Finely Timed Release Dates
- Dynamic Roll Backs
- Bug Isolation
- A-BTesting
- One Percent Testing
- Differentiated Services
Continuous Deployment
factors should be taken into consideration when deciding whether to pause continuous delivery
- Build Health
- Test Comprehensiveness
- Test Reproducibility
- Production Health
- Schedule Permission
- Oncall Schedule
- Manual Stop
- Push Conflicts
- Intentional Delays
- Resource Contention
Terms to Know
- Server
- Service
- Machine
- QPS
- Traffic
- Performant: A neologism from merging “performance” and “conformant”.
- IaaS
- Saas
- Paas
- Oversubscribed
- Undersubscribed
- Static Content
- Dynamic Content
- Database-Driven Dynamic Content
- Control Panel
- Main Database
- Trend Server
- Link Redirect Servers
- Content Delivery Networks
- Outage
- Failure
- Malfunction
- MTBF
- Innovate
- Oncall
- Soft launch
- SRE
- Stakeholders
- Artifacts
- Service Delivery Flow
- Cycle Time
- Deployment
- Release Candidate
- Release
- Domain-Specific Language
- Toil
Reference
- The Practice of Cloud System Administration - DevOps and SRE Practices for Web Services Volume 2 (Thomas A. Limoncelli Strata R. Chalup Christina J. Hogan)
https://www.atlassian.com/microservices/microservices-architecture/distributed-architecture
https://en.wikipedia.org/wiki/Distributed_computing
https://aws.amazon.com/builders-library/challenges-with-distributed-systems/
- Understanding Distributed Systems (Roberto Vitillo)
- Foundations of Scalable Systems (Ian Gorton)
- Distributed Systems: Concepts and Design, 5th ed. (Pearson, 2001)
- Designing Distributed Systems: Patterns and Paradigms for Scalable, Reliable Services (Brendan Burn)
- The Art of Scalability: Scalable Web Architecture, Processes, and Organizations for the Modern Enterprise by Abbott and Fisher (2009)
- Scalability Rules: 50 Principles for Scaling Web Sites, also by Abbott and Fisher (2011)
Disclaimer
- License under
CC BY-NC 4.0
- Copyright issue feedback
me#imzye.me
, replace # with @ - Not all the commands and scripts are tested in production environment, use at your own risk
- No privacy information is collected here