BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Scaling Cloud and Distributed Applications: Lessons and Strategies

Scaling Cloud and Distributed Applications: Lessons and Strategies

Listen to this article -  0:00

Key Takeaways

  • Design for unpredictable scale: Handle ten times the number of traffic spikes with reserved capacity and circuit breakers.
  • Classify infrastructure by criticality, strategically focusing efforts, because not everything needs one hundred percent availability.
  • Automate everything: build self-healing systems that recover before human intervention.
  • Optimize performance at every layer: edge computing, traffic shaping, and content delivery networks (CDNs) for speed.
  • Contain the blast radius: multi-region architecture that isolates failures to small user percentages.

This article shares goals and strategies for scaling cloud and distributed applications, focusing on lessons learned from our cloud migration at Chase.com at JP Morgan Chase.

The discussion centers on three primary goals and the strategies addressing those goals, concluding with how these approaches were achieved in practice. For those managing large-scale systems, these lessons provide valuable guidance drawn from years of experience at our and other financial institutions.

Planning typically accounts for load increases of two or three times, but when systems are deployed on the internet, control over incoming traffic, timing, and use patterns becomes impossible. Any event can trigger massive load increases, whether from legitimate business growth or malicious actors. Both scenarios present distinct challenges.

Security controls can block malicious traffic, but different considerations arise when genuine customer demand surges due to market volatility. Customers require access to financial transactions precisely when such situations occur. Multiple components can fail simultaneously during system stress; network devices, load balancers, applications, and database connections can all break simultaneously.

Goals

Our cloud migration focused on three primary objectives: scaling in a cost-effective and efficient manner, achieving high resilience, particularly important for financial institutions, and delivering strong performance to prevent slow systems from driving users to alternative services.

Scaling Efficiently

Achieving efficiency requires analyzing customer use patterns and behavior. Organizations must develop predictive capabilities while maintaining adaptability through elastic scaling.

Traffic shaping provides a methodology for identifying frequently used functionality, allowing focused scaling of critical applications.

Overall capacity management represents another critical element. Simply adding servers does not guarantee success. Trade-offs with cost exist and require careful consideration.

Traffic Patterns and Sizing

Traffic patterns are fundamental to efficient scaling. Average traffic represents the baseline that systems manage regularly. Predictable patterns exist, driven by recurring events such as paycheck deposits, prompting customers to verify account balances. Seasonal peaks occur throughout the year, requiring proactive planning.

Unexpected events present a different challenge. DDoS attacks occur frequently, with traffic potentially exceeding ten times the normal loads or higher. Attackers utilize the same cloud resources available to legitimate users. Organizations must simultaneously block these attacks while maintaining service level agreements for legitimate customers executing genuine transactions.

Proper sizing based on established patterns helps prevent operational issues. However, elastic scaling presents limitations; during the scaling process, applications must start and establish connections to services and databases. Establishing connections requires time. By the time instances reach operational readiness, several minutes may have elapsed. When large volumes arrive during simultaneous instance startup, contention emerges across the system.

Rather than relying exclusively on elastic scaling, the complete operational picture must be considered, including patterns and related factors. Reserved compute capacity addresses these challenges. Reserved resources ensure availability when needed, particularly when contention exists across other organizations using shared service pools. Reserved compute also yields cost savings.

Cost management requires ongoing attention. FinOps processes should be applied regularly, on a monthly or weekly interval, rather than occasionally.

Scaling Beyond Server Addition

Scaling extends beyond simply adding servers. When scaling occurs, the fundamental question is whether the application requires scaling due to genuine customer demand or whether upstream services experiencing queuing issues slow system response. When threads wait for responses and cannot execute, pressure increases on CPU and memory resources, triggering elastic scaling even though actual demand has not grown.

This scenario requires designing for failure and integrating with scaling strategies. Circuit breakers provide a critical mechanism here. When upstream services slow down or fail, circuit breakers prevent applications from waiting indefinitely for responses. Instead, they enforce timeout limits, allowing the system to either receive a successful response within the defined window or fail fast and move on. This design prevents thread exhaustion, reduces unnecessary resource consumption, and stops false scaling triggers. Without circuit breakers, slow dependencies can cascade into system-wide performance degradation, causing elastic scaling to add more servers that cannot solve the underlying dependency problem.

Being Highly Resilient

Resiliency requires preparation for inevitable system failures. Early detection and readiness to execute failover procedures are critical. However, achieving one hundred percent availability for all components is neither practical nor necessary.

Infrastructure can be categorized into four tiers based on criticality. Components identified as "critical" must maintain availability as close to one hundred percent as possible. DNS exemplifies this category; regardless of how well-architected a site may be, DNS failure prevents all access.

The "manageable" tier contains components where failover permits continued operation if failures occur. This permits targeting four nines of availability (99.99 percent), translating to approximately fifty-two minutes of acceptable downtime annually.

The "tolerable" tier accommodates components with built-in resilience. Token services that cache data for extended periods exemplify this category. If the service becomes unavailable during the cache validity period, operations continue using cached data.

Finally, the "acceptable" tier includes components where limited data loss is permissible, such as certain logging systems. Impact severity defines resiliency targets.

Performance

Performance significantly impacts both user experience and infrastructure costs. Not all applications function identically. Point of presence can be utilized to provide better customer experiences, because lag on websites is particularly problematic, especially on mobile devices.

Speed matters considerably because it builds trust with users who desire better, faster experiences. Search engines like Google recognize this importance by incorporating speed into their ranking algorithms. Mobile performance becomes especially critical when network connectivity is involved. From an infrastructure perspective, when customers spend less time on infrastructure to achieve the same objectives, operational costs decrease.

Application of comprehensive performance strategies resulted in a seventy-one percent reduction in latency from initial implementation through full deployment of these architectural approaches. These strategies can be adapted to other business contexts.

The Power of Five: Key Strategies

Five focus areas guide the architectural approach: multi-region deployment, high-performance optimization, comprehensive automation, observability with self-healing capabilities, and robust security.

Multi-Region

Multi-region architecture creates isolation and segmentation for functional separation. This approach facilitates management of region failures, zone failures, and network failures while containing the blast radius. With a customer base of ninety-four million, zone-level failures can be constrained to impact only a small percentage of users rather than the entire population.

Multi-region implementation requires addressing DNS management, because different regions maintain separate load balancers requiring coordination. Traffic management between regions must be determined. Within regions containing multiple zones, options exist for traffic distribution.

Zonal Failures

Consider a scenario where a load balancer distributes traffic to two zones within a single region. Each application reports a healthy status, and zones appear functional, resulting in continued traffic flow to both zones. However, if one application experiences connectivity problems with backend systems in one zone while the other zone operates normally, traffic continues arriving at the impaired zone. If the application implements readiness and liveness probes but fails to incorporate dependent system status into health checks, problems occur. Without appropriate feedback mechanisms, the load balancer continues routing traffic, leading to application failures.

Resolution requires either propagating dependency health information through readiness and liveness probes back to the load balancer or implementing proxy-based rerouting to functional zones. Both internal and external failures need effective management to address application downtime.

Regional Failures

In a multi-region deployment, we rely on a single, uniform regional pulse check—executed every 10 seconds—to ensure consistent and timely visibility into regional health. Critical decisions involve determining whether failures necessitate complete failover to alternate regions or whether degraded service remains acceptable. Degraded service viability depends on application segmentation. If critical services fail, failover may be necessary (e.g., when dashboard landing page is failing). If less critical elements fail, the application can continue running to avoid impact. Failover creates thundering herd effects (e.g., when an entire region fails, the sudden surge of redirected traffic can overwhelm the remaining regions, and auto-scaling may require time to provision the additional capacity needed) and other problems from traffic redistribution. Health check criteria, including failure and success thresholds for application health checks, determine the appropriate response.

Multi-Region Challenges

Data replication across regions and ensuring data consistency represent primary concerns. Customer sharding offers one approach when data centers are in limited locations, but customers are distributed nationwide. Sharding customers and serving them from geographically proximate locations may address replication needs, potentially avoiding replication and simplifying architecture.

State management requires strategic decisions. Managing the state to maintain session affinity for active sessions, with failover capability when necessary, contributes to effective operation.

High Performance

High performance is essential for user experience. Effective performance can be compared to a reliable dial tone; users expect an immediate response without delay. Edge computing provides a primary method for achieving performance objectives. Modern websites with sophisticated user interfaces are content-heavy. Content can be offloaded to Point of Presence (PoP) locations proximate to customers, while origin servers handle only dynamic operations and critical services: login, accounts, and payments.

Traffic shaping allows traffic categorization. Critical traffic represents functions essential for business operations: daily customer activities such as login, balance checks, and payments. Resources allocated to critical services must remain continuously operational. Even when other traffic experiences degradation under stress conditions, this may be acceptable.

Content Delivery

Geographic distribution significantly affects performance. If asset downloads require traversing long distances for each request, physical network barriers create substantial latency. When identical content is available at PoP locations with cached resources, retrieval occurs in less than 100 milliseconds rather than the longer response time (e.g., greater than 500ms) required for origin server access. Security benefits accompany performance improvements because malicious traffic can be blocked at the edge.

Last-mile connectivity warrants attention. Internet operations involve multiple network hops between endpoints. Edge computing alters this dynamic from user location to edge location, typically with one network hop, followed by optimized networks operating more efficiently than standard ISP-to-ISP connections.

Mobile applications present optimization opportunities. Mobile applications provide available storage where resources can be cached, including network resolutions, configuration settings, and prefetched content.

Automation

Automation represents a critical strategic element. Comprehensive automation across the entire pipeline at every stage provides substantial benefits, encompassing deployment, infrastructure provisioning, environment provisioning, health checks integrated with automated actions, and overall traffic management.

Architecture must extend beyond documentation. Creating opinionated architecture templates assists teams in building applications that automatically inherit architectural standards. Applications deploy automatically using manifest-based definitions, so that teams can focus on business functionality rather than infrastructure tooling complexities.

Repaving Infrastructure

Infrastructure repaving represents a highly effective practice of systematically rebuilding infrastructure each sprint. Automated processes clean up running instances regularly. This approach enhances security by eliminating configuration drift. When drift exists or patches require application, including zero-day vulnerability fixes, all updates can be systematically incorporated.

Extended operation periods create stale resources, performance degradation, and security vulnerabilities. Recreating environments at defined intervals (weekly or bi-weekly) occurs automatically. Traffic is gracefully removed from running systems, environments are rebuilt, and services are relaunched, providing operational stability.

Repaving implementation involves multiple components. Automated scripts monitor the lifecycle of running instances. Time-based validity triggers route removal, preventing new requests while allowing existing requests to complete. Instances are then shut down, nodes are cleaned, and new instances are created. During new instance creation, updated images can be deployed for zero-day vulnerabilities or security patches or instances can simply be recreated. Policies determine specific actions. All processes are automated, with traffic removal preceding repaving to ensure zero customer impact.

Automated Failover

Automated failover with graceful degradation requires consideration of active sessions. For customers with ongoing processing, session handling differs from new incoming sessions, requiring routing. Failover loops must be prevented; if both regions are unhealthy, continuous switching between them worsens problems. Latency tolerance varies by scenario; non-critical service failures may permit continued operation in the affected location.

Observability and Self-Healing

Observability requires an automated response to observed events. Cloud environments generate numerous events across components: system events, infrastructure events, and application events. All observable events require automated action. Automation integrates with observability through serverless functions that trigger automatically upon event detection, executing regional switches based on defined criteria.

Database problems trigger separate functions for database switching. Maintenance activities can trigger functions to block specific regions or virtual private clouds (VPCs). These examples demonstrate automated actions that can be implemented while ensuring integration with observability. Dashboard monitoring provides supplementary value but should not serve as the primary response mechanism.

Health Checks

Health monitoring must occur at multiple levels. At the application level, health determination may involve complex assessments, whether the application itself operates properly and whether connectivity to databases, caches, and other systems remains functional. Complex criteria can exist within the health checker, but the returned status must be simple, a Boolean value indicating healthy or unhealthy status.

Within applications, health checks exist that propagate to the zonal level, which examines all instances. This information moves to the VPC level for overall VPC health assessment, then feeds into the global router. At each level, automated health assessment is achieved by using simple Boolean indicators for rapid decision-making. This approach achieves self-healing through systematic health check implementation.

Decision Criteria

In the following use cases, when alerts indicate node unavailability and capacity is compromised, traffic may need redirection from the affected VPC due to provider issues. When application alerts indicate latency problems and performance is compromised, organizations must decide between continuing with degraded services or meeting service level agreement (SLA) requirements based on business demands. In such cases, choosing to continue with degraded services accepts slower performance rather than moving to another zone where the same issue may exist.

Gray failures represent ambiguous scenarios where failures are not deterministic, but connectivity exists. Network-related failures may be harder to diagnose. When a business function is compromised, rerouting to healthy zones may be an option. Various actions can be applied based on observability data.

Robust Security

Security requires a layered implementation following a zero-trust model. Each layer must function independently, assuming potential failure of other layers. Client devices may be compromised by malware. Perimeter security implements filtering and web application firewalls at the edge. Internal networks require segmentation and isolation. Container security involves image scanning and least privilege principles. Application security ensures proper authentication and authorization. Data security implements encryption and privacy controls. Each layer reinforces the others.

Migration

Cultural transformation is fundamental to successful migration, because cloud operations differ fundamentally from traditional on-premises systems. Cloud providers continuously update services, network policies evolve, browsers change, and numerous factors require ongoing adaptation. The Well-Architected Framework and related principles provide guidance.

The ownership model, where teams own, build, and deploy their solutions, places responsibility with application teams. Human error and oversight are inevitable. Automation provides consistency.

Testing and Verification

Testing approaches vary in methodology. Tools like Chaos Monkey provide reactive testing by introducing failures into running systems. Failure Mode and Effects Analysis (FMEA) provides predictive analysis through systematic component evaluation to identify potential failures and develop mitigation strategies. Both approaches offer value, though FMEA is preferred for comprehensive testing at each application layer, ensuring analysis and mitigation strategy development.

TrueCD was developed as the company’s CI/CD methodology, a twelve-step automated process with comprehensive documentation available in published blog posts. This process functions analogously to pre-flight safety checks in aviation.

Abstraction Layer

Transitions from on-premises to cloud impact application architecture. Applications contain substantial business logic, and continuous changes introduce effects that can impact business operations. Abstraction layers minimize these impacts. This architectural approach uses best-in-class components across single clouds, multiple clouds, on-premises infrastructure, or hybrid combinations. Dapr is a well-regarded open-source framework that supports multi-cloud architectures.

Moving Customer Traffic

Large application migrations cannot be completed instantaneously. Systems can be validated initially with internal user populations, allowing applications to stabilize. Expedited timelines often prove problematic because certain issues and use patterns require extended observation periods for detection. Applications require adequate operational time for optimization.

With extensive functionality portfolios, completing all features within available timeframes may be infeasible. Segmenting systems into discrete application sets addresses this challenge. Throughout migration phases, customer populations can be transitioned incrementally in small percentages, ultimately achieving complete migration.

Results

Implementation of these strategies yielded measurable outcomes because significant cost reductions were achieved. Performance metrics improved substantially, with the platform achieving top rankings in comparative analyses. Dynatrace public reports comparing U.S. banks conclude that sites achieving sub-one-second performance represent optimal performance.

Conclusion

Several key considerations emerge from these strategies. Trade-offs are inevitable. Cost implications and performance must be considered without compromising other requirements. For example, when running multi-region architectures, cache replication decisions require examination, whether to maintain caches in one region versus multiple regions. Operational complexity increases because cloud architectures utilize many components. Reducing this complexity and minimizing manual effort in application monitoring are essential. Automation serves as the key mechanism for this reduction.

Blast radius containment remains critical. Sites will experience issues, and components will fail. When failures occur, the impact scope matters, whether all customers are affected or only a small subset. This focus is important. Ensuring action-oriented observability tied to automated actions is vital.

Customer focus must drive all decisions. Business operations serve customers. Consider the dial tone experience: When picking up a phone, users expect to hear the dial tone immediately. The same principle applies to applications. When users open mobile applications, they expect to see results immediately.

The fundamental principle: scale smart, stay reliable. When the inevitable next traffic surge occurs, system weaknesses will be revealed. The objective of these strategies is direct: When traffic surges, critical components must remain operational, core systems must maintain responsiveness, and customers must continue receiving immediate responses as expected.

About the Author

Rate this Article

Adoption
Style

BT