CLOUD DESIGN PATTERNS PRESCRIPTIVE ARCHITECTURE GUIDANCE FOR CLOUD APPLICATIONS Cloud Design Patterns Alex Homer John Sharp Larry Brader Masashi Narumoto Trent Swanson CLOUD DESIGN PATTERNS For more information explore: microsoft.com/practices Software Architecture and Software Development patterns & practices proven practices for predictable results Save time and reduce risk on your software development projects by incorporating patterns & practices, Microsoft’s applied engineering guidance that includes both production quality source code and documentation. The guidance is designed to help software development teams: Make critical design and technology selection decisions by highlighting the appropriate solution architectures, technologies, and Microsoft products for common scenarios Understand the most important concepts needed for success by explaining the relevant patterns and prescribing the important practices Get started with a proven code base by providing thoroughly tested software and source that embodies Microsoft’s recommendations The patterns & practices team consists of experienced architects, developers, writers, and testers. We work openly with the developer community and industry experts, on every project, to ensure that some of the best minds in the industry have contributed to and reviewed the guidance as it is being developed. We also love our role as the bridge between the real world needs of our customers and the wide range of products and technologies that Microsoft provides. Cloud applications have a unique set of characteristics. They run on commodity hardware, provide services to untrusted users, and deal with unpredictable workloads. These factors impose a range of problems that you, as a designer or developer, need to resolve. Your applications must be resilient so that they can recover from failures, secure to protect services from malicious attacks, and elastic in order to respond to an ever changing workload. This guide demonstrates design patterns that can help you to solve the problems you might encounter in many different areas of cloud application development. Each pattern discusses design considerations, and explains how you can implement it using the features of Windows Azure. The patterns are grouped into categories: availability, data management, design and implementation, messaging, performance and scalability, resiliency, management and monitoring, and security. You will also see more general guidance related to these areas of concern. It explains key concepts such as data consistency and asynchronous messaging. In addition, there is useful guidance and explanation of the key considerations for designing features such as data partitioning, telemetry, and hosting in multiple datacenters. These patterns and guidance can help you to improve the quality of applications and services you create, and make the development process more efcient. Enjoy! “This guide contains a wealth of useful information to help you design and build your applications for the cloud.” Scott Guthrie, Corporate Vice President, Windows Azure 978-1-62114-036-8 This document is provided “as-is”. Information and views expressed in this document, including URL and other Internet Web site references, may change without notice. Some examples depicted herein are provided for illustration only and are fictitious. No real association or connection is intended or should be inferred. This document does not provide you with any legal rights to any intellectual property in any Microsoft product. You may copy and use this document for your internal, reference purposes. © 2014 Microsoft. All rights reserved. Microsoft, MSDN, and Windows Azure are trademarks of the Microsoft group of companies. All other trademarks are property of their respective owners. Preface 1 Contents of this Guide 1 The Design Patterns 3 The Primer and Guidance Topics 5 The Sample Applications 6 More Information 8 Feedback and Support 8 The Team Who Brought You This Guide 8 PATTERNS Cache-Aside Pattern 9 Circuit Breaker Pattern 14 Compensating Transaction Pattern 23 Competing Consumers Pattern 28 Compute Resource Consolidation Pattern 34 Command and Query Responsibility Segregation (CQRS) Pattern 42 Event Sourcing Pattern 50 External Configuration Store Pattern 58 Federated Identity Pattern 67 Gatekeeper Pattern 72 Health Endpoint Monitoring Pattern 75 Index Table Pattern 82 Leader Election Pattern 89 Materialized View Pattern 96 Contents vi Pipes and Filters Pattern 100 Priority Queue Pattern 109 Queue-Based Load Leveling Pattern 116 Retry Pattern 120 Runtime Reconfiguration Pattern 126 Scheduler Agent Supervisor Pattern 132 Sharding Pattern 140 Static Content Hosting Pattern 150 Throttling Pattern 155 Valet Key Pattern 160 GUIDANCE Asynchronous Messaging Primer 166 Autoscaling Guidance 174 Caching Guidance 179 Compute Partitioning Guidance 185 Data Consistency Primer 190 Data Partitioning Guidance 197 Data Replication and Synchronization Guidance 206 Instrumentation and Telemetry Guidance 214 Multiple Datacenter Deployment Guidance 220 Service Metering Guidance 228 Preface This guide from the Microsoft patterns & practices group, produced with the help of many people within the developer community, provides solutions for common problems encountered when developing cloud-hosted applications. The guide: • Articulates the benefit of applying patterns when implementing cloud applications, especially when they will be hosted in Windows Azure. • Discusses the problems that the patterns address, and how these relate to Windows Azure applica- tions. • Shows how to implement the patterns using the features of Windows Azure, emphasizing benefits and considerations. • Depicts the big picture by showing how these patterns fit into cloud application architectures, and how they relate to other patterns. The majority of topics described in the guide are equally relevant to all kinds of distributed systems, whether hosted on Windows Azure or on other cloud platforms. Our intention is not to provide a comprehensive collection of patterns. Instead, we chose what we think are useful patterns for cloud applications—taking into account the popularity of each one amongst users. Neither is this a detailed guide to the features of Windows Azure. To learn about Windows Azure see http://windowsazure.com. C G In conjunction with feedback from a wide representation of the developer community, we identified eight categories that encompass the most common problem areas in cloud application development. Category Description Availability Availability defines the proportion of time that the system is functional and working. It will be affected by system errors, infrastructure problems, malicious attacks, and system load. It is usually measured as a percentage of uptime. Cloud applications typically provide users with a service level agreement (SLA), which means that applications must be designed and implemented in a way that maximizes availability. Data Management Data management is the key element of cloud applications, and influences most of the quality attributes. Data is typically hosted in different locations and across multiple servers for reasons such as performance, scalability or availability, and this can present a range of challenges. For example, data consistency must be maintained, and data will typically need to be synchronized across different locations. Category Description Design and Implementation Good design encompasses factors such as consistency and coherence in component design and deployment, maintainability to simplify administration and development, and reusability to allow components and subsystems to be used in other applications and in other scenarios. Decisions made during the design and implementation phase have a huge impact on the quality and the total cost of ownership of cloud hosted applications and services. Messaging The distributed nature of cloud applications requires a messaging infrastructure that connects the components and services, ideally in a loosely coupled manner in order to maximize scalability. Asynchronous messaging is widely used, and provides many benefits, but also brings challenges such as the ordering of messages, poison message management, idempotency, and more. Management and Monitoring Cloud applications run in in a remote datacenter where you do not have full control of the infrastructure or, in some cases, the operating system. This can make management and monitoring more difficult than an on-premises deployment. Applications must expose runtime information that administrators and operators can use to manage and monitor the system, as well as supporting changing business requirements and customization without requiring the application to be stopped or redeployed. Performance and Scalability Performance is an indication of the responsiveness of a system to execute any action within a given time interval, while scalability is ability of a system either to handle increases in load without impact on performance or for the available resources to be readily increased. Cloud applications typically encounter variable workloads and peaks in activity. Predicting these, especially in a multi-tenant scenario, is almost impossible. Instead, applications should be able to scale out within limits to meet peaks in demand, and scale in when demand decreases. Scalability concerns not just compute instances, but other elements such as data storage, messaging infrastructure, and more. Resiliency Resiliency is the ability of a system to gracefully handle and recover from failures. The nature of cloud hosting, where applications are often multi-tenant, use shared platform services, compete for resources and bandwidth, communicate over the Internet, and run on commodity hardware means there is an increased likelihood that both transient and more permanent faults will arise. Detecting failures, and recovering quickly and efficiently, is necessary to maintain resiliency. Security Security is the capability of a system to prevent malicious or accidental actions outside of the designed usage, and to prevent disclosure or loss of information. Cloud applications are exposed on the Internet outside trusted on-premises boundaries, are often open to the public, and may serve untrusted users. Applications must be designed and deployed in a way that protects them from malicious attacks, restricts access to only approved users, and protects sensitive data. For each of these categories, we created related guidance and documented common patterns designed to help developers solve problems they regularly encounter. The guide contains: • Twenty-four design patterns that are useful in cloud-hosted applications. Each pattern is provided in a common format that describes the context and problem, the solution, issues and considerations for applying the pattern, and an example based on Windows Azure. Each pattern also includes links to other related patterns. • Two primers and eight guidance topics that provide basic knowledge and describe good practice techniques for developing cloud-hosted applications. The format of each primer and guidance topic is designed to present this information in a relevant and informative way. • Ten sample applications that demonstrate the usage of the design patterns described in this guide. You can use and adapt the source code to suit your own specific requirements. The Design Patterns The design patterns are allocated to one or more of the eight categories described earlier. The full list of patterns is shown in the following table. Pattern Categories Description Cache-aside Load data on demand into a cache from a data store. This pattern can improve performance and also helps to maintain consistency between data held in the cache and the data in the underlying data store. Circuit Breaker Handle faults that may take a variable amount of time to rectify when connecting to a remote service or resource. This pattern can improve the stability and resiliency of an application. Compensating Transaction Undo the work performed by a series of steps, which together define an eventually consistent operation, if one or more of the operations fails. Operations that follow the eventual consistency model are commonly found in cloud-hosted applications that implement complex business processes and workflows. Competing Consumers Enable multiple concurrent consumers to process messages received on the same messaging channel. This pattern enables a system to process multiple messages concurrently to optimize throughput, to improve scalability and availability, and to balance the workload. Compute Resource Consolidation Consolidate multiple tasks or operations into a single computational unit. This pattern can increase compute resource utilization, and reduce the costs and management overhead associated with performing compute processing in cloud-hosted applications. Command and Query Responsibility Segregation (CQRS) Segregate operations that read data from operations that update data by using separate interfaces. This pattern can maximize performance, scalability, and security; support evolution of the system over time through higher flexibility; and prevent update commands from causing merge conflicts at the domain level Event Sourcing Use an append-only store to record the full series of events that describe actions taken on data in a domain, rather than storing just the current state, so that the store can be used to materialize the domain objects. This pattern can simplify tasks in complex domains by avoiding the requirement to synchronize the data model and the business domain; improve performance, scalability, and responsiveness; provide consistency for transactional data; and maintain full audit trails and history that may enable compensating actions. External Configuration Store Move configuration information out of the application deployment package to a centralized location. This pattern can provide opportunities for easier management and control of configuration data, and for sharing configuration data across applications and application instances. Federated Identity Delegate authentication to an external identity provider. This pattern can simplify development, minimize the requirement for user administration, and improve the user experience of the application. Gatekeeper Protect applications and services by using a dedicated host instance that acts as a broker between clients and the application or service, validates and sanitizes requests, and passes requests and data between them. This pattern can provide an additional layer of security, and limit the attack surface of the system. Pattern Categories Description Health Endpoint Monitoring Implement functional checks within an application that external tools can access through exposed endpoints at regular intervals. This pattern can help to verify that applications and services are performing correctly. Index Table Create indexes over the fields in data stores that are frequently referenced by query criteria. This pattern can improve query performance by allowing applications to more quickly retrieve data from a data store. Leader Election Coordinate the actions performed by a collection of collaborating task instances in a distributed application by electing one instance as the leader that assumes responsibility for managing the other instances. This pattern can help to ensure that tasks do not conflict with each other, cause contention for shared resources, or inadvertently interfere with the work that other task instances are performing. Materialized View Generate prepopulated views over the data in one or more data stores when the data is formatted in a way that does not favor the required query operations. This pattern can help to support efficient querying and data extraction, and improve application performance. Pipes and Filters Decompose a task that performs complex processing into a series of discrete elements that can be reused. This pattern can improve performance, scalability, and reusability by allowing task elements that perform the processing to be deployed and scaled independently. Priority Queue Prioritize requests sent to services so that requests with a higher priority are received and processed more quickly than those of a lower priority. This pattern is useful in applications that offer different service level guarantees to individual types of client. Queue-based Load Leveling Use a queue that acts as a buffer between a task and a service that it invokes in order to smooth intermittent heavy loads that may otherwise cause the service to fail or the task to timeout. This pattern can help to minimize the impact of peaks in demand on availability and responsiveness for both the task and the service. Retry Enable an application to handle temporary failures when connecting to a service or network resource by transparently retrying the operation in the expectation that the failure is transient. This pattern can improve the stability of the application. Runtime Reconfiguration Design an application so that it can be reconfigured without requiring redeployment or restarting the application. This helps to maintain availability and minimize downtime. Scheduler Agent Supervisor Coordinate a set of actions across a distributed set of services and other remote resources, attempt to transparently handle faults if any of these actions fail, or undo the effects of the work performed if the system cannot recover from a fault. This pattern can add resiliency to a distributed system by enabling it to recover and retry actions that fail due to transient exceptions, long-lasting faults, and process failures. Sharding Divide a data store into a set of horizontal partitions shards. This pattern can improve scalability when storing and accessing large volumes of data. Static Content Hosting Deploy static content to a cloud-based storage service that can deliver these directly to the client. This pattern can reduce the requirement for potentially expensive compute instances. Pattern Categories Description Throttling Control the consumption of resources used by an instance of an application, an individual tenant, or an entire service. This pattern can allow the system to continue to function and meet service level agreements, even when an increase in demand places an extreme load on resources. Valet Key Use a token or key that provides clients with restricted direct access to a specific resource or service in order to offload data transfer operations from the application code. This pattern is particularly useful in applications that use cloud-hosted storage systems or queues, and can minimize cost and maximize scalability and performance. The Primer and Guidance Topics The primer and guidance topics are related to specific areas of application development, as shown in the following diagram. Users Multi DC deployment External STS/IDP DevOps External services or on-premises Compute Database/ storage Data replication and synchronization Data partitioning Data consistency primer Caching Compute partitioning Autoscaling Service usage metering Instrumentation and telemetry Asynchronous messaging primer Background processing Web UI The guide contains the following primers and guidance topics. Topic Categories Description Asynchronous Messaging Primer Messaging is a key strategy employed in many distributed environments such as the cloud. It enables applications and services to communicate and cooperate, and can help to build scalable and resilient solutions. Messaging supports asynchronous operations, enabling you to decouple a process that consumes a service from the process that implements the service. Autoscaling Guidance Constantly monitoring performance and scaling a system to adapt to fluctuating workloads to meet capacity targets and optimize operational cost can be a labor-intensive process. It may not be feasible to perform these tasks manually. This is where autoscaling is useful. Topic Categories Description Caching Guidance Caching is a common technique that aims to improve the performance and scalability of a system by temporarily copying frequently accessed data to fast storage located close to the application. Caching is most effective when an application instance repeatedly reads the same data, especially if the original data store is slow relative to the speed of the cache, it is subject to a high level of contention, or it is far away resulting in network latency. Compute Partitioning Guidance When deploying an application to the cloud it may be desirable to allocate the services and components it uses in a way that helps to minimize running costs while maintaining the scalability, performance, availability, and security of the application. Data Consistency Primer Cloud applications typically use data that is dispersed across data stores. Managing and maintaining data consistency in this environment can become a critical aspect of the system, particularly in terms of the concurrency and availability issues that can arise. You frequently need to trade strong consistency for performance. This means that you may need to design some aspects of your solutions around the notion of eventual consistency and accept that the data that your applications use might not be completely consistent all of the time. Data Partitioning Guidance In many large-scale solutions, data is divided into separate partitions that can be managed and accessed separately. The partitioning strategy must be chosen carefully to maximize the benefits while minimizing adverse effects. Partitioning can help to improve scalability, reduce contention, and optimize performance. Data Replication and Synchronization Guidance When you deploy an application to more than one datacenter, such as cloud and on-premises locations, you must consider how you will replicate and synchronize the data each instance of the application uses in order to maximize availability and performance, ensure consistency, and minimize data transfer costs between locations. Instrumentation and Telemetry Guidance Most applications will include diagnostics features that generate custom monitoring and debugging information, especially when an error occurs. This is referred to as instrumentation, and is usually implemented by adding event and error handling code to the application. The process of gathering remote information that is collected by instrumentation is usually referred to as telemetry. Multiple Datacenter Deployment Guidance Deploying an application to more than one datacenter can provide benefits such as increased availability and a better user experience across wider geographical areas. However, there are challenges that must be resolved, such as data synchronization and regulatory limitations. Service Metering Guidance You may need to meter the use of applications or services in order to plan future requirements; to gain an understanding of how they are used; or to bill users, organization departments, or customers. This is a common requirement, particularly in large corporations and for independent software vendors and service providers. The Sample Applications Ten example applications that demonstrate the implementation of some of the patterns in this guide are available for you to download and run on your own computer or in your own Windows Azure subscription. To obtain and run the applications: 1. Go to the “Cloud Design Patterns - Sample Code” page on the Microsoft Download Center at http://aka.ms/cloud-design-patterns-sample. Download the “Cloud Design Patterns Examples.zip” file. 2. In Windows Explorer open the Properties for the zip file and choose Unblock. [...]... string.Format("StoreWithCache_GetAsync_{0}", objectId); } Related Patterns and Guidance The following patterns and guidance may also be relevant when implementing this pattern: • Caching Guidance This guidance provides additional information on how you can cache data in a cloud solution, and the issues that you should consider when you implement a cache • Data Consistency Primer Cloud applications typically use data that... Documentation: Alex Homer, John Sharp (Content Master Ltd) Graphic Artists: Chris Burns (Linda Werner & Associates Inc), Kieran Phelan (Allovus Design Inc) Editor: RoAnn Corbisier Production: Nelly Delgado Technical Review: Bill Wilder (Author, Cloud Architecture Patterns) , Michael Wood (Cerebrata) Contributors: Hatay Tuna, Chris Clayton, Amit Srivastava, Jason Wescott, Clemens Vasters, Abhishek Lal, Vittorio... possible, design solutions to avoid the complexity of requiring compensating transactions (for more information, see the Data Consistency Primer) Example A travel website enables customers to book itineraries A single itinerary may comprise a series of flights and hotels A customer traveling from Seattle to London and then on to Paris could perform the following steps when creating an itinerary: 1 Book. .. logic in each step in the compensating transaction must take into account any business-specific rules For example, “unbooking” a seat on a flight might not entitle the customer to a complete refund of any money paid 26 ch a pter one Operation steps to create itinerary Book seat on flight F1 Book seat on flight F3 Reserve room at hotel H1 Reserve room at hotel H2 Cancel seat on flight F1 Cancel seat on... cancelling the flights The customer may still elect to cancel (in which case the compensating transaction runs and undoes the bookings made on flights F1, F2, and F3), but this decision should be made by the customer rather than by the system Related Patterns and Guidance The following patterns and guidance may also be relevant when implementing this pattern: • Data Consistency Primer The Compensating Transaction... failing operations by following the Retry Pattern Compensating Tr a nsaction Patter n More Information All links in this book are accessible from the book s online bibliography available at: http://aka.ms/cdpbibliography • The article Sagas on Clemens Vasters’ blog • The article Idempotency Patterns on Jonathan Oliver’s blog 27 Competing Consumers Pattern Enable multiple concurrent consumers to process messages... Windows Azure Service Bus Queues can implement guaranteed first-in-first-out ordering of messages by using message sessions For more information, see Messaging Patterns Using Sessions on MSDN • Designing Services for Resiliency If the system is designed to detect and restart failed service in- stances, it may be necessary to implement the processing performed by the service instances as idempotent operations... communications The samples provided for this guide are simplified to focus on and demonstrate the essential features of each pattern They are not designed to be used in production scenarios More Information All of the chapters include references to additional resources such as books, blog posts, and papers that will provide additional detail if you want to explore some of the topics in greater depth For your convenience,... customer traveling from Seattle to London and then on to Paris could perform the following steps when creating an itinerary: 1 Book a seat on flight F1 from Seattle to London 2 Book a seat on flight F2 from London to Paris 3 Book a seat on flight F3 from Paris to Seattle 4 Reserve a room at hotel H1 in London 5 Reserve a room at hotel H2 in Paris These steps constitute an eventually consistent operation,... surrounding consistency across distributed data, and summarizes how an application can implement eventual consistency to maintain the availability of data More Information All links in this book are accessible from the book s online bibliography available at: http://aka.ms/cdpbibliography • The article Using Windows Azure Cache on MSDN Circuit Breaker Pattern Handle faults that may take a variable amount . CLOUD DESIGN PATTERNS PRESCRIPTIVE ARCHITECTURE GUIDANCE FOR CLOUD APPLICATIONS Cloud Design Patterns Alex Homer John Sharp Larry Brader Masashi Narumoto Trent Swanson CLOUD DESIGN PATTERNS For. applications: 1. Go to the Cloud Design Patterns - Sample Code” page on the Microsoft Download Center at http://aka.ms /cloud- design- patterns- sample. Download the Cloud Design Patterns Examples.zip” file. 2 own specific requirements. The Design Patterns The design patterns are allocated to one or more of the eight categories described earlier. The full list of patterns is shown in the following