1. Trang chủ
  2. » Công Nghệ Thông Tin

the art of scalability scalable web architecture processes and organizations for the modern enterprise phần 4 pps

59 360 0

Đang tải... (xem toàn văn)

Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống

THÔNG TIN TÀI LIỆU

Thông tin cơ bản

Định dạng
Số trang 59
Dung lượng 6,17 MB

Nội dung

ptg5994185 152 CHAPTER 9MANAGING CRISIS AND ESCALATIONS The eBay Scalability Crisis As proof that a crisis can change a company, consider eBay in 1999. In its early days, eBay was the darling of the Internet and up to the summer of 1999, few if any companies had experi- enced its exponential growth in users, revenue, and profits. Through the summer of 1999, eBay experienced many outages including a 20-plus hour outage in June of 1999. These outages were at least partially responsible for the reduction in stock price from a high in the mid $20s the week of April 26, 1999, to a low of $10.42 the week of August 2, 1999. The cause of the outages isn’t really as important as what happened within the company after the outages. Additional executives were brought in to ensure that the engineering organi- zation, the engineering processes, and the technology they produced could scale to the demand placed on them by the eBay community. Initially, additional capital was deployed to purchase systems and equipment (though eBay was successful in actually lowering both its technology expense and capital on an absolute basis well into 2001). Processes were put in place to help the company design systems that were more scalable, and the engineering team was augmented with engineers experienced in high availability and scalable designs and archi- tectures. Most importantly, the company created a culture of scalability. The lessons from the summer of pain are still discussed at eBay, and scalability has become part of eBay’s DNA. eBay continued to experience crises from time to time, but these crises were smaller in terms of their impact and shorter in terms of their duration as compared to the summer of 1999. The culture of scalability netted architectural changes, people changes, and process changes. One such change was eBay’s focus on managing each and every crisis in the fashion described in this chapter. Order Out of Chaos Bringing in and managing several different organizations within a crisis situation is difficult at best. Most organizations have their own unique subculture and often- times, even within a technology organization, those subcultures don’t even truly speak the same language. It is entirely possible that an application developer will use terms with which a systems engineer is not familiar, and vice versa. Moreover, if not managed, the attendance of many people and multiple organizations within a crisis situation will create chaos. This chaos will feed on itself creating a vicious cycle that can actually prolong the crisis or worse yet aggravate the damage done in the crisis through someone taking an ill-advised action. Indeed, if you cannot effectively manage the force you throw at a crisis, you are better off using fewer people. Your company may have a crisis management process that consists of both phone and chat (instant messaging or IRC) communications. If you listen on the phone or ptg5994185 ORDER OUT OF CHAOS 153 follow the chat session, you are very likely to see an unguided set of discussions and statements as different people and organizations go about troubleshooting or trying different activities in the hopes of finding something that will work. You may have questions asked that go unanswered or requests to try something that go without authorization. You might as well be witnessing a grade school recess, with different groups of children running around doing different things with absolutely no coordi- nation of effort. But a crisis situation isn’t a recess; it’s a war, and in war such a lack of coordination results in an increase in the rate of friendly casualties through “friendly fire.” In a technology crisis, these friendly casualties are manifested through prolonged outages, lost data, and increased customer impact. What you really want to see in such a situation is some level of control applied to the chaos. Rather than a grade school recess, you hope to see a high school football game. Don’t get us wrong, you aren’t going to see an NFL style performance, but you do hope that you witness a group of professionals being led with confidence to iden- tify a path to restoration and a path to identification of root cause. Different groups should have specific objectives and guidelines unique to their expertise. There should be an expectation that they are reporting their progress clearly and succinctly in regular time intervals. Hypotheses should be generated, quickly debated, and either prioritized for analysis or eliminated as good initial can- didates. These hypotheses should then be quickly restated as the tasks necessary to determine validity and handed out to the appropriate groups to work them with times for results clearly communicated. Someone on the call or in the crisis resolution meeting should be in charge, and that someone should be able to paint an accurate picture of the impact, what has been tried, the best hypotheses being considered and the tasks associated with those hypotheses, and the timeline for completion of the current set of actions, as well as the development of the next set of actions. Other members should be managers of the technical teams assembled to help solve the crisis and one of the experienced (described in organizations as senior, principal, or lead) technical people from each manager’s teams. We will now describe these roles and positions in greater detail. Other engineers should be gathered in organizational or cross-functional groups to deeply investigate domain areas or services within the platform undergoing a crisis. The Role of the “Problem Manager” The preceding paragraphs have been leading up to a position definition. We can think of lots of names for such a position: outage commander, problem manager, incident manager, crisis commando, crisis manager, issue manager, and from the mili- tary, battle captain. Whatever you call the person, you had better have someone capable of taking charge on the phone. Unfortunately, not everyone can fill this kind of a role. We aren’t arguing that you need to hire someone just to manage your major ptg5994185 154 CHAPTER 9MANAGING CRISIS AND ESCALATIONS production incidents to resolution, though if you have enough of them you might consider that; rather, ensure you have at least one person on your staff who has the skills to manage such a chaotic environment. The characteristics of someone capable of successfully managing chaotic environ- ments are rather unique. As with leadership, some people are born with them and some people nurture them over time. The person absolutely needs to be technically literate but not necessarily the most technical person in the room. He should be able to use his technical base to form questions and evaluate answers relevant to the crisis at hand. He does not need to be the chief problem solver, but he needs to effectively manage the process of the chief problem solvers gathered within the crisis. The per- son also needs to be incredibly calm “inside” but be persuasive “outside.” This might mean that he has the type of presence to which people naturally are attracted or it may mean that he isn’t afraid to yell to get people’s attention within the room or on the conference call. The crisis manager needs to be able to speak and think in business terms. She needs to be conversant enough with the business model to make decisions in the absence of higher guidance on when to force incident resolution over attempting to collect data that might be destroyed and would be useful in problem resolution (remember the differences in definitions from Chapter 8). The crisis manager also needs to be able to create succinct business relevant summaries from the technical chaos that is going on around her in order to keep the remainder of the business informed. In the absence of administrative help to document everything said or done during the crisis, the crisis manager is responsible for ensuring that the actions and discus- sions are represented in a written state for future analysis. This means that the crisis manager will need to keep a history of the crisis as well as help ensure that others are keeping histories to be merged. A shared chat room with timestamps enabled is an excellent choice for this. In terms of Star Trek characters and financial gurus, the person is 1/3 Scotty, 1/3 Captain Kirk, and 1/3 Warren Buffet. He is 1/3 engineer, 1/3 manager, and 1/3 busi- ness manager. He has a combat arms military background, an M.B.A., and a Ph.D. in some engineering discipline. Hopefully, by now, we’ve indicated how difficult it is to find someone with the experience, charisma, and business acumen to perform such a function. To make the task even harder, when you find the person, she probably isn’t going to want the job as it is a bottomless pool of stress. You will either need to incent the person with the right merit based performance package or you will need to clearly articulate how it is that they have a future beyond managing crises in your organization. However you approach it, if you are lucky enough to be successful in finding such an individual, you should do everything possible to keep him or her for the “long term.” ptg5994185 ORDER OUT OF CHAOS 155 Although we flippantly suggested the M.B.A., Ph.D., and military combat arms background, we were only half kidding. Such people actually do exist! As we men- tioned earlier, the military has a role that they put such people in to manage their bat- tles or what most of us would view as crises. The military combat arms branches attract many leaders and managers who thrive on chaos and are trained and have the personalities to handle such environments. Although not all former military officers have the right personalities, the percentage within this class of individual who have the right personalities are significantly higher than the rest of the general population. Moreover, they have life experiences consistent with your needs and specialized train- ing on how to handle such situations. Finally, as a group, they tend to be highly edu- cated, with many of them having at least one and sometimes multiple graduate degrees. Ideally, you would want one who has been out of the military for awhile and running engineering teams to give him the proper experience. The Role of Team Managers Within a crisis situation, a team manager is responsible for passing along action items to her teams and reporting progress, ideas, hypotheses, and summaries back to the crisis manager. Depending upon the type of organization, the team manager may also be the “senior” or “lead” engineer on the call for her discipline or domain. A team manager functioning solely in a management capacity is expected to man- age his team through the crisis resolution process. A majority of his team is going to be somewhere other than the crisis resolution (or “war”) room or on a call other than the crisis resolution call if a phone is being used. This means that the team man- ager must communicate and monitor the progress of his team as well as interacting with the crisis manager. Although this may sound odd, the hierarchical structure with multiple communication channels is exactly what gives this process so much scale. This structured hierarchy affects scale in the following way: If every manager can communicate and control 10 or more subordinate managers or individual contribu- tors, the capability in terms of manpower grows by one or more orders of magnitude. The alternative is to have everyone communicating in a single room or in a single channel, which obviously doesn’t scale well as communication becomes difficult and coordination of people becomes near impossible. People and teams would quickly drown each other out in their debates, discussions, and chatter. Very little would get done in such a crowded environment. Furthermore, this approach to having managers listen and communicate on two channels has been very effective for many years in the military. Company command- ers listen to and interact with their battalion commanders on one channel and issue orders and respond to multiple platoon leaders on another channel (the company commander is at the upper-left of Figure 9.1). The platoon leaders then do the same with their platoons; each platoon leader speaks to multiple squads on a frequency ptg5994185 156 CHAPTER 9MANAGING CRISIS AND ESCALATIONS dedicated to the platoon in question (see the center of Figure 9.1 speaking to squads shown in upper-right). So although it may seem a bit awkward to have someone lis- tening to two different calls or being in a room and while issuing directions over the phone or in a chat room, the concept has worked well in the military since the advent of the radio and we have employed it successfully in several companies. It is not uncommon for military pilots to listen to four different radios at one time while fly- ing the aircraft: two tactical channels and two air traffic control channels. The Role of Engineering Leads The role of a senior engineering professional on the phone can be filled by a deeply technical manager. Each engineering discipline or engineering team necessary to resolve the crisis should have someone capable of both managing that team and answering technical questions within the higher level crisis management team. This person is the lead individual investigator for her domain experience on the crisis management call and is responsible for helping the higher-level team vet information, clear and prioritize hypotheses, and so on. This person can also be on both the calls of the organization she represents and the crisis management call or conference, but her primary responsibility is to interact with the other senior engineers and the crisis manager to help formulate appropriate actions to end the crisis. Figure 9.1 Military Communication Company Commander to Multiple Platoon Leaders Platoon Leader to Multiple Squads 40.50 40.50 50.25 50.25 ptg5994185 COMMUNICATIONS AND CONTROL 157 The Role of Individual Contributors Individual contributors within the teams assigned to the crisis management call or conference communicate on separate chat and phone conferences or reside in sepa- rate conference rooms. They are responsible for generating and running down leads within their teams and work with the lead or senior engineer and their manager on the crisis management team. Here, an individual contributor isn’t just responsible for doing work assigned by the crisis management team. The individual contributor and his teams are additionally responsible for brainstorming potential problems causing the incident, communicating them, generating hypotheses, and quickly proving or disproving those hypotheses. The teams should be able to communicate with the other domains’ teams either through the crisis management team or directly. All sta- tus, however, should be communicated to the team manager who is responsible for communicating it to the crisis management team. Communications and Control Shared communication channels are a must for effective and rapid crisis resolution. Ideally, the teams are moved to be located near each other at the beginning of a crisis. That means that the lead crisis management team is in the same room and that each of the individual teams supporting the crisis resolution effort are located with each other to facilitate rapid brainstorming, hypothesis resolution, distribution of work, and status reporting. Too often, however, crises happen when people are away from work; because of this, both synchronous voice communication conferences (such as conference bridges on a phone) and asynchronous chat rooms should be employed. The voice channel should be used to issue commands, stop harmful activity, and gain the attention of the appropriate team. It is absolutely essential that someone from each of the teams be on the crisis resolution voice channel and be capable of controlling her team. In many cases, two representatives, the manager and the senior (or lead) engineer, should be present from each team on such a call. This is the com- mand and control channel in the absence of everyone being in the same room. All shots are called from here, and it serves as the temporary change control authority and system for the company. The authority to do anything other than perform non- destructive “read” activities like investigating logs is first “OK’d” within this voice channel or conference room to ensure that two activities do not compete with each other and either cause system damage or result in an inability to determine what action “fixed” the system. The chat or IRC channel is used to document all conversations and easily pass around commands to be executed so that time isn’t wasted in communication. Com- mands that are passed around can be cut and pasted for accuracy. Additionally, the ptg5994185 158 CHAPTER 9MANAGING CRISIS AND ESCALATIONS timestamps within the IRC or chat can be used in follow-up postmortems. The crisis manager is responsible for ensuring that he is not only putting his notes in the chat room and writing his decisions in the chat room for clarification, but for ensuring that status updates, summaries, hypotheses, and associated actions are put into the chat room. It is absolutely essential in our minds that both the synchronous voice and asyn- chronous chat channels are open and available for any crisis. The asynchronous nature of chat allows activities to go on without interruption and allows individuals to monitor overall group activities between the tasks within their own assigned duties. Through this asynchronous method, scale is achieved while the voice allows for immediate command and control of different groups for immediate activities. Should everyone be in one room, there is no need for a phone call or conference call other than to facilitate experts who might not be on site and updates for the business managers. But even with everyone in one room, a chat room should be opened and shared by all parties. In the case where a command is misunderstood, it can be buddy checked by all other crisis participants and even “cut and pasted” into the shared chat room for validation. The chat room allows actual system or application results to be shared in real time with the remainder of the group and an immediate log with timestamps is generated when such results are cut and pasted into the chat. The War Room Phone conferences are a poor but sometimes necessary substitute for the “war room” or crisis conference room we had previously mentioned. So much more can be com- municated when people are in a room together, as body language and facial expres- sions can actually be meaningful in a discussion. How many times have you heard someone say something, but when you read or look at the person’s face you realize he is not convinced of the validity of his statement? That isn’t to say that the person is lying, but rather that he is passing along something that he does not wholly believe. For instance, someone might say, “The team believes that the problem could be with the login code,” but she has a scowl on her face that shows that something is wrong. A phone conversation would not pick that up, but you have the presence of mind in person to say, “What’s wrong, Sue?” Sue might answer that she doesn’t believe it’s possible given that the login code hasn’t changed in months, which may lower the priority for investigation. Sue might also respond by saying, “We just changed that damn thing yesterday,” which would increase the prioritization for investigation. In the ideal case, the war room is equipped with phones, a shared desk, terminals capable of accessing systems that might be involved in the crisis, plenty of work space, projectors capable of displaying key operating metrics or any person’s termi- nal, and lots of whiteboard space. Although the inclusion of a white board might ini- ptg5994185 THE WAR ROOM 159 tially appear to be at odds with the need to log everything in a chat room, it actually supports chat activities by allowing graphics, symbols, and ideas best expressed in pictures to be drawn quickly and shared. Then, such things can be reduced to words and placed in chat, or a picture of the whiteboard can be taken and sent to the chat members. Many new whiteboards even have systems capable of reducing their con- tents to pictures immediately. Should you have an operations center, the war room should be close to that to allow easy access from one area to the next. You may think that creating such a war room would be a very expensive proposi- tion. “We can’t possibly afford to dedicate space to a crisis,” you might say. Our answer is that the war room need not be expensive or dedicated to crisis situations. It simply needs to be given a priority to any crisis and as such any conference room equipped with at least one and preferably two lines or more will do. Individual man- agers can use cell phones to communicate with their teams if need be, but in this case, you should consider the inclusion of low-cost cell phone chargers within the room. There are lots of low-cost whiteboard options available including special paint that “acts” like a whiteboard and is easily cleanable, and windows make a fine white- board in a pinch. Moreover, the war room is useful for the “ride along” situation we described in Chapter 6. If you want to make a good case for why you should invest in creating a scalable organization, scalable processes, and a scalable technology platform, invite some business executives into a well-run war room to witness the work necessary to fix scale problems that result in a crisis. One word of caution here: If you can’t run a crisis well and make order out of its chaos, do not invite people into the conference. Instead, focus your time on finding a leader and manager who can run such a crisis and then invite other executives into it. Tips for a Successful War Room A good war room has the following: • Plenty of white board space • Computers and monitors with access to the production systems and real-time data • A projector for sharing information • Phones for communication to teams outside the war room • Access to IRC or chat • Workspace for the number of people who will occupy the room War rooms tend to get loud, and the crisis manager must maintain control within the room to ensure that communication is concise and effective. Brainstorming can and should be used, but limit communication during discussion to one individual at a time. ptg5994185 160 CHAPTER 9MANAGING CRISIS AND ESCALATIONS Escalations Escalations during crisis events are critical for several reasons. The first and most obvious is that the company’s job in maximizing shareholder value is to ensure that it isn’t destroyed in these events. As such, the CTO, CEO, and other execs need to hear quickly of issues that are likely to take significant time or have significant negative customer impact. In a public company, it’s all that much more important that the senior execs know what is going on as shareholders demand that they know about such things, and it is possible that public facing statements will need to be made. Moreover, executives have a better chance at helping to marshal all of the resources necessary to bring a crisis to resolution, including customer communications, vendor, and partner relationships, and so on. The natural tendency for engineering teams is to feel that they can solve the prob- lem without outside help or help from their management teams. That may be true, but solving the problem isn’t enough—it needs to be resolved the quickest and most cost-effective way possible. Often, that will require more than the engineering team can muster on their own, especially if third-party providers are at all to blame for some of the incident. Moreover, communication throughout the company is impor- tant as your systems are either supporting critical portions of the company or in the case of Web companies they are the company. Someone needs to communicate to shareholders, partners, customers, and maybe even the press. That job is best handled by people who aren’t involved in fighting the fire. Think through your escalation policies and get buy-in from senior executives before you have a major crisis. It is the crisis manager’s job to adhere to those escala- tion policies and get the right people involved at the time defined in the policies regardless of how quickly the problem is likely to be solved after the escalation. Status Communications Status communications should happen at predefined intervals throughout the crisis and should be posted or communicated in a somewhat secure fashion such that the organizations needing information on resolution time can get the information they need to take the appropriate actions. Status is different than escalation. Escalation is made to bring in additional help as time drags on during a crisis, and status commu- nications are made to keep people informed. Using the RASCI framework, you esca- late to Rs, As, Ss, and Cs, and you post status communication to Is. A status should include start time, a general update of actions since the start time, and the expected resolution time if known. This resolution time is important for sev- eral reasons. Maybe you support a manufacturing center and the manufacturing ptg5994185 CRISES POSTMORTEMS 161 manager needs to know if she should send home her hourly employees. Potentially, you provide sales or customer support software in a SaaS fashion, and those companies need to be able to figure out what to do with their sales and customer support staff. Your crisis process should clearly define who is responsible for communicating to whom, but it is the crisis manager’s job to ensure that the timeline for communica- tions is followed and that the appropriate communicators are properly informed. A sample status email is shown in Figure 9.2. Crises Postmortems Just as a crisis is an incident on steroids, so is a crisis postmortem a juiced-up post- mortem. Treat this postmortem with extra special care. Bring in people outside of technology because you never know where you are going to get advice critical to making the whole process better. Remember, the systems that you helped create and manage have just caused a huge problem for a lot of people. This isn’t the time to get defensive; this is the time to be reborn. This is the meeting that will fulfill or destroy the process of turning around your team, setting up the right culture, and fixing your processes. Figure 9.2 Status Communication To: Crisis Manager Escalation List Subject: September 22 Login Failures Issue: 100% of internet logins from our customers started failing at 9:00 AM on Thursday, 22 September. Customers who were already logged in could continue to work unless they signed out or closed their browsers. Cause: Unknown at this time, but likely related to the 8:59 AM code push. Impact: User activity metrics are off by 20% as compared to last week, and 100% of all logins from 9 AM have failed. Update: We have isolated potential causes to one of three candidates within the code and we expect to find the culprit within the next 30 minutes. Time to Restoration: We expect to isolate root cause in the code, build the new code and roll out to the site within 60 minutes. Fallback Plan: If we are not live with a fix within 90 minutes we will roll the code back to the previous version within 75 minutes. Johnny Onthespot Crisis Manager AllScale Networks [...]... include the exact time and date of the change, the system undergoing change, the expected results of the change, and the contact information of the person making the change • The intent of change management is to limit the impact of changes by controlling them through their release into the production environment and logging them as they are introduced to production • Change management consists of the. .. The database, for example, would include the number of SQL transactions (based on the current query mix), the storage, and the server loads These assignees should be the people responsible for the health and welfare of these components whenever possible The database administrators are most likely the best candidates for the database analysis, the systems administrators for the application servers The. .. impact of changes within your platform, product, or system ATC works to order aircraft landings and takeoffs based on the availability of the aircraft, its personal needs (does the aircraft have a declared emergency, is it low on fuel, and so on), and its order in the queue for takeoffs and landings Queue order may be changed for a number of reasons including the aforementioned declaration of emergencies... all of the required information necessary to “request” the change is indeed present, that the change proposal has all required fields filled out appropriately To the extent that you’ve implemented some form of the RASCI model, you may also decide to require that the appropriate A, or owner of the system in question, has signed off on the change and is aware of it The primary reason for the inclusion of. .. to reduce the rate of change, but rather to allow the rate of change to increase while decreasing the number of change related incidents and their impact on shareholder wealth creation Increasing the velocity and quantity of change while decreasing the impact and probability of change related incidents is how change management increases the scalability of your organization, service, or platform Change... contain information regarding risk, reward, and suggested or proposed dates for the change • The change approval step validates that all information is correct and that the person requesting the change has the authorization to make the change • The change scheduling step is the process of limiting risk by analyzing dependencies, rates of changes on subsystems and components, and attempting to minimize the. .. determine the headroom of some common components found in systems Lastly, we will discuss the ideal conditions that you want to look for in your components in terms of loads or performance Purpose of the Process The purpose of determining the headroom of your application, as we started to discuss, is to understand where your system stands in terms of its capability to continue to serve the needs of your... RISIS AND E SCALATIONS • The crisis resolution team consists of the crisis manager, engineering managers, and senior engineers In addition, teams of engineers reporting to the engineering managers are employed • The role of the crisis manager is to maintain order and follow the crisis resolution, escalation, and communication processes • The role of the engineering manager is to manage her team and provide... identified within the change proposal and consistent with the limitations, restrictions, or requests identified within the change scheduling phase This phase consists of two steps: starting and logging the start time of the change and completing and logging the completion time of the change This is slightly more robust than the change identification process identified earlier in the chapter, but also will yield... for many of us, the elimination of changes within a system, while potentially accomplishing stability, will limit the ability of our business to grow Therefore, we must allow and enable changes with the intent of limiting impact and managing risk, thereby creating a stable platform or service If unmanaged, a high rate of change will cause you significant problems and will result in the more modern definition . in June of 1999. These outages were at least partially responsible for the reduction in stock price from a high in the mid $20s the week of April 26, 1999, to a low of $10 .42 the week of August. picture of the impact, what has been tried, the best hypotheses being considered and the tasks associated with those hypotheses, and the timeline for completion of the current set of actions,. well as the development of the next set of actions. Other members should be managers of the technical teams assembled to help solve the crisis and one of the experienced (described in organizations

Ngày đăng: 14/08/2014, 17:21

TỪ KHÓA LIÊN QUAN

TÀI LIỆU CÙNG NGƯỜI DÙNG

TÀI LIỆU LIÊN QUAN