Building Scalable Web Sites By Cal Henderson Publisher: O'Reilly Pub Date: May 2006 Print ISBN-10: 0-596-10235-6 Print ISBN-13: 978-0-59-610235-7 Pages: 348 Table of Contents | Index Slow websites infuriate users Lots of people can visit your web site or use your web application - but you have to be prepared for those visitors, or they won't come back Your sites need to be built to withstand the problems success creates Building Scalable Web Sites looks at a variety of techniques for creating sites that can keep users cheerful even when there are thousands or millions of them Flickr.com developer, Cal Henderson, explains how to build sites so that large numbers of visitors can enjoy them Henderson examines techniques that go beyond sheer speed, exploring how to coordinate developers, support international users, and integrate with other services from email to SOAP to RSS to the APIs exposed by many Ajax-based web applications This book uncovers the secrets that you need to know for back-end scaling, architecture and failover so your websites can handle countless requests You'll learn how to take the "poor man's web technologies" - Linux, Apache, MySQL and PHP or other scripting languages - and scale them to compete with established "store bought" enterprise web technologies Toward the end of the book, you'll discover techniques for keeping web applications running with event monitoring and long-term statistical tracking for capacity planning If you're about to build your first dynamic website, then Building Scalable Web Sites isn't for you But if you're an advanced developer who's ready to realize the cost and performance benefits of a comprehensive approach to scalable applications, then let your fingers do the walking through this convenient guide Building Scalable Web Sites By Cal Henderson Publisher: O'Reilly Pub Date: May 2006 Print ISBN-10: 0-596-10235-6 Print ISBN-13: 978-0-59-610235-7 Pages: 348 Table of Contents | Index Copyright Preface Chapter 1 Introduction Section 1.1 What Is a Web Application? Section 1.2 How Do You Build Web Applications? Section 1.3 What Is Architecture? Section 1.4 How Do I Get Started? Chapter 2 Web Application Architecture Section 2.1 Layered Software Architecture Section 2.2 Layered Technologies Section 2.3 Software Interface Design Section 2.4 Getting from A to B Section 2.5 The Software/Hardware Divide Section 2.6 Hardware Platforms Section 2.7 Hardware Platform Growth Section 2.8 Hardware Redundancy Section 2.9 Networking Section 2.10 Languages, Technologies, and Databases Chapter 3 Development Environments Section 3.1 The Three Rules Section 3.2 Use Source Control Section 3.3 One-Step Build Section 3.4 Issue Tracking Section 3.5 Scaling the Development Model Section 3.6 Coding Standards Section 3.7 Testing Chapter 4 i18n, L10n, and Unicode Section 4.1 Internationalization and Localization Section 4.2 Unicode in a Nutshell Section 4.3 Unicode Encodings Section 4.4 The UTF-8 Encoding Section 4.5 UTF-8 Web Applications Section 4.6 Using UTF-8 with PHP Section 4.7 Using UTF-8 with Other Languages Section 4.8 Using UTF-8 with MySQL Section 4.9 Using UTF-8 with Email Section 4.10 Using UTF-8 with JavaScript Section 4.11 Using UTF-8 with APIs Chapter 5 Data Integrity and Security Section 5.1 Data Integrity Policies Section 5.2 Good, Valid, and Invalid Section 5.3 Filtering UTF-8 Section 5.4 Filtering Control Characters Section 5.5 Filtering HTML Section 5.6 Cross-Site Scripting (XSS) Section 5.7 SQL Injection Attacks Chapter 6 Email Section 6.1 Receiving Email Section 6.2 Injecting Email into Your Application Section 6.3 The MIME Format Section 6.4 Parsing Simple MIME Emails Section 6.5 Parsing UU Encoded Attachments Section 6.6 TNEF Attachments Section 6.7 Wireless Carriers Hate You Section 6.8 Character Sets and Encodings Section 6.9 Recognizing Your Users Section 6.10 Unit Testing Chapter 7 Remote Services Section 7.1 Remote Services Club Section 7.2 Sockets Section 7.3 Using HTTP Section 7.4 Remote Services Redundancy Section 7.5 Asynchronous Systems Section 7.6 Exchanging XML Section 7.7 Lightweight Protocols Chapter 8 Bottlenecks Section 8.1 Identifying Bottlenecks Section 8.2 External Services and Black Boxes Chapter 9 Scaling Web Applications Section 9.1 The Scaling Myth Section 9.2 Scaling the Network Section 9.3 Load Balancing Section 9.4 Scaling MySQL Section 9.5 MyISAM Section 9.6 MySQL Replication Section 9.7 Database Partitioning Section 9.8 Scaling Large Database Section 9.9 Scaling Storage Chapter 10 Statistics, Monitoring, and Alerting Section 10.1 Tracking Web Statistics Section 10.2 Application Monitoring Section 10.3 Alerting Chapter 11 APIs Section 11.1 Data Feeds Section 11.2 Mobile Content Section 11.3 Web Services Section 11.4 API Transports Section 11.5 API Abuse Section 11.6 Authentication Section 11.7 The Future About the Author Colophon Colophon Index Building Scalable Web Sites by Cal Henderson Copyright © 2006 O'Reilly Media, Inc All rights reserved Printed in the United States of America Published by O'Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472 O'Reilly books may be purchased for educational, business, or sales promotional use Online editions are also available for most titles (safari.oreilly.com) For more information, contact our corporate/institutional sales department: (800) 998-9938 or corporate@oreilly.com Editor: Simon St.Laurent Production Editor: Adam Witwer Copyeditor: Adam Witwer Proofreader: Colleen Gorman Indexer: John Bickelhaupt Cover Designer: Karen Montgomery Interior Designer: David Futato Illustrators: Robert Romano and Jessamyn Read Printing History: May 2006: First Edition Nutshell Handbook, the Nutshell Handbook logo, and the O'Reilly logo are registered trademarks of O'Reilly Media, Inc Building Scalable Web Sites, the image of a carp, and related trade dress are trademarks of O'Reilly Media, Inc Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks Where those designations appear in this book, and O'Reilly Media, Inc was aware of a trademark claim, the designations have been printed in caps or initial caps While every precaution has been taken in the preparation of this book, the publisher and author assume no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein ISBN: 0-596-10235-6 [M] Preface The first web application I built was called Terrania A visitor could come to the web site, create a virtual creature with some customizations, and then track that creature's progress through a virtual world Creatures would wander about, eat plants (or other creatures), fight battles, and mate with other players' creatures This activity would then be reported back to players by twice-daily emails summarizing the day's events Calling it a web application is a bit of a stretch; at the time I certainly wouldn't have categorized it as such The core of the game was a program written in C++ that ran on a single machine, loading game data from a single flat file, processing everything for the game "tick," and storing it all again in a single flat file When I started building the game, the runtime was destined to become the server component of a client-server game architecture Programming network data-exchange at the time was a difficult process that tended to involve writing a lot of rote code just to exchange strings between a server and client (we had no NET in those days) The Web gave application developers a ready-to-use platform for content delivery across a network, cutting out the trickier parts of client-server applications We were free to build the server that did the interesting parts while building a client in simple HTML that was trivial in comparison What would have traditionally been the client component of Terrania resided on the server, simply accessing the same flat file that the game server used For most pages in the "client" application, I simply loaded the file into memory, parsed out the creatures that the player cared about, and displayed back some static information in HTML To create a new creature, I appended a block of data to the end of a second file, which the server would then pick up and process each time it ran, integrating the new creatures into the game All game processing, including the sending of progress emails, was done by the server component The web server "client" interface was a simple C++ CGI application that could parse the game datafile in a couple of hundred lines of source This system was pretty satisfactory; perhaps I didn't see the limitations at the time because I didn't come up against any of them The lack of interactivity through the web interface wasn't a big deal as that was part of the game design The only write operation performed by a player was the initial creation of the creature, leaving the rest of the game as a read-only process Another issue that didn't come up was concurrency Since Terrania was largely read-only, any number of players could generate pages simultaneously All of the writes were simple file appends that were fast enough to avoid spinning for locks Besides, there weren't enough players for there to be a reasonable chance of two people reading or writing at once A few years would pass before I got around to working with something more closely resembling a web application While working for a new media agency, I was asked to modify some of the HTML output by a message board powered by UBB (Ultimate Bulletin Board, from Groupee, Inc.) UBB was written in Perl and ran as a CGI Application data items, such as user accounts and the messages that comprised the discussion, were stored in flat files using a custom format Some pages of the application were dynamic, being created on the fly from data read from the flat files Other pages, such as the discussions themselves, were flat HTML files that were written to disk by the application as needed This render-to-disk technique is still used in low-write, high-read setups such as weblogs, where the cost of generating the viewed pages on the fly outweighs the cost of writing files to disk (which can be a comparatively very slow operation) The great thing about the UBB was that it was written in a "scripting" language, Perl Because the source code didn't need to be compiled, the development cycle was massively reduced, making it much easier to tinker with things without wasting days at a time The source code was organized into three main files: the endpoint scripts that users actually requested and two library files containing utility functions (called ubb_library.pl and ubb_library2.plseriously) After a little experience working with UBB for a few commercial clients, I got fairly involved with the message board "hacking" communitya strange group of people who spent their time trying to add functionality to existing message board software I started a site called UBB Hackers with a guy who later went on to be a programmer for Infopop, writing the next version of UBB Early on, UBB had very poor concurrency because it relied on nonportable file-locking code that didn't work on Windows (one of the target platforms) If two users were replying to the same thread at the same time, the thread's datafile could become corrupted and some of the data lost As the number of users on any single system increased, the chance for data corruption and race conditions increased For really active systems, rendering HTML files to disk quickly bottlenecks on file I/O The next step now seems like it should have been obvious, but at the time it wasn't MySQL 3 changed a lot of things in the world of web applications Before MySQL, it wasn't as easy to use a database for storing web application data Existing database technologies were either prohibitively expensive (Oracle), slow and difficult to work with (FileMaker), or insanely complicated to set up and maintain (PostgreSQL) With the availability of MySQL 3, things started to change PHP 4 was just starting to get widespread acceptance and the phpMyAdmin project had been started phpMyAdmin meant that web application developers could start working with databases without the visual design oddities of FileMaker or the arcane SQL syntax knowledge needed to drive things on the command line I can still never remember the correct syntax for creating a table or granting access to a new user, but now I don't need to MySQL brought application developers concurrency we could read and write at the same time and our data would never get inadvertently corrupted As MySQL progessed, we got even higher concurrency and massive performance, miles beyond what we could have achieved with flat files and render-to-disk techniques With indexes, we could select data in arbitrary sets and orders without having to load it all into memory and walk the data structure The possibilities were endless And they still are The current breed of web applications are still pushing the boundaries of what can be done in terms of scale, functionality, and interoperability With the explosion of public APIs, the ability to combine multiple applications to create new services has made for a service-oriented culture The API service model has shown us clear ways to architect our applications for flexibility and scale at a low cost The largest and most popular web applications of the moment, such as Flickr, Friendster, MySpace, and Wikipedia, handle billions of database queries per day, have huge datasets, and run on massive hardware platforms comprised of commodity hardware While Google might be the poster child of huge applications, these other smaller (though still huge) applications are becoming role models for the next generation of applications, now labeled Web 2.0 With increased read/write interactivity, network effects, and open APIs, the next generation of web application development is going to be very interesting What This Book Is About This book is primarily about web application design: the design ... Perhaps as importantly, this book is about the development of web applications: the practice of building the hardware and implementing the software systems that we design While the theory of application design is all well and good (and an... First Edition Nutshell Handbook, the Nutshell Handbook logo, and the O'Reilly logo are registered trademarks of O'Reilly Media, Inc Building Scalable Web Sites, the image of a carp, and related trade dress are trademarks of O'Reilly Media, Inc... With applications like Google's Gmail and Microsoft's Office Live, the web application market is moving toward applications delivered over the Web with the features and benefits of desktop applications combined with the benefits of