Co m pl im en ts of The Automated Traffic Handbook Managing Spiders, Bots, Scrapers, and Other Non-Human Traffic Andy Still TheI ndust r yL eadi ngWebTr af fic ManagementSyst em Under st and, Cont r olAndOpt i mi seOnl i neTr af fic Guar ant eewebsi t eupt i me, pr ot ectyourbusi nessf r om mal i ci ousbot s, ensur eex cel l entcust omerex per i enceand max i mi ser evenuegener at edbywebappl i cat i ons How I tWor k s Desi gnedbywebper f or manceex per t s, Tr af ficDef enderi sacl oudser vi cet hatsi t si nf r ontofyourwebsi t eorAPI cont r ol l i ngt heflow oft r af fict oi t Ourhi ghl yr esi l i entpl at f or m guar ant eesupt i meandpr ot ect syourwebsi t ef r om mal i ci ousbotact i vi t y, enabl i ngyout ogener at emax i mum r evenueoveryoursi t e’ sbusi estper i ods F i ndoutwhatTr af ficDef endercandof oryourwebsi t e BookDemo OR F r eeTr i al L ear nmor eati nt echni ca com/t r af ficdef ender The Automated Traffic Handbook Managing Spiders, Bots, Scrapers, and Other Non-Human Traffic Andy Still Beijing Boston Farnham Sebastopol Tokyo The Automated Traffic Handbook by Andy Still Copyright © 2018 O’Reilly Media, Inc All rights reserved Printed in the United States of America Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472 O’Reilly books may be purchased for educational, business, or sales promotional use Online editions are also available for most titles (http://oreilly.com/safari) For more information, contact our corporate/institutional sales department: 800-998-9938 or corporate@oreilly.com Editor: Virginia Wilson Production Editor: Nicholas Adams Copyeditor: Jasmine Kwityn Interior Designer: David Futato February 2018: Cover Designer: Randy Comer Illustrator: Rebecca Demarest Tech Reviewers: Daniel Huddart, Andy Lole, and Jason Hand First Edition Revision History for the First Edition 2018-02-02: First Release The O’Reilly logo is a registered trademark of O’Reilly Media, Inc The Automated Traffic Handbook, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limi‐ tation responsibility for damages resulting from the use of or reliance on this work Use of the information and instructions contained in this work is at your own risk If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsi‐ bility to ensure that your use thereof complies with such licenses and/or rights This work is part of a collaboration between O’Reilly and Intechnica See our state‐ ment of editorial independence 978-1-492-02935-9 [LSI] Table of Contents Introduction vii Part I Background What Is Automated Traffic? Key Characteristics of Automated Traffic Exclusions 4 Misconceptions of Automated Traffic Misconception: Bots Are Just Simple Automated Scripts Misconception: Bots Are Just a Security Problem Misconception: Bot Operators Are Just Individual Hackers Misconception: Only the Big Boys Need to Worry About Bots Misconception: I Have a WAF, I Don’t Need to Worry About Bot Activity 9 10 11 Impact of Automated Traffic 13 Company Interests Other Users System Security Infrastructure 13 14 14 15 iii Part II Types of Automated Traffic Malicious Bots 19 Application DDoS 20 Data Harvesting 21 Search Engine Spiders Content Theft Price Scraping Content/Price Aggregation Affiliates User Data Harvesting 22 22 24 25 26 26 Checkout Abuse 27 Scalpers Spinners Inventory Exhaustion Snipers Discount Abuse 28 29 30 30 31 Credit Card Fraud 33 Card Validation Card Cracking Card Fraud 33 34 34 User-Generated Content (UGC) Abuse 35 Content Spammer 36 Account Takeover 37 Credential Stuffing/Credential Cracking Account Creation Bonus Abuse 37 38 39 10 Ad Fraud 41 Background to Internet Advertising Banner Fraud Click Fraud CPA Fraud Cookie Stuffing Affiliate Fraud Arbitrage Fraud iv | Table of Contents 42 44 45 46 46 47 48 11 Monitors 51 Availability Performance Other 52 52 52 12 Human-Triggered Automated Traffic 53 Part III How to Effectively Handle Automated Traffic in Your Business 13 Identifying Automated Traffic 57 Indications of an Automated Traffic Problem Challenges Generation 0: Genesis—robots.txt Generation 1: Simple Blocking—Blacklisting and Whitelisting Generation 2: Early Bot Identification—Symptom Monitoring Generation 3: Improved Bot Identification—Real User Validation Generation 4: Sophisticated Bot Identification—Behavioral Analysis 57 58 60 60 61 62 64 14 Managing Automated Traffic 67 Blocking Validation Requests Alternative Servers/Caching Alternative Content 68 69 71 71 Conclusion 73 Table of Contents | v Introduction Web traffic consists of more than just the human users who visit your site In fact, recent reports show that human users are becom‐ ing a minority The rest belongs to an ever-expanding group of traf‐ fic that can be grouped under the heading automated traffic Terminology The terms automated traffic, bot traffic, and nonhuman traffic are equally common and are used inter‐ changeably throughout this book As long ago as 2014, Incapsula estimated that human traffic only accounted for as little as 39.5% of all traffic they saw This trend is predicted to continue, with Cisco estimating that automated traffic will grow by 37% year on year until 2022 However, this is not simply a growth in the quantity of automated traffic but also in the variety and sophistication of that traffic New paradigms for interaction with the internet, more complex business models and interdependence between sources of data, evolution of shopping methods and habits, increased sophistication of criminal activity and availability of cloud-based computing capacity, are all converging to create an automated traffic environment that is ever more challenging for a website owner to control vii It’s Not All Good or Bad It is simplistic to think of automated traffic as being all goodies and baddies However, the truth is much more nuanced than that As we’ll discuss, there are clear areas of good and bad traffic but there is a gray area in between where you will need to assess the positivity or negativity for your situation This growth poses a number of fundamental questions for anyone with responsibility for maintaining efficient operation or maximum profitability of a public-facing website: • How much automated traffic is hitting my website? • What is this traffic up to? • How worried should I be about it? • What can I about it? The rest of this book will help you understand how you can provide answers to these questions Terminology The challenge of automated traffic applies to anyone who runs a public-facing web-based system, whether that is a traditional public website, complex web-based application, SaaS system, web portal, or web-based API For simplicity I will use the generic term website when referring to any of these systems Likewise I will use website owner to refer to the range of people who will be responsible for identifying and managing this problem—from security and platform managers to ecommerce and marketing directors I will use the term bot operator to identify the individ‐ ual or group that is operating the automated traffic viii | Introduction Generation 0: Genesis—robots.txt The first generation of bot defense was the definition of a public policy associated with a website that defined how you would like bot traffic to behave This is the robots.txt file that can be placed in the root directory of any website robots.txt The robots.txt file was first proposed by Martijn Koster in 1994 after a web crawler caused an accidental denial-of-service attack on his server It soon became a de facto standard robots.txt is a text file that contains a list of instructions to automa‐ ted traffic on which files they shouldn’t access These instructions can be applied to all traffic or to traffic related to a specified user agent The majority of legitimate automated traffic (including all major search engines) observe robots.txt policies I would categorize this as Generation as it is not a form of bot defense due to having no means of enforcement—it is simply a list of instructions that it is up to the requestor whether or not they follow Generation 1: Simple Blocking—Blacklisting and Whitelisting The first generation of actual bot identification focused on using the data that is made available within HTTP headers, such as the IP address and user agent, in order to create blacklists and whitelists of known “good bots” and “bad bots.” At its simplest level this will be a purely reactive policy Reviewing logs and monitoring output to identify any unwanted activity and adding the IP address or user agent associated with that activity to the relevant blacklist The reactive nature of this approach is limited, however, as it will always miss the first instance of a problem Therefore, better prac‐ tice is to collect known blacklists and share them across multiple 60 | Chapter 13: Identifying Automated Traffic users, essentially crowdsourcing the knowledge of known good or bad actors These shared blacklists are available as open source community projects or from commercial organizations They formed the basis of most early bot defense services and still form an important part of them to this day However, this form of defense is easily bypassed As mentioned, the user agent is simply a string that is defined by the client as part of the request, therefore any automated traffic can send whatever value they want, including the same value as sent by a legitimate browser Most legitimate automated traffic will correctly identify themselves using a user-agent string, although some (such as Google) will also send hidden requests to validate that the responses being sent to their search bot is the same as that being returned to human users Likewise, the growth of cloud providers and other services that will give access to wide ranges of rotating IP addresses, as well as the cre‐ ation of botnets, have made sending requests from a wide range of IP addresses much easier In this situation simply blacklisting IP addresses can have unexpected consequences as they can turn out to be used in the future by legitimate users Generation 2: Early Bot Identification— Symptom Monitoring To handle some of the limitations of early bot identification, a new approach was taken This involved looking at the activity seen on the server and identifying symptoms that may be representative of bot activity A very simple example of this would be to say that if a lot of requests are being made from a single IP address then this would be a symp‐ tom of bot activity Therefore a rule should be applied to say that no session not identified on the whitelist of known “good bots” should be allowed to make more than x requests during a specified period and any requests above that limit will be rejected Symptoms can be related to server behavior or can be obtained by augmenting the IP address data or user agent data with additional information (e.g., the country the IP address is located within) Generation 2: Early Bot Identification—Symptom Monitoring | 61 Other examples of the sort of data that can be looked at include: • Whether the IP address is located within a data center • Whether the user agent relates to a recent version of browser software • Whether users are downloading the full range range of files (HTML, JavaScript, CSS, images, etc.) or just HTML files • The regularity of the pattern of requests being made The limitations to all these approaches are as follows: • They provide no understanding of whether the automated traf‐ fic is good or bad and what the intent of the bot operator is This is reliant on the existing whitelists populated in simple blocking solutions • They not validate that the user is not just a human user that is interacting in a similar manner to a bot • The settings being applied are easily identifiable and therefore the bot operator can adjust their approach to bypass them When looking at the interactions from more sophisticated bots it is common to see them test systems to determine which miti‐ gations are in place—for example, trying different rates of requests to validate that there is no rate limiting in place and if there is then determining the optimal rate to make requests at Generation 3: Improved Bot Identification— Real User Validation The next generation of bot detection was focused on putting meth‐ ods in place to try and validate that there was a human involved, using a real browser and to track responses that are from the same physical device, regardless of variations in IP address and user agent This generation was centred around two dominant approaches: real browser validation and fingerprinting 62 | Chapter 13: Identifying Automated Traffic Real Browser Validation A group of techniques that are designed to ensure that the user is a human user, using an actual browser, rather than an automated tool or a headless browser This includes techniques such as the following: Cookie injection Injecting cookies into the response and validating that they exist in subsequent requests This will validate that the bot is not just executing a predefined series of requests JavaScript injection/honeypot traps Injecting some JavaScript that will update cookies (or poten‐ tially other elements of an application) with the result of the JavaScript execution This validates that the page has been pro‐ cessed by a platform capable of executing JavaScript Real-user tracking Technology that validates that user interaction has occurred on the client side by tracking mouse movements and other means Fingerprinting Fingerprinting is the process of identifying a much wider range of attributes relating to the machine executing the request and tracking when that same combination of attributes is associated with future requests There are multiple means of implementing this approach; one example is to use JavaScript to transmit a beacon to one or more of a range of remote services which includes details of the remote fingerprint In this way much tracking can be put in place to see when people are potentially trying to hide their activity behind multiple IP addresses or by using anonymous proxy systems This leads to more accuracy than relying on IP addresses alone Fingerprinting has been implemented that identifies hundreds of different identifiers to make tracking more accurate This generation of improved bot detection made it a lot more diffi‐ cult for unsophisticated bots to operate, requiring a higher level of sophistication to either execute requests from within an actual Generation 3: Improved Bot Identification—Real User Validation | 63 browser or to employ sophisticated approaches to replicate operat‐ ing with in browser Developments in browser automation, combined with the capacity of clients to execute and low-cost cloud computing, have made this sophistication a lot more available to bot operators Generation 4: Sophisticated Bot Identification —Behavioral Analysis The identification of bots had advanced considerably by this time but still faced two major problems: • Bots were getting ever more sophisticated and were able to mimic human, browser-based interactions • The means to distinguish between good and bad bots was still a blunt tool as implemented by the bot detection tools of the time and had little appreciation of the variance in definition of good and bad from industry to industry and company to company The next generation of bot detection set out to answer both of these limitations and did so using the technology of behavioral analysis and machine learning This approach says that rather than looking only at the external data you have about a request or the impact it is having on the server, you need to understand the activities that users are undertaking within your site Only by understanding their behavior can you understand their intent Once you understand the intent of the bot operator then you can effectively categorize automated traffic and take a much more nuanced management approach These machine learning, behavioral approaches will typically com‐ bine three levels of intelligence in order to make an effective assess‐ ment of whether users are automated or not and what their intent is At the lowest levels these will be patterns that can be applied univer‐ sally across all websites These will then be augmented with patterns that are industry or bot category specific and lastly be augmented again with site specific patterns All these will then be combined into a learning engine that will be constantly evolving as new data, both human and automated, is processed 64 | Chapter 13: Identifying Automated Traffic The major difficulty in any effective bot identification process is that there is no fool-proof way of determining, even after the event, exactly which requests are bot and which are human, especially when looking at the most sophisticated bots which are actively try‐ ing to disguise their activity as human or even integrating their activity with that of human users via botnets This means that as the industry has moved through the generations of bot detection the identification has become increasingly an assessment rather than an identification When moving into machine learning this becomes even more so as essentially the analysis engine is forming an opinion such as “Based on the range of evidence provided and taking into account the experience I have of seeing all other users on this sys‐ tem, I would assess that this user is an automated process that is attempting to scrape content from your site” For this reason, the leading bot detection systems will also provide an indication of the confidence they have in that assessment This allows the site owner to vary the means of handling bot activity based on the confidence level, balancing the risk of the negative impact of bot activity against the risk of negatively affecting the experience of a legitimate human user The sophistication of modern automated traffic is such that only this generation level of detection will be effective, especially as these bots that are making the most effort to be hidden are the ones you are most likely to want to remove from your website Generation 4: Sophisticated Bot Identification—Behavioral Analysis | 65 CHAPTER 14 Managing Automated Traffic This chapter will address the final step of the process Once you have an understanding of the types of automated traffic and you’ve successfully identified which ones are impacting your site, what action can you take against them? As will be discussed in this chapter, there are a range of potential actions that can be taken against bots and it is important that you select the most appropriate means of handling bots while taking the following points into consideration: • The impact of false positives (i.e., wrongly identifying a human user as a bot): Is there a way for a human user wrongly identi‐ fied as a bot to resolve that error, and if so, could a bot use the same resolution? The consequences of allowing a bot through might be so negative that you are willing to risk a number of humans being impacted to preserve the security of your system • The ability of bot traffic to bypass that means of defense • The signal you want to give to the bot operator: Do you want them to be made aware that you have detected them, or will you choose instead to let them assume that their request was suc‐ cessful? Making them aware can have two disparate conse‐ quences: on the one hand, they might move on to someone else once they realize you are no longer an easy target, but on the other, it might motivate them to modify the sophistication of their bot to try and bypass your defenses (this will largely depend on how targeted the particular attack is) 67 These considerations will allow you to define a bot-handling policy Ideally this policy will vary depending on the type of bot traffic that you are identifying, enabling you to take a different approach depending on the threat versus opportunity assessment of the bot identified Blocking The obvious thing to with any bot identified is just to drop the connection This is common practice for firewalls and other security devices There are two ways in which blocking can be applied The first option is to stop processing the request and return an HTTP error code (typically “403 Forbidden”) to indicate to the requester why request was not completed The second option is to silently drop the request, closing the connection with no response returned Blocking is an effective policy when it comes to minimizing the impact of bot traffic and the potential risk of opening up your sys‐ tem to threat of things like application-level DDoS, which is why it is usually employed by security devices However, when looking at a bot management strategy, there are some negative aspects of using blocking The feedback given to users is very limited: they will see an error page with a limited (if any) explanation of why the error has occur‐ red—in the worst case, this will just be a standard browser failed connection error page This is acceptable in the situation where you are confident that the automated traffic identification you have made is definitely accurate If there is any element of uncertainty and chance that you may have incorrectly identified a human user then an ideal policy will give the user a means of being aware of that and how to rectify that mistake Blocking bots also sends an obvious message to bot operators that you have identified them and have put a management strategy into place This can just trigger the introduction of more sophisticated approaches to bypass your identification systems 68 | Chapter 14: Managing Automated Traffic Validation Requests Rather than simply blocking requests, to remove the risk of false positives, a second approach is to replace the response requested with an alternative response that gives the user the chance to iden‐ tify themselves as human and therefore incorrectly identified as automated traffic Validation can be implemented offline or inline Offline Validation One approach is to provide the user with a web-based form, allow‐ ing them to submit their details and information about themselves to a system administrator The system administrator will then assess the details submitted and determine whether an update to the sys‐ tem rules is needed to allow further connections received from this user to access the system The limitation of an offline validation process is that there is a large human overhead but this does mean that there can be a human intelligence applied This can be especially useful if managing large blacklists A means of handling feedback of inaccurate bot identification is necessary, even if not provided via a dedicated web form Equally negative, from the human user point of view, is that it will be a slow process to get this situation resolved, meaning that this potential customer will likely not return to your website Conversely to those negatives, this is likely to be a deterrent to bot operators, as it is a more difficult process to bypass Inline Validation Inline validation, by contrast, aims to give human users the ability to validate themselves as not being bots within the page that is returned, therefore gaining instant access to the system The validation will typically involving carrying out an activity that cannot be completed by bot traffic This usually involves completion of a CAPTCHA-type test Validation Requests | 69 CAPTCHA CAPTCHA (or Completely Automated Public Turing Test to tell Computers and Humans Apart) is a term that was coined in 2003 for tests that are designed to be passable by humans but beyond the ability of comput‐ ers CAPTCHA tests will look to test the human intelli‐ gence abilities of variation recognition, segmentation, and understanding of context To replicate these, a com‐ puter would have to use artificial intelligence The most common form of CAPTCHA is displaying a distorted string of characters and numbers and asking the human user to type in those characters This type of test has been around since the late 1990s Recently, artificial intelligence has advanced such that the majority of these type of CAPTCHAs can now be solved by artificial intelligence processes This has led to a new generation of tests that run more advanced tests, such as answering questions about images (e.g., “Which one of these images contains a vehicle?”) or by completing tasks (e.g., orienting a picture so that it dis‐ plays properly on the screen) The advantages of this from a user satisfaction point of view are evi‐ dent If a user can identify themselves as a human within a single page, it means they can access the website they want to instantly, thereby preventing the website losing that customer This means that a more strict approach can be taken in which suspected bots are challenged However, that does mean that it is easier for bot traffic to bypass the protection by identifying itself as human This can be done in two ways: first, by using systems such as artificial intelligence to pass any tests that are put in place; and second, by using human users to pass the tests and allowing bot activity to continue on the back of this identification There are third-party companies that provide banks of people that will pass CAPTCHA tests on demand in return for payment (typically fractions of a penny per successful CAPTCHA completion) 70 | Chapter 14: Managing Automated Traffic Alternative Servers/Caching The bot management approach you employ will be determined by the problem that automated traffic is causing for you In some cases there may not be a requirement to stop the traffic, only to reduce the impact it is having on your infrastructure and therefore cost and performance for other users In this case there are several approaches that can be employed to deliver to users suspected of being automated traffic These look at identifying alternative sources for the data that is being requested One approach is to just serve cached content to connections identi‐ fied as automated This could be from a caching server (e.g., CDN) or from within your web application The viability of this approach will depend on the nature of the data being requested and the impact of serving potentially stale data to legitimate users A second variation is to redirect all traffic identified as automated to a separate system (or area of the main system), thereby ensuring that there is minimal impact on human users from the overhead of handling large amounts of automated traffic This approach focuses on minimizing the impact on other users and infrastructure costs— it makes no effort to mitigate any of the threats that may be present in automated traffic Alternative Content The final approach that can be taken, and in some ways the most proactive way that actually takes the fight back to the bot operators, is to start serving alternative content to bot traffic The most obvious examples of this would be serving alternative pri‐ ces to traffic you identified as competitor price scrapers or unapproved affiliates This offers two benefits: first, it will mean that the bot operators are making business decisions based on inaccurate data; and second, it gives you the ability to track the data after it has been served If you start to see the invalid content or data displayed on alternate systems then you can be confident they have been using your site as a source of data and take action based on that knowledge Alternative Servers/Caching | 71 Trap Street In map making this concept is known as a “trap street.” Map makers would traditionally add fictitious items into their maps, meaning that if they saw that item in any other map then the maker of that map had been copying their maps This approach can be risky as if users are mis-identified, then invalid content could be served to legitimate users Past techniques for falsely improving search engine ranking included returning different content to search engine bots than that returned to real users, a tactic known as cloaking Search engines were obviously not happy with this, as it undermined the quality of their search results They thus employed measures to detect cloaking (such as sending requests masquerading as non-search users and comparing the results) and penalize sites where cloaking is detected Any alternative content-based strategy must therefore be careful not to have an unwanted negative SEO impact 72 | Chapter 14: Managing Automated Traffic Conclusion There is no doubt that automated traffic is a major factor in the modern internet and estimates that this will continue to grow would seem to be accurate as there are very large amounts of money involved—the value of ad fraud alone in 2015 was estimated at $6.3 billion Automated traffic activity is not conducted only by lone wolf hack‐ ers—there are major, organized criminal groups who are dedicating large amounts of time and effort to target specific systems where they feel they can optimize revenue They have large automated activity scouring the internet for potential targets It is essential then that websites owners take this problem seriously and start to get an understanding of the amount of automated traffic that is hitting their systems but much more importantly, what that traffic is doing As has been shown in this book there are a wide variety of different activities being undertaken for a wide variety of purposes and only by understanding this can you assess the risk/ opportunity this traffic is presenting However, identification of this traffic and its intent is complex as automated traffic is actively trying to stay ahead of any defenses that have been put in place Identifying automated traffic requires the latest generation of machine learning and behavioral-based analysis and that process takes time and effort to so reliably Only by taking this seriously and ensuring that you take control of your traffic can you ensure that you maximize the value you get from automated traffic, while minimizing the threat impact 73 About the Author Andy Still (@andy_still) is cofounder of Intechnica, a vendor inde‐ pendent IT development and performance consultancy and creators of TrafficDefender, the industry-leading website traffic management tool which specializes in identification and management of automa‐ ted traffic With over 15 years of experience in IT, Andy specializes in applica‐ tion architecture for high scalability and throughput Andy cur‐ rently works on building Intechnica’s product range, which has successfully maintained service for a wide range of companies dur‐ ing peak events and also advises Intechnica clients on how to build and maintain highly performant systems as well as on how to opti‐ mize development processes and delivery Andy is the author of several O’Reilly books and is one of the organ‐ izers of the Web Performance Group North UK He blogs irregularly at https://internetperformanceexpert.com ... ender The Automated Traffic Handbook Managing Spiders, Bots, Scrapers, and Other Non-Human Traffic Andy Still Beijing Boston Farnham Sebastopol Tokyo The Automated Traffic Handbook by Andy Still. .. mated traffic management solution This impact is intensified by the timing of the automated traffic in relation to the peak hours seen by the business Search engines and other legitimate automated. .. users due to the effects on the performance of the system, such as the deterioration of site response as a result of the higher traffic on the system System Security Is the non-human traffic trying