IT training 6 elements of big data security khotailieu

Table of Contents Introduction Chapter 1: Big Data Security Rationales Finding Threats Faster Versus Trusting a Tool Big Data Potentially Can Change the Entire Architecture of Business and IT Chapter 2: Securing HeavyD 15 Why Big Data Security is Necessary 18 Does Security Even Work? 33 Chapter 3: How Does Big Data Change Security? 45 Frameworks and Distributions 46 Shrink the Square Peg to Fit a Round Hole? 55 Chapter 4: Understanding Big Data Security Failures 63 Scope of the Problem 64 Can We Get Beyond CIA? 66 Chapter 5: Framing the Big Data Security Challenge 75 Why Not Give Up and Wait? 75 Can Privacy Help Us? 81 Chapter 6: Six Elements of Big Data Security 89 Threat Model for a Hadoop Environment 89 Six Elements 91 Automation and Scale 94 Bottom Line on Network and System Security 96 Element 2: Data Protection 96 Bottom Line on Data Protection 101 Element 3: Vulnerability Management 101 i Table of Contents Bottom Line on Vulnerability Management 103 Element 4: Access Control 103 Bottom Line on Access Control 105 Bottom Line on Policies 110 Conclusion 111 ii Introduction You can’t throw a stick these days without hitting a story about the future of Artificial Intelligence or Machine Learning Many of those stories talk at a very high level about the ethics involved in giant automation systems Should we worry about how we use new found power in big data systems? While the abuse of tools is always interesting, behind the curtain lies another story that draws far less attention These books are written with the notion that all tools can be used for good or bad, and ultimately what matters for engineers is to find a definition of reasonable measures of quality and reliability Big data systems need a guide to be made safe, because ultimately they are a gateway to enhanced knowledge When you think of the abuse that can be done with a calculator, looking across the vast landscape of fraud and corruption, imagine now if the calculator itself cannot be trusted The faster a system can analyze data and provide a “correct” action or answer, the more competitive advantage to be harnessed in any industry A complicated question emerges: how can we make automation tools reliable and predictable enough to be trusted with critical decisions? The first book takes the reader through the foundations for engineering quality into big data systems Although all technology follows a long arc with many dependencies, there are novel and interesting problems in big data that need special attention and solutions This is similar to our book on “Securing the Virtual Environment” where we emphasize a new approach based on core principles of information security The second book then takes the foundations and provides specific steps in six areas to architect, build, and assess big data systems While industry rushes ahead to cross the bridges of data we are excitedly building, we might still have time to establish clear measurements of quality, as it relates to whether these bridges can be trusted Chapter 1: Big Data Security Rationales This chapter aims to help security become an integral part of any big data systems discussion, whether it is before or after deployment We all know security isn’t embedded yet Security is the exception to the deployment discussion, let alone the planning phase “I’m sure someone is thinking about that” might end up being you If you have a group talking about real security steps and delivery dates early in your big data deployment, you are likely the exception This gap between theory and reality partly is because security practitioners lack perspective on why businesses are moving towards big data systems; the trained risk professionals are not at the table to think how best to approach threats and vulnerabilities as technology is adopted We faced a similar situation with cloud technology The business jumped in early, occasionally bringing someone from the security community in to look around and scratch their head as to why this even was happening Definitions of a big data environment are not the point here (we’ll get to that in a minute), although it’s tempting to spend a lot of time on all the different descriptions and names that are floating around That would be like debating what really a cloud environment is The semantics and marketing are useful, yet ultimately not moving us along much in terms of engineering safety Suffice it up front to say this topic of security is about more than just a measure of data size and is something less tangible, more sophisticated, and unknown in nature We say data has become big because size matters to modes of operation, while really we also imply here a change in data rates and variations In rough terms, the systems we ran in the past are like horses compared to these new steam engine discussions, so we need to take off our client-server cowboy hat and spurs, in order to start thinking about the risks of trains and cars running on rails and roads Together, the variables have become known as engines that run on 3V (Volume, Velocity, Variety), a triad which apparently Gartner coined first around 2001 The rationale for security in this emerging world of 3V engines is really twofold On the one hand security is improved by running on 3V (you can’t predict what you don’t know) and on the other hand, security has to protect 3V in order to ensure trust in these engines Better security engines will result from 3V, assuming you can trust the 3V engines Few things speak to this situation faster and better risk knowledge from safe automation, than the Grover Shoe Factory Disaster of 1905 Chapter 1: Big Data Security Rationales On the left you see the giant factory, almost an entire city block, before disaster On the right you see the factory and even neighboring buildings across the street turned into nothing more than rubble and ashes The background to this story comes from an automation technology rush Around 1890 there were 100,000 boilers installed, as Americans could not wait to deploy steam engine technology throughout the country During this great boom, in the years 1880 to 1890, over 2,000 boilers were known to have caused serious disasters We are not just talking about factories in remote areas Trains with their giant boilers up front were ugly disfigured things that looked like Cthulhu himself was the engineer Despite decades of death and destruction through the late 1800s, Grove Shoe Factory still had a catastrophic explosion in 1905 with cascading failures that leveled its entire building, burning to the ground with workers trapped inside Chapter 1: Big Data Security Rationales This example helps illustrate why trusted 3V engines are as important, if not more so, as the performance benefits of a 3V engine Nobody wants to be the Grover Shoe Factory of big data, so that is why we look at the years before 1905 and ask how the rationale for security was presented Who slept through safety class or ignored warnings when building a big data engine? We need to take the security issue very seriously, because these “engines” are being used for very important work, and security issues can have a cascading effect if not properly implemented There is a clear relationship between the two sides of security: better knowledge from data and more trusted engines to process the data I have found that most people in the security community are feverishly working on improving the former, generating lots of shoes as quickly as possible The latter has mostly been left unexplored, leaving unfinished or unknown how exactly to build a safe big data engine That is why I am focusing primarily on the rationale for security in big data with a business perspective in mind, rather than just looking at security issues for security market/industry purposes Finding Threats Faster Versus Trusting a Tool Don’t get me wrong There is much merit in the use of 3V systems for data collection in order to detect and respond to threats faster The rationale is to use big data to improve the quality of security itself Many people actively are working on better security paradigms and tools based on the availability of more data, which is being collected faster than ever before with more detail If you buy a modern security product, it is most likely running on a big data distribution You could remove the fancy marketing material and slick interface and build one yourself One might even argue this is just a natural evolution from the existing branches of detection, including IDS, SIEM, AV, and anti-SPAM In all of these products, the collection and analysis of as much data as possible is justified by the need to more quickly address real threats and vulnerabilities Indeed, as the threat intelligence community progressed towards an overwhelming flow of data being shared, they needed better tools From collection and correlation to visualization to machine learning solutions, products have emerged than can sift through data and get a better signal from the noise For example, let’s say three threat intelligence feeds have the same indicator of compromise and are only slightly altered, making it hard for humans to see the similarities A big data engine can find these anomalies much quicker However, one wonders whether the engine itself is safe, while being used to quickly improve our knowledge of threats and vulnerabilities Big Data Potentially Can Change the Entire Architecture of Business and IT Chapter 1: Big Data Security Rationales It makes a lot of sense at face value that instead of doing analysis on multiple sources of information and disconnected warehouses, a centralized approach could be a faster path with better insights The rationale for big data, meaning a centralized approach, can thus be business-driven, rather than driven by whatever reasons people had to keep the data separate, like privacy or accuracy Agriculture is an excellent example of how an industry can evolve with new technology Replace the oxen with a tractor, and look how much more grain you have in the silos Now consolidate silos with automobiles and elevators and measure again Eventually we are reaching a world where every minute piece of data about inputs and outputs from a field could help improve the yield for the farmer Fly a drone over orchards and collect thermal imagery that predicts crop yields or the need for water, fertilizer, pesticides; these inexpensive birds-eye views and collection systems are very attractive because they can significantly increase knowledge Did the crop dusting work? Is one fertilizer more effective at less cost? Will almonds survive the drought? Answers to the myriad of these business questions are increasingly being asked of big data systems Drones even can collect soil data directly, not waiting for visuals or emissions from plants, predicting early in a season precisely what produce yields might look like at the end Robots can roam amongst cattle with thermal monitors to assess health and report back like spies on the range Predictive analysis using remote feeds from distributed areas, which is changing the whole business of agriculture and risk management, depends on 3V engines running reliably Today the traditional engines of agriculture (diesel-powered tractors) are being set up to monitor data constantly and provide feedback to both growers and their suppliers In this context, there is so much money on the line, with entire markets depending on accurate prediction; everyone has to trust the data environment is safe against compromise or tampering A compromise in the new field of 3V engines is not always obvious or absolute When growers upload their data into a supplier’s system, such as a seed company, that data suddenly may be targeted by investors who want to get advance knowledge of yields to game the market A central collection system would know crucial details about market-wide supply changes long before the food is harvested Imagine having just one giant boiler in a factory, potentially failing and setting the whole business on fire, rather than having multiple redundant engines where one can be shut down at the first sign of trouble Chapter 6: Six Elements of Big Data Security Best practices in data protection may be translated most quickly from prior technology to where big data is going I say this because protecting data is the centerpiece of the new environments Although amassing more information has a resilience of its own, when “enough” data is corrupted or lost it destroys the value and forces all the data to be refreshed The more refresh required, the slower everything goes Already there are many discussions on how to reduce load times for data because of a failure to keep the data clean or safe, which is really a new chapter of discussion on backup and recovery In addition, there are many discussions about compression, which are leading us towards new encryption and key management ideas Concerns about stored data have been the foundation of data protection in the enterprise (encryption of data transmission inside “private” networks, other than credentials, is not yet pervasive), yet data in transit and process becomes far more relevant with emerging technology Instead of a spinning disk, many environments run entirely on flash without long-term storage considerations In some cases, data nodes of Hadoop will run without local storage at all This, like the old thin client models, does not mean stored data is no longer a concern Rather, it highlights that transit and process protections finally will become elevated to equal status in importance for managing risk Interestingly, we’re seeing adversaries use a persistence model similar to this They embed malware in the orchestration or administration servers and re-infect nodes as soon as they come online It’s a useful model for large distributed networks that require low-touch node provisioning However, it also is being exploited to make it extremely difficult to determine whether you’ve eliminated risk of compromise If the malware resides only in memory and only on remote servers, then wiping and cycling nodes after infection just restarts the process A lack of protection in transit would mean results could not be trusted Likewise, for processing, results would be worthless if tampering was found Think about threat models for the largest environments doing the most complicated analytics There surely will still be attackers focused on a subset or some small percentage of data This is like someone wanting to steal private information out of a Facebook profile There also will be attackers who want to change results in the analytics performed on data This means leaving the data in place and finding ways to poison it or manipulate the analysis being done to it One example that comes to mind is with huge investment firms doing market analytics There are attackers who want to break in and steal secrets such as which trade is next so they can try and gain insider information This is the simple evil doer profile often characterized in the press However, the emerging area of threat is related to an investment firm being fed bad data out of its own systems to manipulate its movements and benefit whoever has hired the attackers A far more damaging threat, manipulation of data in transit or process leads the data owners to stop trusting their own systems and staff Unlike data at 97 Chapter 6: Six Elements of Big Data Security rest attacks that tend to have audit trails and the ability to compare current data with other data at rest, transit and data have far less chance of being measured after-the-fact for accuracy Encryption, mentioned earlier briefly, is really where things change in data protection Big data environments have been loath to add encryption because of performance concerns Some current products offering solutions are playing down the fact that they can drop performance in Hadoop clusters as much as 30% That is a shocking number to anyone expecting a performance boost from big data analytics; it’s like building a new freeway to reduce wait times, and then telling people there need to be huge speed bumps for safety In reality, encryption tends to have the greatest impact at the highest layers, where it provides the most protection Running it at a block-level or even file-level will be less performance-intensive, although it also covers the least number of risks The further up the layers you go, the less you can take advantage of parallelism and other benefits in distributed architectures because you are intentionally reducing the number of devices and processes that can see into the data Â Lowest layer methods, with all their performance advantages, thus can be explained roughly as two parts: Someone who directly accesses (without following process) or steals a physical storage device gets nothing but encrypted data Someone who accesses that storage device through normal process or transmission paths gets the unencrypted data So the highest layer, the user access level encryption, is where we really should be headed, without losing too much of the lower layer performance benefits It is fairly easy to see why the highest layer brings trust to big data environments Protection against root access on nodes in a cluster, mentioned above in terms of threat models, can be achieved through the encryption of RPC, block transfer protocols, MapReduce files, and HDFS Advances in one best practice tend to be interrelated and impact the other best practices Fortunately, when it comes to big data protection, the pressure to improve compression opens the door to encryption A direct financial benefit because of reduced storage cost and increased performance means a compression codec already is supported in Hadoop The community is looking at ways to use the data transformation position of compression and add the encryption with keys as a simple path to also protect data (https:// issues.apache.org/jira/browse/HADOOP-9331) Key distribution management and transparency of the encryption then become essential to the success of data protection 98 Chapter 6: Six Elements of Big Data Security So much of distributed systems work has been done without encryption that I worry people take it for granted and consider it non-essential We know it to be the best method to protect against loss or tampering The pressure for cryptographic solutions has been mounting, not just within the context of big data environments for enterprise workloads, but also for individual nodes that have owners; these include devices such as phones, cameras, and PCs that are generating much of the data being collected Digital currency, for example, is a widely distributed task with high value data that is prone to theft Another example, although slightly tangential, is data acquisition systems or sensors Imagine a project meant to detect imminent signs of human rights abuse around the world by doing a real-time analysis of information uploaded from any kind of sensor A really old example of this was the CIA and US AirForce HEXAGON project in 1963, which combined a wide-are project (CORONA) with high-resolution (GAMBIT) to start taking 150,000 feet of film (200 inches per second) and bring pictures to intelligence analysts that have a resolution of a couple of feet to over 150 miles Fast forward 50 years and temperature, particle, sound, chemical, light, pressure, and many other inexpensive sensor types, more commonly known as smart phones, are deployed and are taking a “picture” of things happening across a far greater area at far greater resolution If this data is stored centrally, what kind of intelligence analysis can be performed and by whom? Sensors are the future of healthcare research and diagnosis for this reason Without much stretch of the imagination, we soon should expect our personal devices to be instrumental in individual level diagnosis based on group-level analysis; these devices will report more data, with more detail, and more often to data centers and care providers I worked on some versions of this healthcare sensor and data protection model decades ago, as I mentioned in Chapter A hospital supported very remote locations connected to a radiology department When a small girl was injured and rushed into a rural satellite offfice, we were able to remotely acquire images and upload them via VPN to our central servers We then gave remote desktop access to a radiologist at home who could make a diagnosis immediately It was a clumsy, complex, and expensive system justified by the promise of saving lives or injury in more remote locations Systems of that capability now are so much less expensive that even first responders can have commodity hand-held devices with applications on them, making a diagnosis based on network-based data analysis And of course with tens of millions of cameras making assessments of health, an entirely new model for data protection emerges Our past enterprise approaches have focused very much upon devices themselves, their connections, and the data acquired from them This is important Imagine each compute 99 Chapter 6: Six Elements of Big Data Security device not alone as an asset, but rather as a grain of sand, as discussed earlier, with a kitten walking across Or perhaps here we should talk of a snowflake among others Each snowflake, like the venerable PC or server, tells you something about the environment and has data to protect Protecting a snowflake is fairly straightforward because you factor threats and build controls to regulate things like temperature or wind that would damage the information contained in each one That is today’s model As more and more snowflakes appear, the security of each one quickly grows to an operational challenge beyond the control of them individually Eventually, far before we reach advanced concepts of snow, we must start to look at more general and environmental levels of protection How we change the temperature for all snowflakes we are responsible for and manage threats of open spaces? This transition in thinking brings an entirely new perspective The shift is like the old saying, “Seeing the forest through the trees.” Watch the leaves blow; see the trunks bend, and describe what you really are “seeing.” Consider snowflakes in terms of a footprint pressed into them Each time people move through the Internet of snowflakes, they may create an impression that tells us a lot, even though they have no idea they’re being traced Protecting meta information tends to show up in the news quite a lot for this reason Who can see the imprints on tens of thousands of systems? What imprints really tell anyone? Can the imprints be masked by generating more snowflakes, which would protect snowflakes per our original security models, while removing the trace of impressions? The questions are not meant to be theoretical Write-Once Ready Many (WORM) compliance standards are starting to take notice of the need for immutability in big data environments When I worked with investment banks, they had to answer to the Securities Ex- 100 Chapter 6: Six Elements of Big Data Security change Commission (SEC) 17a-4(f) and CFTC (commodity futures trade commission); our transactions and trades were expected to be held for a specified time in an immutable store The idea was that investigations would be helped if we kept a tamper-proof record for them to review Even these requirements and efforts in the past are challenged, however, if someone in big data systems is able to write a file system call that races and beats other file system calls (a real vulnerability found in Hadoop) Data protection after processing opens the door to a race condition where one call can wipe out or replace the evidence or results of another call, essentially letting someone tamper with results Protecting data against threats becomes far more than just ensuring each system is healthy We are evolving to where large environmental considerations are a reality and we need to think of our controls in terms of personal or human safety as much as computer security Bottom Line on Data Protection Key points Storage Stored data has to account for unauthorized access attempts, by using encryption or tokens to minimize risk Transmission Ideally encryption would be implemented across all networks At a minimum, all communication with the big data environments much be encrypted Automation and Scale Rather than focus on individual events, a wider picture from all nodes can reveal patterns of compromise, impressions from attacks, or traces of adversaries Element 3: Vulnerability Management “When we were children, we used to think that when we were grown-up we would longer be vulnerable But to grow up is to accept vulnerability To be alive is to be vuln able.” —Madeleine L’Engle A simple way to focus on the broad topic of managing vulnerabilities is to break them into sections First, specific controls for systems and applications need to prevent vulnerabilities from being exploited This is rarely called inoculation, but with big data systems it 101 Chapter 6: Six Elements of Big Data Security could actually be a good way to describe new controls Secondly, use detection systems to catch vulnerabilities and mitigate them as quickly as possible In order to execute the first part of the process, there are several sub-steps It is essential to have a complete inventory of assets to begin with, meaning an inventory that includes a wide scope of all systems connecting into the big data environment, not only the infrastructure itself A vulnerability management program needs classification of everything discovered to prioritize where to focus resources Next, the data flow and connectivity of those systems should be mapped to understand their relationships A system may be vulnerable, but its exposure to other systems is a major factor in deciding priority of management Next, a vulnerability assessment based on data collected in the first two steps should be done to identity where to focus remediation and build specific controls These steps may seem familiar to anyone used to looking at quality assurance or IT management process charts They are very similar to a typical “plan-do-check-act” Deming circle The idea behind management, regardless of the vulnerability specifics, is to discover and understand the problems before rolling out fixes and generating reports The part that is different with big data is the odd interrelationships of the many moving parts within a complete environment Hadoop is really made up of numerous open source projects on different release schedules and complex dependences that are not easy to isolate or break apart How soon can you migrate off an old version of Java, for example, when one or two components still require it to function? This has proved to be a very difficult problem to solve because Hadoop was not designed with any kind of rolling upgrade model or method to ensure vulnerabilities could be remediated across the various parts So what does patching look like in Hadoop? It essentially becomes a sliding window over different product levels, often open source community-driven projects, with dependencies into each other Some distributions offer a value proposition in reducing the patching and configuration options, thereby ironing out vulnerabilities and testing before releasing products Yet this also can introduce delays versus grabbing your own code and creating fixes or building a patch specific to your environment And then there also is the need to consider validating the integrity of patches, which really doesn’t change much from existing best practices In a Hadoop environment, a detection system is where practices are most likely to change No one will want to run performance-degrading anti-virus software on their nodes to protect against vulnerabilities It also is questionable whether a blacklist approach is appropriate, given that the nodes have very routine profiles There should be very little user 102 Chapter 6: Six Elements of Big Data Security interaction or randomness Given the fairly regular state of nodes, why not make them all whitelist? They would not start if they did not pass a vulnerability management integrity check An even more stateless/dynamic big data environment, where nodes have no local storage, also could mean pushing detection onto a centralized storage area Scanning for vulnerabilities can be more efficient if done once on the storage for all the distributed nodes Naturally, after any kind of vulnerability is found, the question comes up whether anyone knows what to about it Assuming the four sub-steps mentioned above are followed, the inventory can be used to generate a policy that states how fast remediation is required and what steps to take The policy also might state under what conditions an exception could be made False positives are a good example of a necessary exception Bottom Line on Vulnerability Management Key points Protection Enable a sensor or agent to send feedback of malicious software, based on known-bad Ensure removal, not just detection Regular updates Detect “known” types of malware including rootkits, remote administration, and surveillance tools Management Monitor for evolving threats and keep a detailed log with an analysis of any infections Element 4: Access Control “Every wall is a door.” —Ralph Waldo Emerson The heart of big data security today is controlling access to data, which hopefully seems a little bit obvious More often than not, a discussion about how someone expects to keep data safe will fall back to a description of a perimeter and reasons why someone cannot or should not be able to get past it without appropriate authority You probably noticed that I did it when discussing network security earlier in this chapter You also may have noticed it in the data protection section That’s because if you one thing right, and only one thing, it should be the careful management of identities and access so that the other best practices are more useful 103 Chapter 6: Six Elements of Big Data Security You will see in the next section why access control is such a cornerstone of big data security Spoiler alert: without proper access control, assuming identity management in place, you can only limited monitoring and response Monitoring depends heavily upon being able to gather intelligence and make informed decisions A lack of detail makes intelligence gathering less effective, which of course reduces the information available to make decisions Sound like surveillance? That is exactly what it is, and access controls are an essential ingredient to monitoring, knowledge, or any of the other terms we say instead of just surveillance It is tempting to transfer enterprise access control theory straight into the big data realm There are many reasons why it tempts me If we believe environments are similar to the enterprise, then we can apply surveillance easily Want to know who has access to what? No problem, you own the environment outright, so go take a look whenever you want Move that environment into a hosting provider or up to public cloud, and suddenly the plot thickens Do you get to know who has access, really, when you ask someone else to tell you? Do you trust them to tell you what you need to know or what you want to know? Here’s a perfect example: a client asked for a physical data center assessment Expecting to see a moderate to small size footprint, I instead was walked through the massive big data clusters of another customer because, simply, they were in the way due to rapid growth When that other customer asked the data center provider who has access, I guarantee the data center did not tell them I was given access Sometimes this game can be annoying to IT operations staff They will protest with reasons why people would not this It is a fair point, and I don’t mind trying to argue against them or prove the risks However, it raises a more fundamental problem about big data environments Much of the work is being delegated and system/process impersonation is not only common, it is expected If I initiate a job with hundreds of tasks that has thousands of subtasks, the potential for abuse of access takes on a very different dimension than the world we live in Say, for example, a configuration file requires a delegated task to stop running after seven days Who has access to that configuration file? What if the task itself can be set up to change the configuration file and give itself extended life, or in other words grant itself additional access? Rather than try to make big data systems fit the enterprise access models, which is clearly a non-starter, I instead want to point out how that has been tried and failed, and how different and new ideas are better The original team that developed Hadoop used some interesting assumptions about security inside the Yahoo! environment The original Hadoop security paper assumed an enterprise solution for access controls as the basis for initial control architecture The reality 104 Chapter 6: Six Elements of Big Data Security is Kerberos does not scale well and was designed for an entirely different threat model than found in today’s big data environments Moreover, it is disliked widely for these reasons as well as other usability issues Every big data access control meeting I have ever been to inevitably has at least one person who stops everyone to say, “Talking about building on what we have is great, but all I really want to know is, when we can get rid of Kerberos?” Bottom Line on Access Control Key points Restrict by role Unique user and service accounts must be used, replacing generic and shared for authorization Use deny unless allowed Authentication Require multi-factor authentication for remote access and enforce strong passwords Restrict physical access Monitor for evolving threats and keep a detailed log with analysis of any infections Element 5: Monitoring “You can observe a lot just by watching.” —Yogi Berra The funny thing about monitoring is that it is balanced, which sounds better than opposed, to the other best practices All that work you have done to lock down data as it is transmitted, processed, and stored should prevent anyone from seeing it Yet you obviously need to know that the controls are working and need to measure controls over time if you want to know trends Given that big data becomes such a challenge in terms of the four control areas discussed so far, it seems logical for monitoring to take up the difference Applying more monitoring, coupled with rapid response, would provide security without the need for static controls we have been relying upon for decades The most secure environments, as I often say, are those with the fewest inconveniences Think about a dangerous conflict zone and you may have in your mind checkpoints that force you to stop, identify yourself, and prove your purpose This is the least secure environment and so it requires a lot of controls, inconvenient and necessary, in order to estab- 105 Chapter 6: Six Elements of Big Data Security lish and build trust within pockets of security The most secure environment, in stark contrast, is open and free People move about freely exchanging goods and services A major factor in moving from the former to the latter is having a robust monitoring system If you want to downplay the first four best practices and have an open and freely operating environment, then monitoring is your friend That being said, you might be wondering why everyone doesn’t just implement the best possible monitoring possible and call it a day, forgetting the other controls The simple answer is monitoring has a negative impact of its own Call it surveillance and it should be clear why Once you put a camera in every home, people stop behaving freely And so the secret to the effectiveness of monitoring lies in the word “home” because that is an example of a modern perimeter (the kind that will never go away) There will always be a social norm, rather than a technical one, that limits monitoring to preserve security as much as create it Monitor up to a point and then depend on responsiveness for parts that will not be monitored You may see how the idea of a home perimeter significantly changes with companies like Google selling thermostats, or people sleeping next to their Apple smartphone Within the enterprise there were very, very few limits to monitoring Creating a policy (the next best practice) that notified everyone that everything was open to surveillance was not impossible Usually someone would argue correctly that innovation suffers when employees think they are constantly being monitored And someone else would argue correctly that some employees feel more secure when they are being constantly being monitored Occasionally someone would suggest Henry Ford invented the assembly line and used monitoring to increase output; to which I always had to interject and explain how Ford copied the British, who copied the Dutch, as explained in my last book These debates will be different in big data, albeit based on the very same concerns We have mathematical proofs put on the table that show a compute node has lower performance as soon as you turn on monitoring Enable logging on the task tracker, because you want to be able to respond quickly and stop bad tasks, and you run the risk of slowing down all tasks to an unacceptable level Big data architects, due to impact of monitoring, see it more like the checkpoint in a hostile zone than an invisible friend silently keeping neighborhoods clear of crime It is a challenge to find the right balance of monitoring within big data One approach is to focus on external points of access Assuming you have good idea of where data flows and repositories exist, then monitoring can be increased at strategic points and left out of the “data bedroom.” Unfortunately, if we put too much emphasis on this approach, we are liable to break the model we have been hoping to achieve We want the quality of monitor- 106 Chapter 6: Six Elements of Big Data Security ing to improve everywhere so we can remove the need for static and slow documentation of inputs/outputs The best solution to the balance of monitoring with other controls seems to be emerging in two areas First, as described in the last section with snowflakes, we can generate knowledge from meta-data views of behavior Accumulating enough information at a small and somewhat unreliable level (admitting that commodity sensors tend to be faulty) can be useful if an appropriate amount of data is acquired This becomes big data technology turned on itself, which is a fast-growing field and full of many vendors It has much promise and solutions on the market continue to evolve The reality today is that these first types of solutions are challenged to maintain anonymity or privacy They hoover as much as possible to present a coherent picture of what is happening across a huge scope of systems This new level of visibility created in theory lessens the need for monitoring at any one node or even job level We have seen much progress in this area with regard to public cloud providers They advertise services that say you don’t need to keep track of things your devices are doing at the device level because just by doing it on their service, everything is recorded Getting that to function reliably is a challenge in itself and an emerging best practice Add in a requirement of privacy or even a simple request like “delete this event” and you will likely see engineers either give a blank stare or say things like preserving individual rights will destroy everything they have worked so hard to build (e.g., a permanent record of you that gives sometimes reliable insights, which you have no control over) That brings us to a second approach to this problem Protecting privacy in neighborhoods of compute nodes, while still achieving surveillance, means developing something like a neighborhood watch program instead of hiding secret police in every home Getting the nodes to monitor at a convenient level, keeping things to them, and only reporting risk when probed seems like the more logical approach over the long term When I say nodes could keep things to themselves, I really mean access control rather than local storage Sensors today have relatively limited or no storage at all They push their information to a central service or more long-term repositories Some Hadoop environments even are architected from the start with no local storage at all on any of the nodes (spoiler alert: centralized storage with fast networks sometimes can outperform local storage) The neighborhood watch also gives an interesting new option to have nodes report to each other and deal with problems on their own Imagine a group of 1,000 nodes no longer communicating with some percentage of 10,000 nodes because of clear indications of compromise I am using a neighborhood analogy but there are many others that fit this model 107 Chapter 6: Six Elements of Big Data Security The human brain is not involved in healing every scrape or scratch of human skin, for example Monitoring big data environments adds even more technical complexity and ethics dilemmas to an already fascinating aspect of best practices in security The reality is we know generally the direction that could work out best, yet we are not yet seeing the kind of demand necessary to really drive the market towards smart solutions It would be a shame if we build only centralized surveillance for performance and cost reasons, push it topdown into every single device everywhere, and then discover we could have architected a better, more scalable, and secure solution that also preserves balance Bottom Line on Monitoring Key points Network Record transactions as well as flows with packet capture to replay suspect activities Systems Automate records that track user actions, filesystems and services Immutability Verify that network and system records are protected from destruction or tampering Element 6: Policies “A people that values its privileges above its principles soon loses both.” —Dwight D Eisenhower Last but not least are policies, the painted lines of big data My favorite quote on this topic is from the 2014 CeBit trade show in Germany, where the chairman of Volkswagen Group said “The car must not become a data monster” (http://recode.net/2014/03/09/ volkwsagen-big-data-doesnt-have-to-mean-big-brother/) We must always remember that until computers really think original thoughts, such as those predicted for decades by artificial intelligence experts, all computers are just following human directions And humans create code for each other to follow in the form of policies If I write down that you must stop at a red sign, it really is like coding for humans Significantly different than computers, however, is the fact that humans have original thoughts often and behave unpredictably 108 Chapter 6: Six Elements of Big Data Security Yet we still write policies for the humans to follow and we adjust the policies as well as develop monitoring and response controls to deal with non-conformity There is a whole lot to discuss when it comes to policy best practices, but it is more about social contracts than technical limitations What we really are trying to is “pave cow paths” where possible as long as the paths are not unreasonable or introducing harm Writing rules based on patterns of behavior for humans who interact with big data should be fairly simple, once we establish norms of good behavior An administrator-level role, for example, should never be able to touch data without leaving a trace of that touch Nonetheless, I have seen some disappointing errors even in this space where big data environments are operated with weak policies and leave no trace of compromise by evil administrators Reliability of systems is sometimes so over-emphasized in big data because access may be ironically greater than necessary The engineers are expected to never abuse their access because abuse is narrowly defined as an outage, which would be easily detected In that sense, slowing down an engineer with a privacy-oriented policy that limits authority means potentially slowing down a fix that would restore services Some organizations prioritize access in this manner because they have written poor policies on data protection and access controls Their policies benefit them in terms of service levels and reputation while potentially allowing or even introducing harm to data owners Another important consideration is how to best provide awareness of policies Through the mid-1980s, due to some high-profile attacks by teenagers in Milwaukee, the US government was worried about unauthorized access to systems so they passed a law against “computer fraud” (CFAA) Within ten years, there were requirements to have official warning banners on systems Even the IRS, in publication 1075, says system use notification (AC-8) is required, and they give an exhibit (8) of examples of what to say to anyone who tries to access resources The problem with the warning banner approach is easy to see in the commercial sector Companies have moved away from warnings because it tends to generate a “hostile” work environment Government agencies can tell people to follow strict rules Yet fast-moving big data startups tend to open their doors to the street, install ping-pong tables and have meetups with pizza and beer; naturally this transfers to their systems and they also tend not to lace user interfaces with ominous warnings about arrest and prosecution for abuse I have had more than one C-level executive tell me banners were never going to happen because innovation drops and worker complaints escalate This actually was not a big disappointment for me, because it has become apparent that technically it is becoming impossible to enforce the policy anyway 109 Chapter 6: Six Elements of Big Data Security Banners in the 1980s meant policy auditors had to check only a few connection touchpoints for users Today, every interface between users and systems means a myriad of screen sizes and shapes, as well as a complete lack of user-prompt or interface options between systems communicating on behalf of users Even if we could find a way to make banners possible everywhere, this pervasiveness has been shown to lead to policy fatigue, creating an inverse relationship between comprehension and coverage The answer to the move away from login banners can perhaps be expressed in two parts First, some have said a “login” or “welcome” on an authentication screen is seen by an attacker as an invitation and gives easy defense for unauthorized activity—all activity is welcome Like a welcome mat outside a locked house or windows without bars, there is fear of giving attacks carte blanche by signaling a lack of intent to defend Second, some argue beyond removing misleading signals that a lack of a clear warning or consequence statement means an attacker cannot be prosecuted With this in mind, it turns out that no matter how hard you try, there will always be some place where you simply cannot place a conspicuous warning (e.g., APIs) or cannot fit a full-sized one (e.g., mobiles, wearables) This places a rather strange and unreasonable burden on operations to be default open unless specifically closed with signs everywhere explaining what closed means Furthermore, warning that sensor data (logs, video, packets) may be used in prosecution at some point becomes like warning that electricity may be used in prosecution Monitoring of an environment for anomaly has to be reasonably assumed when inside any modern environment A reasonable solution to this situation, although I am obviously in no way a lawyer and I am not offering legal advice, is to look at how courts have handled unauthorized access prosecutions What should be expected from any environment to prosecute violations? To my untrained eye, both United States v John, 597 F.3d 263, 271 (5th Cir 2010) and United States v Rodriguez, 628 F.3d 1258 (11th Cir 2010) are examples that show banners are not a requirement to achieve prosecution of unauthorized access, but a notice of policy is required at some point to claim it as official Should a data scientist sign an acceptable use policy before being granted access, and this that sufficient for federated data or APIs? Can new automation systems integrate with big data tools so policies are made specific or relevant to the type of research being done on the largest repositories of data? Interesting times lay ahead Bottom Line on Policies Key points 110 Chapter 6: Six Elements of Big Data Security Establish Set the tone of security for the scope of data, to inform all personnel what is expected of them Review At least annually update the policy to reflect changes in the environment Implement Policy has to be published and should be conspicuous for anyone accessing systems Conclusion The point of this book, culminating with these six elements, is to lay before you a simple and efficient structure to begin to evaluate any big data environment for security The elements may seem overlapping because they have been simplified, but even more importantly, they are highly interrelated and work best when used all together Hopefully the reasons for security changing and adapting to new technology have been made clear enough that you now are excited to jump into the details of each We are moving in the right direction, yet need more people demanding and creating better solutions as well as helping write, test, or review solutions being developed I obviously not claim to have all the answers here Instead, I am trying to provide as much information as I can to increase collaboration and find community-based solutions The earlier we all work together on this, the more likely we can trust big data environments to carry the workloads essential to improving knowledge 111 ... new term and see if it sticks It didn’t, but it helped find answers in how to get security into the definition In the next chapter I will explain why there is certain gravity to big data when... security solutions fit an absolute definition, or must they be relative, also?” Perhaps it is more like asking whether a watch can work in different time zones versus whether it can work with... early facial recognition systems, just like wearing flipflops were known to throw off gait analysis It s a huge topic that lends itself to integrity more than confidentiality at this point My

Định dạng
Số trang	115
Dung lượng	2,83 MB