Statistics, Data Mining, and Machine Learning in Astronomy 44 • Chapter 2 Fast Computation on Massive Data Sets and unordered, such as types of galaxies (spiral, elliptical, etc ) They are often nonnu[.]
44 • Chapter Fast Computation on Massive Data Sets and unordered, such as types of galaxies (spiral, elliptical, etc.) They are often nonnumeric This book is centered on data organized in tables Each row corresponds to an object, and different columns correspond to various data values These values are mostly real-valued measurements with uncertainties, though often ordinal and nominal data will be present, too 2.1.2 Data Management Systems Relational databases (or Relational Database Management Systems; RDBMSs) represent a technologically and commercially mature way to store and manage tabular data RDBMSs are systems designed to serve SQL queries quickly—SQL supports queries typically having the form of concatenated unidimensional constraints We will cover basic ideas for making such computations fast in §2.5.1 This is to be distinguished from truly multidimensional queries such as nearest-neighbor searches, which we will discuss in §2.5.2 In general, relational databases are not appropriate for multidimensional operations More recently, so-called “noSQL” data management systems have gained popular interest The most popular is Hadoop,1 an open-source system which was designed to perform text processing for the building of web search engines Its basic representation of data is in terms of key-value pairs, which is a particularly good match for the sparsity of text data, but general enough to be useful for many types of data Note especially that this approach is inefficient for tabular or array-based data As of this writing, the data management aspect of Hadoop distributions is still fairly immature compared to RDBMSs Hadoop distributions typically also come with a simple parallel computing engine Traditional database systems are not well set up for the needs of future large astronomical surveys, which involve the creation of large amounts of array-based data, with complex multidimensional relationships There is ongoing research with the goal of developing efficient database architectures for this type of scientific analysis One fairly new system with the potential to make a large impact is SciDB2 [5, 43], a DBMS which is optimized for array-based data such as astronomical images In fact, the creation of SciDB was inspired by the huge data storage and processing needs of LSST The astronomical and astrophysical research communities are, at the time of this writing, just beginning to understand how this framework could enable more efficient research in their fields (see, e.g., [28]) 2.2 Analysis of Algorithmic Efficiency A central mathematical tool for understanding and comparing the efficiency of algorithms is that of “big O” notation This is simply a way to discuss the growth of an algorithm’s runtime as a function of one or more variables of interest, often focused on N, the number of data points or records See figure 2.1 for an example of http://hadoop.apache.org/ http://www.scidb.org/ 2.2 Analysis of Algorithmic Efficiency • 45 Scaling of Search Algorithms 101 linear search (O[N ]) efficient search (O[log N ]) Relative search time 100 10−1 10−2 10−3 10−4 106 107 Length of Array 108 Figure 2.1 The scaling of two methods to search for an item in an ordered list: a linear method which performs a comparison on all N items, and a binary search which uses a more sophisticated algorithm The theoretical scalings are shown by dashed lines the actual runtimes of two different algorithms which both compute the same thing, in this case a one-dimensional search One exhibits a growth in runtime which is linear in the number of data points, or O(N), and one uses a smarter algorithm which exhibits growth which is logarithmic in the number of data points, or O(log N) Note that “big O” notation considers only the order of growth, that is, ignoring the constant factors in front of the function of N It also measures “runtime” in terms of the number of certain designated key operations only, to keep things conveniently mathematically formalizable Thus it is an abstraction which deliberately does not capture details such as whether the program was written in Python or any other language, whether the computer was a fast one or a slow one—those factors are captured in the constants that “big O” ignores, for the sake of mathematical analyzability Generally speaking, when comparing two algorithms with different growth rates, all such constants eventually become unimportant when N becomes large enough The order of growth of things other than runtime, such as memory usage, can of course be discussed We can also consider variables other than N, such as the number of variables or dimensions D We have given an informal definition of “big O” notation—a formal but accessible definition can be found in [8], an excellent introductory-level text on algorithms written for computer science PhD students Some machine learning methods (MLM) are more difficult to compute than others In general, more accurate MLM are more difficult to compute The naive or straightforward ways to implement such methods are often O(N ) or even O(N ) in runtime However in recent years fast algorithms have been developed which can ... which both compute the same thing, in this case a one-dimensional search One exhibits a growth in runtime which is linear in the number of data points, or O(N), and one uses a smarter algorithm... item in an ordered list: a linear method which performs a comparison on all N items, and a binary search which uses a more sophisticated algorithm The theoretical scalings are shown by dashed lines... growth which is logarithmic in the number of data points, or O(log N) Note that “big O” notation considers only the order of growth, that is, ignoring the constant factors in front of the function