Signal processing means decomposing a signal into simpler components. Two cat- egories of signal processing methods are in common usage, Harmonic Analysis and Singular Value Decomposition. They differ on the basis of their interpretability, that is, whether the building blocks of the original signal are known in advance. When a set of (simple) functions may be defined to decompose a signal into simpler compo- nents, this is Harmonic Analysis, e.g. Fourier analysis. When this is not possible and instead a set of generic (unknown) data-derived variables must be defined, this is Singular Value Decomposition, e.g. Principal Component Analysis (PCA). PCA is the most frequently used so let us discuss this method first.
Singular Value Decomposition (e.g. PCA)
An observation (a state) in a multivariable space may be seen as a point in a multi- dimensional coordinate system where each dimension corresponds to one variable.
The values of each variable for a given observation are thereby the coordinates of this point along each of these dimensions. Linear algebra is the science that studies properties of coordinate systems by leveraging the convenient matrix notation, where a set of coordinates for a given point is called a vector and the multivariable space a vector space. A vector is noted (x1, x2, …, xn) and contains as many entries as there are variables (i.e. dimensions) considered in the multivariable space.
A fundamental concept in linear algebra is the concept of coordinate mapping (also called isomorphism), i.e. the one-to-one linear transformation from one vector space onto another that permits to express a point in a different coordinate system without changing its geometric properties. To illustrate how useful coordinate map- ping is in business analytics, consider a simple 2D scatter plot where one dimension (x-axis) is the customer income level and the second dimension (y-axis) is the edu- cation level. Since these two variables are correlated, there exists a direction of maximum variance (which in this example is the direction at about 45° between the x- and y-axes because income- and education-levels are highly correlated).
Therefore, rotating the coordinate system by something close to 45° will align the x-axis in the direction of maximum variance and the y-axis in the opposite, orthogo- nal direction of minimum variance. Doing so defines two new variables (i.e. dimen- sions), let us called them typical buyer profile (the higher the income, the higher the education) and atypical buyer profile (the higher the income, the lower the educa- tion). In this example the first new variable will deliver all information needed in this 2D space while the second new variable shall be eliminated because its variance is much, much smaller. If your client is BMW, customer segments that might buy the new model are likely highly educated and financially comfortable, or poor and poorly educated, or in-between. But highly educated and financially poor customers are rare and unlikely to buy it, and even though poorly educated yet rich customers are certainly interested in the BMW brand, this market is even smaller. So the second variable may be eliminated because it brings no new information except for
7 Principles of Data Science: Advanced
outsiders. By eliminating the second variable, we effectively reduced the number of dimension and thus simplified the prediction problem1.
More generally, a set of observations in a multivariable space can always2 be expressed in an alternative set of coordinates (i.e. variables) by the process of Singular Value Decomposition. A common application of SVD is the Eigen- decomposition3 which, as in the example above, seeks the coordinate system along orthogonal (i.e. independent, uncorrelated) directions of maximum variance [177].
The new directions are referred to as eigenvectors and the magnitude of displace- ment along each eigenvector is referred to as eigenvalue. In other words, eigenval- ues indicate the amount of dilation of the original observations along each independent direction.
Aν λν= (7.1)
where A is a square n × n covariance matrix (i.e. the set of all covariances between n variables as obtained from Eq. 6.1), v is an unknown vector of dimension n, and λ is an unknown scalar. This equation, when satisfied, indicates that the transforma- tion obtained by multiplying an arbitrary object by A is equivalent to a simple trans- lation along a vector v of magnitude equal to λ, and that there exists n pairs of (v, λ).
This is useful because in this n-dimension space, the matrix A may contain non-zero values in many of its entries and thereby imply a complex transformation, which Eq. 7.1 just reduced to a set of n simple translations. These n vectors v are the char- acteristic vectors of the matrix A and thus referred to as its eigenvectors.
Once all n pairs of (v, λ) have been computed,4 the highest eigenvalues indicate the most important eigenvectors (directions with highest variance), hence a quick look at the spectrum of all eigenvalues plotted in decreasing order of magnitude enables the data scientist to easily select a subset of directions (i.e. new variables) that have most impact in the dataset. Often the eigen-spectrum contains abrupt decays; these decays represent clear boundaries between more informative and less informative sets of variables. Leveraging the eigen-decomposition to create new variables and filter out less important variables reduces the number of variables and thus, once again, simplify the prediction problem.
1 Note that in this example the two variables are so correlated that one could have ignored the other variable from the beginning and thereby bypass the process of coordinate mapping altogether.
Coordinate mapping becomes useful when the trend is not just a 50/50 contribution of two vari- ables (which corresponds to a 45° correlation line in the scatter plot) but some more subtle relation- ship where maximum variance lies along an asymmetrically weighted combination of the two variables.
2 This assertion is only true under certain conditions, but for most real-world applications where observations are made across a finite set of variables in a population, these conditions are fulfilled.
3 The word Eigen comes from the German for characteristic.
4 The equation used to find eigenvectors and eigenvalues for a given matrix when they exist is det(A − λI) = 0. Not surprisingly, it is referred to as the matrix’s characteristic equation.
90
The eigenvector-eigenvalue decomposition is commonly referred to as PCA (Principal Component Analysis [177]) and is available in most analytics software packages. PCA is widely used in signal processing, filtering, and noise reduction.
The major drawback of PCA concerns the interpretability of the results. The reason why I could name the new variables typical and atypical in the example above is that we expect income and education levels to be highly correlated. But in most projects, PCA is used to simplify a complex signal and the resulting eigenvec- tors (new variables) have no natural interpretation. By eliminating variables the overall complexity is reduced, but each new variable is now a composite variable born out of mixing the originals together. This does not pose any problem when the goal is to reconstruct a compound signal such as an oral speech recorded in a noisy conference room, because the nature of the different frequency waves in the original signal taken in isolation had no meaning to the audience in the first place. Only the original and reconstructed signals taken as ensembles of frequencies have meaning to the audience. But when the original components do have meanings (e.g. income levels, education levels), then the alternative dimensions defined by the PCA might loose interpretability, and at the very least demand new definitions before they may be interpreted.
Nevertheless, PCA analyses remain powerful in data science because they often entail a predictive modeling aspect which is akin to speech recognition in the noisy conference room: what matters is an efficient prediction of the overall response variable (the speech) rather than interpreting how a response variable relates to the original components.
Harmonic Analysis (e.g. FFT)
The SVD signal processing method (e.g. PCA) relies on a coordinate mapping defined in vector space. For this process to take place, a set of data-derived vectors (eigenvectors) and data-derived magnitudes of displacement (eigenvalues) need to be stored in the memory of the computer. This approach is truly generic in the sense that it may be applied in all types of circumstances, but becomes prohibitively com- putationally expensive when working with very large datasets. A second common class of signal processing methods, Harmonic Analysis [178], has smaller scope but is ultra-fast in comparison to PCA. Harmonic analysis (e.g. Fourier analysis) defines a set of predefined functions in the dataset that when superposed all together accu- rately re-construct or approximate the original signal. This technique works best when some localized features such as periodic signals can be detected at a macro- scopic level5 (this condition is detailed in the footnote).
5 Quantum theory teaches us that everything in the universe is periodic! But describing the dynam- ics of any system except small molecules at a quantum level would require several years of com- putations even on last-generation supercomputers. And this is assuming we would know how to decompose the signal into a nearly exhaustive set of factors, which we generally don’t. Hence an harmonic analysis in practice requires periodic features to be detected at a scale directly relevant to the analysis in question; this defines macroscopic in all circumstances. For example, a survey of customer behaviors may apply Fourier analysis if a periodic feature is detected in a behavior or any factor believed to influence a behavior.
7 Principles of Data Science: Advanced
In Harmonic analysis, an observation (a state) in a multivariable space is seen as the superposition of base functions called harmonic waves or frequencies. For example, the commonly used Fourier analysis [178] represents a signal by a sum of n trigonometric functions (sines and cosines), where n is the number of data points in the population. Each harmonic is defined by a frequency rate k and a magnitude ak or bk:
f x a a kc x b kc x
k n
k k
( )= + ( ( )+ ( ) )
∫= 0
1
0 0
cos π sin π (7.2)
The coefficients of the harmonic components (ak, bk) can easily be stored which significantly reduce the total amount of storage/computational power required to code a signal compared to PCA where every component is coded by a pair of eigen- vector and eigenvalue. Moreover, the signal components (i.e. harmonic waves) are easy to interpret, being homologous to the familiar notion of frequencies that com- pose a music partition (this is literally what they are when processing audio signals).
Several families of functions that map the original signal into the frequency domain, referred to as transforms, have been developed to fit different types of application. The most commons are Fourier Transform (Eq. 7.2), FFT (Fast Fourier Transform), Laplace Transform and Wavelet Transform [179].
This main drawback of Harmonic Analysis compared to PCA is that the compo- nents are not directly derived from the data. Instead, they rely on a predefined model which is the chosen Transform formula, and thus may only reasonably re-construct or approximate the original signal under the presence of macroscopically detectable periodic features (see footnote on previous page; these signals are referred to as smooth signals).
Re-constructing an original signal by summing up all its individual components is referred to as a synthesis [178], by opposition to an analysis (a.k.a. deconstruction of the signal). Note that Eq. 7.2 is a synthesis equation because of the integral in front, i.e. the equation used when reconstructing the signal. Synthesis may be lever- aged in the same way as PCA by integrating only the high-amplitude frequencies and filtering out the low-amplitude frequencies, which reduces the number of vari- ables and thus simplify the prediction problem.
As for PCA, harmonic analysis and in particular FFT is available in most analyt- ics software packages, and is a widely used technique for signal processing, filtering and noise reduction.