The impact of high dimensionality on SVM when classifying ERP data - A solution from LDA Nguyen Duy Du Nguyen Hoang Huy Nguyen Xuan Hoai Hanoi University, Vietnam National University of Hanoi University of Science and Technology, No - Dai Co Viet road Agriculture, Trauquy - Gialam - Hanoi - Km - Nguyen Trai - Thanh Xuan Hanoi - Vietnam Vietnam Hanoi - Vietnam +8438544338 ext 3236 +84912328010 +841669594068 dund.hust@gmail.com nhhuy@vnua.edu.vn nxhoai@hanu.edu.vn the quality of life of disabled people in that they allow them to have more independence, and at the same time, to reduce the social cost Apart from the most popular application for disabled people, BCI has found many applications in entertainment and marketing, etc ABSTRACT Brain-computer interfaces (BCI) based on P300 event-related potentials (ERP) could help to select characters from a visually presented character-matrix They provide a communication channel for users with neurodegenerative disease Associated to these kinds of BCI systems, there is the problem to determine whether or not a P300 was actually produced in response to the stimuli The design of this classification step involves the choice of one or several classification algorithms from many alternatives Support Vector Machines (SVM) and Linear Discriminant Analysis (LDA) have been used to achieve acceptable results in numerous P300 BCI applications However, both of them suffers from the high dimensional problem which leads to deterioration of their performance In this paper, we introduce a novel and combined approach of LDA and SVM to reduce the negative effect of high dimensional data on SVM and LDA, and investigate the performance of our method The results shows that the new approach achieves similar or slightly better performance than the state-of-art method The most common signal used among various BCI systems is EEG because it is non-invasive and comparatively easy to set up Different types of brain activities are reflected in EEG signals and have been used in BCI, such as ERP [2], ERD/ESD [3] The P300 is a positive component of the ERP, which occurs over the parietal cortex, approximately 300ms after a rare stimulus presentation among a series of frequent stimuli (i.e oddball paradigm) In the past few years, P300 BCIs have clearly emerged as one of the main BCI categories Among the applications of P300 BCI, P300 Speller was first demonstrated for character spelling by Farwell and Donchin [2] Today, visual P300 Spellers have become the most commonly used system for BCI-based communication The conventional P300 Speller is typically composed of a paradigm that presenting characters randomly flashing on a computer screen, and a P300 classifier which is responsible for recognizing the target character In the field of BCI, while performing a pattern recognition task, classifiers may face several problems related to the features’ properties such as noise and outliers, high dimensionality, time information, and small training sets [4] Several classification techniques have demonstrated high performance for the P300 classification, including SVM [5] and LDA [6] However, these methods have not obtained their optimal performances because of the impact of high dimensionality of training data CCS Concepts • Theory of computation ➝ Models of learning Keywords Support Vector Machines; Linear Discriminant Analysis; Regularized Linear Discriminant Analysis; Brain Computer Interface INTRODUCTION Brain-Computer Interface (BCI) is a human computer interface that allows people to work directly with a computer using their brain signals [1] By establishing a non-muscular channel for sending person’s intentions to external devices, BCI has great potential to help millions of individuals who suffer from severe physical disabilities By satisfying their basic needs such as communication and ability to interact with the environment BIC systems improve LDA has been widely applied in numerous different classification problems It is simple to use and requires a very low computational cost, but for a learning/classification task with very high dimensional data, the traditional LDA algorithm encounters several problems First, it is difficult to handle computations of big matrices (such as computing eigenvalues) Second, those big matrices are almost always singular To resolve these difficulties, regularized linear discriminant analysis (regularized LDA) was proposed in [7] and shown to be effective in reducing the negative impact of high dimensionality on LDA [8] The idea is to add a constant to the diagonal elements of the total scatter matrix, where is known as the regularization parameter Regularization stabilizes the sample covariance matrix estimation by which improves the classification performance of LDA Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page Copyrights for components of this work owned by others than ACM must be honored Abstracting with credit is permitted To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee Request permissions from Permissions@acm.org SoICT 2015, December 03-04, 2015, Hue City, Viet Nam © 2015 ACM ISBN 978-1-4503-3843-1/15/12…$15.00 DOI: http://dx.doi.org/10.1145/2833258.2833290 While Support Vector Machines (SVM) are supposed to be appropriate classifiers to deal with feature vectors of high dimensionality (even in with infinite dimensionality), it has been 32 extracted from the filtered signals Since ERP components are characterized by their temporal evolution and the corresponding spatial potential distributions [10] To use these spatio-temporal information, features are often extracted from several time segments and channels, and then concatenated into a single feature vector Based on the feature vectors, the classification stage classifies the signals into interested categories (characters) Finally the predicted character is displayed on the screen shown in theory and practice that high dimensional data has bad on SVM learning ability As noted in Marron et al [9], data piling is common in High-Dimensional, Low Sample Size (HDLSS) data settings, where the number of features of data p is greater than the sample size n and due to this property, SVM may suffer from a loss of generalizability When looking for EEG classification patterns, it is commonly based on the assumption of normal distribution In our preliminary experiments for investigating the effect of high dimensionality on SVM (see Section 2), we used the datasets that are randomly generated according to normal models The results indicated that the performance of SVM tended to decrease when we increased the dimensionality and/or reduced the Mahalanobis distance Therefore, in this paper, we propose a solution to the problem of training SVM with high dimensional data for BCI data classification by using LDA as a preprocessing step This could help to keep the Mahalanobis distance unchanged while reducing the dimensionality of feature vectors After that, SVM can be applied During the data collection process, a screen with characters flashing in random order should be presented to the subject to provide the P300 potential Users are instructed to concentrate on a certain character and mentally count the number of flashes that occur for it In response to the counting of this oddball stimulus, the selected character elicit a P300 wave, while the others not Detection of a P300 response corresponds to a binary classification (present/absent) 2.2 The P300 Classification Problem We denote the EEG signals after preprocessing at channel c C {c1, c2 , , cq1 } and time point t T {t1, t2 , , tq2 } within The rest of the paper is structured as follows In Section 2, we introduce the general framework of a P300 speller system and provide the definition of the P300 classification problem The concepts of SVM, LDA, and regularized LDA are also given An illustrative simulation for investigating the effect of high dimensionality and Mahalanobis distance to the classification performance of SVM is detailed Our proposed new method that combines LDA and SVM for classification is described in Section Section describes our experiments aim to compare the performance of our proposed method with SVM, LDA and regularized LDA The results of the experiments and discussions are presented in Section Finally, in Section 5, we conclude the paper with comments and perspectives on the future work single-trial i by xi (c, t ) (subscript i is sometimes omitted) We define x(C; t ) [ x(c1; t ), , x(cq1 ; t )]T as the spatial feature vector of EEG signals for the set C of channels at time point t By concatenation of those vectors for all time points T {t1, , tq2 } of one trial one obtains spatio-temporal feature vector x(C;T ) or briefly x p with p q1q2 for classification with q1 and q2 being the number of channels and the number of sampled time points, respectively x(C;T ) [ x(C; t1)T ; x(C; t2 )T ; ; x(C; tq2 )T ]T BACKGROUNDS 2.1 The Framework of P300 Speller Systems The P300 detection can be formulated as the problem of binary classification: assigning xi to one of the two classes: The design of a P300 detection system requires several stages: signal acquisition, preprocessing, feature extraction, and classification as depicted in Figure [1] yi 1, if xi contains a P300 response or yi 1, if xi does not contain a P300 response 2.3 Support Vector Machine Suport Vector Machine (SVM) is a powerful classification scheme, which was originally proposed by V Vapnik [11] SVM has been successfully applied to classification problems in general and BCI applications in particular The general setting of two-class classification problems requires pdimensional training data vectors and the task of a classfier is to find a discriminative rule to assign the labels of 1 or 1 to new data vectors This label asignment in distance based classifers relies on the similarity of the new vector to the training vetors in class +1 or class 1 Geometrically, the similarity could be measured via a separating hyperplane using a weight vector w and the bias term b based on a training set of n examples with the data vectors xi and the corresponding class labels yi Figure General framework of a P300 Speller system The first stage is signal acquisition, brain signal is recorded by electrodes of a EEG device Typically, the original EEG signal collected from the subject contains many noise and artifacts caused by muscle and eye movements So, before entering the feature extraction phase, it is necessary to perform several preprocessing operations to eliminate or minimize the negative impact of these factors Filtering bandpass is the most used preprocessing method in P300 Speller systems It is common to filter raw EEG signals with a digital bandpass filter with edge frequencies 0.1 and 30 Hz In the feature extraction stage, discriminative information is ( xi , yi ), ,( xn , yn ) R p {1,1} In the testing stage, a new data vector x can be labelled by projecting x on the weight vector w and find out what side of the hyperplane that x is on f ( x) w x b 33 actually comes from two normal distributions with common covariance matrix and different means 1 and 1 , LDA classifier achieves the minimum classification error rate [13] The LDA score or discriminant function of an observation X is given The key idea behind (linear) SVM is to find w and b such that the hyperplane will separate the (training) data into two halves so as all the data in the same class reside on the same side Moreover, the separation should be maximal in the sense that the data is as far as possible from the separating hyperplane The problem of finding such maximal separating hyperplane could be formulated as a (convex) optimization problem, which only uses data points that are closest to the separating hyperplane, called support vector [11] where and 1 1 ( X ) ( X )T 1 , (1 1) / Object X is assigned to class 1 if ( X ) log( 1 ), otherwise to class , where k is the prior 1 probability of being in class k {1, 1} by: SVM has been applied successfully to solve classification problems of small dimensionality and in therory it should also excel in the high dimensional (even infinite dimensional) setting However, it has been found in theory and practice that in HDLSS setting data a so-called “data-piling” phenomenon is prominant for SVM Datapiling described in [9] as the phenomenon that after projected to the directional vector w given by a linear classifier, a large portion of the data vectors piles upon each other and concentrate on two points This phenomenon is considered harmful for any discriminative classifier as SVM as it reflects severe overfitting in the HDLSS data setting Data piling indicates that the direction is driven by artifacts and only by very particular aspects of the realization of the training data at hand Consequently, the direction as well as the classification performance might be stochastically volatile [12] The error rate for LDA’s discriminant function could be calculated as follows e ( dX ), where d X is the Mahalanobis distance between two classes and given by d X [ T 1 ]1 / , (t ) (t ) is the tail probability of the standard Gaussian distribution We prove the following theory that is the foundation of our approach proposed in the next section Theorem: The Mahalanobis distance between two classes is invariant through transform Proof If X ~ N (1, ) then As a demonstration, we modified Marron’s simulation in [9] using Mahalanobis distance as the amount of separation Training data sets of size n1 n1 50 and testing data sets of size 200 , of dimensions 4,8,16,32,64,128,256,512,1024 were generated from two multivariate normal distributions with common covariance matrix I p and different means 1 d / p (1, ,1) The var( ( X )) E[ ( X ) E ( X )]2 E[( X 1) 1 ]2 T 1E[( X 1)( X 1)T ] 1 T 1 ; E ( ( X )) ( 1 )T 1 Similarly, if X ~ N (1, ) then var( ( X )) T 1 ; E ( ( X )) (1 )T 1 , Which implies that the Mahalanobis distance between two classes of scores: dimensions are intended to cover from non-HDLSS to extreme HDLSS settings The values of Mahalanobis distance were selected between 0.1 and 20 Each experiment was replicated 100 times The obtained results in this experiment indicate a noticeable decline in the performance when the dimensionality increased and the Mahalanobis distance decreased, see Figure d ( X ) T 1 T 1 T 1 [ T 1 ]1/2 d X The proof is complete In practice we not know the parameters of the Gaussian distributions and have to estimate them using training data, k nk / n, ˆ k nk 1 X i , ˆ n ( X i ˆ k )( X i ˆ k )T yi k k 1 yi k where k {1, 1} and nk |{ yi k , i 1, , n}| is the number of class k observations The sample version of LDA works well in the cases where the number of observations is much greater than the dimensionality of each observation, i.e n p In addition to being easy to apply, it also has nice properties, like robustness to deviations from model assumptions However, it turned out to be really difficult to use this method in the HDLSS settings, where p n is always the case When the sample size n of the training data is smaller than the number p of features, then the sample covariance matrix estimate is singular and cannot be inverted The pseudoinverse can be used but this will impair the classification Even when n is larger, but of the same magnitude as p , the aggregated estimation error over many entries of the sample covariance matrix will significantly Figure Illustrating performance of SVM in HDLSS situations 2.4 Linear Discriminant Analysis Linear discriminant analysis (LDA) is still one of the most widely used techniques for data classification [13] If the classifying data 34 At first, LDA is applied to obtain a score for each subgroup of features In this step, the impact of high dimensionality on LDA is negligible if the number of features q1 of each subgroup is chosen to be small in comparison to the sample size of training data n [16] In the second step, SVM is applied to these scores which gives the overall score used for classification Thus, the discriminant function of the combined method is increase the error rate of the sample LDA Actually, the performance of the sample LDA in HDLSS situation is far from optimal 2.5 Regularized Linear Discriminant Analysis To solve the bad impact of high dimensionality on LDA, one possible solution is regularized linear discriminant analysis proposed by Friedman in [7] It is simple to implement, computationally inexpensive, and it gives impressive results for high dimensional data, for example, brain-computer interface data [8] Regularized LDA replaces the empirical covariance matrix ˆ in the sample LDA discriminant function by ( ) (1 )ˆ vI f * ( X ) f ( ( X1), ( X ), , ( X q2 )), where and f denote LDA and SVM discriminant function, respectively Figure illustrates our new approach (1) for a regularization parameter [0, 1] and v is defined as the average eigenvalue tr(ˆ ) / p of ˆ In this way, regularized LDA overcomes the diverging spectra problem of high-dimensional covariance matrix estimate: large eigenvalues of the original covariance matrix are estimated too large, and small eigenvalues are estimated too small Using linear discriminant analysis with such modified covariance matrix, we will have to handle the problem of tuning the regularization parameter Recently an analytic method to calculate the optimal regularization parameter for certain direction of regularization was found by Schäfer et al [14] For regularization towards identity, as defined by equation (1), the optimal parameter can be calculated as: * n (n 1) p var( z j1 j2 (i)) j1 , j2 1 ˆ j (ˆ j1 j1 v)2 j1 j2 j1 j2 Figure Schematic illustration of combined method It is noted that our proposed scheme is different from the simple combination of pricipal component analysis (PCA) and SVM Here, LDA helps to gain the maximum discriminative information from each feature group at the first step, whereas PCA only selects maximally variant eigen directions corresponding to the linear combination of features in each group (2) Where xij , ˆ j are the j-th element of the feature vector xi (the realization of observation X i ), common mean ˆ respectively and ˆ j1 j2 is the element in the j1 -th and j -th column of ˆ , EXPERIMENT 3.1 Experiment Design z j1 j2 (i) ( xij1 ˆ j1 )( xij2 ˆ j2 ) We also can define regularization parameter by performing nfold cross validation for training data as follows We determine a grid on [0; 1] and estimate the area under the curve (AUC) for each point by n-fold cross-validation is chosen such that it gives the maximum, see Frenzel et al [15] The purpose of this experiment is to verify the effectiveness of our proposed method We used the data of a BCI experiment given by Frenzel et al in [15] EEG signals were recoded using a Biosemi ActiveTwo system with 32 electrodes placed at the positions of the modified 10-20 system and sampling rate of 2048 Hz 2.6 The Proposed Approach In Frenzel's experiment setup, the subjects sat in front of a computer screen presenting a by matrix of characters (see figure 4) and she/he had to fixate one of them and count the number of times that the target character was highlighted The fixated and counted (target) characters could be identical (condition 1) or different (condition 2) The characters were dark grey on light grey background and were set to black during the highlighting time Each single trial contained one character highlighted during 600 ms and randomized break lasting up to 50 ms As can be seen in the previous section (also in Figure 2), if we could reduce the dimensionality of feature vectors, while maintaining the Mahalanobis distance, the performance would be improved In the previous section (section 2.4), we have proven that the Mahalanobis distance between two classes is unaltered through the transformation of LDA Therefore, we propose a combined approach for high dimentional data classification such as in P300 Speller that before using SVM for training (distance weighted) classifiers, LDA is applied as a preprocessing step By doing this way, one can reduce dimensionality and maintain the Mahalanobis distance of data All p features of an observation are divided into disjoint subgroups: X [ X1T , , X qT2 ]T , where X j R q1 and q1q2 p 35 Figure Schematic representation of the stimulus paradigm, source: Frenzel et al [15] In our experiment, we only examined data under condition Each time interval if a character was highlighted it is considered as a sample In total, we had nine datasets of m 7290 samples measured with 32 electrodes Each sample was assigned to one label, +1 if the target character was presented and -1 in the other case For each sample, data of the time interval of about 600 ms were downsampled from the acquisition rate of the hardware to 64Hz All 38 time points ( q1 ) between and 600 ms and all 32 electrodes ( q2 ) were used Thus there were p q1q2 1216 spatio-temporal features for classification For each dataset, classifiers were trained using the first n samples, with 200 n 2100 The scores of the remaining m n samples were calculated Our proposed method was compared to LDA, SVM and regularized LDA If the sample size of training data n < p + 2, the inverse ˆ 1 does not exist, we replaced it with the Moore-Penrose pseudo-inverse of ˆ The regularization parameter of regularized LDA was calculated using formula (2) because this formula is often used for long period of training data so that applying it will reduce much computational time compared to cross-validation In SVMs for classification, L1 soft margin loss functions was used 3.2 Collected Statistics To test the performance of all systems, the AUC (area under the curve) value is used This is the relative frequency of target trials having a larger score than non-target ones The reason for AUC values but not error rates is that an overall error rate almost meaningless when the target trials are rare [16] The value of AUC can be calculated as follow: AUC Rank () | | (| | 1) / |||| (3) where: Rank () is the sum of the ranks of all examples that were classified to class 1 | | |{ yi 1, i n}| is the number of examples that belong to class 1 (contain P300) in the dataset | ||{ yi 1, i n}| is the number of examples that belong to class 1 (do not contain P300) in the dataset Figure Performance curves for each method of training for all nine datasets 3.3 Experiment platform The simulations were all performed on a personal computer (Intel Core i5, 2.5 GHz, Windows 7), using Matlab (The Mathworks, USA; version 2014b) RESULTS AND DISCUSSION Figure shows the learning curves of each method for all nine datasets The average performance curves of each method are provided in Figure In these figures, horizontal axis means the sizes of training set ranging from 200 to 2000 and the vertical axis means the AUC value that was calculated by Formula (3) Figure The average performance curves on nine datasets for each method The approach of combining LDA and SVM showed similar or slightly better performance than both SVM and regularized LDA Usually, in most BCI online systems, we only have small training 36 sets Since the combined approach gave a reasonable performance even with small training sizes, it might offer a practically relevant advantage In BCI systems, as generally, the greater distance between training set and test data the smaller accuracy of classification So, in the future we will apply some techniques in stream data mining such as transfer learning, lifelong learning for EEG data EEG-based brain–computer interfaces Journal of neural engineering [5] Kaur, M., Soni, A K., and Rafiq, M Q 2015 Offline Detection of P300 in BCI Speller Systems In Emerging ICT for Bridging the Future-Proceedings of the 49th Annual Convention of the Computer Society of India (CSI) Volume (pp 71-82) Springer International Publishing CONCLUSIONS [6] Xu, M., Liu, J., Chen, L., Qi, H., He, F., Zhou, P., and Ming, D 2015 Inter-subject information contributes to the ERP classification in the P300 speller In Neural Engineering (NER), 2015 7th International IEEE/EMBS Conference on (pp 206-209) IEEE Performance of support vector machine is affected by dimensionality and Mahalanobis distance of data We introduced a method which solves this problem by applying LDA as a preprocessing step that would keep the Mahalanobis distance unchanged while reducing the dimensionality of data For our EEG data, the combined LDA and SVM approach performed better than both SVM and regularized LDA [7] Friedman, J H., 1989 Regularized discriminant analysis J Amer Statist Assoc 84 (405): 165–175 [8] Hohne, J., Blankertz, B., Muller, K R., and Bartz, D 2014 Mean shrinkage improves the classification of ERP signals by exploiting additional label information In Pattern Recognition in Neuroimaging, 2014 International Workshop on (pp 1-4) IEEE Nowadays there are many methods that adapted the design of SVM for HDLSS data settings, such as distance weighted discrimination and its improvements [17] To the best of my knowledge, these algorithms have not implemented for single-trial classification of EEG data in BCI systems In our future works, the combined approach using these algorithms should be investigated carefully [9] Marron, J S., Todd, M J and Ahn, J 2007 Distance Weighted Discrimination Journal of the American Statistical Association, 102, 1267-1271 The way dividing features into subgroup to gain the optimal performance of the combined LDA and SVM also should be highly considered, although it is the NP-hard problem We will give some theoretical justifications for dividing features, which is nearly optimal for EEG data in the future As a consequence, we will point out the high dimensional data models with which our combined approach also work well [10] Blankertz, B., Lemm, S., Treder, M., Haufe, S., and Müller, K R 2011 Single-trial analysis and classification of ERP components – A tutorial NeuroImage, 56, 814–825 [11] Vapnik, V 1998 Statistical learning theory NY, WileyInterscience [12] Qiao, X., and Zhang, L 2015 Distance-weighted Support Vector Machine Statistics and Its Interface, 8, 3, pp 331-345 ACKNOWLEDGMENTS We especially thank Stefan Frenzel (University of Greifswald, Greifswald) for generously providing data We also wish to acknowledge the partially financial support from Vietnam's National Foundation for Science and Technology Development (Nafosted) under grant number 102.01-2014.09 [13] Cai, T., and Shen, X 2010 Frontiers of Statistics: Volume High-Dimensional Data Analysis World Scientific Publishing and Imperial College Press [14] Schäfer, J., and Strimmer, K 2005 A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics Statistical Application in Genetics and Molecular Biology, 4(1), 1544 - 6115 REFERENCES [1] Wolpaw, J R., Birbaumer N, McFarland, D J., Pfurtscheller G, and Vaughan, T M 2002 Brain-computer interfaces for communication and control Clin Neurophysiol 113: 767-91 [15] Frenzel, S., Neubert, E., and Bandt, C 2011 Two communication lines in a 3x3 matrix speller Journal of Neural Engineering, 8, 036021 [2] Farwell, L., and Donchin, E 1988 Talking off the top of your head: toward a mental prosthesis utilizing event-related brain potentials Electroencephalogr Clin Neurophysiol.70, 510– 523 [16] Huy, N H 2013 Multi-step Linear Discriminant Analysis and Its Applications PhD thesis, Department of Mathematics and Computer Science, University of Greifswald [3] Pfurtscheller, G., and Neuper, C 2001 Motor imagery and direct brain-computer communication Proc IEEE89, 1123– 1134 [17] Qiao, X., Zhang, H., Liu, Y., Todd, M J., and Marron, J S 2010 Weighted Distance Weighted Discrimination and Its Asymptotic Properties Journal of the American Statistical Association, 105, 489, pp 401-414 [4] Lotte, F., Congedo, M., Lécuyer, A., Lamarche, F., and Arnaldi, B 2007 A review of classification algorithms for 37 ... in the HDLSS data setting Data piling indicates that the direction is driven by artifacts and only by very particular aspects of the realization of the training data at hand Consequently, the. .. performance of the sample LDA in HDLSS situation is far from optimal 2.5 Regularized Linear Discriminant Analysis To solve the bad impact of high dimensionality on LDA, one possible solution is... the same side Moreover, the separation should be maximal in the sense that the data is as far as possible from the separating hyperplane The problem of finding such maximal separating hyperplane