Tài liệu hạn chế xem trước, để xem đầy đủ mời bạn chọn Tải xuống
1
/ 16 trang
THÔNG TIN TÀI LIỆU
Thông tin cơ bản
Định dạng
Số trang
16
Dung lượng
1,46 MB
Nội dung
Neurocomputing 443 (2021) 131–146 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Robust hierarchical feature selection with a capped ‘2 -norm Xinxin Liu a,b,c, Hong Zhao a,b,⇑ a School of Computer Science in Minnan Normal University, Zhangzhou, Fujian 363000, China Key Laboratory of Data Science and Intelligence Application, Fujian Province University, Zhangzhou, Fujian, 363000, China c Fujian Key Laboratory of Granular Computing and Application (Minnan Normal University), Zhangzhou, Fujian, 363000, China b a r t i c l e i n f o Article history: Received 11 June 2020 Revised 20 February 2021 Accepted March 2021 Available online 10 March 2021 Keywords: Inter-level error propagation Capped ‘2 -norm Data outliers Feature selection Hierarchical classification a b s t r a c t Feature selection methods face new challenges in large-scale classification tasks because massive categories are managed in a hierarchical structure Hierarchical feature selection can take full advantage of the dependencies among hierarchically structured classes However, most of the existing hierarchical feature selection methods are not robust for dealing with the inevitable data outliers, resulting in a serious inter-level error propagation problem in the following classification process In this paper, we propose a robust hierarchical feature selection method with a capped ‘2 -norm (HFSCN), which can reduce the adverse effects of data outliers and learn relatively robust and discriminative feature subsets for the hierarchical classification process Firstly, a large-scale global classification task is split into several small local sub-classification tasks according to the hierarchical class structure and the divide-and-conquer strategy, which makes it easy for feature selection modeling Secondly, a capped ‘2 -norm based loss function is used in the feature selection process of each local sub-classification task to eliminate the data outliers, which can alleviate the negative effects outliers and improve the robustness of the learned feature weighted matrix Finally, an inter-level relation constraint based on the similarity between the parent and child classes is added to the feature selection model, which can enhance the discriminative ability of the selected feature subset for each sub-classification task with the learned robust feature weighted matrix Compared with seven traditional and state-of-art hierarchical feature selection methods, the superior performance of HFSCN is verified on 16 real and synthetic datasets Ó 2021 Elsevier B.V All rights reserved Introduction In this era of rapid information development, the scale of data in many domains increases dramatically, such as the number of samples, features, and classes [1,2] Vast classes are usually arranged in hierarchical structures for classification tasks [3,4] In addition, the data are often vulnerable to outliers, which usually decrease the density of valuable data for a specific task These problems are challenging to machine learning and data mining tasks such as classification On the one hand, high dimensional data bring the curse of dimensionality problem to classification tasks [5–8] Feature selection is considered as an effective technique to alleviate this problem [9,10] This method focuses on the features that relate to the classification task and excludes the irrelevant and redundant ones On the other hand, data outliers usually disturb the learning models and reduce the relevance between these selected features and ⇑ Corresponding author at: School of Computer Science, Minnan Normal University, Zhangzhou, Fujian, 363000, China E-mail address: hongzhaocn@tju.edu.cn (H Zhao) https://doi.org/10.1016/j.neucom.2021.03.002 0925-2312/Ó 2021 Elsevier B.V All rights reserved the corresponding classes This may lead to serious inter-level error propagation, particularly in the following hierarchical classification process [11–13] Therefore, how to deal with data outliers and how to exploit the hierarchical information of classes in feature selection processes are an interesting challenge Feature selection methods can be categorized into flat feature selection and hierarchical feature selection methods depending on whether the class hierarchy is considered The flat feature selection method selects one feature subset to distinguish all the classes Thus far, many flat feature selection methods based on different theories have been proposed Kira and Rendell [14] proposed the classical feature selection method Relief based on statistical methods, which selects a relevant feature subset by statistical analysis and uses few heuristics to avoid the complex heuristic search process Peng et al [15] proposed the mRMR algorithm based on the mutual information measure, which selects a feature subset based on the criteria of maximal dependency, maximal relevance, and minimal redundancy Cai and Zhu [16] proposed a feature selection method based on feature manifold learning and sparsity regularization for multi-label tasks Faeze and Ali [17] proposed an effective feature selection method based on the backward X Liu and H Zhao Neurocomputing 443 (2021) 131–146 and only discriminative features need to be retained for the current local sub-classification task Secondly, HFSCN excludes the data outliers for each sub-classification task by using a capped ‘2 norm based loss function according to the regression analysis In contrast to the existing hierarchical feature selection methods, HFSCN can improve the robustness of the selected local feature subsets and alleviate the error propagation problem in the classification process Finally, HFSCN selects a unique and compact feature subset for the current sub-classification task using an interlevel regularization of the parent–child relationship in the feature selection process of the current sub-classification task The dependency between the current child sub-classification task and its parent sub-classification task is emphasized to drop out the features related to the local sub-classification tasks sharing different parent classes with the current sub-classification task A series of experiments are conducted to compare HFSCN with seven of the existing hierarchical feature selection methods The experimental datasets consist of two protein sequence datasets, two image datasets, and their 12 corrupted datasets with three types of sample noise Six evaluation metrics are used to discuss the significant differences between our method and the compared methods The experimental results demonstrate that the feature subsets selected by the proposed HFSCN algorithm are superior to those selected by the compared methods for the classification tasks with hierarchical class structures The remainder of this paper is organized as follows In Section 2, we present the basic knowledge of hierarchical classification and feature selection and describe the modeling process of HFSCN in detail Then, Section introduces the experimental datasets, the compared methods, the parameter settings, and some evaluation criteria In addition, Section reports the experimental results and discusses the performance of the compared methods Finally, Section provides the main conclusions drawn from this work and ideas for further study elimination approach for web spam detection Meanwhile, some flat feature selection methods using different regularization terms have been proposed in recent years Nie et al [18] used joint ‘2;1 norm minimization on both the loss function and the regularization term to optimize the feature selection process Lan et al [19] exploited a capped norm on the loss function to decrease the effect of data outliers and optimize the flat feature selection process These flat feature selection methods perform well on selecting feature subsets for a two-class classification or a multiclass classification However, these methods fail to consider the ubiquitous and crucial information of local class relationships and not perform well when are applied directly to hierarchical classification tasks This has been verified by the series of experiments in [20–22] The hierarchical feature selection method selects several local feature subsets by taking full advantage of the dependency relationships among the hierarchically structured classes Relying on different feature subsets to discriminate among different classes can help achieve more significant effects in hierarchical classification tasks For example, texture and color features are suitable for distinguishing among different animals, while the edge feature is more appropriate for discriminating among various furniture items [20] Thus far, some feature selection methods using the hierarchical class structure have been proposed Freeman et al [23] combined the process of feature selection and hierarchical classifier design with genetic algorithms to improve the classification accuracy of the designed classifier Grimaudo et al [24] proposed a hierarchical feature selection algorithm based on mRMR [15] for internet traffic classification Zhao et al [25] proposed a hierarchical feature selection algorithm based on the fuzzy rough set theory These methods can achieve high classification accuracy but fail to use the dependencies in the hierarchical structure of the classes to optimize the feature selection process Zhao et al [20] proposed a hierarchical feature selection method with three penalty terms: an ‘2;1 -norm based regularization term to select the features with group sparsity across classes; an ‘F -norm based regularization term to select the common features shared by parent and child categories; and an independence constraint is added to maximize the uniqueness between sibling categories Following that work, they then proposed a recursive regularization based hierarchical feature selection framework in [21] with and without the parent–child relationship constraint and the sibling relationship constraint Compared with the above two approaches, Tuo et al [22] focused on the two-way dependence among different classes and proposed a hierarchical feature selection with subtree-based graph regularization These methods have a good performance in selecting feature subsets for large-scale classification tasks with hierarchical class structures However, these existing hierarchical feature selection methods are not robust to data outliers and suffer from a serious inter-level error propagation problem [26,27] There is no outlier filtering mechanism in these models, and the commonly used least-squares loss function squares the misclassification loss of these outliers, which will further aggravate the negative impacts of these outliers It makes these models achieve relatively low performance when dealing with practical tasks with ubiquitous outliers In this paper, we propose a robust hierarchical feature selection algorithm with a capped ‘2 -norm (HFSCN), which deals with the data outliers and selects unique and compact local feature subsets to control the inter-level error propagation in the following classification process Firstly, HFSCN decomposes a complex large-scale classification task into several simple sub-classification tasks according to the hierarchical structure of classes and the divideand-conquer strategy Compared with the initial classification task, these sub-classification tasks are small-scale and easy to handle, HFSCN method In this section, we present the proposed robust hierarchical feature selection method with a capped ‘2 -norm (HFSCN) in detail 2.1 Framework of the HFSCN method There are two motivations to design our robust hierarchical feature selection method Firstly, the hierarchical class structure in the large-scale classification task has to be taken into account for the prevailing hierarchical management of numerous classes Secondly, the adverse effects of noises such as data outliers, which may result in a serious inter-level error problem in the following hierarchical classification, have to be reduced in the optimization process A framework of HFSCN based on these considerations is designed, as shown in Fig The hierarchical feature selection process of HFSCN can be roughly decomposed into the following two steps: (1) Divide a complex large-scale classification task into a group of small sub-classification tasks according to the divide-andconquer strategy and the class’s hierarchical information (2) Develop a robust hierarchical feature selection for each subclassification task, considering the elimination of the outliers and the addition of the parent–child relationship constraints 132 Neurocomputing 443 (2021) 131–146 X Liu and H Zhao Fig Framework of HFSCN Firstly, a complex large-scale classification task is divided into some sub-classification tasks with small scales and different inputs and outputs according to the divide-and-conquer strategy and the hierarchy of the classes Then, a corresponding training dataset is grouped for these subtasks from bottom to top along the hierarchy of classes Finally, robust and discriminative feature subsets are selected recursively for those subtasks by the capped ‘2 -norm based noise filtering mechanism and the relation constraint between the parent class and its child classes object classes with hierarchical information is represented by a tree structure of the public VOC dataset [30] The root class Object, which contains all of the classes below it, is the only large node There are several internal class nodes, which have parent coarse-grained class nodes and child fine-grained class nodes For instance, the Furniture class has the child class set of Seating and Dining table Class nodes without a child node are termed ‘‘leaf class nodes” The root node and all of the internal nodes are called ‘‘non-leaf class nodes” Moreover, the classification process of all the samples stops at the leaf node in the experiments; i.e., leaf node classification is mandatory Several of the examples shown in Fig have been given to illustrate the asymmetry and the transmission properties of ‘‘IS-A” (1) The asymmetry property: Sofa is a type of Seating, but it is incorrect that all seating are Sofa (2) The transmission property implies that Chair belongs to Seating and Seating belongs to Furniture, so Chair belongs to Furniture as well In this case, class hierarchies in all hierarchical classification tasks satisfy the four properties mentioned above A classification task with object classes managed in a hierarchical tree structure is called hierarchical classification A sample is classified into one class node at each level in turn in a coarse-tofine fashion In the hierarchical tree structure of classes, the root 2.2 Hierarchical classification In most real-world and practical large-scale classification tasks, categories are usually managed in a hierarchical structure A tree structure and a directed acyclic graph structure are two common representations of the class hierarchical information In this study, we focus on the classes with a hierarchical tree structure The hierarchical tree structure of classes is usually defined by an ordered set (C T ; 0), where C T is a finite set of all the classes in a task, and (the ‘‘IS-A” relationship) indicates that the former is the subclass of the latter The ‘‘IS-A” relationship has the following four properties [28,29]: – the root class node is the only significant element in the tree structure of classes; – 8ci ; cj C T , if ci cj , then cj : ci ; – 8ci C T ; ci : ci ; – 8ci ; cj ; ck C T , if ci cj and cj ck , then ci ck ; where ci is the i-th class, and : denotes that the former is not a subclass of the latter That is, the former class is not a child of the latter class in the tree class hierarchy As shown in Fig 2, a set of Fig Hierarchical tree structure of object classes of the VOC dataset [30] 133 X Liu and H Zhao Neurocomputing 443 (2021) 131–146 number of non-leaf nodes in C T and Xi Rmi Âd (i ¼ 0; Á Á Á ; N), and mi is the number of samples described by d features in the i-th sub-classification task Further, the variable C max is the maximum number of child nodes in all the sub-classification tasks to facilitate the calculation The class label matrix set is redefined as follows: Y ¼ fY0 ; Á Á Á ; Yi g, where Yi ¼ f0; 1g Rmi ÂC max and internal nodes are abstract coarse-grained categories summarized from their child classes The class labels of the training samples correspond to the leaf classes, which are fine-grained categories The closer a node to the root node, the coarser the granularity of the category As a result, a sample belongs to several categories from the coarse-grained level to the fine-grained level However, most hierarchical classification methods have one general and serious problem, called inter-level error propagation; the classification errors at the parent class are easily transported to its child classes and propagated to the leaf classes along with the tree structure [26] 2.4 Robust hierarchical feature selection method One unique feature selection process for each small subclassification task is obtained by the aforementioned task decomposition process For the i-th sub-classification task, Xi and Yi are the feature matrix and the class label matrix Not all of the dimensions of object features are suitable for predicting a specific category We select a unique and compact feature subset and drop out the relatively irrelevant features for each subclassification task to alleviate the curse of the dimensionality problem Feature selection methods can be categorized into three groups according to different selection strategies: filter feature selection, wrapper feature selection, and embedding feature selection In this study, we focused on the third one, namely embedding feature selection Different norm-based regularization terms of the feature weighted matrix W are usually used as penalty terms in the embedding feature selection The feature selection model for the i-th sub-classification task can be formulated as a common penalized optimization problem as follows: 2.3 Decomposition of a complex classification task A divide-and-conquer strategy is used to divide a complex classification task with a hierarchical class structure A group of small sub-classification tasks corresponding to the non-leaf classes in the hierarchical class structure can be obtained according to the decomposition process The fine-grained classes under the child classes of a non-leaf class are ignored, and only these direct child classes of this non-leaf node are included in the searching space for the corresponding sub-classification task For example, the non-leaf classes in the hierarchical class structure C T of the VOC dataset are represented by f0; 1; 2; ; 9g, where denotes the root class For the sub-classification task corresponding to the non-leaf class 0, we only need to distinguish its four direct child classes (Vehicles, Animal, Household, and Person), and not discriminate the fine-grained classes under Vehicles, Animal, and Household Several small sub-classification tasks shown in Fig are obtained according to above task decomposition process Therefore, each sub-classification task’s searching space is significantly decreased, which makes it simple to model the feature selection and the classification process Meanwhile, the classification task is represented according to the non-leaf class nodes in the hierarchical class structure C T as follows: the feature matrix set is X ¼ fX0 ; Á Á Á ; XN g, where N is the minLXi ; Yi ; Wi ị ỵ kRWi Þ; Wi ð1Þ where Wi is a feature weighted matrix that has to be learned and Wi RdÂC max , and a feature weighted matrix set W= {W0 ; W1 ; Á Á Á ; WN } has to be learned for the initial hierarchical classification task; LðXi ; Yi ; Wi Þ represents an empirical loss item Further, RðWi Þ indicates a penalty term, and k is the trade-off parameter between the loss item and the penalty term (k > 0) Fig Classification task of VOC is divided into several small sub-classification tasks 134 Neurocomputing 443 (2021) 131–146 X Liu and H Zhao large weights across all the classes, and the features with small weights are not selected An inter-level relation regularization term defined according to the similarity between the parent and child class nodes in C T is also used to optimize the feature selection process of the i-th subclassification task A coarse-grained parent class is abstracted and generalized from the fine-grained child classes under it The samples in one child class have to first be classified into a coarsergrained category (the parent class) and then into the current class Therefore, it is reasonable to believe that the selected feature subset for a child sub-classification task is similar to that selected for its parent sub-classification task The features related to the classes sharing different parent classes with the current class need to be discarded Following the work in [20], we minimize the difference between the feature weighted matrix Wi of the current subclassification task and the feature weighted matrix Wpi of its parent sub-classification task: kWi À Wpi k2F , where ‘F -norm is used The common and traditional empirical loss functions include the least squares loss and the logistic loss Assume that the classical least squares loss kXi Wi À Yi k2F is used as the loss function in the feature selection model for the i-th sub-classification task, where k Á kF is the Frobenius norm of a matrix If some data outliers are exiting in the training set, the classification loss are particularly large since the residual kxji Wi À yji k22 of the j-th sample is squared This makes the learned Wi away from the ideal one and may lead to a serious inter-level error propagation problem in the following hierarchical classification process An outlier is a case that does not follow the same model as the rest of the data and appears as though it came from a different probability distribution [31] In order to reduce the adverse effects of data outliers, we use the capped ‘2 -norm based loss function [19,32,33] in our model to remove the outliers according to the regression analysis of the features and the class labels: 2 min xji Wi yji ; ei ỵ kRWi ị; Wi ð2Þ for the convenience of calculation This penalty term is called the inter-level constraint and added to the objective function Thus, the final objective function for the i-th sub-classification task can be expressed as follows: where ei is the limited maximum loss for the data outliers in the i-th sub-classification task No matter how serious classification error caused by a data point, the classification loss of this data point is at most i Therefore, the negative effects of data outliers on the learned feature weighted matrix are considerably reduced to obtain a robust and discriminative feature subset for the i-th subclassification task The following calculation determines the value of ei in the i-th sub-classification task Firstly, the losses of all the training samples in the i-th sub-classification task are calculated, which is the ‘2 norm of all the row vectors in ðXi Wi À Yi Þ Then, these losses are sorted in descending order The parameter ei is the corresponding value of e quantile (e P 0) in the aforementioned ordered sequence for the i-th sub-classification task For example, the classification losses of 20 samples are calculated according to the loss function 2 2 min xji Wi À yji ; ei ỵ kkWi k2;1 ỵ aWi Wpi F ; Wi Wi ð4Þ where a is the paternity regularization coefficient for the parent– child difference and a > Finally, the hierarchical feature selection objective function of the entire hierarchical classification task is written as follows: N 2 X min xji Wi yji ; ei ỵ kkWi k2;1 Wi iẳ0 N X 2 ỵ a Wi Wpi F : 5ị iẳ1 kxji Wi yji k2 of the j-th sample in the i-th sub-classification task Then, the obtained 20 losses are sorted, and the following descending-order sequence is obtained: [92.1, 87.4, 50.3, 29.9, , 13.1, 12.3] If e ẳ 0:10, then 0:10 20 ỵ ¼ 3, and ei is equal to the third value in the ordered sequence: ei ¼ 50:3 Therefore, the classification losses for the data outliers are limited to 50.3 at most; these serious data outliers are eliminated by the capped ‘2 -norm loss function It can ensure the selected feature subsets’ discriminative ability and decrease the inter-level errors in the hierarchical classification process The following are two penalty terms used in our feature selection model: a sparsity regularization term based on the ‘2;1 -norm of Wi , and an inter-level relation one defined according to the similarity between the coarse-grained parent class and the finegrained child class The ‘2;1 -norm regularization term for the feature selection model, proposed in [34], is convex and can be easily optimized The regularization term based on the ‘2;1 -norm of Wi can help the model to select a compact and unique local feature subset for discriminating the classes in the current i-th sub-classification task, and discard features that are suitable for distinguishing the categories in other sub-classification task This penalty term is called the structural sparsity for the classes in the current subtask For the i-th sub-classification task, the feature selection objective function with the sparsity regularization term is as follows: 2 min xji Wi À yji ; ei ỵ kkWi k2;1 ; 2.5 Optimization of HFSCN In this section, we describe the optimization process for solving the objective function of HFSCN Assume diagonal matrices Di RdÂd and Si Rmi Âmi with the j-th diagonal element as following values, respectively: jj di ẳ ; 2wji 6ị 1 1 sjji ¼ xji Wi À yji Ind xji Wi À yji ei ; 2 ð7Þ where xji and yji represent the feature vector and its corresponding class label vector of the j-th sample in the i-th feature selection process, respectively; wji is the j-th column of Wi ; and j j < 1; if xi Wi À yi ei ; Indị ẳ : 0; otherwise: 8ị The hierarchical feature selection objective function can be rewritten as: N N X X 2 Tr ðXi Wi À Yi ÞT Si ðXi Wi À Yi Þ þ kTr WTi Di Wi þ a Wi À Wpi F : Wi 3ị iẳ0 9ị iẳ1 The optimization objective function of the root node (the 0-th sub-classification task) needs to be updated separately because it has no parent class The objective function of the root node can be expressed as follows: where k is the sparsity regularization parameter and k > 0; the k Á k2;1 term is the ‘2;1 -norm of a matrix The selected features have 135 X Liu and H Zhao Neurocomputing 443 (2021) 131–146 ðX0 W0 À Y0 ÞT S0 ðX0 W0 Y0 ị ỵ kTr WT0 D0 W0 The feature selection process of the initial hierarchical classification is performed by updating Di ; Si , feature weighted matrices W0 , and Wi , according to Eqs (6), (7), (12), and (15), respectively Finally, the feature weighted matrix set W ¼ fW0 ; W1 ; Á Á Á ; WN g is obtained for the whole global hierarchical classification task The n features for the i-th sub-classification task are sorted according to wji (j ¼ 1; Á Á Á ; n) in the descending order, and some top- W0 ;Wi jC j X þ a Tr ðWi À W0 ÞT ðWi À W0 Þ ; ð10Þ i¼1 where C is the child class set of the root class, and jC j indicates the number of elements in C The derivative of Eq (10) with respect to W0 is set to zero as follows: Then, the following are the optimization processes of the other subclassification tasks that have parent classes The feature selection objective function of the i-th sub-classification task can be expressed as follows: ranked features are selected Based on the above analysis, the detailed algorithm to solve the hierarchical feature selection problem with the capped ‘2 -norm in Eq (5) is summarized in Algorithm Firstly, some data outliers are eliminated according to the least squares regression analysis and the outlier percentage e, as listed from Line to Line The feature weighted matrices are then calculated according to the sparse regularization and the parent–child relationship constraint, as listed from Line to Line 10 The iteration of the two aforementioned processes continue until the convergence of the objective function in Eq (5) Finally, the discriminative feature subsets are learned for the hierarchical classification task through the processes in Lines 13 and 14 It takes approximately six iterations before convergence in the experiments Wi ;Wpi Experimental setup: jC j X XT0 S0 X0 ỵ kD0 ỵ ajC jI W0 XT0 S0 Y0 ỵ a Wi ! ẳ 0; 11ị iẳ1 where I is an identity matrix and I RdÂd The following result for W0 is thus obtained: ! jC j À1 X W0 ¼ XT0 S0 X0 ỵ kD0 ỵ ajC jI XT0 S0 Y0 ỵ a Wi : 12ị iẳ1 Tr ðXi Wi À Yi ÞT Si Xi Wi Yi ị ỵ kTr WTi Di Wi þ aTr À Wi À Wpi ÁT À Wi À Wpi Á : ð13Þ In this section, we introduce the experiment setup from the following four aspects: (1) the real and synthetic experimental datasets; (2) the compared methods; (3) the evaluation metrics used to discuss the performance of all methods; and (4) the parameter settings in the experiments All experiments are conducted on a PC with 3.40 GHz Intel Core i7-3770, 3.40 GHz CPU, 16 GB memory, and Windows operating system The code is accessible via the following link: https:// github.com/fhqxa/HFSCN The derivative with respect to the Wi matrix is set to zero as follows: XTi Si Xi ỵ kDi ỵ aI Wi XTi Si Yi ỵ aWpi ẳ 0: 14ị Thus, the following result for Wi is obtained: À1 Wi ẳ XTi Si Xi ỵ kDi ỵ aI XTi Si Yi ỵ aWpi : 15ị 3.1 Datasets Algorithm 1: Robust hierarchical feature selection with a capped ‘2 -norm (HFSCN) Input: Experiments are conducted on four practical datasets (including two protein sequence datasets, one object image dataset, and one fine-grained car image dataset) from the machine learning repository and their 12 corrupted synthetic datasets These datasets from different application scenarios provided a good test ground for the evaluation of the performance of different hierarchical feature selection methods Detailed information about the initial datasets is provided in Table 1, and the sources of these datasets are as follows: (1) Data matrices Xi Rmi Âd ; Yi Rmi ÂC max and CT ; (2) Sparsity regularization parameter k > 0; (3) Paternity regularization coefficient a > 0; (4) Outlier percentage e P Output: (1) Feature weighted matrix set W ¼ fW0 ; W1 ; Á Á Á ; WN g, where Wi RdÂC max ; (2) Selected feature subsets F ¼ fF0 ; F1 ; Á Á Á ; FN g ðtÞ 1: Set iteration number t ¼ 1; initialize WðtÞ with the each element of Wi is 1; 2: repeat 3: for i ¼ to N ðt Þ ðt Þ 4: Compute D and S according to Eqs (6) and (7), respectively; i 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: i end for // Update the feature weighted matrix for the 0-th sub-classification task À1 PjC j ðtÞ tỵ1ị tị tị tị ẳ XT0 S0 X0 ỵ kD0 ỵ ajC jI XT0 S0 Y0 ỵ a i¼1 Wi ; Update W0 :W0 (1) DD and F194 are two protein sequence datasets from Bioinformatics Their samples are described by 473 features extracted by the method in [35] The first two layers of the protein hierarchy, namely class and fold, are used in the experiment The DD dataset contains 27 various protein folds, which originate four major structural classes: a; b; a=b, and a ỵ b [36] F194 includes 8,525 protein sequences from 194 folds which belong to seven classes [35] The class hierarchical tree structures of DD and F194 are shown in Figs and (2) VOC (PASCAL Visual Object Classes 2010) [30] is a benchmark dataset in visual object category recognition and detection It contains 12,283 annotated consumer photographs collected from the Flickr photo-sharing website, where 1,000 features are extracted to characterize each // Update the feature weighted matrix for the i-th sub-classification task for i ¼ to N À1 tỵ1ị ẳ XTi Si Xi ỵ kDi ỵ aI XTi Si Yi ỵ aWpi ; Update Wi : Wi end for n o tỵ1ị tỵ1ị tỵ1ị Update Wtỵ1ị ẳ W0 ; W1 ; Á Á Á ; WN ; t ẳ t ỵ 1; until Convergence; for i ẳ to N Rank wji (j ¼ 1; Á Á Á ; n) in descending order for the i-th sub-classification 15: 16: 17: task; Select the top ranked feature subset Fi for the i-th sub-classification task; end for Return W and F; 136 Neurocomputing 443 (2021) 131–146 X Liu and H Zhao image sample Its class hierarchy is shown in Fig 2, and the leaf nodes correspond to the 20 fine-grained car categories (3) Car196 [37] is a large-scale and fine-grained image dataset of cars It consists of 196 classes and 15,685 images, covering sedans, SUVs, coupes, convertibles, pickups, hatchbacks, and station wagons Each image sample is characterized by 4,096 features which are extracted by the deep VGG model [38] pre-trained with the ImageNet dataset [39] Fig is the hierarchical class structure of Car196 and the leaf nodes correspond to the 196 fine-grained car classes For the corrupted datasets, some sample outliers are generated according to different distributions and then included in the initial training datasets The quantity of sample outliers in each corrupted training set is 10% of the number of samples in the corresponding initial training set But the test sets in all the corrupted datasets are the same as those in all the initial datasets and not corrupted with data outliers In addition, the feature dimensionality and hierarchical structures of the classes are not changed Three common distributions are used to obtain three new synthetic datasets for each initial dataset The detailed information about these 16 initial and corrupted datasets is summarized in Table Fig Hierarchical tree structure of the object classes in DD minimal-redundancy and maximal-relevance criteria and avoids the difficult multivariate density estimation in maximizing dependency (6) HFSRR-PS [20] exploits the parent–child relationships and sibling relationships in the class hierarchy to optimize the feature selection process (7) HFSRR-P [21] is modified from HFSRR-PS and considers the parent–child relationship only during the hierarchical feature selection process (8) HFS-O is a hierarchical feature selection method modified from the flat feature selection method proposed in [19] HFS-O uses the sparsity regularization to obtain the feature subset and relies on the capped ‘2 -norm regularization to filter out the data outliers The parent–child and sibling relationships among the classes are not taken into consideration in its feature selection process 3.2 Compared methods HFSCN is compared with the baselines and seven hierarchical feature selection methods in the hierarchical classification tasks to evaluate the effectiveness and the efficiency of HFSCN for hierarchical feature selection Top-down SVM classifiers are used in the classification experiments The compared methods are as follows: (1) Baselines, including F-Baseline and H-Baseline, classify the test samples with all the initial features F-Baseline assumes that the categories are independent of each other and directly distinguishes among all the categories Different from F-Baseline, H-Baseline considers the semantic hierarchy between the categories and classifies the samples from coarse to fine (2) HRelief is a hierarchical feature selection method that is a variant of the Relief method [14] and belongs to instancebased learning (3) HFisher uses the classical feature selection method Fisher score [40] to select the relevant features on the class hierarchy, where the Fisher score selects the features of the labeled training data on the basis of the criterion of the best discriminatory ability and can avoid the process of heuristic search (4) HFSNM is a hierarchical selection method that selects features based on FSNM [18] FSNM uses joint ‘2;1 -norm minimization on both the loss function and the regularization on all the data points In this method, a sparsity regularization parameter needs to be predefined (5) HmRMR [24] is a hierarchical feature selection algorithm modified from mRMR [15] The classical flat feature selection method mRMR selects a feature subset based on the There are three methods HFSRR-PS, HFSRR-P, and HFS-O related to the proposed HFSCN method Table presents the differences among these three compared methods and HFSCN in terms of the outlier filtering loss function, parent–child relationship constraint, and sibling relationship constraint We compare HFSCN with HFSRR-PS, HFSRR-P, and HFS-O as follow: (1) Compared with HFSRR-PS and HFSRR-P, HFSCN is robust to deal with the inevitable data outliers The outlier filtering mechanism realized by the capped ‘2 -norm based loss function can improve the discriminative ability of selected local feature subset and further alleviate the error propagation problem in the hierarchical classification process (2) The sibling relationship constraint between classes is not used in HFSCN given its relatively minor effect improvement in HFSRR-PS compared with HFSRR-P and the model complexity of HFSCN (3) In contrast to HFS-O, HFSCN uses the similarity constraint between the parent and child classes to select a unique and compact local feature subset recursively for a sub- Table Initial dataset description No Dataset Feature Node Leaf Height Training Test Type DD F194 VOC Car196 473 473 1000 4096 32 202 30 206 27 194 20 196 3 3,020 7105 7178 8144 605 1420 5105 7541 Protein Protein Object image Car image 137 X Liu and H Zhao Neurocomputing 443 (2021) 131–146 Fig Hierarchical tree structure of object classes in F194 Fig Hierarchical tree structure of object classes in Car196 (2) Feature test time (T test ) T test is the running time of the classifier learning and prediction process on the test set with the selected features (3) Classification accuracy (Acc ) Classification accuracy is the simplest evaluation metric for flat or hierarchical classification It is calculated as the ratio of the number of correctly predicted samples to the total number of test samples (4) Hierarchical F -measure (F H ) [41] The evaluation criteria for hierarchical classification models include hierarchical precision (P H ), hierarchical recall (RH ), and hierarchical F measure (F H ), defined as follows: classification task However, the relation among classes is not taken into consideration and local feature subsets are selected independently and respectively in HFS-O 3.3 Evaluation metrics Some evaluation metrics in [28] are used to validate in the experiments to evaluate the performance of our method and the compared methods On the one hand, the running time of the feature selection process and the classification time for testing the selected feature subsets are evaluated for these methods On the other hand, the effectiveness of the selected feature subsets for classification is evaluated in terms of the classification accuracy, the hierarchical F -measure, the tree-induced error, and the F measure based on the least common ancestor These metrics are described as follows: (1) Feature selection time (T sel ) T sel is the running time for selecting feature subsets by the different feature selection algorithms F-Baseline and H-Baseline input all the initial features directly in the classification without the feature selection process Therefore, only the proposed method and the other seven hierarchical feature selection methods are compared based on this evaluation metric PH ¼ b aug \ Daug j jD ; b aug j jD FH ¼ Á PH Á RH ; P H ỵ RH where Daug ẳ D S RH ẳ b aug \ Daug j jD ; jDaug j ð16Þ ^ aug ¼ D b S An D b ; D is the real label AnðDÞ; D of the test sample, An(D) represents the parent node set of b indithe real class to which the sample really belongs, and D cates the predicted class label of the test sample 138 Neurocomputing 443 (2021) 131–146 X Liu and H Zhao LCACat; Dog ị ẳ fDomesticg; Daug ẳ fCat; Domesticg, and ^ aug ¼ fDog; Domesticg Finally, the following results can D and be obtained: PLCA ¼ 1=2 ¼ 0:5; RLCA ¼ 1=2 ¼ 0:5, F LCA ¼ Â 0:5 0:5=0:5 ỵ 0:5ị ẳ 0:5 However, for the calculation of F H ; Daug ¼ fCat; Domestic; Animal; Objectg, and ^ aug ¼ fDog;Domestic;Animal;Objectg Then, PH ¼3=4¼0:75; D Table Detailed information about 16 experimental datasets No Dataset DD DD-R DD-N DD-U Sample group Feature description of sample outliers (a) Initial and corrupted datasets for DD DD — DD + Random Random noise in ½0; 10 DD + Gaussian Random noise obeys Gaussian distribution N ð0; 1Þ DD + Uniform Random noise obeys uniform distribution U 0; 1ị RH ẳ3=4ẳ0:75, and F H ẳ20:750:75=0:75ỵ0:75ịẳ0:75 Therefore, the F LCA metric is more stable than F H with considering the least common ancestor (b) Initial and corrupted datasets for F194 F194 F194-R F194-N F194 F194 + Random F194 + Gaussian F194-U F194 + Uniform 3.4 Parameter settings — Random noise in ½0; 10 Random noise obeys Gaussian distribution N ð0; 1Þ Random noise obeys uniform distribution U ð0; 1Þ Some experimental parameters in the compared methods have to be set in advance Parameter k of HRelief is varied over the set f1; 2; 3; 4; 5; 6; 7g and set to the value which can lead to the best result for each classification task The parameters shared by HFSCN and the compared methods except HRelief are set to the same values of HFSCN: (1) the sparsity regularization parameters k in HFSNM, HFSRR-PO, HFSRR-P, HFS-O; (2) the paternity regularization coefficients a in the HFSNM, HFSRR-PO, HFSRR-P; and (3) the outlier percentage e for HFS-O For HFSCN, the three parameters are determined by a grid search based on a 10-fold cross validation: k and a are searched (c) Initial and corrupted datasets for VOC VOC VOC-R VOC-N VOC VOC + Random VOC + Gaussian VOC-U VOC + Uniform — Random noise in ½0; 10 Random noise obeys Gaussian distribution N ð0; 1Þ Random noise obeys uniform distribution U ð0; 1Þ (d) Initial and corrupted datasets for Car196 Car196 Car196R Car196N Car196U Car196 Car196 + Random — Random noise in ½0; 10 Car196 + Gaussian Random noise obeys Gaussian distribution N ð0; 1Þ Random noise obeys uniform distribution U ð0; 1Þ Car196 + Uniform from {10À2 ; 10À1 , 1, 10, 102 }, and e is tuned in the set {0%, 1%, 2%, 3%, 4%, 5%, 6%, 8%, 10%, 12%, 14%, 16%, 18%, 20%} Different parameter values with the best results are determined for different datasets A top-down C-SVC classifier modified from the classical support vector machine (SVM) with a linear kernel (t ¼ 0) and a penalty factor C ¼ 1, is used in the classification processes for testing the selected features In addition, all experiments are conducted using the 10-fold cross-validation strategy, and the average result is reported for each method (5) Tree induced error (TIE) In hierarchical classification, we should give different punishments on different types of classification errors In the proposed model, the penalty is defined by the distance of nodes, which is termed the Tree Induced Error (TIE) [42] Assume the predicted class label cp and the real class label cr for one test sample; then, TIE is À Á À Á À Á computed: TIE cp ; cr ¼ EH cp ; cr , where EH cp ; cr is the edge set along the path from node cp to node cr in the class hierarchy C and jÁj calculates the number of elements That À Á is, TIE cp ; cr is the number of edges along the path from class cp to class cr in the class tree (6) F -measure based on the least common ancestor (F LCA ) This metric is based on but is different from the hierarchical F -measure [28] F H is susceptible to the number of shared ancestors To avoid this negative effect, F LCA is defined The least common ancestor (LCA) is derived from the graph theÀ Á ory [43], which denotes LCA cp ; cr as the least common ancestor between class nodes cp and cr Meanwhile, Daug is À Á changed to the nodes along the path from cr to LCA cp ; cr , ^ and D is changed to the nodes along the path from cp to À aug Á LCA cp ; cr Then, PLCA ; RLCA , and F LCA are calculated according to Eq (16) The following is a toy example to calculate F LCA Let cr and cp be the classes ‘‘Cat” and ‘‘Dog” in Fig 2, respectively Then, we can obtain the following: Experimental results and analysis In this section, we present the experimental results and discuss the performance of the proposed method from the following five aspects: (1) the classification results with different numbers of features selected by HFSCN; (2) the effects of the outlier filtering mechanism and the inter-level constraint in HFSCN; (3) two running time comparisons for feature selecting and feature testing; (4) performance comparisons using the flat and hierarchical metrics Acc ; F H ; F LCA , and TIE; in addition, the significant differences among HFSCN and the compared methods are evaluated statistically; and (5) the convergence analysis of HFSCN In all the tables presenting the experimental results, the best result is marked in bold, the next best result is underlined The mark ‘‘"” represents that the bigger the value, the better the performance; ‘‘#” indicates that the smaller the better 4.1 Classification results with different numbers of selected features An experiment on four initial real datasets is conducted to verify the changes of each evaluation metric when classifying samples with different numbers of features selected by HFSCN The experimental results show that the changes of the results on each metric Table Differences among HFSRR-PS, HFSRR-P, HFS-O, and HFSCN Different terms Outlier filtering Parent–child relationship Sibling relationship HFSRR-PS HFSRR-P No p p No p No 139 HFS-O p No No HFSCN p p No X Liu and H Zhao Neurocomputing 443 (2021) 131–146 set value of e, the worse was the classification effectiveness of HFSCN This result is likely because some samples that are important and discriminative to the classes are removed (2) The proposed algorithm’s performance is poor when the parameter e is set to smaller values This demonstrates that the data outliers can lower the learning ability of algorithms are consistent, so only the F H results of the proposed method and the compared methods on four initial datasets are presented in Fig From these results, we can obtain the following two observations: (1) On the two protein datasets DD and F194, the F H values are increasing as more and more features are selected In addition, the results no longer change or change little when HFSCN selecting more than 47 features (approximately 10%) (2) On the two image datasets VOC and Car196, the classification results are relatively good and stable when more than 20% of features are selected by the proposed method The second stage of the experiment explores the effects of the proposed algorithm on avoiding inter-level propagation The sparsity regularization parameter k and the outlier percentage e are fixed for different classification datasets The parent–child relationship constraint is emphasized in varying degrees The experimental results are shown in Fig 9, leading to the following conclusions: Based on these results, we select 10% of features for the protein datasets and 20% of features for the image datasets (1) The larger the value of a, the worse the classification effect Therefore, an appropriate paternity constraint can help HFSCN to achieve more effective feature subsets and avoid inter-level error propagation (2) An overemphasis on the shared common features of the parent class and child class will neglect the uniqueness between categories This conclusion is not very apparent for the VOCR dataset This could be attributed to the fact that the value of a is not sufficient enough 4.2 Effects of the outlier filtering mechanism and inter-level constraint The first phase of the experiment explores the effects of HFSCN on controlling the inter-level error propagation by removing the data outliers During the feature selection process, the values of both the sparsity regularization parameter k and the paternity regularization coefficient a are both fixed The outlier percentage parameter e is set to different values for each classification task From the results shown in Fig (where the exact outlier ratio in these synthetic datasets is shown as red dotted lines), we can obtain the following three observations: 4.3 Running time comparisons for selecting and testing features We discuss the performance of HFSCN and the compared methods on two metrics: the running time for selecting features and the running time for testing the discriminative ability of the selected features in classification processes (1) When the percentage of data outliers is set to or almost close to 10% (the exact outlier ratio in synthetic datasets), HFSCN obtains the best classification performance The larger the Fig F H results of HFSCN with different numbers of selected features: (a) DD; (b) F194; (c) VOC; (d) Car196 140 Neurocomputing 443 (2021) 131–146 X Liu and H Zhao Fig F H results of HFSCN with different percentages of outliers: (a) DD-R; (b) F194-R; (c) VOC-R; (d) Car196-R The exact outlier ratio in these synthetic datasets is shown on these figures as red dotted lines F H ; F LCA , and TIE In addition, we use two statistical evaluation methods to verify the significant differences among the proposed HFSCN method and the compared methods on the F H metric and all datasets Firstly, the classification results on four flat and hierarchical evaluation metrics are presented in Tables 6–9 From these results, the following four conclusions are drawn: According to the series of experiments, all the compared algorithms exhibited performance for the datasets with different noises in terms of the efficiency Thus, the efficiency of HFSCN and the other seven compared algorithms are only analyzed on the corrupted datasets with random noise in ½0; 10 Tables and present the experimental results, leading to the following two conclusions: (1) On the running time for feature selection, the algorithms that consider the relationships or filter the data outliers are more efficient than the HFSNM, HRelief, and HmRMR algorithms HFisher achieves comparable efficiency performance but is inferior to HFSCN in terms of the hierarchical classification effectiveness of the selected feature subsets; this will be discussed further in the following part (2) In terms of the running time for classifying test samples, the results of hierarchical methods with feature selection processes are comparable and significantly outperform the two baseline algorithms without any feature selection process Therefore, the time complexity of classification can be substantially decreased after the hierarchical feature selection The bigger the task, the more obvious the effect For example, the running time of classification after feature selection on Car196-R is two orders of magnitude faster than that of F-Baseline and H-Baseline (1) It is clear that the feature subset selected by HFSCN outperformed the other seven hierarchical feature selection algorithms on almost all the datasets With one exception of the Car196 dataset, HFSCN compared favorably with HFSRR-PS and HFSRR-P, and ranked third Therefore, the superiority of the proposed algorithm is proven (2) From the comparison of the HFSCN method and HFSRR-P, we concluded that the outlier constraint in HFSCN improved the quality of the selected feature subsets The comparison of HFSCN with HFS-O revealed that the parent–child relationship also enhanced the selected feature subsets’ quality Therefore, the positive effects of the two constraints in the proposed algorithm are verified (3) The classification effectiveness of the proposed method is equal to or slightly better than that of HFSRR-PS on all the experimental datasets This result illustrated that the constraint of eliminating the data outliers performed better than the sibling relationship constraint in the feature selection processes (4) On the F LCA and TIE metrics, the proposed algorithm HFSCN generally performed better than the other seven algorithms The results illustrated that HFSCN is more likely to classify 4.4 Performance comparisons using flat and hierarchical metrics We discuss the performance of the feature subsets selected using the proposed method and the seven compared methods on the flat evaluation metric Acc, and three hierarchical metrics 141 X Liu and H Zhao Neurocomputing 443 (2021) 131–146 Fig F H of HFSCN with different paternity regularization coefficient: (a) DD-R; (b) F194-R; (c) VOC-R; (d) Car196-R Table Running time for hierarchical feature selection (seconds, #) Algorithm DD-R F194-R VOC-R Car196-R HFSNM HRelief HFisher HmRMR HFSRR-PS HFSRR-P HFS-O HFSCN 87.4 68.6 0.5 32.5 1.3 4.4 4.2 4.5 1,444 405 71 30 25 30 1052 640 338 10 45 42 47 10,953 9751 22 10,387 1680 982 816 859 Table Running time for testing the feature subsets (seconds, #) Algorithm DD-R F194-R VOC-R Car196-R F-Baseline H-Baseline HFSNM HRelief HFisher HmRMR HFSRR-PS HFSRR-P HFS-O HFSCN 2.909 3.213 0.021 0.023 0.021 0.022 0.018 0.018 0.019 0.019 28.21 23.42 0.11 0.10 0.10 0.13 0.12 0.12 0.10 0.10 487.65 992.61 9.95 12.18 10.14 9.96 9.57 9.48 9.22 9.23 4024.6 4278.9 37.1 97.0 65.1 67.7 49.2 52.2 50.7 46.8 samples into the classes that shared closer ancestor nodes with the real classes and resulted in fewer wrongs Therefore, HFSCN could alleviate the problem of inter-level error propagation Secondly, the widely used statistical methods Friedman test [44] and Bonferroni-Dunn test [45] are used to obtain a clear analysis of the significant difference among our method and eight compared methods on multiple datasets The proposed method and 142 Neurocomputing 443 (2021) 131–146 X Liu and H Zhao Table Classification accuracies of all methods on sixteen datasets (%, ") Dataset HFSNM HRelief HFisher HmRMR HFSRR-PS HFSRR-P HFS-O HFSCN 68.61 66.13 68.44 68.94 66.30 68.60 68.61 68.61 69.10 68.61 68.94 33.24 33.87 DD 68.28 50.76 52.07 68.28 DD-R 59.34 24.79 59.67 63.96 DD-N 67.43 47.30 66.30 64.47 67.62 DD-U 61.67 35.55 67.11 62.49 68.44 68.11 68.44 F194 31.76 23.66 25.77 32.11 F194-R 25.92 18.17 17.25 31.97 33.66 33.31 33.66 33.24 F194-N 32.04 19.30 31.62 31.69 33.17 33.24 34.08 32.61 F194-U 28.31 23.80 32.89 32.82 33.17 33.03 VOC 40.33 40.29 39.34 40.84 33.31 41.53 41.51 VOC-R 41.16 41.70 40.27 40.94 42.15 42.37 41.68 41.72 VOC-N 40.90 41.31 40.53 41.17 41.21 41.14 VOC-U 41.47 41.24 39.04 40.00 42.23 43.35 Car196 66.05 66.33 66.38 66.22 42.47 66.95 41.49 42.06 66.54 66.98 Car196-R 64.79 66.72 66.29 66.53 66.76 66.97 66.72 66.16 64.49 66.58 66.54 66.60 66.66 66.82 66.33 67.09 Car196-N Car196-U 65.51 59.25 66.22 66.44 66.79 66.85 66.56 66.61 67.03 HFS-O HFSCN 86.06 86.17 69.26 34.58 33.59 33.73 42.31 42.35 41.74 Table F H results of all methods combined with SVM classifiers (%, ") Dataset HFSNM HRelief HFisher HmRMR HFSRR-PS HFSRR-P DD 86.12 79.51 79.56 85.79 86.01 85.95 86.01 DD-R 81.43 63.47 81.87 83.30 84.74 84.80 DD-N 76.26 84.80 83.53 85.46 85.73 DD-U 85.78 81.82 85.90 85.95 72.01 84.96 82.92 86.01 86.01 F194 69.72 67.72 68.43 70.26 F194-R 66.92 62.75 64.95 70.07 70.68 70.87 70.68 70.85 F194-N 70.14 64.77 70.21 70.00 70.56 70.59 71.10 70.85 71.34 67.24 67.48 67.78 86.23 70.54 71.20 70.19 85.95 86.39 70.92 71.60 70.89 F194-U 68.22 66.92 70.75 70.26 VOC 66.26 65.85 65.76 66.69 71.15 67.29 VOC-R 66.72 67.09 66.21 66.70 67.22 67.37 VOC-N 66.49 66.69 66.22 66.57 66.81 66.87 67.22 67.45 66.77 67.24 67.42 68.10 82.84 82.40 82.78 82.10 82.80 82.95 VOC-U 66.61 66.72 65.20 66.15 Car196 81.96 82.15 82.13 82.13 67.45 82.84 Car196-R 80.98 82.50 82.35 82.33 82.56 82.54 82.43 Car196-N 82.10 80.89 82.39 82.32 Car196-U 81.44 77.57 82.08 82.32 82.40 82.68 Ranks 6.094 6.344 5.969 67.76 82.43 82.60 82.79 3.063 82.74 3.063 3.219 1.250 Table TIE results of all hierarchical feature selection methods (Edges, #) Dataset HFSNM HRelief HFisher HmRMR HFSRR-PS HFSRR-P HFS-O HFSCN DD 0.833 1.229 1.227 0.853 0.840 0.843 0.839 DD-R 1.114 2.192 1.088 1.002 0.915 0.912 0.836 0.830 DD-N 1.424 0.912 0.988 0.872 0.856 DD-U 0.853 1.091 0.846 0.843 1.679 0.903 1.025 0.839 0.839 F194 1.817 1.937 1.894 1.785 1.759 1.749 1.765 1.734 F194-R 1.985 2.235 2.103 1.796 1.759 1.748 F194-N 1.792 2.114 1.787 1.800 1.766 1.731 2.158 F194-U 1.907 1.985 1.755 1.785 VOC 2.216 2.237 2.271 2.188 2.166 0.843 0.816 0.826 1.768 1.745 1.728 1.789 1.746 1.704 1.749 1.720 2.146 2.151 2.121 2.127 VOC-R 2.188 2.166 2.221 2.187 2.159 2.145 VOC-N 2.208 2.197 2.233 2.200 2.191 2.187 2.157 VOC-U 2.200 2.186 2.303 2.240 2.143 2.094 Car196 1.083 1.071 1.072 1.072 2.136 1.030 2.180 2.157 1.030 1.056 Car196-R 1.141 1.050 1.059 1.060 1.047 1.047 1.032 1.023 1.056 1.075 1.061 1.056 1.039 1.054 1.033 1.074 1.054 1.035 1.044 1.033 Car196-N 1.074 1.147 Car196-U 1.113 1.346 1.061 143 X Liu and H Zhao Neurocomputing 443 (2021) 131–146 Table F LCA results of all hierarchical feature selection methods (%, ") Dataset HFSNM HRelief HFisher HmRMR HFSRR-PS HFSRR-P HFS-O HFSCN DD 82.49 73.34 73.80 82.32 82.54 DD-R 77.16 56.67 77.49 79.64 81.08 82.46 82.65 81.17 82.48 82.51 82.57 82.79 DD-N 82.04 70.56 81.16 79.92 81.94 DD-U 78.13 64.52 81.52 78.96 82.49 82.24 82.49 F194 62.11 58.42 59.47 62.50 F194-R 58.77 54.10 54.89 62.36 63.23 63.20 63.23 63.17 63.04 63.27 F194-N 62.42 55.48 62.31 62.23 63.00 F194-U 60.21 58.06 63.00 62.73 VOC 62.09 61.90 61.35 62.44 63.35 62.93 62.92 82.51 82.76 63.02 82.95 63.63 62.63 63.99 63.42 63.31 63.10 63.58 63.07 63.06 63.48 VOC-R 62.58 62.95 62.01 62.49 63.18 63.35 VOC-N 62.36 62.60 62.05 62.51 62.60 62.60 63.02 VOC-U 62.63 62.62 61.01 61.82 63.43 62.76 63.14 63.29 64.04 63.46 Car196 79.66 79.85 79.86 79.81 80.05 80.39 78.76 80.16 79.94 80.01 80.40 80.20 80.41 Car196-R 80.18 80.50 Car196-N 79.77 78.61 80.06 80.01 80.06 80.10 80.33 79.83 Car196-U 79.22 75.20 79.78 79.97 80.27 80.32 80.15 compared methods are tested in terms of the classification results on F H according to the following procedures of the Friedman test and the Bonferroni-Dunn test: 80.08 80.40 Then, we can obtain the following observation The F H results of HFSCN, HFSRR-P, HFS-O, and HFSRR-PS are statistically better than those of HRelief, HFisher, HFSNM, and HmRMR There is no consistent evidence to indicate the statistical differences among HFSCN, HFSRR-PS, HFSRR-P, and HFSRR-PO on the F H metric (1) All of the compared methods are ranked and the average rank of each algorithm is computed on all the experimental datasets The best performing algorithm is given the rank of 1, the second-best is given the rank of 2, and so on In case of a tie, the average ranks are assigned (2) A null-hypothesis is formulated: all the algorithms are equivalent (3) We compute and obtain F F ¼ 40:7548 in the case of the nullhypothesis (4) The critical value of the F-distribution is calculated with a ¼ 0:05 : F ¼ finvð0:95; K À 1; N À 1ị ẳ 2:0980, where K ẳ and N ẳ 16 are the numbers of the algorithms and the datasets, respectively (5) If F F > F, then the null-hypothesis is rejected, and these algorithms are not considered equivalent The following processes of the Bonferroni-Dunn test can be carried out after rejecting the null-hypothesis (6) The critical distance (CD) is calculated: p CD ẳ qa K K ỵ 1ị=6N ẳ 2:3296, where the critical value qa ¼ q0:05 ¼ 2:690 is searched from Table 5(b) in [44] If the distance between the average ranks of the two algorithms exceeded the critical distance, the two algorithms’ performance is significantly different (7) The average ranks of all the algorithms with the critical distance (CD) are plotted as shown in Fig 10 4.5 Convergence analysis of the proposed method This section provides some results to demonstrate the convergence of the HFSCN algorithm presented in Algorithm via iterative procedures The objective function value of Eq (9) is taken as the evaluation criterion Fig 11 shows the convergence curves of the small dataset DD-R, and the big dataset Car196-R The results demonstrate that the objective function value decreases monotonically and converges within no more than ten iterations Therefore, the efficiency of the proposed method is verified Conclusions and future work In this paper, we proposed a robust hierarchical feature selection method with a capped ‘2 -norm (HFSCN) HFSCN first divided the complex hierarchical classification task into several simpler sub-classification tasks according to the class hierarchy information Then, the capped ‘2 -norm based loss function and the parent–child relationship constraint regularization were exploited to exclude the data outliers and decrease the inter-level error propagation Finally, HFSCN was compared with seven existing hierarchical feature selection methods on 16 datasets from different fields The experimental results validated that HFSCN could effi- Fig 10 Average ranks of all algorithms with the critical distance (CD) for F H 144 Neurocomputing 443 (2021) 131–146 X Liu and H Zhao Fig 11 Convergence curves of HFSCN function value: (a) DD-R; (b) Car196-R ciently select robust and effective feature subsets for the hierarchical classification tasks In the HFSCN method, the parameters e; k, and a are not automatically adaptive to the task The diagonal matrix S with m diagonal elements (where m was equal to the number of samples in the training set) for updating the feature weighted matrix could limit the scale of the hierarchical classification task We will focus on solving these problems in future work [9] A Barbu, Y She, L Ding, G Gramajo, Feature selection with annealing for computer vision and big data learning, IEEE Trans Pattern Anal Mach Intell 39 (2017) 272–286 [10] J Yu, Manifold regularized stacked denoising autoencoders with feature selection, Neurocomputing 358 (2019) 235–245 [11] A Sun, E.P Lim, Hierarchical text classification and evaluation, in: IEEE International Conference on Data Mining, 2001 [12] A Sun, E.P Lim, W.K Ng, J Srivastava, Blocking reduction strategies in hierarchical text classification, IEEE Trans Knowl Data Eng 16 (2004) 1305– 1308 [13] Y Qu, L Lin, F Shen, C Lu, Y Wu, Y Xie, D Tao, Joint hierarchical category structure learning and large-scale image classification, IEEE Trans Image Process (2017) 1–16 [14] K Kira, L.A Rendell, A practical approach to feature selection, Mach Learn Proc 48 (1992) 249–256 [15] H Peng, F Long, C Ding, Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy, IEEE Trans Pattern Anal Mach Intell 27 (2005) 1226–1238 [16] Z Cai, W Zhu, Multi-label feature selection via feature manifold learning and sparsity regularization, Int J Machi Learn Cybern (2018) 1321–1334 [17] A Faeze, S Ali, An effective feature selection method for web spam detection, Knowl.-Based Syst 166 (2019) 198–206 [18] F Nie, H Huang, X Cai, C Ding, Efficient and robust feature selection via joint l2;1 -norms minimization, in: International Conference on Neural Information Processing Systems, 2010, pp 1813–1821 [19] G.Lan,C.Hou, F.Nie,T.Luo,D.Yi,Robustfeature selectionviasimultaneous capped norm andsparse regularizer minimization, Neurocomputing(2018) 228–240 [20] H Zhao, P Zhu, P Wang, Q Hu, Hierarchical feature selection with recursive regularization, in: International Joint Conference on Artificial Intelligence, 2017, pp 3483–3489 [21] H Zhao, Q Hu, P Zhu, Y Wang, P Wang, A recursive regularization based feature selection framework for hierarchical classification, IEEE Trans Knowl Data Eng (2020) 1–13 [22] Q Tuo, H Zhao, Q Hu, Hierarchical feature selection with subtree based graph regularization, Knowl.-Based Syst 163 (2019) 996–1008 [23] C Freeman, D Kulic, O Basir, Joint feature selection and hierarchical classifier design, in: IEEE International Conference on Systems, 2011, pp 1–7 [24] L Grimaudo, M Mellia, E Baralis, Hierarchical learning for fine grained internet traffic classification, in: International Wireless Communications and Mobile Computing Conference, 2012, pp 463–468 [25] H.Zhao,P.Wang,Q.Hu,Z.Pengfei,Fuzzyroughsetbasedfeatureselectionforlargescalehierarchicalclassification, IEEETrans FuzzySyst.27 (2019) 1891–1903 [26] J Fan, J Zhang, K Mei, J Peng, L Gao, Cost-sensitive learning of hierarchical tree classifiers for large-scale image classification and novel category detection, Pattern Recognit 48 (2015) 1673–1687 [27] Z Yu, J Fan, Z Ji, X Gao, Hierarchical learning of multi-task sparse metrics for large-scale image classification, Pattern Recognit 67 (2017) 97–109 [28] A Kosmopoulos, I Partalas, E Gaussier, G Paliouras, I Androutsopoulos, Evaluation measures for hierarchical classification: a unified view and novel approaches, Data Min Knowl Discov 29 (2015) 820–865 [29] C.N Silla, A.A Freitas, A survey of hierarchical classification across different application domains, Data Min Knowl Discov 22 (2011) 31–72 [30] M Everingham, L.V Gool, C.K.I Williams, J.M Winn, A Zisserman, The pascal visual object classes (VOC) challenge, Int J Comput Vis 88 (2010) 303–338 [31] X Zhu, X Wu, Class noise vs attribute noise: a quantitative study of their impacts, Artif Intell Rev 22 (2004) 177–210 [32] F Nie, X Wang, H Huang, Multiclass capped ‘p -norm SVM for robust classifications, in: AAAI Conference on Artificial Intelligence, 2017a, pp 2415–2421 [33] F Nie, Z Huo, H Huang, Joint capped norms minimization for robust matrix recovery, in: International Joint Conference on Artificial Intelligence, 2017b, pp 2557–2563 CRediT authorship contribution statement Xinxin Liu: Methodology, Software, Validation, Writing - original draft Hong Zhao: Conceptualization, Data curation, Supervision, Writing - review & editing Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper Acknowledgements This work was supported by the National Natural Science Foundation of China under Grant No 61703196, the Natural Science Foundation of Fujian Province under Grant No 2018J01549, and the President’s Fund of Minnan Normal University under Grant No KJ19021 References [1] Y Wang, Q Hu, Y Zhou, H Zhao, Y Qian, J Liang, Local Bayes risk minimization based stopping strategy for hierarchical classification, in: IEEE International Conference on Data Mining, 2017, pp 515–524 [2] Q Hu, Y Wang, Y Zhou, H Zhao, Y Qian, J Liang, Review on hierarchical learning methods for large-scale classification task, Sci Sin 48 (2018) 7–20 [3] W Wang, G Zhang, J Lu, W Wang, G Zhang, J Lu, Hierarchy visualization for group recommender systems, IEEE Trans Syst Man Cybern Syst (2018) 1–12 [4] J Xuan, X Luo, J Lu, G Zhang, Explicitly and implicitly exploiting the hierarchical structure for mining website interests on news events, Inf Sci 420 (2017) 263–277 [5] M Harandi, M Salzmann, R Hartley, Dimensionality reduction on SPD manifolds: the emergence of geometry-aware methods, IEEE Trans Pattern Anal Mach Intell 40 (2018) 48–62 [6] C Jie, J Luo, S Wang, Y Sheng, Feature selection in machine learning: a new perspective, Neurocomputing 300 (2018) 70–79 [7] L Wang, Q Mao, Probabilistic dimensionality reduction via structure learning, IEEE Trans Pattern Anal Mach Intell 41 (2019) 205–219 [8] C Zhang, H Fu, Q Hu, P Zhu, X Cao, Flexible multi-view dimensionality coreduction, IEEE Trans Image Process 26 (2016) 648–659 145 X Liu and H Zhao Neurocomputing 443 (2021) 131–146 Xinxin Liu received the B.E degree from Fujian University of Technology, China, in 2016, and the M.E degree from Minnan Normal University, China, in 2020 She is currently pursuing the Ph.D degree with the School of Computer Science and Engineering at Nanjing University of Science and Technology Her current research interests in data mining and machine learning [34] A Argyriou, T Evgeniou, M Pontil, Multi-task feature learning, in: Neural Information Processing Systems, 2006, pp 41–48 [35] D Li, Y Ju, Q Zou, Protein folds prediction with hierarchical structured SVM, Curr Proteom 13 (2016) 79–85 [36] C Ding, I Dubchak, Multi-class protein fold recognition using support vector machines and neural networks, Bioinformatics 17 (2001) 349–358 [37] J Krause, M Stark, D Jia, F Li, 3D object representations for fine-grained categorization, in: IEEE International Conference on Computer Vision Workshops, 2013, pp 554–561 [38] K Simonyan, A Zisserman, Very deep convolutional networks for large-scale image recognition, in: International Conference on Learning Representations, 2015, pp 1–13 [39] J Deng, W Dong, R Socher, L Li, K Li, F Li, ImageNet: a large-scale hierarchical image database, in: IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2009, pp 248–255 [40] R.O Duda, P.E Hart, D.G Stork, Pattern classification, 2nd Edition., Wiley, 2001 [41] J.C Gomez, M.F Moens, Hierarchical classification of web documents by stratified discriminant analysis, in: Information Retrieval Facility Conference, 2012, pp 94–108 [42] O Dekel, J Keshet, Y Singer, Large margin hierarchical classification, in: International Conference on Machine Learning, 2004, pp 27–36 [43] A.V Aho, J.E Hopcroft, J.D Ullman, On finding lowest common ancestors in trees, SIAM J Comput (1976) 115–132 [44] J Demsar, Statistical comparisons of classifiers over multiple data sets, J Mach Learn Res (2006) 1–30 [45] O.J Dunn, Multiple comparisons among means, J Am Stat Assoc 56 (1961) 52–64 Hong Zhao received the M.S and PhD degrees from Liaoning Normal University and Tianjin University, China, in 2006 and 2019 She is also a professor of the School of Computer Science, Minnan Normal University, Zhangzhou, China She has authored over 40 journal and conference papers in the areas of granular computing based machine learning and cost-sensitive learning Her current research interests include rough sets, granular computing, and data mining for hierarchical classification 146 ... et al [24] proposed a hierarchical feature selection algorithm based on mRMR [15] for internet traffic classification Zhao et al [25] proposed a hierarchical feature selection algorithm based... classification tasks satisfy the four properties mentioned above A classification task with object classes managed in a hierarchical tree structure is called hierarchical classification A sample... datasets are the same as those in all the initial datasets and not corrupted with data outliers In addition, the feature dimensionality and hierarchical structures of the classes are not changed