1 Department of Computer Science, ITER, Siksha ‘O’ Anusandhan University, Bhubaneswar, India
2 Department of Information and Communication Technology, Fakirmohan University, Vyasa Vihar, Balasore, Odisha, India
3 School of Computer Engineering, KIIT University, Bhubaneswar, Odisha, India
This paper studies the relevance of feature selection algorithms in microarray data for effective analysis. With no loss of generality, we present a list of feature selection algorithms and propose a generic categorizing framework that systematically groups algorithms into categories. The generic categorizing framework is based on search strategies and evaluation criteria. Further, it provides guidelines for selecting feature selection algorithms in general and in specific to the context of this study. In the context of microarray data analysis, the feature selection algorithms are classified into soft and non-soft computing categories. Their performance analysis with respect to microarray data analysis has been presented.
We summarize this study by highlighting pointers to recent trends and challenges of feature selection research and development in microarray data.
Keywords: Microarray data, Data mining, Data mining tasks, Feature selection, Search strategies, Soft computing technique, Non-soft computing technique.
open-access license: This is an open access article distributed under the terms of the Creative Commons Attribution 4.0 International Public License (CC-BY 4.0), a copy of which is available at: (https://creativecommons.org/licenses/by/4.0/legalcode). This license permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
* Address correspondence to this author at the Department of Computer Science, ITER, Siksha ‘O’ Anusandhan University, Bhubaneswar, India, Tel: +8895278059; E-mail: email@example.com
Microarray data is a high throughput technology used in cancer research for diagnosis and prognosis of disease [1Bosio M, Bellot P, Salembier P, Verges AO. Microarray classification with hierarchical data representation and novel feature selection criteria 2012; In: IEEE 12th International Conference on Bioinformatics & Bioengineering (BIBE); 2012; pp. 344-9.[http://dx.doi.org/10.1109/BIBE.2012.6399648] ]. It offers the ability to simultaneously measure thousands of gene expression values. The microarray data analysis is structured, based on basic four steps: First, the raw data generated from the instruments need to be imported and parsed using specific libraries. Secondly, such data are preprocessed partially eliminating the noise, and data are annotated with biological information stored in specific libraries. Finally, data-mining algorithms extract biological information from annotated data [2Guzzi PH, Cannataro M. Challenges in Microarray Data Management and Analysis. Computer-Based Medical Systems 2011; 24(3): 1-6.]. DNA microarray data, which provides gene expression values, can be used for prediction of disease outcome, classification of cancer types and identification of relevant genes [3Liang S, Ma A, Yang S, Wang Y, Ma Q. A Review of Matched-pairs Feature Selection Methods for Gene Expression Data Analysis. Comput Struct Biotechnol J 2018; 16: 88-97.[http://dx.doi.org/10.1016/j.csbj.2018.02.005] ]. Since Microarray data consists of hundreds of thousands of features in comparison to the number of samples, it undergoes the curse of dimensionality problem. However, a very small number of features contribute to the classifier or which is relevant to an outcome of interest. For example, one or two genes behavior may be responsible for a particular cancer type such as p53 which act as a biomarker in Lung cancer dataset and expressed differentially so instead of taking the uncorrelated genes with the biomarker we should make a subset of correlated genes which can give accurate classification accuracy. Thus, effective feature selection techniques are often needed in this case to aid to correctly classify different tumor types and consequently lead to a better understanding of genetic signatures as well as improve treatment strategies [4Jain I, Jain VK, Jain R. Correlation feature selection based improved-Binary Particle Swarm Optimization for gene selection and cancer classification. Appl Soft Comput 2018; 62: 203-15.[http://dx.doi.org/10.1016/j.asoc.2017.09.038] ].
A feature in a dataset is an individual measurable property of the process being observed [5Chandrashekar G, Sahin F. A survey on feature selection methods. Comput Electr Eng 2014; 40: 1-28.[http://dx.doi.org/10.1016/j.compeleceng.2013.11.024] ]. The feature (gene) selection process helps in understanding data, reducing computational requirement, reducing the effect of the curse of dimensionality and improving the performance of the classifier. The process of removing irrelevant, noisy and redundant features and finding the relevant feature subset based on which the learning algorithm improves its
performance is known as feature selection. Feature selection is primarily applied on the dataset comprises of thousands of features with small sample size (e.g., microarray data). Where feature selection is termed as gene selection or Biomarker selection. The intentions of feature selection are manifold: some of the important objectives are first to exclude overfitting and enhance model performance i.e., prediction accuracy in the case of supervised classification and better cluster detection in the case of clustering. Second, to provide faster and better cost-effective models and third to gain a deeper insight into the underlying processes that generated the data [6Liang S, Ma A, Yang S, Wang Y, Ma Q. A review of matched-pairs feature selection methods for gene expression data analysis. Comput Struct Biotechnol J 2018; 16: 88-97.[http://dx.doi.org/10.1016/j.csbj.2018.02.005] ]. Instead of the conventional method of feature selection, soft computing techniques (i.e. fuzzy logic, neural networks, and evolutionary algorithms) are also applied for feature selection on high dimensional data. In this case, the high dimensional dataset is converted into low dimensional dataset by selecting an optimal number of genes. To accomplish the optimal gene selection process, several evolutionary algorithms are widely used for high dimensional dataset such as genetic algorithm, ant colony optimization algorithm, particle optimization algorithm etc. Feature selection is a complex process in high dimensional data set as there is multifarious interaction among features for large search space. Therefore, exhaustive search technique is quite unfeasible for these situations. To resolve this issue conventional searching techniques are used such as sequential forward selection and sequential backward selection. However, these search techniques suffer from stagnation in local optima and high computational cost. In order to address the issues in the above searching techniques, global search methods like evolutionary algorithms are used. In genetic algorithm an evolutionary search using genetic operators combined with a fitness function searches and evaluates the selected features to get the optimal feature subset. Feature selection process in ant colony optimization is a pathfinding problem. Where the ants lay pheromone along the edges of the graph and the traces get thicker when the path leads towards the optimal path (optimal features). Particle swarm optimization adapts a random search strategy for feature selection. The particles move in a multidimensional space attracted towards the global best position. The movement of the particles is influenced by the previous experience and by other particles and this approach is used for feature selection. These evolutionary algorithms are associated with many hybrid approaches for reducing the number of features. The current study presents the fundamental taxonomy of feature selection, along with reviews the consortium of state-of-the-art feature selection methods present in the literature into two categories: non soft computing and soft computing approach of feature selection.
The rest of the paper is set out as follows. Section 2, describes the objective of feature selection and its relevance for model generation. Section 3, discusses the steps of feature selection and categorization of the algorithms into soft and non-soft computing methods along with their performance analysis. Section 4, deals with the feature selection methods used for microarray data. Section 5, highlights pointers to recent trends and challenges in feature selection.
2. FEATURE SELECTION AND ITS RELEVANCE
Traditionally manual management of the high dimensional data set is more challenging. With the advent of data mining and machine learning techniques, knowledge discovery and recognition of patterns from these data can be done automatically [7Alelyani S, Tang J, Liu H. Feature selection for clustering: A review Data Clustering: Algorithms and Applications 2013.]. However, the data in the database is filled with a high level of noise and redundancy. One of the reasons causing noise in these data is an imperfection in the technologies that collected the data and the source of the data itself is another reason. Dimensionality reduction is one of the famous techniques to remove noisy (i.e. irrelevant) and redundant features. For data mining techniques such as classification and clustering dimensionality reduction is treated as preprocessing task for better performance of the model. Dimensionality reduction techniques can be classified mainly into feature extraction and feature selection. Feature extraction approaches set features into a new feature space with lower dimensionality and the newly constructed features are usually combinations of original features. On the other hand, the objective of feature selection approaches is to select a subset of features that minimize redundancy and maximize relevance to the target such as the class labels in classification. Therefore, both feature extraction and feature selection are capable of improving learning performance, lowering computational complexity, building better-generalized models, and decreasing required storage. Fig. (1) shows the classification of dimension reduction process and the data set in which these are generally applied in the literature. Feature selection selects a group of features from the original feature set without any changeover and maintains the physical meanings of the original features. Therefore, feature selection is superior in terms of better readability and interpretability [8Masaeli M, Dy JG, Fung G. From transformation-based dimensionality reduction to feature selection. Proceedings of the 27th International
Conference on Machine Learning. pp. 751-8., 9Dash M, Liu H. Feature selection for classification. Intelligent data analysis 1997; 1(1- 4): 131-56.[http://dx.doi.org/10.1016/S1088-467X(97)00008-5] ]. One of the applications would be in gene microarray data analysis [10Hu L, Gao W, Zhao K, Zhang P, Wang F. Feature selection considering two types of feature relevancy and feature interdependency. Exp Sys Appl 2018; 93: 423-34.[http://dx.doi.org/10.1016/j.eswa.2017.10.016] -14Lazar C, Taminau J, Meganck S, Steenhoff D, Coletta A, Molter C. A survey on filter techniques for feature selection in gene expression microarray analysis IEEE/ACM Trans Computational Biology and Bioinformatics 2012; 9(4): 1106 -19.[http://dx.doi.org/10.1109/TCBB.2012.33] ]. Feature selection has its significance in many real-world applications such as finding relevant genes to a specific disease in Microarray data, analysis of written text, and analysis of medical images, analysis of the image for face recognition and for weather forecasting.
Taxonomy of dimension reduction techniques suitable for different datasets.
There are a number of different definitions in the machine learning literature for relevant feature selection. The feature is relevant which correlate to the target concept. The target concept (tc) depends upon the problem and the requirement. So, the simplest conviction of relevance is a perception of being “relevant to the tc”.
Definition 1: (Relevant to the target): A feature xi is relevant to a target concept (tc) if there exists a pair of examples A and B in the instance space such that A and B differ only in their assignment to x and tc(A)! = tc(B).
Definition 2: (Strongly Relevant to the sample)
A feature xi is strongly relevant to sample S if their present examples A and B in S differ only in their assignment to x and have different labels (or have several distributions of labels if they appear in S multiple times). Similarly, x is highly relevant to tc and distribution D if there exist examples A and B having non-zero probability over D that differ only in their assignment to x and satisfy tc(A)!=tc(B).
Definition 3: (Weakly Relevant to the sample):
A feature xi is weakly relevant to sample S (or target tc and distribution D) if there is a possibility of removal of a subset of the features so that xi becomes strongly relevant.
Definition 4: (Relevance as a complexity measure)
Given a paper of data D and a set of concept C, let r (D, C) be the number of features relevant using Definition 1 to a concept in C that out of all those whose error over D is least, has the fewest relevant features.
In this article different existing FS methods defined by many others are universal and compared. The subsequent list which is theoretically different and covers up a variety of definitions are given below.
General Definition: A large set of (D) items is given from which we need to find a small subset (d) being optimal in a certain sense [15Guzzi PH, Agapito G, Cannataro M. core SNP: Parallel Processing of Microarray Data. IEEE Trans Comput 2014; 63(12): 2961-74.[http://dx.doi.org/10.1109/TC.2013.176] ].
Generic Definition: FS consists of identifying the set of features whose expression levels are indicated to a particular target feature (clinical/biological annotation). Mathematically, this problem can be viewed as follows: Let
be a matrix containing‘m’ features and ‘n’ samples generating from different groups started by a target annotation. Selection of the most informative genes consists of identifying the subset of genes across the entire population of samples
which are the most discriminative for the outlined classes. This definition is only valid for classification problems where the groups are clearly identified beforehand [16Lazar C, Taminau J, Meganck S, et al. A survey on filter techniques for feature selection in gene expression microarray analysis. IEEE/ACM Trans Comput Biol Bioinformatics 2012; 9(4): 1106-19.[http://dx.doi.org/10.1109/TCBB.2012.33] [PMID: 22350210] ].
2.1. Feature Selection based on Relevance and Redundancy
Relevance definitions divide features into three categories such as strongly relevant, weakly relevant and irrelevant features. Redundancy definition further divides weakly relevant features into redundant and non-redundant ones. Relevance analysis determines the subset of relevant features by removing the irrelevant ones, and redundancy analysis determines and eliminates redundant features from relevant ones and thus produces the final subset [12Ding C, Peng H. Minimum redundancy feature selection from microarray gene expression data. J Bioinform Comput Biol 2005; 3(2): 185-205.[http://dx.doi.org/10.1142/S0219720005001004] [PMID: 15852500] ].
Idealized: uncover the minimally sized feature subset that is necessary and sufficient to the target concept [17Kira K, Rendell LA. The feature selection problem: Traditional methods and a new algorithm. Proceedings of Ninth National Conference on Artificial Intelligence 129-34.].
Classical: choose M features from a set of N features (where M < N), such that the value of a criterion function is optimized over all subsets of size M [18Koller D, Sahami M. Toward optimal feature selection. ICML’96 Proceedings of the 13th International Conference on International
Conference on Machine Learning. Bari, Italy. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc. 1996; pp. 284-92.].
2.2. Improving Prediction Accuracy
The objective of feature selection is to improve the prediction ability of the classifier by reducing the structure size. The final training feature subset is constructed using the selected features only [19Narendra PM, Fukunaga K. A branch and bound algorithm for feature selection. IEEE Trans Comput 1977; 9(26): 917-22.[http://dx.doi.org/10.1109/TC.1977.1674939] ].
3. FEATURE SELECTION STAGES AND CLASSIFICATIONS
There are four basic stages in feature selection method: (i) Generation Procedure (GP), to select candidate feature subset (ii) Evaluation Procedure (EP), to evaluate the generated candidate feature subset and output, a relevancy value (iii) Stopping Criteria (SC): To determine when to stop (iv) Validation Procedure (VP): To determine whether it is the optimal feature subset or not. The process of feature selection approach is given in (Fig. 2).
This procedure generates a subset of features that is relevant to the target concept. GP are of two types
3.1.1. Individual Ranking
Measures the relevance of each feature. The feature relevance is measured based on some evaluation function. In this case, each individual feature is evaluated by assigning some weight or score.
3.1.2. Subset Selection
A subset of features is selected based on some search strategy. If the size of the data set is N×M, then a total number of features in the data set is N. The possible number of subsets of features is 2N. This is even very large for a medium sized feature set. Therefore suitable search strategy is applied to this process. The search is classified as:
A. Complete: It traverses all the feasible solutions. This procedure does an exhaustive search for the best possible subset pertaining to the evaluation function. Example of complete search is a branch and bound best first search.
B Heuristic Deterministic: uses a greedy strategy to select features according to local change. There are many alternatives to this straightforward method, but the creation of subset is basically incremental. Examples of this procedure are sequential forward selection, sequential backward selection, sequential floating forward selection, and sequential floating backward selection.
C. Nondeterministic (Random): It attempts to find an optimal solution in a random fashion. This procedure is new in the field of feature selection methods compared to the above two categories. Optimality of the selected subset depends on the resources available.
3.2. Evaluation Procedure (EP)
An optimal subset is always relative to a certain evaluation function. An evaluation function tries to measure the discriminating ability of a feature or a subset to distinguish the different class labels. The evaluation function is categorized as distance, information (or uncertainty), dependence, consistency, and classifier error rate [25Guyon I. Gene selection for cancer classification using support vector machines. Mach Learn 2002; 46(1-3): 389-422.[http://dx.doi.org/10.1023/A:1012487302797] , 26Peng Y, Wu Z, Jiang J. A novel feature selection approach for biomedical data classification. J Biomed Inform 2010; 43(1): 15-23.[http://dx.doi.org/10.1016/j.jbi.2009.07.008] [PMID: 19647098] ].
3.2.1. Distance Measures
For a two-class problem say A and B are two features, then A and B are selected on the basis of their distance (e.g. Euclidian distance). If the distance is zero then the features are said to be redundant and ignored. The higher the distance the more the features are discriminating.
3.2.2. Information Measures
This determines the information gain for the feature. Feature A is preferred over feature B if the information gain of A is more than B (e.g. entropy measure).
3.2.3. Dependence Measures
Dependence or correlations of the ability to predict the value of one variable from the value of another. If the correlation of feature A with class C is higher than the correlation of feature B with class C then feature A is preferred to B.
This measure finds the minimally sized subset that satisfies the acceptable inconsistency rate that is usually set by the user.
3.2.4. Consistency Measure
This measure finds the minimally sized subset that satisfies the acceptable inconsistency rate that is usually set by the user.
3.2.5. Classifier Error Rate
The evaluation function is the classifier itself. It measures the accuracy of the classifier for different subsets of feature set and measures the error rate for the different subset.
We have classified the feature selection method as non-soft computing based and soft computing based. Based on the generation procedure and evaluation function, the feature selection methods are classified, where the generation procedure and evaluation functions are two dimensions.
3.2.6. Stopping Criteria
It indicates the end of the process. Commonly used stopping criteria are: (i) When the search completes (ii) When some given bound (minimum number of features or a maximum number of iterations) is reached. (iii) When a subsequent addition (or deletion) of any feature does not produce a better subset and (iv) When a sufficiently good subset (e.g. a subset if its classification error rate is less than the allowable error rate for a given task) is selected.
Consulting Table 1, the feature selection approaches are primarily categorized as a filter, wrapper, and embedded method. Recently other feature selection methods are gaining popularity i.e., hybrid and ensemble methods (Fig. 3).
Table 1 Classification of feature selection methods based on combination of GP and EF.
Classification of feature selection methods.
A. Filter Method
Filter method deals with individual ranking as well as subset selection. The individual ranking is based on the evaluation functions such as distance, information, dependence, and consistency excluding the classifier (Fig. 3). Filter techniques judge the relevance of genes by looking only at the intrinsic properties of the data. In microarray data, a gene relevance score is calculated, and low-scoring genes are removed. Afterward, this subset of genes is presented as input to the classification algorithm. The filtering technique can be used as a pre-processing step to reduce space dimensionality and overcome overfitting. The filter approach is commonly divided into two different sub-classes: individual evaluation and subset evaluation [20Yu L, Liu H. Efficient feature selection via analysis of relevance and redundancy. J Mach Learn Res 2004; 5: 1205-24.]. In individual evaluation method, the gene expression dataset is given as input. The method has an inbuilt evaluation process according to which a rank is provided to each individual gene based on which the selection is done. Different criteria can be adopted, like setting a threshold for the scores and selecting the genes which satisfy the threshold criteria, or sometimes the threshold can be chosen in such a way that a maximum number of genes can be selected. Then, the subset selected can be the final subset which is used as the input to the classifiers. In subset selection, all GP and evaluation function excluding the classifier can be taken for the combination. The model for the filter approach is given in Fig. (4).
Filter FS Method.
However, methods in this framework may suffer from an inevitable problem, which is caused by searching through the possible feature subsets. The subset generation process usually increases the computational time but gives more relevant feature subset. In literature, it is found that the subset evaluation approach outperformed the ranking methods [19Narendra PM, Fukunaga K. A branch and bound algorithm for feature selection. IEEE Trans Comput 1977; 9(26): 917-22.[http://dx.doi.org/10.1109/TC.1977.1674939] -22Ding C, Peng H. Minimum redundancy feature selection from microarray gene expression data. J Bioinform Comput Biol 2005; 3(2): 185-205.[http://dx.doi.org/10.1142/S0219720005001004] [PMID: 15852500] ]. The filter method is again classified into the ranking method and space search method. Fig. (5), describes the taxonomy of filter feature selection method.
Taxonomy of filter FS methods: Pros of Filter Feature Selection Method.
The method is simple and fast.
It scales well to high dimensional data.
It is independent of classifiers.
Cons of Filter Feature Selection Method
The method is generally univariate or low variate.
B. Wrapper Method
In the wrapper approach, all GP can be taken in combination with the classifier as evaluation function and generates the relevant feature subset. Wrappers are feedback methods, which incorporate the machine-learning algorithm in the feature selection process, i.e., they rely on the performance of a specific classifier to evaluate the quality of a set of features. Wrapper methods search through the space of feature subsets and calculate the estimated accuracy of a single learning algorithm for each feature that can be added to or removed from the feature subset. The search may be a GP and the evaluation function is a classifier. The model for the wrapper feature selection is given in Fig. (6). While building a wrapper algorithm one needs to find the answers for the following questions such as:
How to find all possible feature subsets for evaluation?
How to satisfy oneself with the classification performance of the chosen classifier in order to guide the search and what should be the stopping criteria?
Which predictor to use?
The wrapper approach applies a blind search to find a subset of features. It searches randomly for the best subset which cannot be made sure without getting all possible subsets. Therefore, feature selection in this approach is NP-hard and the search with each iteration tends to become intractable for the user. This is not a preferred approach for feature selection, as it is a crude force method and requires higher computational time for feature subset selection.
The feature space in case of wrapper approach can be searched with various strategies, e.g., forward (i.e., by adding attributes to an initially empty set of attributes) or backward (i e., by starting with the full set and deleting attributes one at a time). The correctness of a specific subset of features/genes based on our classifier is obtained by training and testing the subset against that specific classification model.
The advantage of wrapper approach is that it selects a near perfect subset and error rate in this method is less as compared to other methods. The major disadvantage of the method is that it is computationally very intensive and it is intended for the particular learning machine on which it has been tested. Therefore, there is a high risk of overfitting than filter techniques.
The wrapper approach of feature selection is classified as sequential search based and Heuristic search based. The taxonomy of the wrapper model is given in Fig. (7).
Taxonomy of wrapper FS method.
Usually, an exhaustive search is too expensive, and thus non-exhaustive, heuristic search techniques like genetic algorithms, greedy stepwise, best first or random search are often used. Here, feature selection occurs externally to the induction method using the method as a subroutine rather than as a post-processor. In this process, the induction algorithm is called for every subset of feature consequently inducing high computational cost [23Nakariyakul S, Casasent DP. An improvement on floating search algorithms for feature subset selection. Pattern Recognit 2009; 42: 1932-40.[http://dx.doi.org/10.1016/j.patcog.2008.11.018] , 24Hsu HH, Hsieh CW, Lu MD. Hybrid feature selection by combining filters and wrappers. Expert Syst Appl 2011.[http://dx.doi.org/10.1016/j.eswa.2010.12.156] ].
C. Embedded Method
Despite the lower time consumption of the filter method, a major limitation of the filter approach is that it is independent of the classifier, usually resulting in worse performance than the wrappers. However, the wrapper model comes with an expensive computational cost, which is particularly aggravated by the high dimensionality of microarray data. An intermediate solution for researchers is the use of hybrid or embedded methods, which use the core of the classifier to establish criteria to rank features. Embedded methods are more tractable and efficient in comparison to wrapper approach. This method has a lower risk of overfitting compared to wrapper approach. Probably the most famous embedded method is Support Vector Machine based on Recursive Feature Elimination (SVM-RFE) [25Guyon I. Gene selection for cancer classification using support vector machines. Mach Learn 2002; 46(1-3): 389-422.[http://dx.doi.org/10.1023/A:1012487302797] ]. Fig. (8), shows the schematic diagram of embedded feature selection Method.
Embedded FS method.
The embedded method is classified into three different categories. The taxonomy of embedded method is shown in Fig. (9).
Taxonomy of embedded method.
D. Hybrid Method
It is the combination of any number of same or different classical methods of feature selection such as filter and wrapper methods. The combination can be a filter-filter, filter-wrapper, and filter-filter-wrapper where the gene subset obtained from one method is served as the input to another selection algorithm. Generally, filter is used to select the initial gene subset or help to remove redundant genes. Any combination of several filter techniques can be applied vertically to select the preliminary feature subset. In the next phase, the selected features are given to the wrapper method for the optimal feature selection. This method uses different evaluation criteria. Therefore, it manages to improve the efficiency and prediction accuracy with the better computational cost for high dimensional data. The most common hybrid method is mentioned in the paper [26Peng Y, Wu Z, Jiang J. A novel feature selection approach for biomedical data classification. J Biomed Inform 2010; 43(1): 15-23.[http://dx.doi.org/10.1016/j.jbi.2009.07.008] [PMID: 19647098] ] (Fig. 10).
Hybrid method of feature selection.
E. Ensemble Method
Ensemble method is gaining popularity nowadays for feature selection in case of high dimensional data. This approach of feature selection produces a group of feature subset and either aggregated or intersected to produce the most relevant feature subset. This technique aggregates the significant features selected by different ranking approaches to formulate the most optimal feature subset. This method is therefore robust and stable while dealing with high dimensional data. A brief description of ensemble feature selection can be found in the paper [27Awada W, Khoshgoftaar TM, Dittman D, Wald R, Napolitano A. A review of the stability of feature selection techniques for bioinformatics data in Proc IEEE 13th Int Conf Inf Reuse Integr 2012; 2010; 356-63.] (Fig. 11).
Ensemble method of feature selection.
In this paper, we have classified the general feature selection approaches (filter wrapper and embedded) as non-soft computing approach and soft computing approach of feature selection.
3.3. Nonsoft Computing Methods of Feature Selection
All filter methods, some wrapper, embedded or hybrid methods are included in this category. Filter method ranks each feature according to some univariate metric keeping only the top-ranked features and eliminating low ranking features. There is no specific rule applied for the selection of the high ranking feature and elimination of the low ranking features. Some heuristic score is decided as high or low. Generally, it uses a scoring function to quantify the difference in expression between different groups of samples and rank features in decreasing order of the estimated scores. The topmost features are then selected as a good feature subset removing the others. Different categories of scoring functions for gene ranking are identified in the literature (Table 2).
Table 2 Scoring function family.
Univariate methods analyze a single variable whereas multivariate deal with more than one variable at a time. All filter individual ranking technique belongs to univariate methods and wrapper, embedded or hybrid methods belong to multivariate methods. Filter individual ranking univariate methods as well as bivariate methods are non-soft computing methods of feature selection (Table 3).
Wrapper FS technique uses subset generation strategy. It depends on the classifier to evaluate each subset. The subset generation process can happen in two different ways such as forward selection and backward elimination. First, the search starting point must be decided, which in turn influences the search direction. The search may start with an empty set and successively features can be added in the forward direction, or it may start with a full set and successively remove features i.e. elimination in the backward direction, or start with both ends and add and remove features simultaneously i.e. bidirectional. Non-soft computing Wrapper approaches of feature selections used in the literature are given below in (Table 4).
Table 4 Wrapper based feature selection(non-soft computing methods).
The embedded approach interacts with the learning algorithm at a lower computational cost than the wrapper approach. Feature dependencies are also captured by this method. It is not only the relationship between input features and the output feature but also searches features locally, which allows better local discrimination. It utilizes the independent criteria to choose the best subsets for a known cardinality. Then, the learning algorithm selects the ultimate optimal subset amongst the best possible subsets across different cardinality. The embedded subsets in soft computing methods in the literature are given in Table 5.
There are several soft computing methods of feature selection, which apply the subset generation strategy for feature selection or hybrid approach for feature selection. The search may also start with a randomly selected subset in order to avoid being trapped into local optima [27Awada W, Khoshgoftaar TM, Dittman D, Wald R, Napolitano A. A review of the stability of feature selection techniques for bioinformatics data in Proc IEEE 13th Int Conf Inf Reuse Integr 2012; 2010; 356-63.]. This search space is exponentially prohibitive for exhaustive search with even a moderate N. Therefore, different strategies have been explored: complete, sequential, and random search. These techniques use a dependency measure and a significant measure of a feature defined by rough, fuzzy set theory, soft set, Artificial Neural Network, Genetic algorithm, Particle Swarm Optimization, Ant Colony Optimization, metaheuristic optimization such as Bat algorithm etc. Further, the soft computing based feature selection approach is categorized into hybrid and nonhybrid approach (Tables 6 and 7).
Table 6 Some non soft computing methods and their performance analysis [Classification].
Microarrays are high dimensional data [78Wang G, Song Q, Xu B, Zhou Y. Selecting feature subset for high dimensional data via the propositional foil rules. Pattern Recognit 2013; 46(1): 199-214. [http://dx.doi.org/10.1016/j.patcog.2012.07.028].[http://dx.doi.org/10.1016/j.patcog.2012.07.028] ], which represent a matrix, where the row of the matrix represents the number of genes and columns represent a number of samples. Microarrays are used to measure the expression level of thousands of gene simultaneously. The major issue in Microarray data analysis is the curse of Dimensionality. In the literature, microarray data is widely used for cancer classification. Due to the curse of Dimensionality problem, it may lead to overfitting of the classifier. This issue can be resolved by dimensionality reduction in microarray data. For a few number of genes say 5, the performance of the classifier is poor, but gradual increase in the number of selected features up to a point improves the performance. However, when more features are included beyond the threshold, the performance gets worse. If all features are included then performance deteriorates markedly. This means that including too many irrelevant features can actually worsen the performance of a learning algorithm, and hence shows the need for feature selection or feature extraction in microarray data for supervised learning. Feature extraction involves various techniques such as PCA, various clustering technique such as c-means clustering, hierarchical clustering, etc.
The major objectives of feature selection technique in microarray data are
To remove noisy and irrelevant genes from the current data set.
Improve the computational cost.
Avoid overfitting of the classifier.
In the current paper, we have classified the feature selection techniques for microarray data as nonsoft computing based and soft computing based feature selection. From the above-mentioned feature selection methods, all statistical methods (filter methods), as well as sequential wrappers and some of the embedded methods, belong to the nation soft computing feature selection category, and most of the hybrid methods of feature selection come under soft computing based feature selection category.
In literature Univariate, filter methods have been extensively used in microarray data to identify biomarkers, which is a parametric technique. Beside parametric techniques, non-parametric techniques can also be applied for feature selection. In microarray data, feature selection becomes a critical aspect when tens of thousands of features are considered. The wrapper approach takes the attention as the filter method for feature selection in case of microarray data due to its high computational cost. It is due to the fact that, as the number of features grows, the space of feature subset grows exponentially. Furthermore, they have the risk of overfitting due to the small sample size of microarray data. Therefore, the wrapper approach has been listed and considered in the literature for feature selection. Hybrid and ensemble methods are widely used in the literature for microarray data analysis as it overcomes the limitations of both filter and wrapper approach. Table 8 shows the different feature selection techniques used for Microarray data
Table 8 Different feature selection techniques used for microarray data.
4.1. Nonsoft Computing Method Performance Analysis for Microarray Data
This section reviews some of the non-soft computing based feature selection methods used in microarray data listed in Table 8. Selection methods involving evaluation of individual features, sequential forward selection, sequential backward selection and many more statistical approach are included in this category (Table 9).
Table 9 Nonsoft computing methods performance analysis for microarray data.
4.2. Soft Computing Methods performance Analysis for Microarray Data
Feature subset selection in the context of many practical problems such as diagnosis of cancer-based on microarray data presents an instance of a multi-criteria optimization problem. The multi-criteria to be optimized include the accuracy of the classifier, cost, and risk associated with the classification which in turn depends on the selection of attributes used to describe the patterns. Evolutionary algorithms offer a significant approach to multi-criteria optimization (Table 10).
Table 10 Soft computing methods performance analysis for microarray data
From the study, the recent trend of feature selection has been shifted to hybrid and ensemble method from the classical feature selection method like a filter, wrapper and embedded. For microarray data, the hybrid and ensemble method of feature selections are extensively used. In the current review, we have categorized the classical feature selection techniques and other technique like hybrid and ensemble approach in soft computing and non soft computing technique of feature selection. From the study, it is apparent that researchers are more focused towards the soft computing approach of feature selection rather than non soft computing approaches for high dimensional data like microarray data. Because in supervised learning, the model efficiency is highly dependent on training, the model is trained with significant features to enhance the efficiency of the model. The soft computing feature selection techniques mostly use evolutionary algorithms for feature selection to get the optimal and discriminative features. Moreover, the soft computing technique with hybrid approach is more desirable to reduce the computational cost and overfitting of the model.
5. DISCUSSION AND FUTURE RESEARCH DIRECTION
Various methods of feature selection for microarray data are discussed in this paper. From the literature survey, non-soft computing approaches like the statistical approach of feature selection give accuracy to the classifier without considering the correlation of features, whereas soft computing based feature selection adopts different search strategy to select optimal feature subsets in association with non-soft computing based feature selection such as filter and wrapper methods. From the literature, it is observed that hybrid soft computing approaches of feature selection are used widely for different applications in comparison to non hybrid techniques. For high dimensional data like microarray data, the non hybrid non soft computing approaches (like filter technique) were used previously for most of the microarray dataset. But now a days, hybrid soft computing techniques of feature selection are mostly preferred to get the optimal feature subset over non hybrid non soft computing techniques of feature selection due to flexibility and efficiency in selecting features from high dimensional data. There can be future research to develop algorithms using sequential and random search strategies for clustering and classification tasks respectively. The research can be extended to identify the biological relevance of feature subsets after applying non-soft computing as well as soft computing technology instead of only considering the model performance. The performance needs to be evaluated not only based on classification accuracy but also evaluating the metrics like sensitivity and specificity.
CONSENT FOR PUBLICATION
CONFLICT OF INTEREST
The authors declare no conflict of interest, financial or otherwise.
Bosio M, Bellot P, Salembier P, Verges AO. Microarray classification with hierarchical data representation and novel feature selection criteria 2012; In: IEEE 12th International Conference on Bioinformatics & Bioengineering (BIBE); 2012; pp. 344-9.[http://dx.doi.org/10.1109/BIBE.2012.6399648]
Guzzi PH, Cannataro M. Challenges in Microarray Data Management and Analysis. Computer-Based Medical Systems 2011; 24(3): 1-6.
Jain I, Jain VK, Jain R. Correlation feature selection based improved-Binary Particle Swarm Optimization for gene selection and cancer classification. Appl Soft Comput 2018; 62: 203-15.[http://dx.doi.org/10.1016/j.asoc.2017.09.038]
Lazar C, Taminau J, Meganck S, Steenhoff D, Coletta A, Molter C. A survey on filter techniques for feature selection in gene expression microarray analysis IEEE/ACM Trans Computational Biology and Bioinformatics 2012; 9(4): 1106 -19.[http://dx.doi.org/10.1109/TCBB.2012.33]
Lazar C, Taminau J, Meganck S, et al. A survey on filter techniques for feature selection in gene expression microarray analysis. IEEE/ACM Trans Comput Biol Bioinformatics 2012; 9(4): 1106-19.[http://dx.doi.org/10.1109/TCBB.2012.33] [PMID: 22350210]
Kira K, Rendell LA. The feature selection problem: Traditional methods and a new algorithm. Proceedings of Ninth National Conference on Artificial Intelligence 129-34.
Koller D, Sahami M. Toward optimal feature selection. ICML’96 Proceedings of the 13th International Conference on International
Conference on Machine Learning. Bari, Italy. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc. 1996; pp. 284-92.
Yu L, Liu H. Efficient feature selection via analysis of relevance and redundancy. J Mach Learn Res 2004; 5: 1205-24.
Yu L, Liu H. Feature selection for high-dimensional data: A fast correlation-based filter solution. ICML’03 Proceedings of the 20th
International Conference on International Conference on Machine Learning. Washington, DC, USA. 2003; pp. 856-63.
Awada W, Khoshgoftaar TM, Dittman D, Wald R, Napolitano A. A review of the stability of feature selection techniques for bioinformatics data in Proc IEEE 13th Int Conf Inf Reuse Integr 2012; 2010; 356-63.
Deng L, Pei J, Ma J, Lee DL. A rank sum test method for informative gene discovery. Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2004; 410-19.[http://dx.doi.org/10.1145/1014052.1014099]
Long AD, Mangalam HJ, Chan BYP, Tolleri L, Hatfield GW, Baldi P. Improved statistical inference from DNA microarray data using analysis of variance and a Bayesian statistical framework. Analysis of global gene expression in Escherichia coli K12. J Biol Chem 2001; 276(23): 19937-44.[http://dx.doi.org/10.1074/jbc.M010192200] [PMID: 11259426]
Ruijuan J, Chunxia X. Mechanical fault diagnosis and signal feature extraction based on the fuzzy neural network. 27th Chinese Control
Conference. Kunming, China: IEEE 2008; pp. 234-7.[http://dx.doi.org/10.1109/CHICC.2008.4605121]
Xuemei L, Lixing D, Jinhu L. Agriculture irrigation water demand forecasting based on rough set theory and weighted LS-SVM Second International Conference. Vol. 2: pp. 371-4.
Andrés PU, Héctor S, Miguel B, Damme Patrick V, Marco T. A survey of artificial neural network-based modeling in agroecology 2008; 247-69.
Sahu B, Mishra D. A Novel Feature Selection Algorithm using Particle Swarm Optimization for Cancer Microarray Data, International Conference on Modeling Optimization and Computing (ICMOC-2012). Procedia Eng 2012; 38: 27-31.[http://dx.doi.org/10.1016/j.proeng.2012.06.005]
Sahu B, Mishra D. “Performance of Feed Forward Neural Network for a Novel Feature Selection Approach”, (IJCSIT). International Journal of Computer Science and Information Technologies 2011; 2(4): 1414-9.
Chiang Y, Chiang H, Yilin S. The Application Of Ant Colony Optimization For Gene Selection In Microarray-Based Cancer Classification Proceedings of the Seventh. 2008; In: International Conference on Machine Learning and Cybernetics.; Kunming. 2008; pp. 12-5.
Prasad Y, Biswas KK, Jain CK. classifier based feature selection using GA ACO, PSO for siRNA design. In: Tan Y, Shi Y, Tan KC, Eds. International Conference in Swarm Intelligence; 2010; Springer: Berlin, Heidelberg; pp. 307-14.[http://dx.doi.org/10.1007/978-3-642-13498-2_40]
Barak S, Tichý T. Wrapper ANFIS-ICA method to do stock market timing and feature selection on the basis of Japanese Candlestick. Expert Syst Appl 2015; 13(3)
Wang A, An N, Chen G, Li L, Alterovitz G. Accelerating incremental wrapper based gene selection with K-Nearest-Neighbor IEEE International Conference on Bioinformatics and Biomedicine (BIBM) 21-3.
Shanab AA, Khoshtoftaar TM, Wald R. Evaluation of Wrapper-based Feature Selection using Hard, Moderate, and Easy Bioinformatics Data 2014; In: IEEE 14th International Conference on Bioinformatics and Bioengineering; 2014; pp. 2014; 149-55.
Krishnaveni V, Arumugam G. Harmony Search based Wrapper Feature Selection Method for 1-Nearest Neighbor Classifier International Conference on Pattern Recognition, Informatics and Mobile Engineering (PRIME) 2013; 24-9.
Osareh A, Shadgar B. Microarray Data Analysis for Cancer Classification 2010; 125-32.
Pour AF, Dalton LA. Optimal Bayesian feature selection on high dimensional gene expression data. 2014; In: Signal and Information Processing (Global SIP), IEEE Global Conference on.; 2014; pp. 2014; 1402 -5.[http://dx.doi.org/10.1109/GlobalSIP.2014.7032358]
Mishra D, Sahu B. A Signal-to-noise Classification Model for Identification of Differentially Expressed Genes from Gene Expression Data 2011; In: Electronics Computer Technology (ICECT), 3rd International Conference on; 2011; pp. 2011; 2(1): 204 -8.[http://dx.doi.org/10.1109/ICECTECH.2011.5941685]
Hsu H, Lu MD. Feature Selection for Cancer Classification on Microarray Expression Data Eighth International Conference on Intelligent Systems Design and Applications Vol.3: 153-8.2008; [http://dx.doi.org/10.1109/ISDA.2008.198]
Hu H, Li J, Wang H, Daggard G. Combined gene selection methods for microarray data analysis. In: Gabrys B, Howlett RJ, Jain LC, Eds. International Conference on Knowledge-Based and Intelligent Information and Engineering Systems 2006; 976-83.[http://dx.doi.org/10.1007/11892960_117]
Xing EP, Jordan MI, Karp RM. Feature Selection for High-Dimensional Genomic Microarray Data Proc 18th International Conf on Machine Learning. 23-41.
Hall M. Correlation-Based Feature Selection for Machine Learning. PhD thesis 1999.
Yu L, Liu H. Feature selection for high-dimensional data: A fast correlation-based filter solutionMachine Learning-International Workshop 2003; 20: p. 856.
Peng H, Long F, Ding C. Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 2005; 27(8): 1226-38.[http://dx.doi.org/10.1109/TPAMI.2005.159] [PMID: 16119262]
Inza I, Sierra B, Blanco R, Larrañaga P. Gene selection by sequential search wrapper approaches in microarray cancer class prediction. J Intell Fuzzy Syst 2002; 12(1): 25-33.
Wanderley M, Gardeux V, Natowicz R, Braga A. Ga-kde-Bayes: an evolutionary wrapper method based on non-parametric density estimation applied to bioinformatics problems 21st European Symposium on Artificial Neural Networks-ESANN. 155-60.
Wang G, Song Q, Xu B, Zhou Y. Selecting feature subset for high dimensional data via the propositional foil rules. Pattern Recognit 2013; 46(1): 199-214. [http://dx.doi.org/10.1016/j.patcog.2012.07.028].[http://dx.doi.org/10.1016/j.patcog.2012.07.028]
Canul-Reich J, Hall L, Goldgof D, Korecki J, Eschrich S. Iterative feature perturbation as a gene selector for microarray data. Int J Pattern Recognit Artif Intell 2012; 26(05)[http://dx.doi.org/10.1142/S0218001412600038]
Shah M, Marchand M, Corbeil J. Feature selection with conjunctions of decision stumps and learning from microarray data. IEEE Trans Pattern Anal Mach Intell 2012; 34(1): 174-86.[http://dx.doi.org/10.1109/TPAMI.2011.82] [PMID: 21576745]
Leung Y, Hung Y. A multiple-filter-multiple-wrapper approach to gene selection and microarray data classification IEEE/ACM Transaction Computational Biology Bioinformatics (TCBB) 2010; 7(1): 108-17.
Cho SB, Won HH. Machine Learning in DNA Microarray Analysis for Cancer Classification Conference in Research and Practice in Information Technology, Proceedings of the First Asia-Pacific bioinformatics conference on Bioinformatics.
Sivapriya TR, Banu N, Kamal AR. Hybrid Feature Reduction and Selection for Enhanced Classification of High Dimensional Medical Data IEEE International Conference on Computational Intelligence and Computing Research 327-30.2013; [http://dx.doi.org/10.1109/ICCIC.2013.6724237]
Xiao Z, Dellandrea E, Dou W, Chen L. A new embedded feature selection method based on SFS In: Proceedings of Advanced Concepts for Intelligent Vision Systems. 2009; pp. 1-10.
Kumara M, Rath NK, Swain A, Rath SK. Feature selection and classification of microarray data using MapReduce based ANOVA and KNearest neighbor. Procedia Comput Sci 2015; 54: 301-10.[http://dx.doi.org/10.1016/j.procs.2015.06.035]
Xing EP, Jordan MI, Karp RM. Feature Selection for High-Dimensional Genomic Microarray Data Proc 18th Int’l Conf Machine Learning (ICML ’01). 601-8.
Cho SB, Won HH. Machine learning in DNA microarray analysis for cancer classification. Proceedings of the 1st Asia-Pacific Bioinformatics Conference on Bioinformatics 2003; 189-98.
Fung BYM, Ng VTY. Classification of heterogeneous gene expression data. Article 2003; 5(2): 69-78.
Yu Y. SVM-RFE Algorithm for Gene Feature Selection. Computer Engineering 2008.
Nikumbh S, Ghosh S, Jayaraman VK. Biogeography-based informative gene selection and cancer classification using SVM and random forests IEEE Congress on Evolutionary Computation; pp. 1-6.[http://dx.doi.org/10.1109/CEC.2012.6256127]
Chen F, Zeng X, Li Q. Redundant Gene Selection based on particle swarm optimization. Sys Biol Intell Comput 2009; 8(5): 10-6.
Wang Z. Neuro-fuzzy modeling for microarray cancer gene expression data 2005.
Paul S, Maji P. Rough Sets and Support Vector Machine for Selecting Differentially Expressed mi-RNAs IEEE International Conference on Bioinformatics and Biomedicine Workshops (BIBMW). 864-71.[http://dx.doi.org/10.1109/BIBMW.2012.6470255]
Molodtsov D. The Theory of Soft Sets 2004.
Mary E, Yamany W, Hassanien AE. New approach for feature selection based on rough set and bat algorithm 2014; 346-53.