This site is the companion web-site for the paper with title listed as above.

 

Motivation:

Given the joint feature-label distribution, increasing the number of features always results in decreased classification error; however, this is not the case when a classifier is designed via a classification rule from sample data. Typically (but not always), for fixed sample size, the error of a designed classifier decreases and then increases as the number of features grows. The potential downside of using too many features is most critical for small samples, which are commonplace for gene-expression-based classifiers for phenotype discrimination. For fixed sample size and feature-label distribution, the issue is to find an optimal number of features.

 

Results:

Since only in rare cases is there a known distribution of the error as a function of the number of features and sample size, this study employs simulation for various feature-label distributions and classification rules, and across a wide range of sample and feature-set sizes. To achieve the desired end, finding the optimal number of features as a function of sample size, it employs massively parallel computation. Seven classifiers are treated: 3-nearest-neighbor (3NN), Gaussian kernel, linear support vector machine, polynomial support vector machine, perceptron, regular histogram and linear discriminant analysis (LDA). Three Gaussian-based models are considered: linear, nonlinear, and bimodal. In addition, real patient data from a large breast-cancer study is considered. To mitigate the combinatorial search for finding optimal feature sets, and to model the situation in which subsets of genes are co-regulated and correlation is internal to these subsets, we assume that the covariance matrix of the features is blocked, with each block corresponding to group of correlated features. For detailed descriptions of the structures of the covariance models tested, please refer to our paper.

Altogether there are 500+ error surfaces for the many cases in this companion web-site. For each error surface, the black lines with circular markers are those with the lowest error rate, and hence the ones showing the optimal feature size.

 

Click the links to enter the site:

Linear Model (equal variance)

Nonlinear Model (unequal variance)

Bimodal Model (equal variance)

Real Patient Data