Faculty of Computing, Health and Science
School of Computer and Security Science
Knowledge discovery from large data sets using classic data mining techniques has been proved to be difficult due to large size in both dimension and samples. In real applications, data sets often consist of many noisy, redundant, and irrelevant features, resulting in degrading the classification accuracy and increasing the complexity exponentially. Due to the inherent nature, the analysis of the quality of data sets is difficult and very limited approaches about this issue can be found in the literature. This paper presents a novel method to investigate the quality and structure of data sets, i.e., how to analyze whether there are noisy and irrelevant features embedded in data sets. In doing so, a wrapper-based feature selection method using genetic algorithm and an external classifier are mployed for selecting the discriminative features. The importance of features are ranked in terms of their frequency appeared in the selected chromosomes. The effectiveness of proposed idea has been investigated and discussed with some sample data sets.
This is an Author's Accepted Manuscript of: Leng, J. , Valli, C. , & Armstrong, L. (2010). A Wrapper-based Feature Selection for Analysis of Large Data Sets. Proceedings of 2010 3rd International Conference on Computer and Electrical Engineering (ICCEE 2010). (pp. 167-170). . Chengdu, China. IEEE.
© 2010 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.