Title

Mining Outliers in Correlated Subspaces for High Dimensional Data Sets

Document Type

Journal Article

Publisher

IOS Press

Faculty

Computing, Health and Science

School

Computer & Security Science

RAS ID

10240

Comments

This article was originally published as: Leng, J. , & Hong, T. (2010). Mining Outliers in Correlated Subspaces for High Dimensional Data Sets. Fundamenta Informaticae, 98(1), 71–86. Original article available here

Abstract

Outlier detection in high dimensional data sets is a challenging data mining task. Mining outliers in subspaces seems to be a promising solution, because outliers may be embedded in some interesting subspaces. Searching for all possible subspaces can lead to the problem called "the curse of dimensionality". Due to the existence of many irrelevant dimensions in high dimensional data sets, it is of paramount importance to eliminate the irrelevant or unimportant dimensions and identify interesting subspaces with strong correlation. Normally, the correlation among dimensions can be determined by traditional feature selection techniques or subspace-based clustering methods. The dimension-growth subspace clustering techniques can find interesting subspaces in relatively lower dimension spaces, while dimension-reduction approaches try to group interesting subspaces with larger dimensions. This paper aims to investigate the possibility of detecting outliers in correlated subspaces. We present a novel approach by identifying outliers in the correlated subspaces. The degree of correlation among dimensions is measured in terms of the mean squared residue. In doing so, we employ a dimension-reduction method to find the correlated subspaces. Based on the correlated subspaces obtained, we introduce another criterion called "shape factor" to rank most important subspaces in the projected subspaces. Finally, outliers are distinguished from most important subspaces by using classical outlier detection techniques. Empirical studies show that the proposed approach can identify outliers effectively in high dimensional data sets.

DOI

10.3233/FI-2010-217

This document is currently not available here.

 
COinS
 

Link to publisher version (DOI)

10.3233/FI-2010-217