Author Identifiers

A N M Bazlur Rashid
ORCID: 0000-0002-8672-5023

Date of Award

2021

Degree Type

Thesis

Degree Name

Doctor of Philosophy

School

School of Science

First Advisor

Dr Mohiuddin Ahmed

Second Advisor

Dr Leslie F Sikos

Third Advisor

Associate Professor Paul Haskell-Dowland

Fourth Advisor

Dr Tonmoy Choudhury

Abstract

The rapid progress of modern technologies generates a massive amount of highthroughput data, called Big Data, which provides opportunities to find new insights using machine learning (ML) algorithms. Big Data consist of many features (attributes). However, irrelevant features may degrade the classification performance of ML algorithms. Feature selection (FS) is a combinatorial optimisation technique used to select a subset of relevant features that represent the dataset. For example, FS is an effective preprocessing step of anomaly detection techniques in Big Cybersecurity Datasets. Evolutionary algorithms (EAs) are widely used search strategies for feature selection. A variant of EAs, called a cooperative co-evolutionary algorithm (CCEA) or simply cooperative co-evolution (CC), which uses a divide-and-conquer approach, is a good choice for large-scale optimisation problems. The goal of this thesis is to investigate and develop three key research issues related to feature selection in Big Data and anomaly detection using feature selection in Big Cybersecurity Data.

The first research problem of this thesis is to investigate and develop a feature selection framework using CCEA. The objective of feature selection is twofold: selecting a suitable subset of features or in other words, reducing the number of features to decrease computations and improving classification accuracy, which are contradictory, but can be achieved using a single objective function. Using only classification accuracy as the objective function for FS, EAs, such as CCEA, achieves higher accuracy, even with a higher number of features. Hence, this thesis proposes a penalty-based wrapper single objective function. This function has been used to evaluate the FS process using CCEA, henceforth called Cooperative Co-Evolutionary Algorithm-Based Feature Selection (CCEAFS). Experimental analysis was performed using six widely used classifiers on six different datasets, with and without FS. The experimental results indicate that the proposed objective function is efficient at reducing the number of features in the final feature subset without significantly reducing classification accuracy. Furthermore, the performance results have been compared with four other state-of-the-art techniques.

CC decomposes a large and complex problem into several subproblems, optimises each subproblem independently, and collaborates different subproblems only to build a complete solution of the problem. The existing decomposition solutions have poor performance because of some limitations, such as not considering feature interactions, dealing with only an even number of features, and decomposing the dataset statically. However, for real-world problems without any prior information about how the features in a dataset interact, it is difficult to find a suitable problem decomposition technique for feature selection. Hence, the second research problem of this thesis is to investigate and develop a decomposition method that can decompose Big Datasets dynamically, and can ensure the probability of grouping interacting features into the same subcomponent. Accordingly, this thesis proposes a random feature grouping (RFG) with three variants. RFG has been used in the CC-based FS process, hence called Cooperative Co-Evolution-Based Feature Selection with Random Feature Grouping (CCFSRFG). Experiment analysis performed using six widely used ML classifiers on seven different datasets, with and without FS, indicates that, in most cases, the proposed CCFSRFG-1 outperforms CCEAFS and CCFSRFG-2, and also does so when using all features. Furthermore, the performance results have been compared with five other state-of-theart techniques.

Anomaly detection from Big Cybersecurity Datasets is very important; however, this is a very challenging and computationally expensive task. Feature selection in cybersecurity datasets may improve and quantify the accuracy and scalability of both supervised and unsupervised anomaly detection techniques. The third research problem of this thesis is to investigate and develop an anomaly detection approach using feature selection that can improve the anomaly detection performance, and also reduce the execution time. Accordingly, this thesis proposes an Anomaly Detection Using Feature Selection (ADUFS) to deal with this research problem. Experiments were performed on five different benchmark cybersecurity datasets, with and without feature selection, and the performance of both supervised and unsupervised anomaly detection techniques were investigated by ADUFS. The experimental results indicate that, instead of using the original dataset, a dataset with a reduced number of features yields better performance in terms of true positive rate (TPR) and false positive rate (FPR) than the existing techniques for anomaly detection. In addition, all anomaly detection techniques require less computational time when using datasets with a suitable subset of features rather than entire datasets. Furthermore, the performance results have been compared with six other state-of-the-art techniques.

Available for download on Sunday, July 05, 2026

Included in

Data Science Commons

Share

 
COinS