Seven pitfalls of using data science in cybersecurity
Data Science in Cybersecurity and Cyberthreat Intelligence
School of Science
Machine learning, a subset of artificial intelligence, is used for many problems where a data-driven approach is required and the problem space involves either classification or prediction. The hype surrounding machine learning, coupled with the ease of use of machine learning tools can lead to a (mistaken) belief that machine learning is a panacea for all problems and simply feeding large volumes of data to an algorithm will generate a sensible and usable answer. In this chapter, we explore several pitfalls that a data scientist must evaluate in order to obtain some tangible meaning from the results provided by a machine learning algorithm. There is some evidence to suggest that algorithm choice is not a discriminator. In particular, we explore the importance of feature set selection and evaluate the inherent problems in relying on synthetic data.