Which method to use? An assessment of data mining methods in Environmental Data Science
Environmental Modelling & Software
School of Science
Data Mining (DM) is a fundamental component of the Data Science process. Over recent years a huge library of DM algorithms has been developed to tackle a variety of problems in fields such as medical imaging and traffic analysis. Many DM techniques are far more flexible than more classical numerial simulation or statistical modelling approaches. These could be usefully applied to data-rich environmental problems. Certain techniques such as artificial neural networks, clustering, case-based reasoning or Bayesian networks have been applied in environmental modelling, while other methods, like support vector machines among others, have yet to be taken up on a wide scale. There is greater scope for many lesser known techniques to be applied in environmental research, with the potential to contribute to addressing some of the current open environmental challenges. However, selecting the best DM technique for a given environmental problem is not a simple decision, and there is a lack of guidelines and criteria that helps the data scientist and environmental scientists to ensure effective knowledge extraction from data. This paper provides a broad introduction to the use of DM in Data Science processes for environmental researchers. Data Science contains three main steps (pre-processing, data mining and post-processing). This paper provides a conceptualization of Environmental Systems and a conceptualization of DM methods, which are in the core step of the Data Science process. These two elements define a conceptual framework that is on the basis of a new methodology proposed for relating the characteristics of a given environmental problem with a family of Data Mining methods. The paper provides a general overview and guidelines of DM techniques to a non-expert user, who can decide with this support which is the more suitable technique to solve their problem at hand. The decision is related to the bidimensional relationship between the type of environmental system and the type of DM method. An illustrative two way table containing references for each pair Environmental System-Data Mining method is presented and discussed. Some examples of how the proposed methodology is used to support DM method selection are also presented, and challenges and future trends are identified.