Environmental monitoring and peat assessment using a multivariate analysis of regional-scale geochemical data

A compositional multivariate approach was used to analyse regional-scale soil geochemical data obtained as part of the Tellus Project generated by the Geological Survey of Northern Ireland. The multi-element total concentration data presented compriseX-rayﬂuorescence(XRF) analyses of 6862rural soil samples collectedat 20-cm depth on a non-aligned grid at one site per 2 km 2 . Censored data were imputed using published detection limits. Each soil sample site was assigned to the regional geology map, resulting in spatial data for one categorical variable and 35 continuous variables comprised of individual and amalgamated elements. This paper examines the extent to which soil geochemistry reﬂects the underlying geology or superﬁcial deposits. Since the soil geochemistry is compositional, log-ratios were computed to adequately evaluate the data using multivariate statistical methods. Principal component analysis (PCA) and minimum/maximum autocorrelation factors (MAF) were used to carry out linear discriminant analysis (LDA) as a means to discover and validate processes related to the geologic assemblages coded as age bracket. Peat cover was introduced as an additional category to measure the ability to predict and monitor fragile ecosystems. Overall prediction accuracies for the age bracket categories were 68.4% using PCA and 74.7% using MAF. With inclusion of peat, the accuracy for LDA classiﬁcation decreased to 65.0 and 69.9%, respectively. The increase in misclassiﬁcation due to the presence of peat may reﬂect degradation of peat-covered areas since the creation of superﬁcial deposit classiﬁcation. rocks basin of interbedded sandstones, mudstones and conglomerates. The spatial maps of co-kriging estimates of the posterior probabilities, age


Introduction
A diversity of rock types observed across Northern Ireland represents three basement terranes ( Fig. 1a after Mitchell 2004). The Grampian Terrane and associated rocks in the northwest have metamorphic igneous and sedimentary origins spanning the Proterozoic Era. Psammites and semi-pelites are the dominant rock type, with subordinate sandstones and conglomerates. The Midland Valley Terrane hosts Palaeozoic igneous formations and Late Devonian-Early Carboniferous sedimentary rocks. Rock types comprise red sandstones, limestone and mudstones with less common conglomerates. The Southern Uplands-Down-Longford Terrane consists of granitic igneous intrusives and Lower Palaeozoic Ordovician and Silurian marine sedimentary rocks (lithic arenites and sandstones). Palaeogene flood basalts and lava-derived sedimentary clays cover a large portion of the Midland Valley and Grampian basement rocks in the northeast of the country. The advance of ice sheets and their meltwaters over the last 100,000 years has left at least 80 % of the bedrock covered by superficial deposits, including glacial till and post-glacial alluvium and peat (Fig. 1b). Monitoring peat coverage has become important in calculating soil carbon stocks due to the relatively high carbon density of peat and organic-rich soils. This is particularly important for Ireland (and other Northern European countries), where some 16 % of the land surface is covered by peat bog. In Northern Ireland, previous work has estimated the total amount of carbon stored in vegetation to be 4.4 Mt compared with 386 Mt stored within soils such as peat (Cruickshank et al. 1998).
A number of comprehensive regional and national soil sampling programmes have been completed across the UK, including the British Geological Survey (BGS) G-BASE survey and regional soil surveys for Northern Ireland (Jordan et al. 2001). These have been used to assess baseline element concentrations in soils and normal background concentrations of contaminants (for example Ander et al. 2013). The Northern Ireland Tellus Survey (GSNI 2007;Young and Donald 2013) included a ground-based geochemical survey in which 6862 rural soil samples were collected between 1994 and 2006. Samples were collected on a grid of one sample site every 2 km 2 , with soils being collected at depths of 20 and 50 cm. The samples were analysed by X-ray fluorescence spectrometry (XRFS) for 60 elements and inorganic compounds. Tellus Survey field methods and analytical methodology are described in Smyth (2007) and Young and Donald (2013). Soils and parent geology across Northern Ireland are typical for the geological and pedological conditions across the UK (Jordan et al. 2001) as well as for several parts of Northern Europe, therefore the Tellus data set provides the basis for a comprehensive study with relevancy not only for the whole of the UK, but also Northern Europe. This paper explores the extent to which soil geochemistry can be used to classify the underlying geology and moreover differentiate superficial deposits such as peat.

Methods
Previous work by Grunsky et al. (2014) has demonstrated the effectiveness of applying log-ratio transforms, principal component analysis (PCA) and linear discriminant analysis (LDA) to differentiate lithologies and glacial processes. This study involves the use of both process discovery and validation methods within a compositional multivariate approach to analyse the rural Tellus soil data sampled at 20-cm depth. Many geochemical data sets contain values that are reported at less than the lower limit of detection, and these values are generally termed 'censored' (Grunsky 2010). For this study, published detection limits (Young and Donald 2013) were used to impute censored data; this resulted in 47 elements (Al, As, Ba, Bi, Br, Ca, Cd, Ce, Cl, Co, Cr, Cs, Cu, Fe, Ga, Ge, Hf, I, K, La, Mg, Mn, Mo, Na, Nb, Nd, Ni, P, Pb, Rb, Sc, Se, Si, Sm, Sn, Sr, Th, Ti, Tl, U, V, W, Y, Yb, Zn, Zr and total C) available for initial analysis. Elements that occupy the same sites and/or substitute for other elements in mineral structures can display collinearity when estimating statistical moments. In many cases, the elements behave in identical ways and these elements can be summed to a single variable without any loss of information or structure in the data. This is often the case for rare-earth elements, and the degree of collinearity can be observed in the loadings of the variables in a principal component biplot. An application of clr-based PCA showed that several elements were near-collinear and prompted an amalgamation of the data based on the following proxies: light rare-earth elements (LREE), La, Ce, Nd along with Th; heavy rare-earth elements (HREE), Yb and Hf; elements associated with mafic minerals (Mafic), Fe, V, Cr and Sc; alkali elements associated with feldspars (Ba_Na_K_Rb), Ba, Na, K and Rb; and lastly elements associated with a feldspar silicate framework (Al_Ga), Al and Ga. This reduced the total number of elements and amalgamated elements to 35, and these were the basis of all further analysis.
Each soil sample site was assigned to the regional geology map and then age brackets provided by Geological Survey Northern Ireland (GSNI) using the dominant  Table 1 age bracket for the map polygon ( Fig. 2; Table 1). A process discovery approach (Grunsky 2010) involves use of unsupervised multivariate methods such as principal component analysis (PCA). An essential part of the process discovery phase is a suitable choice of transform to overcome the problem of closure. Therefore, to account for the requirements of compositional data analysis and avoid closure issues, the geochemical data were transformed using centred log-ratios (clr, Aitchison 1986).
Analysis of variance (ANOVA) is a commonly used method for determining the statistical distinctiveness of groups based on the observations and variables (Grunsky and Kjarsgaard 2016). The results will determine whether or not there is sufficient statistical separation of the groups to undertake a classification such as linear discriminant analysis (LDA, Venables and Ripley 2002). Garrett (1989) suggests that the ratio of observations to the total number of variables and groups should be a minimum of 3:1 and ideally greater than 10:1 (Garrett 1989). To ensure that this requirement is fulfilled, data compression was applied. Two potential compression methods are the use of principal components and minimum/maximum autocorrelation (MAF) factors. The use of principal components to represent the variables is well established. However, in this approach, spatial relationships between the variables are ignored. One way to address this failure is to use minimum/maximum autocorrelation factors instead (Mueller and Grunsky 2016). These are derived from the given data by first performing a principal component analysis and then a rotation based on the spectral decomposition of a variance-covariance matrix of the PCA-transformed and scaled data at a lag (h), typically chosen to represent the sample spacing (Desbarats and Dimitrakopoulos 2000). The MAF transformation can also be seen as the solution of a generalised eigenvalue problem (Bandarian and Mueller 2008), and this is the approach taken here. For both PCA and MAF, the decision on the number of factors to use to represent the data is determined experimentally based on the behaviour of the F-values in the ANOVA. Following the data compression step, n-fold cross-validation LDA can be used to derive posterior probabilities of class membership. Since these probabilities are compositional (van den Boogaart and Tolosana-Delgado 2006; Egozcue and Pawlowsky-Glahn 2016), an appropriate transformation is required prior to their interpolation in order to ensure that estimates are non-negative and adhere to the constant sum constraint. In this study, ordinary co-kriging was applied to the isometric log-ratio (ilr)-scores of the posterior probabilities, and this step was followed by back-transformation to interpolated probabilities.

Results and Discussion
Based on the geological age bracket classification, without peat, PCA showed that 72 % of the variation was determined by the first four principal components (PCs), implying 'significant' structure in the data. Analysis of variance showed that only 10 PCs were necessary to classify the soil geochemical data. To consider an improvement over PCA that uses the spatial relationships of the data, a classification based on MAF analysis was undertaken using the first eight dominant factors (Fig. 3a). Twenty-fold cross-validation LDA, using PCA and MAF, resulted in overall classification accuracy of 68.4 % (PCA) and 74.7 % (MAF) for the geological age bracket classification. More specifically, except for PlDv and PlOr, the use of MAF and cross-validation LDA resulted in improvements in both recall (quotient of total number of correctly classified sites and total number of sites in the class) and precision (quotient of total number of correctly classified sites and total number of sites allocated to the class), particularly so for the classes with low number of sites (Table 2; Fig. 4).
The mapped posterior probabilities for six of the age bracket classes (CzPl, NeoP, PlCr, PlDv, PlOr and PlSi) are shown in Fig. 5. In the case of the CzPl and PlSi age brackets (Fig. 5a, f), the locations of non-zero probability are quite well constrained and the probabilities of occurrence are generally very high. The Palaeocene Antrim Lava Group, within the CzPl age bracket class, dominates the landscape of the North West of Ireland comprising the largest remnant of the North Atlantic Igneous Province (Cooper  and Mitchell 2013). Petrographical studies demonstrate the consistent geochemical signature of the fine-grained, olivine basalt lava formations. This helps to explain the well-constrained and high probabilities of occurrence for this age bracket. Likewise the PlSi class (Fig. 5f) closely constitutes the Caledonian granodiorite igneous complex. In contrast, the PlDv class (Fig. 5d), which is also quite well constrained, has much lower probabilities of occurrence. This class, which has 296 tagged samples, represents a broad range of Palaeozoic rocks representing a fault-bounded non-marine sedimentary basin sequence of interbedded sandstones, mudstones and conglomerates. The spatial maps of co-kriging estimates of the posterior probabilities, shown for the same age  Table 1 bracket classifications (Fig. 6), consolidate the findings from the mapped posterior probabilities (Fig. 5).

Peat Assessment
Since the aim of this study is to elucidate the relationship between soil geochemistry and post-glacial deposits for environmental monitoring, the next stage of the analysis explored whether peat cover could be predicted from the classification. For this, the geological age bracket designation was adapted to include the presence of peat based on GSNI superficial deposit polygons (Fig. 1b) and linear discriminant analysis (LDA) undertaken. As was the case for the classification based on the age brackets only, 10 PCA factors and 8 MAF factors were sufficient for the classification via LDA (Fig. 3b).  Table 1 Using LDA, the presence of peat was clearly differentiated from the other classifications (Fig. 7). The classification accuracies for the age bracket classes, including peat, were 65.0 % (PCA) and 69.9 % (MAF). The prediction accuracies for both data types are lower than for those where peat was not included as a class. The presence of peat results in more classification confusion, most likely because the peat composition retains geochemical characteristics of the underlying lithology and the organic composition of the peat itself. The plotted MAF posterior probabilities (Fig. 8) demonstrate a good match between the reported peat areas (Fig. 1b) and the highest probability for peat. However, there are areas of mapped peat where the estimated probabilities of occurrence of peat are low. The explanation for this misclassification of peat may be twofold: these areas may indicate degradation of peat-covered areas since the creation of the superficial deposit classification, or a further refinement in the classification of peat is required. Irish peatland is divided into blanket peatland (approximately 85 %)   Tomlinson and Davidson 2000). Blanket bogs typically form on gentle slopes within upland regions [>315 m above sea level (SL); Hamilton 1982]. The distribution of blanket bogs is more spatially continuous and associated with areas of high precipitation (rainfall exceeding 1200 mm). Raised bogs develop primarily in lowland areas (<200 m above SL; Wheeler and Shaw 1995) where accumulating peat in fens becomes isolated from the groundwater supply. This process of accumulation gradually forms a dome of ombrogenous peat above the fen, giving raised bogs a distinct topography, with steep margins to the main bog expanse. Raised bogs are more limited in extent and occur as isolated features. Prediction of peat-covered areas using MAF analysis methods, which use the spatial relationships of the data, has been more successful in predicting the more extensive upland blanket bogs than lowland raised bogs.

Conclusions
Compositional multivariate techniques, PCA and MAF analysis methods were used to determine the influence of underlying geology on the soil geochemistry signature. The approach was explored for environmental monitoring of peat to ascertain whether peat cover could be predicted from the classification. Using LDA, the presence of peat was clearly differentiated from the other lithological classifications. Moreover, the prediction accuracy for LDA classification improved using MAF analysis. In an attempt to reduce the number of areas of misclassification of peat, further work will examine the influence of underlying lithologies on elemental concentrations in peat composition and the effect of this in classification analysis.