Challenges in Big Data analysis arise due to the way the data are recorded, maintained, processed and stored. We demonstrate that a hierarchical, multivariate, statistical machine learning algorithm, namely Boosted Regression Tree (BRT) can address Big Data challenges to drive decision making. The challenge of this study is lack of interoperability since the data, a collection of GIS shapefiles, remotely sensed imagery, and aggregated and interpolated spatio-temporal information, are stored in monolithic hardware components. For the modelling process, it was necessary to create one common input file. By merging the data sources together, a structured but noisy input file, showing inconsistencies and redundancies, was created. Here, it is shown that BRT can process different data granularities, heterogeneous data and missingness. In particular, BRT ha s the advantage of dealing with missing data by default by allowing a split on whether or not a value is missing as well as what the value is. Most importantly, the BRT offers a wide range of possibilities regarding the interpretation of results and variable selection is automatically performed by considering how frequently a variable is used to define a split in the tree. A comparison with two similar regression models (Random Forests and Least Absolute Shrinkage and Selection Operator, LASSO) show s that BRT outperforms these in this instance. BRT can also be a starting point for sophisticated hierarchical modelling in real world scenarios. For example, a single or ensemble approach of BRT could be tested with existing models in order to improve results for a wide range of data-driven decisions and applications.
Data are typically stored in various ways and various formats, mostly in monolithic software architectures which do not allow for interoperability. Analysis of data across multiple data sources is thus difficult, since the functionality of the single data sources with respect to input and output, maintenance, data processing, error handling and user interface is all interwoven and acts as architecturally separate components. In order to create a basis for analysing the data considered here, it was required to extract the datasets from their original databases and combine them to form a common input file for the modelling process. It was therefore inevitable that this resulted in a data file structure which showed missing data, inconsistencies, duplicates and redundancies.
A case study is presented here to examine land use data sourced from a GIS, direct observations from an agricultural company, and remotely sensed data. The data were extracted from a relational database, Excel spreadsheets, remotely sensed imagery stored as raster data, and vector data from a Geographic Information System (GIS), directly observed and measured data in real-time and interpolated data. By combining these data sources to form one common basis for our analysis, issues of data volume, variety and veracity were encountered. Big Data research clearly deals with issues beyond volume and belongs not only to the ongoing digital revolution, but to the scientific revolution as well. The question posed of Big Data and illustrated in the case study presented here, is whether new knowledge can be extracted from various data sources that haven’t been analysed in combination before, and can thus assist in a better and more confident decision making.
There is an exponential increase in interest in the use of digital data to improve decision making in a range of areas such as human systems, urban environments, agriculture and national security. For example, decisions in the agricultural domain may require information based on vegetation or land use change, estimation of crops or biomass, distribution of native or exotic species, livestock or weed assessment and so on. One source of digital data that has generated intense interest over the past decades is remotely sensed imagery. These data are available from a wide range of sources, ranging from satellites to drones, and have been used for a very wide range of environmental applications [
The availability and resolution of these data, combined with improved computer storage and data management facilities, have greatly increased the opportunity for mathematicians and statisticians to utilise this information in their models and analyses. The challenge in linking remotely sensed data to decision- making is that there are multiple steps in the process. Here, we focus on an exemplar real-world problem in the livestock industry: deciding on the allocation of animals to different paddocks and potentially different grazing properties based on the predicted availability of grass over the year. This problem arose in the context of collaboration between statisticians at the Queensland University of Technology and a large livestock organisation in Australia. The specific aim of the project was to develop an ensemble of models to predict the carrying capacity, that is, the number of animals that can be sustained on a paddock. In order to achieve this goal we utilised remote sensing data and supporting information about climate and paddock characteristics. Further, it was important to present the results in a form that is useful for the agricultural decision makers.
Difficult or challenging decisions demand a thorough consideration and even then they imply uncertainty, complexity and different levels of risk. Making the right decisions at the right time can lead to success, increase of profit or minimisation of risk. It is thus important that thoughtful considerations are put into each decision.
In this article we focus on one component of the ensemble modelling approach employed in the project, namely the use of BRT to estimate so-called animal equivalents per paddock. Since calves, cows and bulls of different ages consume different amounts of grass, these animals are standardised to a reference animal which can then be used as a common response variable in the analysis. An interesting conundrum is that one of the major inputs into such a model is the amount of grass, or more generally the biomass, in a paddock. This can potentially be estimated directly from remote sensing, but is confounded by the fact that animals are on the paddock eating the very thing that is being measured by the sensor. Moreover, the decision maker may be interested in the biomass estimates themselves, either directly via the remotely sensed measurements or indirectly via the animal equivalents based on animal weight and metabolic formula.
A BRT is a popular statistical and machine learning approach that has not yet seen much application in the analysis of remotely sensed data. Indeed, although they were first defined two decades ago, BRT has only recently been extended to deal with the types of features that are characteristic of remotely sensed data, in particular its spatial and temporal dynamics. Most of the activity around the use of BRT for agricultural and environmental applications does not appear in the mainstream mathematical and statistical literature.
The study area is located in the Northern Territory, Australia. The main climate zone is identified as grassland with hot dry summers and mild winters [
There is a strong advantage in using remotely sensed Landsat imagery and applied spectroscopy for these types of analyses because the data are freely available, the imagery covers a wide geographical range, and it avoids expensive, extensive and often impractical in-situ measurement. However, the trade-off is in resolution: in-situ measurements provide highly localised accuracy whereas a pixel in a Landsat image covers an area of 30 ´ 30 meters. It is noted that other satellites are now able to provide higher resolution, but these are not yet freely available for the areas of interest in this case study.
Estimation of biomass using satellite data is of ongoing global interest. Grass biomass estimation is challenging since the phenological growing cycle of naturally existing grass is a dynamic process influenced by many complex parameters, including grass type, soil, climate, topography and land use. With the spectral information of remotely sensed imagery it is possible to detect green vegetation, which is driven by the photosynthetic biochemical process of grass biomass. However, since raster imagery is only a two dimensional representation of the land cover it is difficult to derive the quantity of the vertical grass biomass directly.
Fractional cover [
In addition to fractional cover Vegetation Indices (VI) are commonly used to extract meaningful information out of the imagery through image analysis techniques. To calculate VIs it is common to apply arithmetical methods in order to create additional artificial channels using existing spectral bands of the imagery. Other related data were also available to support the analyses. For example, SILO (Scientific Information for Land Owners) is a database of historical climate records for Australia. SILO provides daily datasets for a range of climate variables and in formats suitable for a variety of applications. In addition, SILO datasets are constructed from observational records provided by the Bureau of Meteorology (BOM). As another example, the AussieGRASS spatial framework includes inputs of key climate variables (rainfall, evaporation, temperature, vapour pressure and solar radiation), soil and pasture types, tree and shrub cover, domestic livestock and other herbivore numbers. The derived results of AussieGRASS data are spatially interpolated to construct gridded datasets on a regular grid (approximately 5 ´ 5 km) across Australia [
The analysis of relationships in ecological data sets is not trivial [
Another challenging characteristic of remotely sensed data is missing information. There are two major considerations in dealing with this issue. The first is dealing with the missing values. Common options are to filter them out [
For our case study we performed a number of pre-processing steps to prepare our data for the modelling process, namely data aggregation and data reduction for our predictor variables, as well as calculation of the response variable. Instead of working with single pixel values we reduced the volume of data by deriving descriptive statistics from Landsat, MODIS and SILO data, thereby obtaining paddock specific means, medians, first quartile, third quartile, variance and Shannon Entropy. With respect to our response variable, we aggregated real- time measurements to a monthly mean. In the next step we created a test and a training data set by partitioning the data to 20% and 80% respectively. The training set was used to estimate the model parameters. The test set was used for model performance evaluation on unseen data.
Boosted Regression Trees (BRT), also known as Gradient Boosted Machine (GBM) or Stochastic Gradient Boosting (SGB), are non-parametric regression techniques that combine a regression tree with a boosting algorithm [
A regression tree partitions the data with a hierarchy of binary splits that define regions of the covariate space in which the response variable has similar values. These splits are defined by rules, distance metrics or information gain. The choice of variables and the value at which the split point occurs is determined in a recursive manner at each stage of the tree construction. The segmentation can be depicted as a tree-like structure, comprising nodes representing the selected factors, branches acting as if-else connectors between the nodes, and leaves representing terminal nodes containing the subsets of responses [
Boosting improves the performance of a simple base-learner by reweighting observations that were misclassified or had large residual errors in the previous iteration. The deeper we grow the tree, the more segments we can accommodate and thus more variance can be explained. This results in higher model complexity and therefore higher risk of overfitting the model to the data.
The motivation behind Boosting is that each tree can be quite shallow (a weak classifier) and thus fast to estimate, but by combining the predictive power of many weak classifiers, a classifier of arbitrary accuracy and precision can be created [
In this section we give a brief summary of the method, following Friedman [
F * ( x ) = arg min F ( x ) E y , x ψ ( y , F ( x ) ) . (1)
Boosting approximates F * ( x ) by an “additive” expansion in the form of
F ( x ) ∑ m = 0 M β m h ( x ; a m ) , (2)
where the functions h ( x ; a ) are generally simple functions of x with parameters a = { a 1 , a 2 , ⋯ } . The parameters { a m } 0 M and the expansion coefficients { β m } 0 M are jointly fit to the training data. This is done in a forward stage wise manner. Gradient Boosting [
y ˜ i m = − [ ∂ ψ ( y i , F ( x i ) ) ∂ F ( x i ) ] F ( x ) = F m − 1 ( x ) (3)
which represent the residuals from the given stage of the tree building.
Then, given h ( x ; a m ) , the optimal value of the coefficient β m is calculated via
β m = arg min β ∑ i = 1 N ψ ( y i , F m − 1 ( x i ) + β h ( x i ; a m ) ) . (4)
Gradient Tree Boosting performs this with a base learner h ( x ; a ) of an L terminal node regression tree. A regression tree partitions the feature space into L disjoint regions { R l m } l − 1 L and predicts a separate constant value at each iteration m.
h ( x ; { R l m } ) 1 L = ∑ l − 1 L y ¯ l m 1 ( x ∈ R l m ) . (5)
The parameters of the base learner are the splitting variables and corresponding split points that define the tree, and this defines the corresponding regions { R l m } 1 L of the partition at each iteration. These are accomplished in a top-down “best-first” approach using a least squares splitting measure [
γ l m = arg min γ ∑ x i ∈ R l m ψ ( y i , F m − 1 ( x i ) + γ ) . (6)
Next, the current approximation F m − 1 ( x ) is individually updated in all of the corresponding regions
F m ( x ) = F m − 1 ( x ) + ν ⋅ γ l m 1 ( x ∈ R l m ) . (7)
Friedman [
replacement. This subsample is then used to fit the base learner and compute the model update for the current iteration. By adding randomness to the algorithm the performance of gradient boosting was improved and this resulted in the stochastic Gradient Boosting Machine (GBM) [
The data were presented as a set { ( x i , y i ) | 0 ≤ i < n samples } with feature vector x i ∈ ℝ n features , and the response y i ∈ ℝ . All the data we used for our case study were combined into a structured comma-separated values (CSV) file that consisted of 209 observations and 141 covariates. The machine friendly notation of our covariates are generated in the following manner. There are in total 5 different components for creating the covariate names. The first shows whether the calculated summary statistics are for monthly values of EOLW/D = end of last wet/dry, or WS = wet season; these are then followed by whether it is an aggregated mean, minimum red or maximum monthly values, followed by the nature of the descriptive statistic: first quartile, median, mean, third quartile, variance and Shannon Entropy; next comes the name of data source (e.g. rain = SILO data), and lastly the corresponding area in proximity to water (3 km, 5 km, 99 km = whole paddock). The covariate name of paha.99km/5km stores values for the whole paddock area measured in hectare and the proximity of water e.g. 5 km radius or 99 km for the whole extent of the paddock. As described in 2.2, the data set was partitioned by treating 80% as training data and the remaining 20% as test data, resulting in 167 training and 42 test observations.
The computational environment was the R statistical modelling software version 3.3.3 [
One way of showing the complex relationships of the joint probability and contribution of each covariate in describing the response is through a relative influence plot. Relative influence measures are calculated by averaging the number of times a covariate is used for splitting, weighted by the squared improvement to the model as the result of each split. It is then scaled so the values sum to 100.
In
Regularisation methods are used to constrain the fitting procedure so that it balances model fit and predictive performance [
The bias-variance trade-off goal is to find the optimal number of trees where the bias and the variance are balanced and the error is minimised, since both under- and overfitting will have a negative effect on the predictive performance of the model.
Histograms of the residuals for the test and training sets are shown in
One of the biggest advantages in using a BRT is that it can handle missing values in the predictors by default. As part of the model diagnostics, we can plot how the data have been split, to which node they have been assigned, and the reduction in error for this single iteration/tree. If the tree is challenged with data that are missing a variable, the split is decided based on a surrogate variable, typically one that has a high correlation with non-missing observations.
The R function pretty.gbm.tree() returns a data frame in which each row corresponds to a node in the tree (
to the missing node 3. If the node is a terminal node then this is the prediction. The error reduction (29.73) indicates the reduction in the loss function as a result of splitting this node and there were 466 weights (weights will be on each node) in the root node. The weight indicates the total weight of observations in the node. The last column prediction of −0.000019659 denotes the value assigned to all values at this node before the point was split. The prediction column refers to individual trees and they are fit to predict the gradient of the loss function evaluated in the current prediction and the response. This is the gradient part of Gradient Boosting.
In this case study we demonstrated that BRT is able to address Big Data challenges, produce satisfying results and can deal with missing values by default. In addition, we obtained in-depth knowledge of the diverse and heterogeneous data sources used in this study, and identified key covariates that were most influential in describing the response variable. Further, descriptive statistics has been used to quantitatively describe our data and basic features of it by providing summaries that enables us to present our results in a meaningful way and therefore allowed for a simpler interpretation. The histograms of training and test data showed us the underlying frequency distribution of our continuous data. In this case both histograms are left skewed and demonstrate that the majority of data can be found on the left hand side. Because histograms use bins to display data it is not possible to see exactly what the specific values are for the minimum and maximum. However, we can see an approximation of the range of values, see how spread out the data are and that there are not outliers that we need to take care of. One of the biggest disadvantages of BRT is, that they are prone to overfit the data, thus appropriate settings for the hyperparameters need to be set in order to control the model building process. It is therefore advisable to tune the model hyperparameters as part of a pre-processing step in an iterative manner prior to performing the final modelling.
There are many features of BRT that are advantageous for the problem considered here. In addition to computational speed and accuracy of estimation, they can describe complex nonlinearities and interactions between variables, accommodate missing data, include different types of input variables without the need for transformations, perform well in high-dimensional problems, and allow for different loss functions such as accurate identification of small areas of interest. Moreover, they can be visualised and interpreted easily, thus facilitating the translation of the analytic results to decision makers [
There are several challenges in using BRT for this case study. First, the volume of one single satellite imagery is quite high even without aggregating or combining them in a dense time series. One Landsat satellite scene covers 185 ´ 185 km of land and has a file size of about 300 MB. The temporal resolution of Landsat is on average 16 days; thus, in one year there are 22 scenes of the same area to computationally process, analyse and store, a data volume of about 6.6 GB. Examination of several years of satellite imagery yields in enormous geo-temporal datasets. Given these specifications, a substantive challenge is the storage, processing and management of massive volumes of raster data information. This challenge is exacerbated when the other input variables are also considered, especially since these are of different data formats, sources, structure and spatial granularities. In order to decrease the volume we calculated descriptive statistics based on individual paddock information instead of using pixel information for our analysis.
The second challenge is determining the geographic area to include in the statistical models. The region of interest is spread over multiple stations, with multiple paddocks per station. However, not all of the land in a paddock is grazable. Jansen [
The third challenge is incorporating spatial information with disjoint geographic areas (agricultural properties or stations), each of which comprises regions (paddocks) of varying sizes. In the case study, the provided information was typically in the form of summary values per paddock per month. Seasonal (wet and dry) indicators were also used to help quantify the biomass [
There is a large literature on the predictive, methodological and computational properties of decision trees, including the Random Forest (RF) and Boosted Regression Tree (BRT) models used in this paper. The predictive accuracy of these methods has been investigated both theoretically [
This article has focused on the use of a modern statistical machine learning technique, namely Boosted Regression Trees, to address a challenging real world problem in industry. We presented and demonstrated the efficiency of BRT for addressing Big Data properties with environmental data, specifically remotely sensed data for decision making. There are, of course, other methods that could be used for this type of problem. An appealing alternative that also deals with big, noisy and spatial data is the Bayesian additive regression model [
This research was undertaken through the Australian Cooperative Research Centre for Spatial Information with the Project number P4.101. We thank our colleagues for their insight and expertise, particularly Dr. Michael Schmidt (Department of Science, Information Technology and Innovation) for providing key publications of the Wambiana Grazing Trial and helpful discussions.
Colin, B., Clifford, S., Wu, P., Rathmanner, S. and Mengersen, K. (2017) Using Boosted Regression Trees and Remotely Sensed Data to Drive Decision-Making. Open Journal of Statistics, 7, 859-875. https://doi.org/10.4236/ojs.2017.75061