(This article was first published on Bart Rogiers - Sreigor, and kindly contributed to R-bloggers)
Various approaches exist to relate saturated hydraulic conductivity (Ks) to grain-size data. Most methods use a single grain-size parameter and hence omit the information encompassed by the entire grain-size distribution. This study compares two data-driven modelling methods—multiple linear regression and artificial neural networks—that use the entire grain-size distribution data as input for Ks prediction. Besides the predictive capacity of the methods, the uncertainty associated with the model predictions is also evaluated, since such information is important for stochastic groundwater flow and contaminant transport modelling.

Keywords: Early stopping - Cross-validation - Generalised likelihood uncertainty estimation - Artificial neural networks - Sedimentary aquifer - Data-driven modelling - Likelihood measures - Principal component analysis - GLUE-ANN
Rogiers, B., Mallants, D., Batelaan, O., Gedeon, M., Huysmans, M., Dassargues, A. 2012. Estimation of hydraulic conductivity and its uncertainty from grain-size data using GLUE and artificial neural networks. Mathematical Geosciences. doi: 10.1007/s11004-012-9409-2.
An example script for performing such analysis in the R language is provided below:(see also at http://goo.gl/xn0p4)
R script with examples of GLUE-ANN analyses:
R functions for GLUE-ANN analysis with the AMORE package (Castejón Limas et al 2010):
R script with examples of GLUE-ANN analyses:
################################################################################
### GLUE-ANN parameters + examples (Rogiers et al. 2012) ###
################################################################################
# A dataframe has to be provided with the independent variable(s), and a vector
# with the dependent variable. A list of parameters has to be set to define the
# a priori distributions of the stochastic parameters, and fixed values for the
# ones considered to be constant.
# Be sure to load the additional GLUE-ANN functions into R before trying
# the examples provided below. The AMORE package should be installed and loaded:
install.packages('AMORE')
library('AMORE')
### Parameter list #############################################################
# A list of all required parameters with their default values.
### Pre-processing parameters ################################################
# Orthogonalise input data; options are: 'Cov', 'Cor', and 'No'
orthogonal <- 'Cor'
# Number of input variables, provide vector for stochastic treatment
variableNumber <- c(1:ncol(inputData))
# T: Select always most variance explaining PCs; F: Choose PCs randomly
dimRed <- T
# Rescale output to 0-1 range
rescale.output <- T
### ANN parameters ###########################################################
# Enable early stopping
ES <- T
# Part of the data used for early stopping
randomESPart <- 0.5
# Early stopping samples should have a similar mean as the training samples
nonEqualMeans <- T
# Maximum difference between early stopping and training observation means
maxMeanDiff <- mean(outputData)*0.2
# maximum number of training cycles, provide a vector for stochastic treatment
nCycles <- 100
# Amount of hidden layers, provide vector for stochastic treatment
HLrange <- c(1:1)
# Amount of hidden nodes, provide vector for stochastic treatment
HNrange <- c(1:10)
# Maximum amount of hidden nodes per input variable
maxHNPerVar <- Inf
# Hidden layer activation function; options are 'tansig', 'sigmoid', 'purelin'
hidden.layer <- 'tansig'
# Output layer activation function; options are 'tansig', 'sigmoid', 'purelin'
output.layer <- 'sigmoid'
### GLUE parameters ##########################################################
# Number of models in the ensemble
nSets <- 100
# Number of random variates for weights based on Stedinger et al (2008)
nStedinger <- 50
# GLUE weighting procedure; options are 'MSE', 'MSEes' (for weights based on
# the early stopping subset performance), 'NSEff' and 'Stedinger'
weighting <- 'Stedinger'
### Cross-validation parameters ##############################################
# Use leave-one-out cross-validation
cv <- T
### Examples ###################################################################
### Example of training and prediction for fictional data ####################
# Create fictional data for x range [1,100]
inputData <- data.frame(x=seq(1,100,0.5))
# Dependent variable y = x^2
outputData <- inputData$x ^ 2
# add random noise to y, representing different sources of error
outputData <- jitter(outputData, amount=mean(outputData)*0.2)
# Run GLUE-ANN functions
glue.ann.ensemble <- train.glue.ann(inputData, outputData, nCycles=500)
training.results <- predict.glue.ann(inputData, glue.ann.ensemble)
# Visualise results for training data
plot(inputData$x, training.results$eMean, type='l', xlab='x', ylab=
expression(y = x^2 + epsilon), ylim=c(-2000,13000))
lines(inputData$x, glue.ann.ensemble$outputData, lty=2, col='red')
lines(inputData$x, training.results$eQ025, lty=3, col='blue')
lines(inputData$x, training.results$eQ975, lty=3, col='blue')
legend('topleft', c('GLUE-ANN ensemble mean prediction','Observations',
'95% prediction uncertainty'), lty=c(1,2,3),col=c('black','red','blue'),
bty='n')
# Predict new data
newData <- data.frame(x=c(-100:200)+0.5)
newData.predictions <- predict.glue.ann(newData, glue.ann.ensemble)
# Visualise results of prediction
plot(newData$x, newData.predictions$eMean, type='l', xlab='x', ylab=
expression(y = x^2 + epsilon), ylim=c(-2000,13000))
lines(newData$x, jitter(newData$x^2, amount=mean(outputData)*0.2) , lty=2,
col='red')
lines(newData$x, newData.predictions$eQ025, lty=3, col='blue')
lines(newData$x, newData.predictions$eQ975, lty=3, col='blue')
abline(v=c(0,100))
text(50,13000,'Training data range')
legend('bottomright', c('GLUE-ANN ensemble mean prediction','Observations',
'95% prediction uncertainty'), lty=c(1,2,3),col=c('black','red','blue'),
bty='n')
### Example of cross-validation with an R dataset ############################
# Read input and output data (ozone concentration in function of wind,
# temperature and solar radiation)
outputData <- na.omit(airquality)[,1]
inputData <- na.omit(airquality)[,2:4]
# Perform cross-validation
cross_validation <- train.glue.ann(inputData, outputData, cv=T, nCycles=1000
, nSets=10)
# Visualisation
plot(cross_validation$target, ylim=c(-100,200), xlab='Sample number', ylab=
'Ozone')
points(cross_validation$eMean, pch=2, col='red')
lines(cross_validation$eQ025, lty=3, col='blue')
lines(cross_validation$eQ975, lty=3, col='blue')
legend('bottomright', c('Observations','GLUE-ANN ensemble mean prediction',
'95% prediction uncertainty'), pch=c(1,2,NA), lty=c(NA,NA,3),col=c('black',
'red','blue'),bty='n')
################################################################################
### GLUE-ANN - functions (Rogiers et al. 2012) ###
################################################################################
### ANN training function ######################################################
nncalc <- function(Ptrain, Pval, targetTrain, targetVal, architecture, n, Stao=1
, es=T, hidden.layer="sigmoid", output.layer="purelin")
{
neuralnetwork <- newff(n.neurons=architecture, learning.rate.global=1e-2,
momentum.global=0.5, error.criterium = 'LMS', Stao = Stao, hidden.layer=
hidden.layer, output.layer=output.layer, method="ADAPTgdwm")
if(!es) { Pval <- NULL; targetVal <- NULL }
capture.output(results <- train(neuralnetwork, Ptrain, Stao=Stao, Pval=Pval,
Tval=targetVal,targetTrain, error.criterium = 'LMS', report=T, show.step=1,
n.shows=n ))
return(results)
}
### Sampling parameter vectors function ########################################
resample <- function(x, n, ...) x[sample.int(length(x), size=n, ...)]
### Model ensemble training function ###########################################
train.glue.ann <- function(inputData,outputData,orthogonal='Cor',variableNumber=
c(1:ncol(inputData)),dimRed=T,rescale.output=T,ES=T,randomESPart=0.5,
nonEqualMeans=T,maxMeanDiff=mean(outputData)*0.2,nCycles=100,HLrange=c(1:1),
HNrange=c(1:10),maxHNPerVar=Inf,hidden.layer='tansig',output.layer='sigmoid',
nSets=100,nStedinger=50,weighting='Stedinger',cv=F)
{
### Initialisation ###########################################################
parameters <- NULL
parameters$orthogonal <- orthogonal
parameters$variableNumber <- variableNumber
parameters$dimRed <- dimRed
parameters$rescale.output <- rescale.output
parameters$ES <- ES
parameters$randomESPart <- randomESPart
parameters$nonEqualMeans <- nonEqualMeans
parameters$maxMeanDiff <- maxMeanDiff
parameters$nCycles <- nCycles
parameters$HLrange <- HLrange
parameters$HNrange <- HNrange
parameters$maxHNPerVar <- maxHNPerVar
parameters$hidden.layer <- hidden.layer
parameters$output.layer <- output.layer
parameters$nSets <- nSets
parameters$nStedinger <- nStedinger
parameters$weighting <- weighting
parameters$cv <- cv
glue.ann <- NULL
cvPredictions <- NULL
glue.ann$parameters <- parameters
glue.ann$esNumber <- matrix(nrow=nrow(inputData), ncol=nSets)
### Training loop ############################################################
for(crossValidationSampleNumber in 1:ifelse(cv,nrow(inputData),1))
{
if(cv)
{
inputDataModellingCV <- as.data.frame(
inputData[crossValidationSampleNumber,])
inputDataModelling <- as.data.frame(
inputData[-crossValidationSampleNumber,])
outputDataModellingCV <- outputData[crossValidationSampleNumber]
outputDataModelling <- outputData[-crossValidationSampleNumber]
} else {
inputDataModelling <- inputData
outputDataModelling <- outputData
}
names(inputDataModelling) <- c(1:ncol(inputDataModelling))
if(cv) names(inputDataModellingCV) <- c(1:ncol(inputDataModellingCV))
glue.ann$inputData <- inputDataModelling
glue.ann$outputData <- outputDataModelling
P <- cbind(inputDataModelling)
if(orthogonal != 'No'){Ppca <- princomp(P, cor=ifelse(orthogonal=='Cor', T,
F));P <- Ppca$scores}
trans <- P
glue.ann$P <- P
for(i in 1:ncol(inputData)) trans[,i] <- (trans[,i]-mean(P[,i]))/sd(P[,i])
if(rescale.output) outputDataModelling <- (outputDataModelling - (
min(glue.ann$outputData)))/(((max(glue.ann$outputData))-(
min(glue.ann$outputData))))
if(rescale.output & cv) outputDataModellingCV <- (outputDataModellingCV -(
min(glue.ann$outputData)))/(((max(glue.ann$outputData))-(
min(glue.ann$outputData))))
variableslist <- list();networks <- list(); esNumber <- NULL
earlyStoppingMSE <- NULL; earlyStoppingCycle <- NULL; hn <- list()
totalVar <- NULL; trainingMSE <- NULL
for(j in 1:nSets)
{
if(cv) cat(paste('# Cross-validating sample',crossValidationSampleNumber,
' '))
esDataNumbers <- sample(1:nrow(inputDataModelling),
round(nrow(inputDataModelling)*randomESPart,0))
while(nonEqualMeans)
{
esDataNumbers <- sample(1:nrow(inputDataModelling),
round(nrow(inputDataModelling)*randomESPart,0))
esMean <- mean(outputDataModelling[esDataNumbers])
trainMean <- mean(outputDataModelling[-esDataNumbers])
if(abs(esMean-trainMean) < maxMeanDiff) nonEqualMeans <- F
}
trainingDataNumbers <- c(1:nrow(inputDataModelling))[-esDataNumbers]
if(!ES) {trainingDataNumbers <- c(1:nrow(inputDataModelling))
esDataNumbers <- trainingDataNumbers}
nvars <- resample(variableNumber,1)
variables <- resample(c(1:ncol(inputDataModelling)),nvars)
if(dimRed) variables <- c(1:length(variables))
variableslist[[j]] <- variables
hiddenLayers <- resample(HLrange,1, replace=T)
hiddenNodes <- resample(HNrange,hiddenLayers, replace=T)
for(i in 1:hiddenLayers) if(maxHNPerVar < hiddenNodes[i]/nvars) {
hiddenNodes[i] <- resample((min(HNrange):maxHNPerVar),1, replace=T)}
ncyc <- resample(nCycles,1)
cat('# Training ANN nr. ',j,', with architecture: ',sep='')
cat(c(nvars,hiddenNodes[1:hiddenLayers],1), sep='-')
Ptrain <- as.data.frame(trans[trainingDataNumbers,variables] )
Pval <- as.data.frame(trans[esDataNumbers,variables])
targetTrain <- outputDataModelling[trainingDataNumbers]
targetVal <- outputDataModelling[esDataNumbers]
networks[[j]] <- nncalc(Ptrain, Pval, targetTrain, targetVal,
architecture=c(nvars,hiddenNodes,1), ncyc, Stao=10, es=ES, hidden.layer=
hidden.layer, output.layer=output.layer)
if(ES) esNumber[j] <- which.min(networks[[j]]$Merror[,2])
predTrain <- sim(networks[[j]]$net, Ptrain)
trainingMSE[j] <- mean((targetTrain-predTrain)^2)
if(ES) predEarlyStopping <- sim(networks[[j]]$net, Pval)
if(ES) earlyStoppingMSE[j] <- mean((targetVal-predEarlyStopping)^2)
if(ES) totalVar[j] <- var(c(predTrain-targetTrain, predEarlyStopping-
targetVal))
if(!ES) totalVar[j] <- var(c(predTrain-targetTrain))
if(ES) earlyStoppingCycle[j] <- which.min(networks[[j]]$Merror[,2])
hn[[j]] <- hiddenNodes
if(ES) cat(paste(' # Training MSE: ',round(trainingMSE[j],2),
', early stopping MSE: ',round(earlyStoppingMSE[j],2),', cycles: ',
earlyStoppingCycle[j],'\n', sep=''))
if(!ES) cat(paste(' # Training MSE: ',round(trainingMSE[j],2),
', cycles: ',earlyStoppingCycle[j],'\n', sep=''))
}
glue.ann$networks <- networks
glue.ann$variables <- variableslist
glue.ann$trainMSE <- trainingMSE
if(ES) glue.ann$esMSE <- earlyStoppingMSE
if(ES) glue.ann$totalMSE <- (1-randomESPart) * glue.ann$trainMSE +
randomESPart * glue.ann$esMSE
if(!ES) glue.ann$totalMSE <- glue.ann$trainMSE
glue.ann$totalVar <- totalVar
glue.ann$hn <- hn
if(orthogonal != 'No') glue.ann$pca <- Ppca
### Predict cross-validated sample #########################################
if(cv)
{
cvPrediction <- predict.glue.ann(inputDataModellingCV, glue.ann)
cvPredictions$MSE[crossValidationSampleNumber] <- mean(
(outputData[crossValidationSampleNumber] - cvPrediction$eMean)^2)
cvPredictions$eMean[crossValidationSampleNumber] <- cvPrediction$eMean
cvPredictions$eMedian[crossValidationSampleNumber] <- cvPrediction$eMedian
cvPredictions$eMin[crossValidationSampleNumber] <- cvPrediction$eMin
cvPredictions$eMax[crossValidationSampleNumber] <- cvPrediction$eMax
cvPredictions$eQ025[crossValidationSampleNumber] <- cvPrediction$eQ025
cvPredictions$eQ975[crossValidationSampleNumber] <- cvPrediction$eQ975
cvPredictions$MLE[crossValidationSampleNumber] <- cvPrediction$MLE
cvPredictions$target[crossValidationSampleNumber] <-
outputData[crossValidationSampleNumber]
}
}
if(cv){return(cvPredictions)} else {return(glue.ann)}
}
### Model ensemble prediction function #########################################
predict.glue.ann <- function(inputData, glue.ann,
weighting=glue.ann$parameters$weighting)
{
names(inputData) <- c(1:ncol(inputData))
if(glue.ann$parameters$orthogonal != 'No') inputData <-
predict(glue.ann$pca, newdata=inputData) # orthogonalise
for(i in 1:ncol(inputData)) inputData[,i] <-
(inputData[,i]-mean(glue.ann$P[,i]))/sd(glue.ann$P[,i]) # standardise
### Predict model ensemble results for new data ##############################
resultsMatrix <- matrix(nrow=nrow(inputData), ncol=glue.ann$parameters$nSets)
if(nrow(inputData) > 1) {for(i in 1:glue.ann$parameters$nSets) {
resultsMatrix[,i] <- sim(glue.ann$networks[[i]]$net,
inputData[,glue.ann$variables[[i]]])}}
if(nrow(inputData) == 1) {for(i in 1:glue.ann$parameters$nSets) {
resultsMatrix[,i] <- sim(glue.ann$networks[[i]]$net, t(
inputData[,glue.ann$variables[[i]]]))}}
### Construct GLUE weights ###################################################
if(glue.ann$parameters$weighting=='MSEes'){weightsGLUE <- (max(glue.ann$esMSE)
- glue.ann$esMSE)
weightsGLUE[which(weightsGLUE < 0)] <- 0
weightsGLUE <- weightsGLUE/sum(weightsGLUE)}
if(glue.ann$parameters$weighting=='MSE'){weightsGLUE <- (
max(glue.ann$totalMSE) - glue.ann$totalMSE)^2
weightsGLUE[which(weightsGLUE < 0)] <- 0
weightsGLUE <- weightsGLUE/sum(weightsGLUE)}
if(glue.ann$parameters$weighting=='Stedinger')
{
se2 <- min(glue.ann$totalVar)
n <- length(glue.ann$outputData)
weightsGLUE <- exp((-n/2)* (glue.ann$totalMSE)/(se2) )
}
if(glue.ann$parameters$weighting=='NSEff')
{
weightsGLUE <- (1-(glue.ann$totalMSE/var(glue.ann$outputData)))
weightsGLUE[which(weightsGLUE < 0)] <- 0
}
### Weight model predictions #################################################
predictions <- NULL
predictions$MLE <- resultsMatrix[1,which.min(glue.ann$totalVar)]
if(glue.ann$parameters$weighting != 'Stedinger')
{
predictions$eMean <- apply(resultsMatrix, 1, "wtd.mean", weights =
weightsGLUE, normwt=T)
predictions$eMedian <- apply(resultsMatrix, 1, "wtd.quantile", weights =
weightsGLUE, probs = c(0.5), normwt=T)
predictions$eQ025 <- apply(resultsMatrix, 1, "wtd.quantile", weights =
weightsGLUE, probs = c(0.025), normwt=T)
predictions$eQ975 <- apply(resultsMatrix, 1, "wtd.quantile", weights =
weightsGLUE, probs = c(0.975), normwt=T)
predictions$eMin <- apply(resultsMatrix, 1, "min")
predictions$eMax <- apply(resultsMatrix, 1, "max")
predictions$eResults <- resultsMatrix
}
if(glue.ann$parameters$weighting=='Stedinger')
{
resultsMatrix_sted <- matrix(nrow=nrow(inputData), ncol=
glue.ann$parameters$nSets*glue.ann$parameters$nStedinger)
for(i in 1:nrow(inputData))
{
for(k in 1:glue.ann$parameters$nSets)
{
errorVector <- rnorm(glue.ann$parameters$nStedinger,mean=0,sd=sqrt(se2))
resultsMatrix_sted[i,(((k-1)*glue.ann$parameters$nStedinger)+c(
1:glue.ann$parameters$nStedinger))] <- errorVector + resultsMatrix[i,k]
}
}
weightsGLUE <- rep(weightsGLUE, each=glue.ann$parameters$nStedinger)
predictions$eMean <- apply(resultsMatrix_sted, 1, "wtd.mean", weights =
weightsGLUE, normwt=T)
for(i in 1:nrow(inputData))
{
quantiles <- wtd.quantile(resultsMatrix_sted[i,], weights = weightsGLUE,
probs = c(0.025, 0.5, 0.975), normwt=T)
predictions$eMedian[i] <- quantiles[2]
predictions$eQ025[i] <- quantiles[1]
predictions$eQ975[i] <- quantiles[3]
}
predictions$eResults <- resultsMatrix_sted
predictions$eMin <- apply(resultsMatrix_sted, 1, "min")
predictions$eMax <- apply(resultsMatrix_sted, 1, "max")
}
### Backtransform of predictions according to training data output ###########
if(glue.ann$parameters$rescale.output)
{
outputRange <- max(glue.ann$outputData)-min(glue.ann$outputData)
outputMin <- min(glue.ann$outputData)
predictions$MLE <- predictions$MLE * outputRange + outputMin
predictions$eMean <- predictions$eMean * outputRange + outputMin
predictions$eMedian <- predictions$eMedian * outputRange + outputMin
predictions$eQ025 <- predictions$eQ025 * outputRange + outputMin
predictions$eQ975 <- predictions$eQ975 * outputRange + outputMin
predictions$eMin <- predictions$eMin * outputRange + outputMin
predictions$eMax <- predictions$eMax * outputRange + outputMin
predictions$eResults <- predictions$eResults * outputRange + outputMin
}
return(predictions)
}
To leave a comment for the author, please follow the link and comment on his blog: Bart Rogiers - Sreigor.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...