Estimation of hydraulic conductivity and its uncertainty from grain-size data using GLUE and artificial neural networks.

(This article was first published on Bart Rogiers - Sreigor, and kindly contributed to R-bloggers)

Various approaches exist to relate saturated hydraulic conductivity (K_s) to grain-size data. Most methods use a single grain-size parameter and hence omit the information encompassed by the entire grain-size distribution. This study compares two data-driven modelling methods—multiple linear regression and artificial neural networks—that use the entire grain-size distribution data as input for K_sprediction. Besides the predictive capacity of the methods, the uncertainty associated with the model predictions is also evaluated, since such information is important for stochastic groundwater flow and contaminant transport modelling.

Artificial neural networks (ANNs) are combined with a generalised likelihood uncertainty estimation (GLUE) approach to predict K_s from grain-size data. The resulting GLUE-ANN hydraulic conductivity predictions and associated uncertainty estimates are compared with those obtained from the multiple linear regression models by a leave-one-out cross-validation. The GLUE-ANN ensemble prediction proved to be slightly better than multiple linear regression. The prediction uncertainty, however, was reduced by half an order of magnitude on average, and decreased at most by an order of magnitude. This demonstrates that the proposed method outperforms classical data-driven modelling techniques. Moreover, a comparison with methods from the literature demonstrates the importance of site-specific calibration. The data set used for this purpose originates mainly from unconsolidated sandy sediments of the Neogene aquifer, northern Belgium. The proposed predictive models are developed for 173 grain-size K_s-pairs. Finally, an application with the optimised models is presented for a borehole lacking K_sdata.

Keywords: Early stopping - Cross-validation - Generalised likelihood uncertainty estimation - Artificial neural networks - Sedimentary aquifer - Data-driven modelling - Likelihood measures - Principal component analysis - GLUE-ANN

Rogiers, B., Mallants, D., Batelaan, O., Gedeon, M., Huysmans, M., Dassargues, A. 2012. Estimation of hydraulic conductivity and its uncertainty from grain-size data using GLUE and artificial neural networks. Mathematical Geosciences. doi: 10.1007/s11004-012-9409-2.

An example script for performing such analysis in the R language is provided below:

(see also at http://goo.gl/xn0p4)

R script with examples of GLUE-ANN analyses:

################################################################################
### GLUE-ANN parameters + examples (Rogiers et al. 2012)                     ###
################################################################################
# A dataframe has to be provided with the independent variable(s), and a vector
# with the dependent variable. A list of parameters has to be set to define the
# a priori distributions of the stochastic parameters, and fixed values for the
# ones considered to be constant.
# Be sure to load the additional GLUE-ANN functions into R before trying
# the examples provided below. The AMORE package should be installed and loaded:
install.packages('AMORE')
library('AMORE')
 
### Parameter list #############################################################
# A list of all required parameters with their default values.
 
  ### Pre-processing parameters ################################################
  # Orthogonalise input data; options are: 'Cov', 'Cor', and 'No'
    orthogonal <- 'Cor'
  # Number of input variables, provide vector for stochastic treatment
    variableNumber <- c(1:ncol(inputData))
  # T: Select always most variance explaining PCs; F: Choose PCs randomly
    dimRed <- T
  # Rescale output to 0-1 range
    rescale.output <- T
 
  ### ANN parameters ###########################################################
  # Enable early stopping
    ES <- T
  # Part of the data used for early stopping
    randomESPart <- 0.5
  # Early stopping samples should have a similar mean as the training samples
    nonEqualMeans <- T
  # Maximum difference between early stopping and training observation means
    maxMeanDiff <- mean(outputData)*0.2
  # maximum number of training cycles, provide a vector for stochastic treatment
    nCycles <- 100
  # Amount of hidden layers, provide vector for stochastic treatment
    HLrange <- c(1:1)
  # Amount of hidden nodes, provide vector for stochastic treatment
    HNrange <- c(1:10)
  # Maximum amount of hidden nodes per input variable
    maxHNPerVar <- Inf
  # Hidden layer activation function; options are 'tansig', 'sigmoid', 'purelin'
    hidden.layer <- 'tansig'
  # Output layer activation function; options are 'tansig', 'sigmoid', 'purelin'
    output.layer <- 'sigmoid'
 
  ### GLUE parameters ##########################################################
  # Number of models in the ensemble
    nSets <- 100
  # Number of random variates for weights based on Stedinger et al (2008)
    nStedinger <- 50
  # GLUE weighting procedure; options are 'MSE', 'MSEes' (for weights based on
  # the early stopping subset performance), 'NSEff' and 'Stedinger'
    weighting <- 'Stedinger'
 
  ### Cross-validation parameters ##############################################
  # Use leave-one-out cross-validation
    cv <- T
 
### Examples ###################################################################
 
  ### Example of training and prediction for fictional data ####################
  # Create fictional data for x range [1,100]
    inputData <- data.frame(x=seq(1,100,0.5))
  # Dependent variable y = x^2                              
    outputData <- inputData$x ^ 2
  # add random noise to y, representing different sources of error
    outputData <- jitter(outputData, amount=mean(outputData)*0.2)  
  # Run GLUE-ANN functions   
    glue.ann.ensemble <- train.glue.ann(inputData, outputData, nCycles=500)
    training.results <- predict.glue.ann(inputData, glue.ann.ensemble)
  # Visualise results for training data
    plot(inputData$x, training.results$eMean, type='l', xlab='x', ylab=
    expression(y = x^2 + epsilon), ylim=c(-2000,13000))
    lines(inputData$x, glue.ann.ensemble$outputData, lty=2, col='red')
    lines(inputData$x, training.results$eQ025, lty=3, col='blue')
    lines(inputData$x, training.results$eQ975, lty=3, col='blue')
    legend('topleft', c('GLUE-ANN ensemble mean prediction','Observations',
    '95% prediction uncertainty'), lty=c(1,2,3),col=c('black','red','blue'),
    bty='n')
  # Predict new data
    newData <- data.frame(x=c(-100:200)+0.5)
    newData.predictions <- predict.glue.ann(newData, glue.ann.ensemble)
  # Visualise results of prediction
    plot(newData$x, newData.predictions$eMean, type='l', xlab='x', ylab=
    expression(y = x^2 + epsilon), ylim=c(-2000,13000))
    lines(newData$x, jitter(newData$x^2, amount=mean(outputData)*0.2) , lty=2,
    col='red')
    lines(newData$x, newData.predictions$eQ025, lty=3, col='blue')
    lines(newData$x, newData.predictions$eQ975, lty=3, col='blue')
    abline(v=c(0,100))
    text(50,13000,'Training data range')
    legend('bottomright', c('GLUE-ANN ensemble mean prediction','Observations',
    '95% prediction uncertainty'), lty=c(1,2,3),col=c('black','red','blue'),
    bty='n')
 
  ### Example of cross-validation with an R dataset ############################
  # Read input and output data (ozone concentration in function of wind,
  # temperature and solar radiation)
    outputData <- na.omit(airquality)[,1]
    inputData <- na.omit(airquality)[,2:4]
  # Perform cross-validation
    cross_validation <- train.glue.ann(inputData, outputData, cv=T, nCycles=1000
    , nSets=10)
  # Visualisation
    plot(cross_validation$target, ylim=c(-100,200), xlab='Sample number', ylab=
    'Ozone')
    points(cross_validation$eMean, pch=2, col='red')
    lines(cross_validation$eQ025, lty=3, col='blue')
    lines(cross_validation$eQ975, lty=3, col='blue')
    legend('bottomright', c('Observations','GLUE-ANN ensemble mean prediction',
    '95% prediction uncertainty'), pch=c(1,2,NA), lty=c(NA,NA,3),col=c('black',
    'red','blue'),bty='n')

R functions for GLUE-ANN analysis with the AMORE package (Castejón Limas et al 2010):

################################################################################
### GLUE-ANN - functions (Rogiers et al. 2012)                               ###
################################################################################
 
### ANN training function ######################################################
nncalc <- function(Ptrain, Pval, targetTrain, targetVal, architecture, n, Stao=1
, es=T, hidden.layer="sigmoid", output.layer="purelin")
{
  neuralnetwork <- newff(n.neurons=architecture, learning.rate.global=1e-2,
  momentum.global=0.5, error.criterium = 'LMS', Stao = Stao, hidden.layer=
  hidden.layer, output.layer=output.layer, method="ADAPTgdwm")
  if(!es) { Pval <- NULL; targetVal <- NULL }     
  capture.output(results <- train(neuralnetwork, Ptrain, Stao=Stao,  Pval=Pval,
  Tval=targetVal,targetTrain, error.criterium = 'LMS', report=T, show.step=1,
  n.shows=n ))
  return(results)
}
 
### Sampling parameter vectors function ########################################
resample <- function(x, n, ...) x[sample.int(length(x), size=n, ...)]
 
### Model ensemble training function ###########################################
train.glue.ann <- function(inputData,outputData,orthogonal='Cor',variableNumber=
c(1:ncol(inputData)),dimRed=T,rescale.output=T,ES=T,randomESPart=0.5,
nonEqualMeans=T,maxMeanDiff=mean(outputData)*0.2,nCycles=100,HLrange=c(1:1),
HNrange=c(1:10),maxHNPerVar=Inf,hidden.layer='tansig',output.layer='sigmoid',
nSets=100,nStedinger=50,weighting='Stedinger',cv=F)
{
  ### Initialisation ###########################################################
  parameters <- NULL
  parameters$orthogonal <- orthogonal
  parameters$variableNumber <- variableNumber
  parameters$dimRed <- dimRed
  parameters$rescale.output <- rescale.output
  parameters$ES <- ES
  parameters$randomESPart <- randomESPart
  parameters$nonEqualMeans <- nonEqualMeans
  parameters$maxMeanDiff <- maxMeanDiff
  parameters$nCycles <- nCycles
  parameters$HLrange <- HLrange
  parameters$HNrange <- HNrange
  parameters$maxHNPerVar <- maxHNPerVar
  parameters$hidden.layer <- hidden.layer
  parameters$output.layer <- output.layer
  parameters$nSets <- nSets
  parameters$nStedinger <- nStedinger
  parameters$weighting <- weighting
  parameters$cv <- cv        
  glue.ann <- NULL
  cvPredictions <- NULL
  glue.ann$parameters <- parameters      
  glue.ann$esNumber <- matrix(nrow=nrow(inputData), ncol=nSets)
 
  ### Training loop ############################################################
  for(crossValidationSampleNumber in 1:ifelse(cv,nrow(inputData),1))
  {
    if(cv)
    { 
        inputDataModellingCV <- as.data.frame(
        inputData[crossValidationSampleNumber,])
        inputDataModelling <- as.data.frame(
        inputData[-crossValidationSampleNumber,])
        outputDataModellingCV <- outputData[crossValidationSampleNumber]
        outputDataModelling <- outputData[-crossValidationSampleNumber] 
    } else {
        inputDataModelling <- inputData
        outputDataModelling <- outputData
    }
    names(inputDataModelling) <- c(1:ncol(inputDataModelling))
    if(cv) names(inputDataModellingCV) <- c(1:ncol(inputDataModellingCV))
    glue.ann$inputData <- inputDataModelling
    glue.ann$outputData <- outputDataModelling
    P <- cbind(inputDataModelling)    
    if(orthogonal != 'No'){Ppca <- princomp(P, cor=ifelse(orthogonal=='Cor', T,
    F));P <- Ppca$scores}
    trans <- P
    glue.ann$P <- P
    for(i in 1:ncol(inputData)) trans[,i] <- (trans[,i]-mean(P[,i]))/sd(P[,i])
    if(rescale.output) outputDataModelling <- (outputDataModelling - (
    min(glue.ann$outputData)))/(((max(glue.ann$outputData))-(
    min(glue.ann$outputData))))
    if(rescale.output & cv) outputDataModellingCV <- (outputDataModellingCV -(
    min(glue.ann$outputData)))/(((max(glue.ann$outputData))-(
    min(glue.ann$outputData))))
    variableslist <- list();networks <- list(); esNumber <- NULL
    earlyStoppingMSE <- NULL; earlyStoppingCycle <- NULL; hn <- list()
    totalVar <- NULL; trainingMSE <- NULL
    for(j in 1:nSets)
    {
      if(cv) cat(paste('# Cross-validating sample',crossValidationSampleNumber,
      ' '))
      esDataNumbers <- sample(1:nrow(inputDataModelling),
      round(nrow(inputDataModelling)*randomESPart,0))
      while(nonEqualMeans)
      {
        esDataNumbers <- sample(1:nrow(inputDataModelling),
        round(nrow(inputDataModelling)*randomESPart,0))
        esMean <- mean(outputDataModelling[esDataNumbers])
        trainMean <- mean(outputDataModelling[-esDataNumbers])
        if(abs(esMean-trainMean) < maxMeanDiff) nonEqualMeans <- F
      }        
      trainingDataNumbers <- c(1:nrow(inputDataModelling))[-esDataNumbers]
      if(!ES) {trainingDataNumbers <- c(1:nrow(inputDataModelling))
      esDataNumbers <- trainingDataNumbers}
      nvars <- resample(variableNumber,1)
      variables <- resample(c(1:ncol(inputDataModelling)),nvars)
      if(dimRed) variables <- c(1:length(variables))
      variableslist[[j]] <- variables
      hiddenLayers <- resample(HLrange,1, replace=T)
      hiddenNodes <- resample(HNrange,hiddenLayers, replace=T)
      for(i in 1:hiddenLayers) if(maxHNPerVar < hiddenNodes[i]/nvars) {
      hiddenNodes[i] <- resample((min(HNrange):maxHNPerVar),1, replace=T)}
      ncyc <- resample(nCycles,1)
      cat('# Training ANN nr. ',j,', with architecture: ',sep='')
      cat(c(nvars,hiddenNodes[1:hiddenLayers],1), sep='-')
      Ptrain <- as.data.frame(trans[trainingDataNumbers,variables]   )
      Pval <- as.data.frame(trans[esDataNumbers,variables])
      targetTrain <- outputDataModelling[trainingDataNumbers]
      targetVal <- outputDataModelling[esDataNumbers]
      networks[[j]] <- nncalc(Ptrain, Pval, targetTrain, targetVal,
      architecture=c(nvars,hiddenNodes,1), ncyc, Stao=10, es=ES, hidden.layer=
      hidden.layer, output.layer=output.layer)
      if(ES) esNumber[j] <- which.min(networks[[j]]$Merror[,2])
      predTrain <- sim(networks[[j]]$net, Ptrain)
      trainingMSE[j] <- mean((targetTrain-predTrain)^2)
      if(ES) predEarlyStopping <- sim(networks[[j]]$net, Pval)
      if(ES) earlyStoppingMSE[j] <- mean((targetVal-predEarlyStopping)^2)
      if(ES) totalVar[j] <- var(c(predTrain-targetTrain, predEarlyStopping-
      targetVal))
      if(!ES) totalVar[j] <- var(c(predTrain-targetTrain))
      if(ES) earlyStoppingCycle[j] <- which.min(networks[[j]]$Merror[,2])
      hn[[j]] <- hiddenNodes
      if(ES) cat(paste(' # Training MSE: ',round(trainingMSE[j],2),
      ', early stopping MSE: ',round(earlyStoppingMSE[j],2),', cycles: ',
      earlyStoppingCycle[j],'\n', sep=''))
      if(!ES) cat(paste(' # Training MSE: ',round(trainingMSE[j],2),
      ', cycles: ',earlyStoppingCycle[j],'\n', sep=''))
    }
    glue.ann$networks <- networks
    glue.ann$variables <- variableslist
    glue.ann$trainMSE <- trainingMSE
    if(ES) glue.ann$esMSE <- earlyStoppingMSE
    if(ES) glue.ann$totalMSE <- (1-randomESPart) * glue.ann$trainMSE +
    randomESPart * glue.ann$esMSE
    if(!ES) glue.ann$totalMSE <- glue.ann$trainMSE
    glue.ann$totalVar <- totalVar
    glue.ann$hn <- hn
    if(orthogonal != 'No') glue.ann$pca <- Ppca
 
    ### Predict cross-validated sample #########################################
    if(cv)
    {
      cvPrediction <- predict.glue.ann(inputDataModellingCV, glue.ann)
      cvPredictions$MSE[crossValidationSampleNumber] <- mean(
      (outputData[crossValidationSampleNumber] - cvPrediction$eMean)^2)
      cvPredictions$eMean[crossValidationSampleNumber] <- cvPrediction$eMean
      cvPredictions$eMedian[crossValidationSampleNumber] <- cvPrediction$eMedian
      cvPredictions$eMin[crossValidationSampleNumber] <- cvPrediction$eMin
      cvPredictions$eMax[crossValidationSampleNumber] <- cvPrediction$eMax
      cvPredictions$eQ025[crossValidationSampleNumber] <- cvPrediction$eQ025
      cvPredictions$eQ975[crossValidationSampleNumber] <- cvPrediction$eQ975
      cvPredictions$MLE[crossValidationSampleNumber] <- cvPrediction$MLE
      cvPredictions$target[crossValidationSampleNumber] <-
      outputData[crossValidationSampleNumber]
    }
  }
  if(cv){return(cvPredictions)} else {return(glue.ann)}
}
 
### Model ensemble prediction function #########################################
predict.glue.ann <- function(inputData, glue.ann,
weighting=glue.ann$parameters$weighting)
{
  names(inputData) <- c(1:ncol(inputData))
  if(glue.ann$parameters$orthogonal != 'No') inputData <-
  predict(glue.ann$pca, newdata=inputData) # orthogonalise
  for(i in 1:ncol(inputData)) inputData[,i] <-
  (inputData[,i]-mean(glue.ann$P[,i]))/sd(glue.ann$P[,i]) # standardise
 
  ### Predict model ensemble results for new data ##############################
  resultsMatrix <- matrix(nrow=nrow(inputData), ncol=glue.ann$parameters$nSets)
  if(nrow(inputData) > 1) {for(i in 1:glue.ann$parameters$nSets) {
  resultsMatrix[,i] <- sim(glue.ann$networks[[i]]$net,
  inputData[,glue.ann$variables[[i]]])}}
  if(nrow(inputData) == 1) {for(i in 1:glue.ann$parameters$nSets) {
  resultsMatrix[,i] <- sim(glue.ann$networks[[i]]$net, t(
  inputData[,glue.ann$variables[[i]]]))}}
 
  ### Construct GLUE weights ###################################################
  if(glue.ann$parameters$weighting=='MSEes'){weightsGLUE <- (max(glue.ann$esMSE)
  - glue.ann$esMSE)
  weightsGLUE[which(weightsGLUE < 0)] <- 0
  weightsGLUE <- weightsGLUE/sum(weightsGLUE)}
  if(glue.ann$parameters$weighting=='MSE'){weightsGLUE <- (
  max(glue.ann$totalMSE) - glue.ann$totalMSE)^2
  weightsGLUE[which(weightsGLUE < 0)] <- 0
  weightsGLUE <- weightsGLUE/sum(weightsGLUE)}
  if(glue.ann$parameters$weighting=='Stedinger')
  {
    se2 <- min(glue.ann$totalVar)
    n <- length(glue.ann$outputData)
    weightsGLUE <- exp((-n/2)* (glue.ann$totalMSE)/(se2) )
  }
  if(glue.ann$parameters$weighting=='NSEff')
  {
    weightsGLUE <- (1-(glue.ann$totalMSE/var(glue.ann$outputData)))
    weightsGLUE[which(weightsGLUE < 0)] <- 0
  }
 
  ### Weight model predictions #################################################
  predictions <- NULL
  predictions$MLE <- resultsMatrix[1,which.min(glue.ann$totalVar)]    
  if(glue.ann$parameters$weighting != 'Stedinger')
  {
    predictions$eMean <- apply(resultsMatrix, 1, "wtd.mean", weights =
    weightsGLUE, normwt=T)
    predictions$eMedian <- apply(resultsMatrix, 1, "wtd.quantile", weights =
    weightsGLUE, probs = c(0.5), normwt=T)
    predictions$eQ025 <- apply(resultsMatrix, 1, "wtd.quantile", weights =
    weightsGLUE, probs = c(0.025), normwt=T)
    predictions$eQ975 <- apply(resultsMatrix, 1, "wtd.quantile", weights =
    weightsGLUE, probs = c(0.975), normwt=T)
    predictions$eMin <- apply(resultsMatrix, 1, "min")
    predictions$eMax <- apply(resultsMatrix, 1, "max")
    predictions$eResults <- resultsMatrix
  }
  if(glue.ann$parameters$weighting=='Stedinger')
  {
    resultsMatrix_sted <- matrix(nrow=nrow(inputData), ncol=
    glue.ann$parameters$nSets*glue.ann$parameters$nStedinger)
    for(i in 1:nrow(inputData))
    {
      for(k in 1:glue.ann$parameters$nSets)
      {
        errorVector <- rnorm(glue.ann$parameters$nStedinger,mean=0,sd=sqrt(se2))
        resultsMatrix_sted[i,(((k-1)*glue.ann$parameters$nStedinger)+c(
        1:glue.ann$parameters$nStedinger))] <- errorVector + resultsMatrix[i,k]
      }
    }
    weightsGLUE <- rep(weightsGLUE, each=glue.ann$parameters$nStedinger)
    predictions$eMean <- apply(resultsMatrix_sted, 1, "wtd.mean", weights =
    weightsGLUE, normwt=T)
    for(i in 1:nrow(inputData))
    {
      quantiles <- wtd.quantile(resultsMatrix_sted[i,], weights = weightsGLUE,
      probs = c(0.025, 0.5, 0.975), normwt=T)
      predictions$eMedian[i] <- quantiles[2]
      predictions$eQ025[i] <- quantiles[1]
      predictions$eQ975[i] <- quantiles[3] 
    }
    predictions$eResults <- resultsMatrix_sted
    predictions$eMin <- apply(resultsMatrix_sted, 1, "min")
    predictions$eMax <- apply(resultsMatrix_sted, 1, "max")    
  }
 
  ### Backtransform of predictions according to training data output ###########
  if(glue.ann$parameters$rescale.output)
  {
    outputRange <- max(glue.ann$outputData)-min(glue.ann$outputData)
    outputMin <- min(glue.ann$outputData)
    predictions$MLE <- predictions$MLE * outputRange + outputMin
    predictions$eMean <- predictions$eMean * outputRange + outputMin
    predictions$eMedian <- predictions$eMedian * outputRange + outputMin
    predictions$eQ025 <- predictions$eQ025 * outputRange + outputMin
    predictions$eQ975 <- predictions$eQ975 * outputRange + outputMin
    predictions$eMin <- predictions$eMin * outputRange + outputMin
    predictions$eMax <- predictions$eMax * outputRange + outputMin
    predictions$eResults <- predictions$eResults * outputRange + outputMin
  }
  return(predictions)
}

Created by Pretty R at inside-R.org

To leave a comment for the author, please follow the link and comment on his blog: Bart Rogiers - Sreigor.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...

Estimation of hydraulic conductivity and its uncertainty from grain-size data using GLUE and artificial neural networks.

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112