PCA or Polluting your Clever Analysis

August 31, 2012, 5:49 am

≫ Next: What’s the smallest amount you can’t make with 5 coins ?

≪ Previous: Feature selection and linear modeling

(This article was first published on Machine Master, and kindly contributed to R-bloggers)

When I learned about principal component analysis (PCA), I thought it would be really useful in big data analysis, but that's not true if you want to do prediction. I tried PCA in my first competition at kaggle, but it delivered bad results. This post illustrates how PCA can pollute good predictors.
When I started examining this problem, I always had following picture in mind. So I couldn't resist to include it. The left picture shows, how I feel about my raw data before PCA. The right side represents how I see my data after applying PCA on it.

Yes it can be that bad …

The basic idea behind PCA is pretty compelling:
Look for a (linear) combination of your variables which explains most of the variance of the data. This combination shall be your first component. Then search for the combination which explains the second most of the variance and is vertical to the first component.
I don't want to go into details and assume that you are familiar with the idea.
At the end of the procedure you have uncorrelated variables, which are linear combinations of your old variables. They can be ordered by how much variance they explain. The idea for machine learning / predictive analysis is now to use only the ones with high variance, because variance equals information, right?
So we can reduce dimensions by dropping the components which do not explain much of the variance. Now you have less variables, the dataset becomes manageable, your algorithms run faster and you have the feeling, that you have done something useful to your data, that you have aggregated them in a very compact but effective way.
It's not as simple as that.
Well let's try it out with some toy data.
At first we build a toy data set. Therefor we first create some random “good” x values, which are simply drawn from a normal distribution. The response variable y is a linear transformation of x, with a random error, so we should be able to make a good prediction of y with the help of good_x.
The second covariable is a “moisy” x, which is highly correlated with good_x, but has more variance itself.
To build a bigger dataset I included variables which are very useless for predicting y, because they are randomly normal distributed with mean zero. Five of the variables are highly correlated and the other five as well, which will be detected by the PCA later.

library("MASS")
set.seed(123456)


### building a toy data set ###

## number of observations
n <- 500

## good predictor
good_x <- rnorm(n = n, mean = 0.5, sd = 1)

## noisy variable, which is highly correlated with good predictor
noisy_x <- good_x + rnorm(n, mean = 0, sd = 1.2)

## response variable linear to good_x plus random error
y <- 0.7 * good_x + rnorm(n, mean = 0, sd = 0.11)

## variables with no relation to response 10 variables, 5 are always
## correlated
Sigma_diag <- matrix(0.6, nrow = 5, ncol = 5)
zeros <- matrix(0, nrow = 5, ncol = 5)
Sigma <- rbind(cbind(Sigma_diag, zeros), cbind(zeros, Sigma_diag))
diag(Sigma) <- 1
random_X <- mvrnorm(n = n, mu = rep(0, 10), Sigma = Sigma)
colnames(random_X) <- paste("useless", 1:10, sep = "_")

## binding all together
dat <- data.frame(y = y, good_x, noisy_x, random_X)

Let's look at the relationship between good_x and y, and noisy_x and y:


## relationship between y and good_x and noisy_x
par(mfrow = c(1, 2))
plot(y ~ good_x, data = dat)
plot(y ~ noisy_x, data = dat)

Obviously, good_x is a much better predictor for y than noisy_x.
Now I run the principal component analysis. The first three components explain a lot, which is due to the way I constructed the data. The variables good_x and noisy_x are highly correlated (Component 3), the useless variables number one to five are correlated and so are the useless variables number six to ten (Components 1 and 2)

## pca
prin <- princomp(dat[-1], cor = TRUE)

## results visualized
par(mfrow = c(1, 2))
screeplot(prin, type = "l")

loadings(prin)

## 
## Loadings:
##            Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9
## good_x                    0.700  0.148                0.459        -0.441
## noisy_x                   0.700         0.148        -0.424         0.474
## useless_1  -0.183 -0.408        -0.290        -0.464  0.382 -0.184  0.180
## useless_2  -0.204 -0.390         0.170  0.490  0.549  0.114        -0.116
## useless_3  -0.228 -0.393         0.242        -0.434         0.176       
## useless_4  -0.201 -0.394        -0.446 -0.278  0.309 -0.103  0.599       
## useless_5  -0.172 -0.414         0.365 -0.271  0.105 -0.306 -0.484       
## useless_6   0.413 -0.172         0.209 -0.527        -0.125              
## useless_7   0.411 -0.171        -0.305         0.273  0.398 -0.311  0.413
## useless_8   0.368 -0.241        -0.424  0.340 -0.252 -0.403 -0.209 -0.447
## useless_9   0.398 -0.191         0.366  0.386 -0.166         0.427  0.262
## useless_10  0.405 -0.215         0.137 -0.157  0.122               -0.297
##            Comp.10 Comp.11 Comp.12
## good_x     -0.234           0.146 
## noisy_x     0.239                 
## useless_1   0.144  -0.440  -0.252 
## useless_2   0.272  -0.284   0.196 
## useless_3   0.264   0.561   0.347 
## useless_4  -0.237                 
## useless_5  -0.443          -0.220 
## useless_6   0.186  -0.392   0.515 
## useless_7           0.412   0.165 
## useless_8  -0.135           0.150 
## useless_9  -0.416  -0.166  -0.171 
## useless_10  0.484   0.212  -0.596 
## 
##                Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
## SS loadings     1.000  1.000  1.000  1.000  1.000  1.000  1.000  1.000
## Proportion Var  0.083  0.083  0.083  0.083  0.083  0.083  0.083  0.083
## Cumulative Var  0.083  0.167  0.250  0.333  0.417  0.500  0.583  0.667
##                Comp.9 Comp.10 Comp.11 Comp.12
## SS loadings     1.000   1.000   1.000   1.000
## Proportion Var  0.083   0.083   0.083   0.083
## Cumulative Var  0.750   0.833   0.917   1.000

Let's look at the relationship between the now mixed good_x and noisy_x and the response y. Component 3 is the only one which contains only the good and the noisy x, but none of the useless variables. You can see here, that the relationship is still remained, but by adding the noise to the good predictor we now have a worse predictor than before.

dat_pca <- as.data.frame(prin$scores)
dat_pca$y <- y

plot(y ~ Comp.3, data = dat_pca)

Now we can compare the prediction of y with the raw data and with the data after pca analysis. The first method is a linear model. Since the true relationship between good_x and y is linear, this should be a good approach. At first we take the raw data and include all variables, which are the good and the noisy x and the useless variables.
As expected, the linear model performs very well with an adjusted R-squared of 0.975. The estimated coefficient of good_x is also very close to the true value. The linear model also performed well on finding the only covariable that matters indicated by the p-values.


## linear model detects good_x rightfully as only good significant
## predictor
mod <- lm(y ~ ., dat)
summary(mod)

## 
## Call:
## lm(formula = y ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3205 -0.0750 -0.0048  0.0784  0.3946 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.001377   0.005681   -0.24     0.81    
## good_x       0.705448   0.006411  110.03   <2e-16 ***
## noisy_x     -0.002980   0.004288   -0.70     0.49    
## useless_1   -0.010783   0.006901   -1.56     0.12    
## useless_2    0.009326   0.006887    1.35     0.18    
## useless_3   -0.001307   0.007572   -0.17     0.86    
## useless_4   -0.005631   0.007037   -0.80     0.42    
## useless_5    0.002154   0.007116    0.30     0.76    
## useless_6   -0.001407   0.007719   -0.18     0.86    
## useless_7   -0.003830   0.007513   -0.51     0.61    
## useless_8    0.004877   0.007215    0.68     0.50    
## useless_9    0.000769   0.007565    0.10     0.92    
## useless_10   0.005247   0.007639    0.69     0.49    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 0.112 on 487 degrees of freedom
## Multiple R-squared: 0.976,   Adjusted R-squared: 0.975 
## F-statistic: 1.62e+03 on 12 and 487 DF,  p-value: <2e-16 
##

And now let's look at the mess the PCA made. If we would include all components into our linear model, then we would get the same R-squared value, because the new components are only linear combinations of the old variables. Only the p-values would be a mess, because a lot of components would have a significant influence on the outcome of y.
But we wanted to use PCA for dimensionality reduction. The screeplot some plots above suggests, that we should take the first four components, because they explain most of the variance. Applying the linear model on the reduced data gives us a worse model. The adjusted R-squared drops to 0.787.

dat_pca_reduced <- dat_pca[c 1=""Comp.1"," 2=""Comp.2"," 3=""Comp.3"," 4=""Comp.4")" language="("y","][/c]
mod_pca_reduced <- lm(y ~ ., data = dat_pca_reduced)

summary(mod_pca_reduced)

## 
## Call:
## lm(formula = y ~ ., data = dat_pca_reduced)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.9634 -0.2188 -0.0147  0.2084  0.9651 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.35828    0.01463   24.50  < 2e-16 ***
## Comp.1       0.03817    0.00786    4.86  1.6e-06 ***
## Comp.2      -0.00778    0.00795   -0.98     0.33    
## Comp.3       0.48733    0.01149   42.41  < 2e-16 ***
## Comp.4       0.11190    0.02138    5.24  2.4e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 0.327 on 495 degrees of freedom
## Multiple R-squared: 0.789,   Adjusted R-squared: 0.787 
## F-statistic:  463 on 4 and 495 DF,  p-value: <2e-16 
##

The same thing happens when we use other methods. I chose random forests, but the same will happen when you use other methods.
The first random forest with the raw data gives the best results with 93.15% of the variance explained. This result is not as good as the linear model, because the true relationship is linear, but it is still a reasonable result.

library("randomForest")

## raw data
set.seed(1234)
forest <- randomForest(y ~ ., data = dat)
forest

## 
## Call:
##  randomForest(formula = y ~ ., data = dat) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##           Mean of squared residuals: 0.03436
##                     % Var explained: 93.15

The next random forest uses all components from the PCA, which means that there is still all information in the data, because we only build some linear combinations of the old variables. But the results are worse with only 85.47% of the variance explained by the random forest.


## pca data
set.seed(1234)
forest_pca <- randomForest(y ~ ., data = dat_pca)
forest_pca

## 
## Call:
##  randomForest(formula = y ~ ., data = dat_pca) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##           Mean of squared residuals: 0.07293
##                     % Var explained: 85.47

Let's see what happens, if we only take the first four components, which explain most of the variance of the covariables.
As suspected we loose again predictive power and the explained variance drops to 70.98 %.
That's a difference of about 22% percent of explained variance, compared to the results from the random forest with the raw data.

## reduced pca data
set.seed(1234)
forest_pca <- randomForest(y ~ ., data = dat_pca_reduced)
forest_pca

## 
## Call:
##  randomForest(formula = y ~ ., data = dat_pca_reduced) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 1
## 
##           Mean of squared residuals: 0.1456
##                     % Var explained: 70.98

PCA can do worse things to your data, when used for prediction. This was of course a simple example, but my intuition is telling me, that the the problem stays for other relationships between y and the covariables and other covariation structures of the covariables.
Of course PCA still is a useful statistical tool. It can help as a descriptive tool or if you are looking for some latent variable behind your observed features (which is very common in surveys) it does its job. But don't use it blindly in predictive models.
This blog entry was inspired by this one:
http://blog.explainmydata.com/2012/07/should-you-apply-pca-to-your-data.html
I am interested in your experience with PCA. Feel free to comment.

To leave a comment for the author, please follow the link and comment on his blog: Machine Master.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series,ecdf, trading) and more...

↧

What’s the smallest amount you can’t make with 5 coins ?

September 16, 2012, 1:09 pm

≫ Next: DESeq vs edgeR Comparison

≪ Previous: PCA or Polluting your Clever Analysis

(This article was first published on Computing / Geek, and kindly contributed to R-bloggers)

My amazing, awesome wife often comes up with the little puzzles for our amazing children, and this one seemed destined to be solved in R. So, using up to 5 coins (1p, 2p, 5p, 10p, 20p and 50p) first she asked our
kids whether they could make every value up to 50p, and then what the smallest value they couldn't make was.

Here's my R solution (which took about 5mins less than our daughter took to answer the first question)

# What Amounts can't you make using up to 5 coins 1p to 50p
# 
# Author: Paul Hurley
library(ggplot2)
library(plyr)
# Define our coins
coins <- as.factor(c(0, 1, 2, 5, 10, 20, 50))
# build a list of all the possibilities
possibilities <- expand.grid(coin1 = coins, coin2 = coins, coin3 = coins, coin4 = coins, 
    coin5 = coins)
# calculate the result
possibilities$total <- as.numeric(as.character(possibilities$coin1)) + as.numeric(as.character(possibilities$coin2)) + 
    as.numeric(as.character(possibilities$coin3)) + as.numeric(as.character(possibilities$coin4)) + 
    as.numeric(as.character(possibilities$coin5))
# define our target values
targets <- 1:250
# what amounts aren't possible
targets[!targets %in% possibilities$total]

##  [1]  88  89  98  99 118 119 128 129 133 134 136 137 138 139 143 144 146
## [18] 147 148 149 158 159 163 164 166 167 168 169 173 174 176 177 178 179
## [35] 181 182 183 184 185 186 187 188 189 191 192 193 194 195 196 197 198
## [52] 199 203 204 206 207 208 209 211 212 213 214 215 216 217 218 219 221
## [69] 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238
## [86] 239 240 241 242 243 244 245 246 247 248 249

So, the smallest value we can't make is
88

We can even produce a table of the number of ways to make each
number, and a graph

tableofpossibilities <- ddply(.data = possibilities, .(total), nrow)
ggplot(data = possibilities, aes(x = total)) + geom_histogram(binwidth = 1)

plot of chunk unnamed-chunk-2

Then when I triumphantly told her, she asked, 'what about 4 coins ?'

# How about 4 coins build a list of all the possibilities
fourpossibilities <- expand.grid(coin1 = coins, coin2 = coins, coin3 = coins, 
    coin4 = coins)
# calculate the result
fourpossibilities$total <- as.numeric(as.character(fourpossibilities$coin1)) + 
    as.numeric(as.character(fourpossibilities$coin2)) + as.numeric(as.character(fourpossibilities$coin3)) + 
    as.numeric(as.character(fourpossibilities$coin4))
# what values can't be made ?
targets[!targets %in% fourpossibilities$total]

##   [1]  38  39  48  49  68  69  78  79  83  84  86  87  88  89  93  94  96
##  [18]  97  98  99 108 109 113 114 116 117 118 119 123 124 126 127 128 129
##  [35] 131 132 133 134 135 136 137 138 139 141 142 143 144 145 146 147 148
##  [52] 149 153 154 156 157 158 159 161 162 163 164 165 166 167 168 169 171
##  [69] 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188
##  [86] 189 190 191 192 193 194 195 196 197 198 199 201 202 203 204 205 206
## [103] 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223
## [120] 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240
## [137] 241 242 243 244 245 246 247 248 249 250

So, the answer is 38and a graph


ggplot(data = fourpossibilities, aes(x = total)) + geom_histogram(binwidth = 1)

plot of chunk unnamed-chunk-4

To leave a comment for the author, please follow the link and comment on his blog: Computing / Geek.

↧

DESeq vs edgeR Comparison

September 18, 2012, 1:08 pm

≫ Next: Long-term precipitation data for your location from the US Historical Climatology Network using R

≪ Previous: What’s the smallest amount you can’t make with 5 coins ?

(This article was first published on Getting Genetics Done, and kindly contributed to R-bloggers)

DESeq and edgeR are two methods and R packages for analyzing quantitative readouts (in the form of counts) from high-throughput experiments such as RNA-seq or ChIP-seq. After alignment, reads are assigned to a feature, where each feature represents a target transcript, in the case of RNA-Seq, or a binding region, in the case of ChIP-Seq. An important summary statistic is the count of the number of reads in a feature (for RNA-Seq, this read count is a good approximation of transcript abundance).

Methods used to analyze array-based data assume a normally distributed, continuous response variable. However, response variables for digital methods like RNA-seq and ChIP-seq are discrete counts. Thus, both DESeq and edgeR methods are based on the negative binomial distribution.

I see these two tools often used interchangeably, and I wanted to take a look at how they stack up to one another in terms of performance, ease of use, and speed. This isn't meant to be a comprehensive evaluation or "bake-off" between the two methods. This would require complex simulations, parameter sweeps, and evaluation with multiple well-characterized real RNA-seq datasets. Further, this is only a start - a full evaluation would need to be much more comprehensive.

Here, I used the newest versions of both edgeR and DESeq, using the well-characterized Pasilla dataset, available in the pasilla Bioconductor package. The dataset is from an experiment in Drosophila investigating the effect of RNAi knockdown of the splicing factor, pasilla. I used the GLM functionality of both packages, as recommended by the vignettes, for dealing with a multifactorial experiment (condition: treated vs. untreated; library type: single-end and paired-end).

Both packages provide built-in functions for assessing overall similarity between samples using either PCA (DESeq) or MDS (edgeR), although these methods operate on the same underlying data and could easily be switched.

PCA plot on variance stabilized data from DESeq:

MDS plot from edgeR:

Per gene dispersion estimates from DESeq:

Biological coefficient of variation versus abundance (edgeR):

Now, let's see how many statistically significant (FDR<0.05) results each method returns:

In this simple example, DESeq finds 820 genes significantly differentially expressed at FDR<0.05, while edgeR is finds these 820 and an additional 371. Let's take a look at the detected fold changes from both methods:

Here, if genes were found differentially expressed by edgeR only, they're colored red; if found by both, colored green. What's striking here is that for a handful of genes, DESeq is (1) reporting massive fold changes, and (2) not calling them statistically significant. What's going on here?

It turns out that these genes have extremely low counts (usually one or two counts in only one or two samples). The DESeq vignette goes through the logic of independent filtering, showing that the likelihood of a gene being significantly differentially expressed is related to how strongly it's expressed, and advocates for discarding extremely lowly expressed genes, because differential expression is likely not statistically detectable.

Count-based filtering can be achieved two ways. The DESeq vignette demonstrates how to filter based on quantiles, while I used the filtering method demonstrated in the edgeR vignette - removing genes without at least 2 counts per million in at least two samples. This filtering code is commented out above - uncomment to filter.

After filtering, all of the genes shown above with apparently large fold changes as detected by DESeq are removed prior to filtering, and the fold changes correlate much better between the two methods. edgeR still detects ~50% more differentially expressed genes, and it's unclear to me (1) why this is the case, and (2) if this is necessarily a good thing.

Conclusions:

Unfortunately, I may have oversold the title here - this is such a cursory comparison of the two methods that I would hesitate to draw any conclusions about which method is better than the other. In addition to finding more significantly differentially expressed genes (again, not necessarily a good thing), I can say that edgeR was much faster than DESeq for fitting GLM models, but it took slightly longer to estimate the dispersion. Further without any independent filtering, edgeR gave me moderated fold changes for the extremely lowly expressed genes for which DESeq returned logFCs in the 20-30 range (but these transcripts were so lowly expressed anyway, they should have been filtered out before any evaluation).

If there's one thing that will make me use edgeR over DESeq (until I have time to do a more thorough evaluation), it's the fact that using edgeR seems much more natural than DESeq, especially if you're familiar with the limma package (pretty much the standard for analyzing microarray data and other continuously distributed gene expression data). Setting up the design matrix and specifying contrasts feels natural if you're familiar with using limma. Further, the edgeR user guide weighs in at 67 pages, filled with many case studies that will help you in putting together a design matrix for nearly any experimental design: paired designs, time courses, batch effects, interactions, etc. The DESeq documentation is still fantastic, but could benefit from a few more case studies / examples.

What do you think? Anyone want to fork my R code and help do this comparison more comprehensively (more examples, simulated data, speed benchmarking)? Is the analysis above fair? What do you find more easy to use, or is ease-of-use (and thus, reproducibility) even important when considering data analysis?

To leave a comment for the author, please follow the link and comment on his blog: Getting Genetics Done.

↧

Long-term precipitation data for your location from the US Historical Climatology Network using R

September 19, 2012, 10:10 pm

≫ Next: Color Palettes in HCL Space

≪ Previous: DESeq vs edgeR Comparison

(This article was first published on a modeler's tribulations, and kindly contributed to R-bloggers)

Abstract

Long-term daily precipitation records from the US Historical Climatology Network (USHCN) are processed and summarized using R. Below is R code which examines the data from Livermore, CA, the closest available USHCN station to my present home in Fremont, CA.

Introduction

Long-term daily precipitation (rain/snow, and also temperature) records for the United States are available from the United States Historical Climatology Network - USHCN. These are observations of precipitation for typically about 100 years or longer. The USHCN stations are a subset of the larger/denser (but shorter) observation network from the Global Historical Climatology Network - GHCN

I obtained the data for California and other relevant data from here.

R code

Identify the station nearest to your location. For me its Livermore, CA.

rm(list = ls())

workDir <- "C:/Rstuff/DATA/ushcn/"
setwd(workDir)

# id of my nearest station (Livermore, CA), identified from
# 'ushcn-stations.txt'
myId <- "044997"

Following is the format of the data.

# format of the data, from 'data_format.txt'# .... (Each record in a file contains one month of daily data.)# # Variable Columns Type COOP ID 1-6 Character YEAR 7-10 Integer MONTH# 11-12 Integer ELEMENT 13-16 Character VALUE1 17-21 Integer MFLAG1 22# Character QFLAG1 23 Character SFLAG1 24 Character VALUE2 25-29 Integer# MFLAG2 30 Character QFLAG2 31 Character SFLAG2 32 Character .  .  .

Process precipitation data for your state and extract a subset of data corresponding to your station.

# read data for all stations in the state
allData <- readLines("state04_CA.txt")

# extract station ids from the data
idData <- substr(allData, 1, 6)
# create a new data frame, with ids as the first column of the frame
newData <- data.frame(idData, allData, stringsAsFactors = FALSE)
# extract data corresp to your nearest station
myData <- subset(newData, idData == myId)
myData <- myData[, 2]  #throw away the previously added first column

Below function used later to determine the number of days in a month, including leap years.

# function to computes days in a month: input is year and month this is
# the best I could do without downloading external R libraries
FnDaysInMonth <- function(yr, mo) {
    date1 <- paste(yr, mo, "01", sep = "-")  #current month, day 1

    mo2 <- ifelse(mo < 12, mo + 1, 1)
    yr2 <- ifelse(mo < 12, yr, yr + 1)
    date2 <- paste(yr2, mo2, "01", sep = "-")  #next month, day 1

    return(as.numeric(difftime(as.Date(date2), as.Date(date1))))
}

Output file to store the data and read it back again for plotting

outFile <- file(paste(myId, ".txt", sep = ""), "wt")

Each line of the data file contains data corresponding to all the days in the month. Discard temperature records and read only “PRCP”. Also, check for the data quality flags.

## each line is 1 month of data
for (eachLine in 1:length(myData)) {
    yrVar <- as.numeric(substr(myData[eachLine], 7, 10))
    moVar <- as.numeric(substr(myData[eachLine], 11, 12))
    metVar <- substr(myData[eachLine], 13, 16)

    # only extract precipitation info
    if (metVar == "PRCP") {
        ## for each day of the month , check the data flags and get the data
        for (eachDay in 1:FnDaysInMonth(yrVar, moVar)) {
            dayOffset <- 17 + ((eachDay - 1) * 8)
            metVal <- as.numeric(substr(myData[eachLine], dayOffset, dayOffset + 
                4))
            mflag <- substr(myData[eachLine], dayOffset + 5, dayOffset + 5)  #is irrelevant
            qflag <- substr(myData[eachLine], dayOffset + 6, dayOffset + 6)  #should be blank
            sflag <- substr(myData[eachLine], dayOffset + 7, dayOffset + 7)  #should not be blank

            # write to ouput
            if (qflag == " " & sflag != " ") {
                writeLines(paste(yrVar, moVar, eachDay, metVal, sep = ","), 
                  outFile)
            }
        }
    }
}
close(outFile)

Read back data for summary graphs

prcp <- read.csv(paste(myId, ".txt", sep = ""), header = FALSE, sep = ",", 
    as.is = TRUE)
colnames(prcp) <- c("yr", "mo", "day", "val")
prcp$val <- prcp$val/100  #convert hundredths of inches to inches

Graphs …

# yearly total
yrtot <- aggregate(val ~ yr, data = prcp, FUN = sum)
png(filename = "fig1.png")
plot(yrtot$val ~ yrtot$yr, type = "h", main = "Annual Precipitation Total (inches), Livermore, CA", 
    ylab = "inches/year", xlab = "year")
garbage <- dev.off()

plot of chunk unnamed-chunk-8


# monthly total
montot <- aggregate(val ~ yr + mo, data = prcp, FUN = sum)
png(filename = "fig2.png")
boxplot(montot$val ~ montot$mo, range = 0, main = "Monthly Precipitation Total (inches), Livermore, CA", 
    ylab = "inches/month", xlab = "calendar month")
garbage <- dev.off()

# number of rainy days per month, rainy day of rain amount > 0.01 inches
prcp$val <- ifelse(prcp$val <= 0.01, 0, 1)
raindays <- aggregate(val ~ yr + mo, data = prcp, FUN = sum)
png(filename = "fig3.png")
boxplot(raindays$val ~ raindays$mo, range = 0, main = "Monthly Rainy Days, Livermore, CA", 
    ylab = "days/month", xlab = "calendar month")
garbage <- dev.off()

Summary

This code, when modified slightly, could also be used to read GHCN daily precipitation data.

To leave a comment for the author, please follow the link and comment on his blog: a modeler's tribulations.

↧

Color Palettes in HCL Space

October 12, 2012, 3:38 pm

≫ Next: analyze the national survey on drug use and health (nsduh) with r

≪ Previous: Long-term precipitation data for your location from the US Historical Climatology Network using R

(This article was first published on Trestle Technology, LLC » R, and kindly contributed to R-bloggers)

This is a quick follow-up to my previous post about Color Palettes in RGB Space. Achim Zeileis had commented that, perhaps, it would be more informative to evaluate the color palettes in HCL (polar LUV) space, as that spectrum more accurately describes how humans perceive color. Perhaps more clear trends would emerge in HCL space, or color palettes would more closely hug their principal component.

For the uninitiated, a good introduction to HCL color spaces is available at this site, or from Mr. Zeileis' own paper here.

We'll start by loading in some code written previously (or slightly modified to support HCL). We can plot out an image of the different color palettes we're evaluating once again.

source("../Correlation.R")
source("loadPalettes.R")
par(mfrow = c(6, 3), mar = c(2, 1, 1, 1))
for (i in 1:18) {
    palette <- getSequentialLuv(i, allSequential)
    image(mat, col = hex(palette), axes = FALSE, main = i)
}

HCL-Space

The fundamental question for this analysis was how these color palettes move through HCL space, as opposed to RGB space, which was considered last time.

rgbsnapshot
You must enable Javascript to view this page properly.

An interactive visualization of palette #2 in 3-Dimensional HCL space

R² Values

As we did in the previous analysis, we can use the principal component of each palette, to compute the R² value to quantify how well the data aligns to this component.

pcasnapshot
You must enable Javascript to view this page properly.

Palette #2 with the Principal Component.

We'll recycle some of the old functions, and create a new one that calculates the R² values in HCL space.

#' Compute the proportion of variance accounted for by the given number of components
#'
#' @author Jeffrey D. Allen \email{Jeffrey.Allen@@UTSouthwestern.edu}
propVar <- function(pca, components=1){
  #proportion of variance explained by the first component
  pv <- pca$sdev^2/sum(pca$sdev^2)[1] 
  return(sum(pv[1:components]))
}

#' Calculate the R-squared values of all 18 color palettes
#'
#' @author Jeffrey D. Allen \email{Jeffrey.Allen@@UTSouthwestern.edu}
calcR2Sequential <- function(){
  library(rgl)
  R2 <- list()
  for (i in which(sequential[,2] == 9)){
    palette <- RGBToLuv(sequential[i:(i+8), 7:9])
    pca <-plotLuvCols(palette[,"L"],palette[,"C"], palette[,"H"], pca=1)    
    R2[[length(R2)+1]] <- propVar(pca,1)
  }
  return(R2)
}

calcR2RGBSequential <- function() {
    library(rgl)
    R2 <- list()
    for (i in which(sequential[, 2] == 9)) {
        palette <- sequential[i:(i + 8), 7:9]
        pca <- plotCols(palette$R, palette$G, palette$B, pca = 1)
        cat(i, ": ", propVar(pca, 1), "\n")
        R2[[length(R2) + 1]] <- propVar(pca, 1)
    }
    return(R2)
}

Comparison of R² Values

Let's compare the R² values computed from RGB space to those computed in HCL space

# R2 Values in HCL Space
r2 <- calcR2Sequential()
r2 <- unlist(r2)
names(r2) <- 1:18

# R2 Values in RGB Space
rgbr2 <- calcR2RGBSequential()

## 34 :  0.981 
## 76 :  0.9097 
## 118 :  0.9541 
## 160 :  0.9847 
## 202 :  0.9552 
## 244 :  0.9663 
## 286 :  0.9674 
## 328 :  0.9273 
## 370 :  0.9344 
## 412 :  0.9593 
## 454 :  0.9049 
## 496 :  0.9007 
## 538 :  0.9954 
## 580 :  0.9752 
## 622 :  0.9846 
## 664 :  0.9292 
## 706 :  0.9311 
## 748 :  1

rgbr2 <- unlist(rgbr2)
names(rgbr2) <- 1:18

# plot the comparison
plot(rgbr2 ~ r2, ylab = "RGB R2 Values", xlab = "HCL R2 Values", main = "RGB vs. HCL R2 Values")

pv <- anova(lm(rgbr2 ~ r2))$"Pr(>F)"[1]

As seen clearly in the plot, these two variables are not correlated (p-value of 0.5232), so there's definitely a difference in how these palettes move through HCL vs. RGB space.

Color Ratings

One hypothesis of this analysis was that because HCL space corresponds more directly to our perception of color, perhaps a smoother or more linear path through HCL space would have greater consequence on the visual appeal of the color palette than it would in RGB space. To test this, we can repeat the same analysis as we had before to see if a small R² value is significantly correlated with the visual appeal of a color palette.

colorPreference <- read.csv("../turk/output/Batch_790445_batch_results.csv", 
    header = TRUE, stringsAsFactors = FALSE)
colorPreference <- colorPreference[, 28:29]
colorPreference[, 1] <- substr(colorPreference[, 1], 44, 100)
colorPreference[, 1] <- substr(colorPreference[, 1], 0, nchar(colorPreference[, 
    1]) - 4)
colnames(colorPreference) <- c("palette", "rating")

prefList <- split(colorPreference[, 2], colorPreference[, 1])
prefList <- prefList[order(as.integer(names(prefList)))]

R² Values

We can calculate the R² values for each palette as previously discussed and compare to see if it's associated with the aesthetic appeal of a palette.

r2

##      1      2      3      4      5      6      7      8      9     10 
## 0.9614 0.9794 0.9799 0.9388 0.9708 0.8902 0.9651 0.9579 0.9725 0.9842 
##     11     12     13     14     15     16     17     18 
## 0.9325 0.9129 0.9603 0.9358 0.9568 0.9611 0.9785 1.0000

plot(avgs ~ r2, main = "Linearity of Color Palette vs. Aesthetic Appeal", xlab = "R-squared Value", 
    ylab = "Average Aesthetic Rating")
abline(lm(avgs ~ r2), col = 4)

pv <- anova(lm(avgs ~ r2))$"Pr(>F)"[1]

Oddly, this result contradicts the previous result; it seems that adhering to a linear path through HCL space is actually inversely related with the aesthetic appeal on these palettes. We should note, of course, that the p-value for this correlation is not significant, as it stands (p-value = 0.6598).

Regardless, the negative trend seems fairly obvious on inspection and, if you were to exclude the two "outlier" palettes on the left side which have stark breaks in HCL space on either end of the spectrum, you get a significant (0.01) negative correlation. Indeed, these same two palettes were the same outliers in the previous plot comparing RGB to HCL R² values.

Obviously, as it stands, the conclusion of the analysis is that there is no correlation in these palettes between linearity in HCL space and aesthetic appeal, but it is curious that the hinted correlation is actually negative.

Dimensional Spread

Another point of interest is the "spread" or coverage of a palette across one particular axis. For instance, it may be interesting to compare a color palette which varies only in luminance to one which varies only in chroma to see if one type of progression is more aesthetically appealing.

We can summarize such a phenomenon by calculating the distance between each color in a palette along a given axis (luminosity, for instance), which we'll label "ΔLuminosity" the . Extracting the median of these points may provide some insight into the movement of a particular palette along an axis.

plotDimensionalScoring <- function(palettes, scoring, ...) {
    
    ranges <- data.frame(L = numeric(0), C = numeric(0), H = numeric(0))
    
    for (i in 1:length(palettes)) {
        colPal <- RGBToLuv(palettes[[i]])
        
        # get the differences between each color on each axis
        colPal <- diff(colPal)
        colPal <- abs(apply(colPal, 2, median))
        
        ranges[i, ] <- colPal
        
    }
    
    plot3d(0, 0, 0, xlab = "L", ylab = "C", zlab = "H", type = "n", xlim = c(-0.5, 
        max(ranges[, 1] + 1)), ylim = c(-0.5, max(ranges[, 2] + 1)), zlim = c(-0.5, 
        max(ranges[, 3] + 1)), ...)
    
    cols <- heat_hcl(n = 101)
    
    for (i in 1:nrow(ranges)) {
        plot3d(ranges[i, 1], ranges[i, 2], ranges[i, 3], type = "p", add = TRUE, 
            pch = 19, size = 10, col = cols[101 - (round(scoring[i], 2) * 100)], 
            ...)
    }
    ranges
}

normAvgs <- avgs - min(avgs)
normAvgs <- normAvgs/max(normAvgs)
ranges <- plotDimensionalScoring(allSequential, normAvgs)

Each palette now has one median Δ value for each axis. We can plot these in 3D space to see where the palettes fall. We'll go ahead and color-code these based on their average visual appeal to see if we can observe any trends or "hotspots" in which palettes are consistently rated as visually appealing based only on their median movement along an axis.

hclsnapshot
You must enable Javascript to view this page properly.

A plot of all color palettes' median change in HCL space.

We're descending into fairly non-scientific analysis here, but I noticed a pattern when viewing these points along the chroma and luminosity axes. It seems like there's something of a pattern resulting in consistently positioned high-ranked palettes.

I interpolated a heatmap based on these ratings on the C and L axes.

library(akima)

resolution = 64
cSeq <- seq(min(ranges[, 2]), max(ranges[, 2]), length.out = resolution)
lSeq <- seq(min(ranges[, 1]), max(ranges[, 1]), length.out = resolution)

a <- interp(x = ranges[, 1], y = ranges[, 2], z = normAvgs, xo = lSeq, yo = cSeq, 
    duplicate = "mean")

filled.contour(a, color.palette = heat_hcl, xlab = "Luminance", ylab = "Chroma", 
    main = "Visual Appeal Across Chroma & Luminance")

maxL <- lSeq[which.max(apply(a$z, 2, max, na.rm = TRUE))]

## Warning: no non-missing arguments to max; returning -Inf

maxC <- cSeq[which.max(apply(a$z, 1, max, na.rm = TRUE))]

## Warning: no non-missing arguments to max; returning -Inf

You can see the peak of visual appeal at luminance = 6.374 and chroma = 8.332. A few examples of palettes using this "optimized" chroma and luminance spacing...

par(mfrow = c(6, 3), mar = c(2, 1, 1, 1))
for (i in 1:18) {
    barplot(rep(1, 9), col = sequential_hcl(9, h = (20 * i), c. = c(100, 33.6), 
        l = c(40, 84.59), power = 1))
}

Individual Axis Analysis

There is one palette that seems to be a bit of an outlier in most regards. Of all the palettes, the greyscale palette has the lightest L value, and the lowest C and H values. If you calculate the Z-scores on each color axis, this palette has consistently higher scores; below is a plot of the average Z-score across the 3 axes.

barplot(apply(abs(scale(ranges)), 1, mean), col = c(rep("#BBBBBB", 17), "#CC7777"), 
    xlab = "Palette #", ylab = "Average Z-score Across {H, C, L} Axes")

Because of this, I decided to exclude it from all further calculations of significance. Where appropriate, I'll still plot it graphically but distinguish it from the other palettes.

Rather than relying on a 3D plot, we can examine each axis individually to see if there's a significant association between a palette's median variation on that access and its visual appeal. (The greyscale plot will be plotted with a triangle symbol in these plots, and was excluded to p-value and regression calculations).

filRanges <- ranges[1:17, ]
filNormAvgs <- normAvgs[1:17]

plot(normAvgs ~ ranges[, "H"], pch = c(rep(1, 17), 2), main = "Hue vs. Visual Appeal", 
    xlab = expression("Median " * Delta * Hue), ylab = "Average Aesthetic Rating")
abline(lm(filNormAvgs ~ filRanges[, "H"]), col = 2)
pvH <- anova(lm(filNormAvgs ~ filRanges[, "H"]))$"Pr(>F)"[1]
text(3, 0.1, paste("p-value =", round(pvH, 3)))

plot(normAvgs ~ ranges[, "C"], pch = c(rep(1, 17), 2), main = "Chroma vs. Visual Appeal", 
    xlab = expression("Median " * Delta * Chroma), ylab = "Average Aesthetic Rating")
abline(lm(filNormAvgs ~ filRanges[, "C"]), col = 2)
pvC <- anova(lm(filNormAvgs ~ filRanges[, "C"]))$"Pr(>F)"[1]
text(3, 0.1, paste("p-value =", round(pvC, 3)))

plot(normAvgs ~ ranges[, "L"], pch = c(rep(1, 17), 2), main = "Luminance vs. Visual Appeal", 
    xlab = expression("Median " * Delta * Luminance), ylab = "Average Aesthetic Rating")
abline(lm(filNormAvgs ~ filRanges[, "L"]), col = 2)
pvL <- anova(lm(filNormAvgs ~ filRanges[, "L"]))$"Pr(>F)"[1]
text(5, 0.1, paste("p-value =", round(pvL, 3)))

As shown on the plots, there is no observable correlation with regards to movement through hue with visual appeal; there is a significant negative correlation between median variation in chroma and the visual appeal; and there seems to be a negative (though non-significant) correlation between median variation in luminance and visual appeal.

Conclusion

Comparison to Colorspace Palettes

It was mentioned that colorspace generates their palettes in HCL space, so the trends should more obviously emerge for such palettes. As it turns out, the palettes in HCL not necessarily linear, even in HCL space. Many palettes are fairly linear on at least one axis, but often have at least one color at either end which is completely non-linear, causing the R2 values to often be lower than they were in colorbrewer palettes. For instance:

csp <- as(hex2RGB(sequential_hcl(n = 9)), "polarLUV")@coords
pca <- plotLuvCols(csp[, "L"], csp[, "C"], csp[, "H"], 1)

colorspacesnapshot
You must enable Javascript to view this page properly.

A plot of the "sequential_hcl" color palette from the colorspace package.

After inspecting the code, the hue value in this color palette never changes. So it seems that we're encountering a substantial rounding error when you get the the dark/light edges of the color palette.

Summary

I'm still a bit perplexed by the results I observer, but I suppose this analysis hints at a few things.

First of all, I realized that it very much matters which color space you're working in when designing color palettes. As the lack of a correlation between the R² values from RGB space and HCL space show, the movement of different palettes through these spaces are drastically different and don't seem to be correlated.

Second, it looks like there is some potential to optimize the spacing of colors in a given palette by finding these "hotspots" in the Luminance vs. Chroma plot. On the other hand, movement along the hue axis for a given palette doesn't seem to have much of an effect on the visual appeal.

Finally, it seems that the rounding error on hex codes and 255-value RGB values becomes significant for especially dark or bright ends of a palette. This may put into question the effectiveness of such analysis in 3D space. Of course, the hue value represents a complete circle, so a hue of 359 is very close to a hue of 0, which wouldn't be captured in a naive 3D analysis. Even accounting for this problem, there are other nuances of this color space which can't be captured in 3D space, and may jeopardize the effectiveness of PCA in such a space.

Future Work

As mentioned previously, it will be interesting to analyze not just the visual appeal of a color palette, but the effectiveness of a palette at communicating information. Hopefully the next post will be able to capture some of that information.

As always, comments are welcome! I'm well outside the scope of my training and expertise, so I'd be happy to hear any critiques or concerns about methodology.

Acknowledgements

All analysis done in R
All ratings generated using Amazon Mechanical Turk
Interactive 3D graphics generated using rgl version 0.92.879
Report generated using knitr
Code hosted by GitHub in trestletech/RGB-Space
Color palettes obtained from Cynthia Brewer's ColorBrewer2.0 service
Idea from Achim Zeileis

To leave a comment for the author, please follow the link and comment on his blog: Trestle Technology, LLC » R.

↧

analyze the national survey on drug use and health (nsduh) with r

November 6, 2012, 3:00 am

≫ Next: Finding a pin in a haystack – PCA image filtering

≪ Previous: Color Palettes in HCL Space

(This article was first published on united states government survey data by anthony damico, and kindly contributed to R-bloggers)

the national survey on drug use and health (nsduh) monitors illicit drug, alcohol, and tobacco use with more detail than any other survey out there. if you wanna know the average age at first chewing tobacco dip, the prevalence of needle-sharing, the family structure of households with someone abusing pain relievers, even the health insurance coverage of peyote users, you are in the right place. the substance abuse and mental health services administration (samhsa) contracts with the north carolinians over at research triangle institute to run the survey, but the university of michigan's substance abuse and mental health data archive (samhda) holds the keys to this data castle.

nsduh in its current form only goes back about a decade, when samhsa re-designed the methodology and started paying respondents thirty bucks a pop. before that, look for its predecessor - the national household survey on drug abuse (nhsda) - with public use files available back to 1979 (included in these scripts). be sure to read those changes in methodology carefully before you start trying to trend smokers' virginia slims brand loyalty back to 1999.

although (to my knowledge) only the national health interview survey contains r syntax examples in its documentation, the friendly folks at samhsa have shown promise. since their published data tables were run on a restricted-access data set, i requested that they run the same sudaan analysis code on the public use files to confirm that this new r syntax does what it should. they delivered, i matched, pats on the back all around.

if you need a one-off data point, samhda is overflowing with options to analyze the data online. you even might find some restricted statistics that won't appear in the public use files. still, that's no substitute for getting your hands dirty. when you tire of menu-driven online query tools and you're ready to bark with the big data dogs, give these puppies a whirl. the national survey on drug use and health targets the civilian, noninstitutionalized population of the united states aged twelve and older. this new github repository contains three scripts:

1979-2010 - download all microdata.R

authenticate the university of michigan's "i agree with these terms" page
download, import, save each available year of data (with documentation) back to 1979
convert each pre-packaged stata do-file (.do) into r, run the damn thing, get NAs where they belong

2010 single-year - analysis examples.R

load a single year of data
limit the table to the variables needed for an example analysis
construct the complex sample survey object
run enough example analyses to make a kitchen sink jealous

replicate samhsa puf.R

load a single year of data
limit the table to the variables needed for an example analysis
construct the complex sample survey object
print statistics and standard errors matching the target replication table

click here to view these three scripts

for more detail about the national survey on drug use and health, visit:

notes:

the 'download all microdata' program intentionally breaks unless you complete the clearly-defined, one-step instruction to authenticate that you have read and agree with the download terms. the script will download the entire public use file archive, but only after this step has been completed. if you contact me for help without reading those instructions, i reserve the right to tease you mercilessly. also: thanks to the great hadley wickham for figuring out how to authenticate in the first place.

confidential to sas, spss, stata, and sudaan users: did you know that you don't have to stop reading just because you've run out of candlewax? maybe it's time to switch to r. :D

To leave a comment for the author, please follow the link and comment on his blog: united states government survey data by anthony damico.

↧

Finding a pin in a haystack – PCA image filtering

December 4, 2012, 11:24 am

≫ Next: How to explore your heartbeat

≪ Previous: analyze the national survey on drug use and health (nsduh) with r

(This article was first published on me nugget, and kindly contributed to R-bloggers)

I found the following post regarding the anomalous metal object observed in a Curiosity Rover photo to be fascinating - specifically, the clever ways that some programmers used for filtering the image for the object. The following answer on mathematica.stackexchange.com was especially illuminating for its use of a multivariate distribution to describe the color channels for a test region of "sand". This distribution was subsequently used to assess if the rest of the image colors belonged to the same distribution.

I tried a different approach, using a Principal Component Analysis (PCA) filter (above), also based on a region of sand. I believe the PCs can be understood in the following way: the PCs represent dominant rgb colors (below), while the loadings are indicate the intensity of the color.

The first PC is obviously the main color of the sand and explains 99.949 % of the variance in the colors. Both shadowed areas and sun-lit areas are fairly equally masked out, by subtracting the a reconstructed image based on PC1, since they are of similar color of differing intensity. What remains are the non-sand regions.

Code to reproduce:

Read more »

To leave a comment for the author, please follow the link and comment on his blog: me nugget.

↧

How to explore your heartbeat

December 8, 2012, 4:53 pm

≫ Next: A thermometer in R using Arduino and Java

≪ Previous: Finding a pin in a haystack – PCA image filtering

(This article was first published on holtmeier.de » Rstat, and kindly contributed to R-bloggers)

A few months ago, I bought a really cool book: Exploring Everyday Things (with R and Ruby). I learned many interesting and mostly useless things from the author, Sau Sheong Chang. Chapter 6 for example explains how to build a stethoscope with nothing more than

a paper cup,
a foil,
an elastic band and
iphone-earplugs.

I took a day off yesterday. This was the ideal opportunity to realize this little project. I also needed to gain experience with Knitr, a fantastic software to write reproducible reports (like the one you're reading now) inMarkdown. If you're interested, you can find the source code and all necessary files for this complete blog-post here. Last but not least, I re-wrote the original code (Ruby –> R). Why? Again: Practicing my R-skills…

First step: Homemade stethoscope

Bevor I start, let me show you what I've done to record a heartbeat. It wasn't that easy as I thought it would be. The first picture shows my attempt to build the stethoscope. The earphones have a build-in microphone. I placed them inside the paper cup. Two little holes in the cup's side-walls were made to let through the cables. How to place the microphone inside the paper cup.

How to place the microphone inside the paper cup.

You can see in the second picture, that I've used a lot of crepe tape. There's no need to… And, I must admit that it doesn't look very elegant. But, I don't care. The importent trick in recording a heartbeat is the membrane. Mine is made from a feezer bag and it is attached by two elastic bands. The final result.

Second step: Record your heartbeat!

To record the heartbeat, I used an iphone-App called TwistedWave. You can try any other recording-software as far as it can save wav-files (mono). All my initial efforts to record the heartbeat were not successful. And no solution for these problems could be found in the book…!! I thought about this challenge an come up with an easy soulution: Sports! After running the stairs down to the frontdoor and back to the 4th floor I was able to record a heartbeat – because the heartbeat was louder then before. I saved theheartbeat.wav in my working directory for further analysis.

Third step: Extracting data from sound

The wav-file format is described on page 160 and the explanation of the utilized Ruby-code you can find on page 163. Buy the book. It's worth it! The following code is my R-version (extremely short) of the original Ruby-code. I make use of the tuneR-package.

library(tuneR) heartbeat <- readWave("heartbeat.wav", from = 55001, to = 275000, units = "samples") print(heartbeat) 
## ## Wave Object ## Number of Samples: 220000 ## Duration (seconds): 4.99 ## Samplingrate (Hertz): 44100 ## Channels (Mono/Stereo): Mono ## Bit (8/16/24/32): 16

My resulting sound-files lasts 4.99 seconds. 220000 Samples have been imported. Wow! Because of unwanted noises at the beginning and the end of heartbeat.wav, I selected samples above #55001 and under #275000.

Fourth step: Visualizing your hertbeat

Because I recorded the heart sounds in mono, I plot the left channel (there is no right channel). The following graph suggests, that the heart beats 10 times whithin the 5 seconds-periode.

plot(heartbeat@left, type = "n", main = "Channel 1 + Channel 2", xlab = "Time", ylab = "Frequency") lines(heartbeat@left)

I multiply the 10 beats with 12 to find the heart rate: 120 bpm! Sounds plausible. If you're interested in deeper analysis of the sound file, you'll find more information in the book.

Last step: Publish the “knitred” html-file in my blog (WordPress) and on RPubs

So far so good. Now I want to publish the finished html report. What you read here is the result of my efforts. 1. RPubs: Very easy! I use RStudio, then I first press the “Knit HTML”-button and afterwards the “Publish”-button. Registration for RPubs is free! Here is the result: http://rpubs.com/stephan_cgn/3134 2. WordPress: Also very simple! At first I thought that I need to upload the images manually and adjust the paths. Surprise! This is not necessary. The generated html file already contains the binary data of the pictures! So I did not expect. So the html file can easily be opened in a text editor and the code between and is copied into the html view of the WordPress editor. Done. Unfortunately, I have so no syntax highlighting. That can be resolved, but not in this article. Here is the result:http://holtmeier.de/heartbeat

To leave a comment for the author, please follow the link and comment on his blog: holtmeier.de » Rstat.

↧

A thermometer in R using Arduino and Java

November 11, 2012, 12:00 am

≫ Next: Visualising Tourism Data using R with googleVis package

≪ Previous: How to explore your heartbeat

(This article was first published on jean-robert.github.com, and kindly contributed to R-bloggers)

A couple of weeks ago, Markus Gesmann demonstrated the feasability of connecting R and Arduino using Processing and Rserve in a nice and simple example. This actually rekindled my interest in Arduino that I hadn’t had time to satisfy…

Again I will show a very easy and simple example, but this time, transferring the data between Arduino and R via Java. That way, we can use Arduino in a “pull” mode rather than in a “push” mode. Indeed, typically the Arduino sketch is written such that the board is continuously sending information to the computer (each time it goes through the loop() function), but maybe we don’t want to listen to it and only retrieve information when we need it.

One of the useful item provided with the basic Arduino starter kit is the themperature sensor (called TMP36), which can also be useful in the everyday life. The idea of this exercise will thus consist in recording the temperature and showing this in an R chart. Practically, this means 3 steps involving a) setting up the Arduino board to capture the temperature, b) building a Java class to “pull” the temperature from the Arduio board, c) using the Java class from R to make our chart.

Set up the Arduino board

No need to do complicated things since the Arduino Experimentation Kit has an example code doing almost what we need. All details on the physical setup can be found here, we just need to tweak the code a little bit for our purpose. As you will notice, the loop() function (which runs over and over again) prints the temperature to the serial port ending with a carriage return character. Yet, when we read from the serial port (later in Java), something interesting happens. For reasons that escape me, the data sent via serial is split in several packets of various sizes making the retrieval in Java a bit more arduous. To solve that issue, we are simply going to prepend a small header to our data, namely ‘T’. The loop() function now looks like this:

void loop() {
     float temperature = getVoltage(temperaturePin);
     temperature = (temperature - .5) * 100;

     Serial.print("T");
     Serial.println(temperature);
     delay(1000);
}

We upload all what is needed on the Arduino board, and now it should be sending the temperature with its header every second (in “push” mode). You can use the Serial Monitor in Arduino IDE to check if this is fine. Now let’s use Java to get that information.

Build a Java class to interface

As suggested on the Arduino website, one possibility to connect Java to Arduino is through the RXTX library. You should follow the instructions here on how to set it up. Actually, the sample provided also does most of the job. Two things need to be taken care of.

First of all, the Java class originally works with a listener on the serial port such that when the Arduino sends data, the Java app will print it. Because we want to use it in a “pull” mode, we will instead save the data in a buffer variable (temperatureBuffer). Remember we also added a header to the data being sent, so we take care of that too in the serialEvent function.
Secondly, the original sample code works with a standard “main” function to be ran on its own. We don’t need it here, but instead we create a public function in our class that we will use in R to “read” the temperature from the buffer. We end up with the following code for the Java class, that you then need to compile and build into a jar file (javac SerialTemperature.java then jar cf SerialTemperature.jar SerialTemperature.class).

import java.io.InputStream;
import gnu.io.CommPortIdentifier; 
import gnu.io.SerialPort;
import gnu.io.SerialPortEvent; 
import gnu.io.SerialPortEventListener; 
import java.util.Enumeration;

public class SerialTemperature implements SerialPortEventListener {

    SerialPort serialPort;
    /** The port we're normally going to use. */
    private static final String PORT_NAMES[] = { 
	"/dev/ttyACM0", // Linux port, might be different on your PC...
    };
    /** Buffered input stream from the port */
    private InputStream input;

    /** Milliseconds to block while waiting for port open */
    private static final int TIME_OUT = 2000;
    /** Default bits per second for COM port. */
    private static final int DATA_RATE = 9600;
    
    private String temperatureBuffer;

    public void initialize() {
	CommPortIdentifier portId = null;
	Enumeration portEnum = CommPortIdentifier.getPortIdentifiers();

	// iterate through, looking for the port
	while (portEnum.hasMoreElements()) {
	    CommPortIdentifier currPortId = (CommPortIdentifier) portEnum.nextElement();
	    for (String portName : PORT_NAMES) {
		if (currPortId.getName().equals(portName)) {
		    portId = currPortId;
		    break;
		}
	    }
	}

	if (portId == null) {
	    System.out.println("Could not find COM port.");
	    return;
	}

	try {
	    // open serial port, and use class name for the appName.
	    serialPort = (SerialPort) portId.open(this.getClass().getName(),
						  TIME_OUT);

	    // set port parameters
	    serialPort.setSerialPortParams(DATA_RATE,
					   SerialPort.DATABITS_8,
					   SerialPort.STOPBITS_1,
					   SerialPort.PARITY_NONE);

	    // open the streams
	    input = serialPort.getInputStream();

	    // add event listeners
	    serialPort.addEventListener(this);
	    serialPort.notifyOnDataAvailable(true);
	} catch (Exception e) {
	    System.err.println(e.toString());
	}
    }

    /**
     * This should be called when you stop using the port.
     * This will prevent port locking on platforms like Linux.
     */
    public synchronized void close() {
	if (serialPort != null) {
	    serialPort.removeEventListener();
	    serialPort.close();
	}
    }

    /**
     * This will be used by R to retrieve the temperature value
     */
    public synchronized Float read() {
	return Float.valueOf(temperatureBuffer.substring(1)).floatValue();
    }

    /**
     * Handle an event on the serial port. Read the data and save it to the buffer
     */
    public synchronized void serialEvent(SerialPortEvent oEvent) {
	if (oEvent.getEventType() == SerialPortEvent.DATA_AVAILABLE) {
	    try {
		int available = input.available();
		byte chunk[] = new byte[available];
		input.read(chunk, 0, available);
		
		String s = new String(chunk);
		if(s.contains("T")) {
		    temperatureBuffer = s;
		} else {
		    temperatureBuffer += s;
		}
				
	    } catch (Exception e) {
		System.err.println(e.toString());
	    }
	}
    }

}

Read the temperature from R

The final step, and maybe the easiest one for R people, is to read the temperature from R using the Java interface. That’s where the rJava package comes in handy. Remember to initialize the JVM with your newly built jar in the classpath, and you’re ready to go. We first initizalie the connection, then read the temperature whenever we want to simply using .jsimplify(.jcall(ardJava, returnSig='Ljava/lang/Float;', method='read')). The code below shows a simple application where we record the temperature to plot it using ggplot2. From there it’s pretty easy to do whatever you want with your temperature recorder. One could for instance use the statistical arsenal of R to make forecasts, and then display them using cool visualization on your website with Shiny?

setwd('/data/R/ArduinoTemp/')
require(rJava)
require(ggplot2)

.jinit(classpath='SerialTemperature.jar')
ardJava <- .jnew('SerialTemperature')

.jcall(ardJava, returnSig='V', method='initialize')

tempCapture <- NULL
while(Sys.Date()<'2012-11-11') {
  system('sleep 30')
  try({
    ans <- .jsimplify(.jcall(ardJava, returnSig='Ljava/lang/Float;', method='read'))
    tempCapture <- rbind(tempCapture, data.frame(Time=Sys.time(), Temperature=ans))
    print(ggplot(tempCapture) + geom_line(aes(x=Time, y=Temperature)) + theme_bw())
  }, silent=T)
}

.jcall(ardJava, returnSig='V', method='close')

temperature

The chart shows the outdoor temperature during one day, sampled every 30 secondes. As you can see my window is exposed south, since the temperature spikes up around noon. Otherwise it’s quite cold for a November in Paris, barely 10 degrees…

All files used in this Arduino/Java/R example are available here: https://gist.github.com/4055869.

To leave a comment for the author, please follow the link and comment on his blog: jean-robert.github.com.

↧

Visualising Tourism Data using R with googleVis package

December 20, 2012, 7:42 pm

≫ Next: Visualizing Principal Components

≪ Previous: A thermometer in R using Arduino and Java

(This article was first published on Pairach Piboonrungroj » R, and kindly contributed to R-bloggers)

Inspired by Mages’s post on Accessing and plotting World bank data with R (using googleVis package), I created one visualising tourism receipts and international tourist arrivals of various countries since 1995. The data used are from the World Bank’s country indicators.

To see the motion chart, double click a picture below.

Code

install.packages("googleVis")
library('googleVis')

getWorldBankData <- function(id='SP.POP.TOTL', date='1960:2010',
 value="value", per.page=12000){
 require(RJSONIO)
 url <- paste("http://api.worldbank.org/countries/all/indicators/", id,
 "?date=", date, "&format=json&per_page=", per.page,
 sep="")

 wbData <- fromJSON(url)[[2]]

 wbData = data.frame(
 year = as.numeric(sapply(wbData, "[[", "date")),
 value = as.numeric(sapply(wbData, function(x)
 ifelse(is.null(x[["value"]]),NA, x[["value"]]))),
 country.name = sapply(wbData, function(x) x[["country"]]['value']),
 country.id = sapply(wbData, function(x) x[["country"]]['id'])
 )

 names(wbData)[2] <- value

 return(wbData)
}

getWorldBankCountries <- function(){
 require(RJSONIO)
 wbCountries <-
 fromJSON("http://api.worldbank.org/countries?per_page=12000&format=json")
 wbCountries <- data.frame(t(sapply(wbCountries[[2]], unlist)))
 wbCountries$longitude <- as.numeric(wbCountries$longitude)
 wbCountries$latitude <- as.numeric(wbCountries$latitude)
 levels(wbCountries$region.value) <- gsub(" \\(all income levels\\)",
 "", levels(wbCountries$region.value))
 return(wbCountries)
}

## Create a string 1960:this year, e.g. 1960:2011
years <- paste("1960:", format(Sys.Date(), "%Y"), sep="")

## International Tourism Arrivals
inter.tourist.arrivals<- getWorldBankData(id='ST.INT.ARVL',
 date=years, value="International tourism, number of arrivals")

## International Tourism Receipts
tourism.receipts <- getWorldBankData(id='ST.INT.RCPT.CD', date=years,
 value="International tourism, receipts (current US$)")

## Population
population <- getWorldBankData(id='SP.POP.TOTL', date=years,
 value="population")

## GDP per capita (current US$)
GDP.per.capita <- getWorldBankData(id='NY.GDP.PCAP.CD',
 date=years,
 value="GDP.per.capita.Current.USD")

## Merge data sets
wbData <- merge(tourism.receipts, inter.tourist.arrivals)
wbData <- merge(wbData, population)
wbData <- merge(wbData, GDP.per.capita)

## Get country mappings
wbCountries <- getWorldBankCountries()

## Add regional information
wbData <- merge(wbData, wbCountries[c 1=""region.value"," 2=""incomeLevel.value")" language="("iso2Code","][/c],
 by.x="country.id", by.y="iso2Code")

## Filter out the aggregates and country id column
subData <- subset(wbData, !region.value %in% "Aggregates" , select=
 -country.id)

## Create a motion chart
M <- gvisMotionChart(subData, idvar="country.name", timevar="year",
 options=list(width=700, height=600))

## Display the chart in your browser
plot(M)

# save as a file
print(M, file="myGoogleVisChart.html")

Filed under: R, Tourism

To leave a comment for the author, please follow the link and comment on his blog: Pairach Piboonrungroj » R.

↧

Visualizing Principal Components

December 22, 2012, 10:12 am

≫ Next: Clustering with selected Principal Components

≪ Previous: Visualising Tourism Data using R with googleVis package

(This article was first published on Systematic Investor » R, and kindly contributed to R-bloggers)

Principal Component Analysis (PCA) is a procedure that converts observations into linearly uncorrelated variables called principal components (Wikipedia). The PCA is a useful descriptive tool to examine your data. Today I will show how to find and visualize Principal Components.

Let’s look at the components of the Dow Jones Industrial Average index over 2012. First, I will download the historical prices and sector infromation for all components of the Dow Jones Industrial Average index.

###############################################################################
# Load Systematic Investor Toolbox (SIT)
# http://systematicinvestor.wordpress.com/systematic-investor-toolbox/
###############################################################################
setInternet2(TRUE)
con = gzcon(url('http://www.systematicportfolio.com/sit.gz', 'rb'))
    source(con)
close(con)

	#*****************************************************************
	# Find Sectors for each company in DOW 30
	#****************************************************************** 
	tickers = spl('XLY,XLP,XLE,XLF,XLV,XLI,XLB,XLK,XLU')
	tickers.desc = spl('ConsumerCyclicals,ConsumerStaples,Energy,Financials,HealthCare,Industrials,Materials,Technology,Utilities')
	
	sector.map = c()
	for(i in 1:len(tickers)) {
		sector.map = rbind(sector.map, 
				cbind(sector.spdr.components(tickers[i]), tickers.desc[i])
			)
	}
	colnames(sector.map) = spl('ticker,sector')

	#*****************************************************************
	# Load historical data
	#****************************************************************** 
	load.packages('quantmod')	
	tickers = dow.jones.components()
	
	sectors = factor(sector.map[ match(tickers, sector.map[,'ticker']), 'sector'])
		names(sectors) = tickers
	
	data <- new.env()
	getSymbols(tickers, src = 'yahoo', from = '2000-01-01', env = data, auto.assign = T)
		for(i in ls(data)) data[[i]] = adjustOHLC(data[[i]], use.Adjusted=T)	
	
	bt.prep(data, align='keep.all', dates='2012')
	
	# re-order sectors, because bt.prep can change the order of tickers
	sectors = sectors[data$symbolnames]
	
	# save data for later examples
	save(data, tickers, sectors, file='bt.pca.test.Rdata')

Next, let’s run the Principal Component Analysis (PCA) on the companies returns during 2012 and plot percentage of variance explained for each principal component.

	#*****************************************************************
	# Principal component analysis (PCA), for interesting discussion
	# http://machine-master.blogspot.ca/2012/08/pca-or-polluting-your-clever-analysis.html
	#****************************************************************** 
	prices = data$prices	
	ret = prices / mlag(prices) - 1
	
	p = princomp(na.omit(ret))
	
	loadings = p$loadings[]
	p.variance.explained = p$sdev^2 / sum(p$sdev^2)

	# plot percentage of variance explained for each principal component	
	barplot(100*p.variance.explained, las=2, xlab='', ylab='% Variance Explained')

The first principal component, usually it is market returns, explains around 45% of variance during 2012.

Next let’s plot all companies loadings on the first and second principal components and highlight points according to the sector they belong.

	#*****************************************************************
	# 2-D Plot
	#****************************************************************** 		
	x = loadings[,1]
	y = loadings[,2]
	z = loadings[,3]
	cols = as.double(sectors)
	
	# plot all companies loadings on the first and second principal components and highlight points according to the sector they belong
	plot(x, y, type='p', pch=20, col=cols, xlab='Comp.1', ylab='Comp.2')
	text(x, y, data$symbolnames, col=cols, cex=.8, pos=4)
	
	legend('topright', cex=.8,  legend = levels(sectors), fill = 1:nlevels(sectors), merge = F, bty = 'n')

Please notice that the companies in the same sector tend to group together on the plot.

Next, let’s go one step further and create a 3D plot using first, second, and third principal components

	#*****************************************************************
	# 3-D Plot, for good examples of 3D plots
	# http://statmethods.wordpress.com/2012/01/30/getting-fancy-with-3-d-scatterplots/
	#****************************************************************** 				
	load.packages('scatterplot3d') 
	
	# plot all companies loadings on the first, second, and third principal components and highlight points according to the sector they belong
	s3d = scatterplot3d(x, y, z, xlab='Comp.1', ylab='Comp.2', zlab='Comp.3', color=cols, pch = 20)
		
	s3d.coords = s3d$xyz.convert(x, y, z)
	text(s3d.coords$x, s3d.coords$y, labels=data$symbolnames, col=cols, cex=.8, pos=4)
		
	legend('topleft', cex=.8,  legend = levels(sectors), fill = 1:nlevels(sectors), merge = F, bty = 'n')

The 3D chart does add a bit of extra info, but I find the 2D chart easier to look at.

In the next post, I will demonstrate clustering based on the selected Principal components and after that I want to explore the interesting discussion presented in the using PCA for spread trading post.

Happy Holidays

To view the complete source code for this example, please have a look at the bt.pca.test() function in bt.test.r at github.

To leave a comment for the author, please follow the link and comment on his blog: Systematic Investor » R.

↧

Clustering with selected Principal Components

December 28, 2012, 7:54 pm

≫ Next: 100 most read R posts in 2012 (stats from R-bloggers) – big data, visualization, data manipulation, and other languages

≪ Previous: Visualizing Principal Components

(This article was first published on Systematic Investor » R, and kindly contributed to R-bloggers)

In the Visualizing Principal Components post, I looked at the Principal Components of the companies in the Dow Jones Industrial Average index over 2012. Today, I want to show how we can use Principal Components to create Clusters (i.e. form groups of similar companies based on their distance from each other)

Let’s start by loading the historical prices for the the companies in the Dow Jones Industrial Average index that we saved in the Visualizing Principal Components post.

###############################################################################
# Load Systematic Investor Toolbox (SIT)
# http://systematicinvestor.wordpress.com/systematic-investor-toolbox/
###############################################################################
setInternet2(TRUE)
con = gzcon(url('http://www.systematicportfolio.com/sit.gz', 'rb'))
    source(con)
close(con)

	#*****************************************************************
	# Load historical data
	#****************************************************************** 
	load.packages('quantmod')	
	
	# load data saved in the bt.pca.test() function
	load(file='bt.pca.test.Rdata')

	#*****************************************************************
	# Principal component analysis (PCA), for interesting discussion
	# http://machine-master.blogspot.ca/2012/08/pca-or-polluting-your-clever-analysis.html
	#****************************************************************** 
	prices = data$prices	
	ret = prices / mlag(prices) - 1
	
	p = princomp(na.omit(ret))
	
	loadings = p$loadings[]
	
	x = loadings[,1]
	y = loadings[,2]
	z = loadings[,3]

To create Clusters, I will use the hierarchical cluster analysis, hclust function, in stats package. The first argument in the hclust function is the distance (dissimilarity) matrix. To compute distance matrix, let’s take the first 2 principal components and compute the Euclidean distance between each company:

	#*****************************************************************
	# Create clusters
	#****************************************************************** 		
	# create and plot clusters based on the first and second principal components
	hc = hclust(dist(cbind(x,y)), method = 'ward')
	plot(hc, axes=F,xlab='', ylab='',sub ='', main='Comp 1/2')
	rect.hclust(hc, k=3, border='red')

Similarly we can use the first three principal components:

	# create and plot clusters based on the first, second, and third principal components
	hc = hclust(dist(cbind(x,y,z)), method = 'ward')
	plot(hc, axes=F,xlab='', ylab='',sub ='', main='Comp 1/2/3')
	rect.hclust(hc, k=3, border='red')

Another option is to use the Correlation matrix as a proxy for a distance matrix:

	# create and plot clusters based on the correlation among companies
	hc = hclust(as.dist(1-cor(na.omit(ret))), method = 'ward')
	plot(hc, axes=F,xlab='', ylab='',sub ='', main='Correlation')
	rect.hclust(hc, k=3, border='red')

Please note that Clusters will be quite different, depending on the distance matrix you use.

To view the complete source code for this example, please have a look at the bt.clustering.test() function in bt.test.r at github.

To leave a comment for the author, please follow the link and comment on his blog: Systematic Investor » R.

↧

100 most read R posts in 2012 (stats from R-bloggers) – big data, visualization, data manipulation, and other languages

January 2, 2013, 10:50 am

≫ Next: More Principal Components Fun

≪ Previous: Clustering with selected Principal Components

(This article was first published on R-statistics blog » R, and kindly contributed to R-bloggers)

R-bloggers.com is now three years young. The site is an (unofficial) online journal of the R statistical programming environment, written by bloggers who agreed to contribute their R articles to the site.

Last year, I posted on the top 24 R posts of 2011. In this post I wish to celebrate R-bloggers’ third birthmounth by sharing with you:

Links to the top 100 most read R posts of 2012
Statistics on “how well” R-bloggers did this year
My wishlist for the R community for 2013 (blogging about R, guest posts, and sponsors)

1. Top 100 R posts of 2012

R-bloggers’ success is thanks to the content submitted by the over 400 R bloggers who have joined r-bloggers. The R community currently has around 245 active R bloggers (links to the blogs are clearly visible in the right navigation bar on the R-bloggers homepage). In the past year, these bloggers wrote around 3200 posts about R!

Here is a list of the top visited posts on the site in 2012 (you can see the number of page views in parentheses):

2. Statistics – how well did R-bloggers do in 2012?

Short answer: quite well.

In 2012, R-bloggers has reached around 11,000 regular subscribers (which you can also subscribe to: via RSS, or e-mail), serving the content of about 245 R bloggers. In total, the site was visited around 2.7 million times, by over 1.1 million people. Bellow you can see a few figures comparing the statistics of 2012 with those of 2011 (just click the image to enlarge it):

3. My wishlist for 2013 – about the future of the R blogosphere

Well now, this has been an AMAZING year for the R-project in general, the R community, and consequently also for R-bloggers. Here are a few things I wish for 2013:

Reproducible R blogging – make it to blog from R to WordPress and blogger (via knitr, RStudio, etc.)

The past year has been wonderful regarding progress in making reproducible research with R using Sweave, knitr, RStudio, and many new R packages. For 2013 I wish someone (or some-company, RStudio, cough cough) would take on themselves to make it as easy as possible to do Reproducible R blogging. The seeds are already there, thanks to people like JJ Allaire, Jeffrey Horner, Vicent Marti, and Natacha Porte we now have the markdown package, which combined with Yihui Xie knitr package and the wonderful RStudio (R IDE), allows us all to easily create HTML documents of R analysis. Combine this with something like one of Duncan Temple Lang’s R packages (XMLRPC, RWordPress) and one can imagine the future.

The next step will be to have a “publish to your WordPress/blogger” button right from the RStudio console – allowing for the smoothest R blogging experience one could dream of.

I hope we’ll see this as early as possible in 2013.

Creating online interactive visualization using R

There can never be enough of this really.

So far, I should give props to Markus Gesmann, Diego de Castillo for authoring and maintaining the awesome googleVis R package. This package is great for online publishing of interesting results. For example, see the site StatIL.org – visualizing over 25,000 Time series of Israel’s statistics using html files produced (also) with the googleVis package (example: population of Israel between 1950 to 2011).

The second promising project is Shiny, which Shiny makes it incredibly easy to build interactive web applications with R. Since they intend to release an open source server of Shiny, which can run on Apache, we can expect very interesting developments on that front this year.

More guest posts on R-bloggers

If you have valuable knowledge and insights to share with the R community, the best way I suggest is to start your own free blog on WordPress.com. Create a dedicated R category for your R posts, and ask to join r-bloggers (make sure to read and follow the guidelines mentioned there).

This year I am considering allowing non-bloggers to also take part in the party. The idea is to create a simple form which will allow you to write a guest article which (after review) will go live on r-bloggers (without the need to first start your own blog). If you are interested to submit such a guest article in the future (even if you are not sure exactly what you will write about), please fill out this form with your e-mail. IF I see people are interested, I will go ahead and create this service.

Your help in sharing/linking-to R-bloggers.com

Sharing: If you don’t alreayd know, R-bloggers is not a company. The site is run by just one guy (Tal Galili). There is no marketing team, marketing budget, or any campaign. The only people who know about the site are your and the people YOU will send the link to (through facebook, your personal website, blog, etc.). So if you haven’t already – please help share r-bloggers.com in whatever way you can online.

Subscribe to R-bloggers.com

You can also subscribe to daily updates of new R posts via RSS, or by filling in your e-mail address (I don’t give it to strangers, I promise). You can also join the R-bloggers facebook page, but make sure (once liked) to press the “like” button and mark V by “get notifications” and “show in news feed” (see in the image bellow)

Sponsoring

If you are interested in sponsoring/placing-ads/supporting R-bloggers, then you are welcome to contact me. Currently there is not much place left, but you can still contact me and I will update you once an ad placement is freed up.

Stay in touch

As always, you are welcome to leave a comment on this blog, and/or contact me (keeping in mind it might take me some time to get back to you, but I promise I will).

Happy new year!
Yours truly,
Tal Galili

To leave a comment for the author, please follow the link and comment on his blog: R-statistics blog » R.

↧

More Principal Components Fun

January 6, 2013, 10:58 am

≫ Next: Factor Analysis of Baseball’s Hall of Fame Voters

≪ Previous: 100 most read R posts in 2012 (stats from R-bloggers) – big data, visualization, data manipulation, and other languages

(This article was first published on Systematic Investor » R, and kindly contributed to R-bloggers)

Today, I want to continue with the Principal Components theme and show how the Principal Component Analysis can be used to build portfolios that are not correlated to the market. Most of the content for this post is based on the excellent article, “Using PCA for spread trading” by Jev Kuznetsov.

Let’s start by loading the components of the Dow Jones Industrial Average index over last 5 years.

###############################################################################
# Load Systematic Investor Toolbox (SIT)
# http://systematicinvestor.wordpress.com/systematic-investor-toolbox/
###############################################################################
setInternet2(TRUE)
con = gzcon(url('http://www.systematicportfolio.com/sit.gz', 'rb'))
    source(con)
close(con)

	#*****************************************************************
	# Load historical data
	#****************************************************************** 
	load.packages('quantmod')
	tickers = dow.jones.components()

	data <- new.env()
	getSymbols(tickers, src = 'yahoo', from = '2009-01-01', env = data, auto.assign = T)
	bt.prep(data, align='remove.na')

Next let’s compute the Principal Components based on the first year of price history.

	#*****************************************************************
	# Principal component analysis (PCA), for interesting discussion
	# http://machine-master.blogspot.ca/2012/08/pca-or-polluting-your-clever-analysis.html
	#****************************************************************** 
	prices = last(data$prices, 1000)
		n = len(tickers)  		
	ret = prices / mlag(prices) - 1
	
	p = princomp(na.omit(ret[1:250,]))
	
	loadings = p$loadings[]

	# look at the first 4 principal components 	
	components = loadings[,1:4]
	
	# normalize all selected components to have total weight = 1
	components = components / repRow(colSums(abs(components)), len(tickers))
	
	# note that first component is market, and all components are orthogonal i.e. not correlated to market
	market = ret[1:250,] %*% rep(1/n,n)
	temp = cbind(market, ret[1:250,] %*% components)
		colnames(temp)[1] = 'Market'	
		
	round(cor(temp, use='complete.obs',method='pearson'),2)

	# the variance of each component is decreasing
	round(100*sd(temp,na.rm=T),2)

Correlation:
       Market Comp.1 Comp.2 Comp.3 Comp.4
Market    1.0      1    0.2    0.1      0
Comp.1    1.0      1    0.0    0.0      0
Comp.2    0.2      0    1.0    0.0      0
Comp.3    0.1      0    0.0    1.0      0
Comp.4    0.0      0    0.0    0.0      1

Standard Deviation:
Market Comp.1 Comp.2 Comp.3 Comp.4
   1.8    2.8    1.2    1.0    0.8

Please note that the first principal component is highly correlated to the market and all principal components have very low correlation to each other and very low correlation to the market. Also by construction the volatility of principal components is decreasing. An interesting observation that you might want to check yourself: principal components are quite persistent in time (i.e. if you compute both correlations and volatilities using the future prices, for example, 4 years of prices, the principal components maintain their correlation and volatility profiles)

Next, let’s check if any of the principal components are mean-reverting. I will use Augmented Dickey-Fuller test to check if principal components are mean-reverting. (small p-value => stationary i.e. mean-reverting)

	#*****************************************************************
	# Find stationary components, Augmented Dickey-Fuller test
	#****************************************************************** 	
	library(tseries)
	equity = bt.apply.matrix(1 + ifna(-ret %*% components,0), cumprod)
		equity = make.xts(equity, index(prices))
	
	# test for stationarity ( mean-reversion )
	adf.test(as.numeric(equity[,1]))$p.value
	adf.test(as.numeric(equity[,2]))$p.value
	adf.test(as.numeric(equity[,3]))$p.value
	adf.test(as.numeric(equity[,4]))$p.value

The Augmented Dickey-Fuller test indicates that the 4th principal component is stationary. Let’s have a closer look at its price history and its composition:

	#*****************************************************************
	# Plot securities and components
	#*****************************************************************
	layout(1:2)
	
	# add Bollinger Bands
	i.comp = 4
	bbands1 = BBands(repCol(equity[,i.comp],3), n=200, sd=1)
	bbands2 = BBands(repCol(equity[,i.comp],3), n=200, sd=2)
	temp = cbind(equity[,i.comp], bbands1[,'up'], bbands1[,'dn'], bbands1[,'mavg'],
				bbands2[,'up'], bbands2[,'dn'])
		colnames(temp) = spl('Comp. 4,1SD Up,1SD Down,200 SMA,2SD Up,2SD Down')
	
	plota.matplot(temp, main=paste(i.comp, 'Principal component'))
	
	barplot.with.labels(sort(components[,i.comp]), 'weights')

The price history along with 200 day moving average and 1 and 2 Bollinger Bands are shown on the top pane. The portfolio weights of the 4th principal component are shown on the bottom pane.

So now you have a mean-reverting portfolio that is also uncorrelated to the market. There are many ways you can use this infromation. Please share your ideas and suggestions.

To view the complete source code for this example, please have a look at the bt.pca.trading.test() function in bt.test.r at github.

To leave a comment for the author, please follow the link and comment on his blog: Systematic Investor » R.

↧

Factor Analysis of Baseball’s Hall of Fame Voters

January 9, 2013, 8:32 am

≫ Next: Predictive Modeling using R and the OpenScoring-Engine – a PMML approach

≪ Previous: More Principal Components Fun

(This article was first published on Statistically Significant, and kindly contributed to R-bloggers)

Factor Analysis of Baseball's Hall of Fame Voters
Recently, Nate Silver wrote a post which analyzed how voters who voted for and against Barry Bonds for Baseball's Hall of Fame differed. Not surprisingly, those who voted for Bonds were more likely to vote for other suspected steroids users (like Roger Clemens). This got me thinking that this would make an interesting case study for factor analysis to see if there are latent factors that drive hall of fame voting.

The Twitter user @leokitty has kept track of all the known ballots of the voters in a spreadsheet. The spreadsheet is a matrix that has one row for each voter and one column for each player being voted upon. (Players need 75% of the vote to make it to the hall of fame.) I removed all players that had no votes and all voters that had given a partial ballot.

(This matrix either has a 1 or a 0 in each entry, corresponding to whether a voter voted for the player or not. Note that this kind of matrix is similar to the data that is analyzed in information retrieval. I will be decomposing the (centered) matrix using singular value decomposition to run the factor analysis. This is the same technique used for latent semantic indexing in information retrieval.)

Starting with the analysis, there is a drop off of the variance after the first 2 factors, which means it might make sense to look only at the first 2 (which is good because I was only planning on doing so).

votes = read.csv("HOF votes.csv", row.names = 1, header = TRUE)
pca = princomp(votes)
screeplot(pca, type = "l", main = "Scree Plot")

Looking at the loadings, it appears that the first principal component corresponds strongly to steroid users, which Bonds and Clemens having large negative values and other suspected steroid users being on the negative end. The players on the positive end have no steroid suspicions.

dotchart(sort(pca$loadings[, 1]), main = "First Principal Component")

The second component isn't as easy to decipher. The players at the negative end seem to players that are preferred by analytically minded analysts (think Moneyball). Raines, Trammell, and Martinez have more support among this group of voters. Morris, however, has less support among these voters and he isn't that far separated from them.

There also may be some secondary steroid association in the component as well separating players who have proof of steroid use versus those which have no proof but “look like” they took steroids. For example, there is no hard evidence that Bagwell or Piazza took steroids, but they were very muscular and hit a lot of home runs, so they are believed to have taken steroids. There is some sort of evidence the top five players of this component did take steroids.

dotchart(sort(pca$loadings[, 2]), main = "Second Principal Component")

Projecting the votes onto two dimensions, we can look at how the voters for Bonds and Clemens split up. You can see there is a strong horizontal split between the voters for and against Bonds/Clemens. There are also 3 voters that voted for Bonds, but not Clemens.

ggplot(data.frame(cbind(pca$scores[, 1:2], votes))) + geom_point(aes(Comp.1, 
    Comp.2, colour = as.factor(Barry.Bonds), shape = as.factor(Roger.Clemens)), 
    size = 4) + coord_equal() + labs(colour = "Bonds", shape = "Clemens")

Similarly, I can look at how the voters split up on the issue of steroids by looking at both Bonds and Bagwell. The voters in the upper left do not care about steroid use, but believe that Bagwell wasn't good enough to make it to the hall of fame. The voters in the lower right do care about steroid use, but believe that Bagwell was innocent of any wrongdoing.

ggplot(data.frame(cbind(pca$scores[, 1:2], votes))) + geom_point(aes(Comp.1, 
    Comp.2, colour = as.factor(paste(Roger.Clemens, "/", Jeff.Bagwell))), size = 4) + 
    geom_hline(aes(0), size = 0.2) + geom_vline(aes(0), size = 0.2) + coord_equal() + 
    labs(colour = "Bonds / Bagwell")

We can also look at a similar plot with Schilling instead of Bagwell. The separation here appears to be stronger.

ggplot(data.frame(cbind(pca$scores[, 1:2], votes))) + geom_point(aes(Comp.1, 
    Comp.2, colour = as.factor(paste(Barry.Bonds, "/", Curt.Schilling))), size = 4) + 
    geom_hline(aes(0), size = 0.2) + geom_vline(aes(0), size = 0.2) + coord_equal() + 
    labs(colour = "Bonds / Schilling")

Finally, we can look at a biplot (using code from here).

PCbiplot <- function(PC = fit, x = "PC1", y = "PC2") {
    # PC being a prcomp object
    library(grid)
    data <- data.frame(obsnames = row.names(PC$x), PC$x)
    plot <- ggplot(data, aes_string(x = x, y = y)) + geom_text(alpha = 0.4, 
        size = 3, aes(label = obsnames))
    plot <- plot + geom_hline(aes(0), size = 0.2) + geom_vline(aes(0), size = 0.2)
    datapc <- data.frame(varnames = rownames(PC$rotation), PC$rotation)
    mult <- min((max(data[, y]) - min(data[, y])/(max(datapc[, y]) - min(datapc[, 
        y]))), (max(data[, x]) - min(data[, x])/(max(datapc[, x]) - min(datapc[, 
        x]))))
    datapc <- transform(datapc, v1 = 0.7 * mult * (get(x)), v2 = 0.7 * mult * 
        (get(y)))
    plot <- plot + coord_equal() + geom_text(data = datapc, aes(x = v1, y = v2, 
        label = varnames), size = 5, vjust = 1, color = "red")
    plot <- plot + geom_segment(data = datapc, aes(x = 0, y = 0, xend = v1, 
        yend = v2), arrow = arrow(length = unit(0.2, "cm")), alpha = 0.75, color = "red")
    plot
}

fit <- prcomp(votes, scale = F)
PCbiplot(fit)

I could have also attempted to rotate the factors to make them more interpretable, but they appeared to have easy interpretation as is. Since we were looking at 2-d plots, rotation would not have made a difference in interpreting the plots. It is also common to use a likelihood approach to estimate factors. I chose to use the principal component method because the data are definitely not normal (being 0's and 1's).

To leave a comment for the author, please follow the link and comment on his blog: Statistically Significant.

↧

Predictive Modeling using R and the OpenScoring-Engine – a PMML approach

December 13, 2012, 2:51 pm

≫ Next: Decluttering ordination plots in vegan part 1: ordilabel()

≪ Previous: Factor Analysis of Baseball’s Hall of Fame Voters

(This article was first published on Data * Science + R, and kindly contributed to R-bloggers)

On November, the 27^th, a special post took my interest. Scott Mutchler presented a small framework for predictive analytics based on the PMML (Predictive Model Markup Language) and a Java-based REST-Interface. PMML is a XML based standard for the description and exchange of analytical models. The idea is that every piece of software which supports the corresponding PMML (newest version 4.1) version could utilize such a model description to predict outcomes regardless of the creation process. Using PMML offers some big advantages especially when using R for modeling with the final task to bring a model into production afterwards. Often the IT-staff is not willing to support yet another piece of software (like the open source R) beside the legacy systems, particularly when there are no service agreements, 24/7 support or guaranteed ‘High Availability’. R also lacks an “enterprise workflow” to send your model right into production for being used as a “prediction engine”. This is one of the key points most commercial software vendors in the field of analytical software offer (e.g. Revolution Enterprise, TIBCO Spotfire, …). Scott Mutchler’s OpenScoring is a first draft for such an engine and definitely worth to have an eye on it. While he created mainly the server side of the engine, my intention was to create some simple sketch for the client side, so that you can call the servlet engine from within R. Therefore two main functions have to be provided:

Exporting a predictive model into XML
Predicting a target variable, given a dataset and a model (specified in PMML)

A scenario how this all could be used is depicted as an overview in the following picture:

So the first step was to get the engine running. I experienced some small problems but Scott fixed them very fast and offered a new war file ready for deployment inside tomcat. It can be obtained under http://code.google.com/p/openscoring/. If you still encounter some problems you could also try my war file. It was built with java version “1.6.0_26” and tested under tomcat 7. After installing the war file using the management console, you could test it by sending a POST request in XML like the following to ‘http://localhost:8080/OpenScoring/Scoring’:

<?xml version="1.0" encoding="UTF-8"?>
<scoring_request>
  <pmml_url>http://www.dmg.org/pmml_examples/knime_pmml_examples/IrisNeuralNet.xml
  </pmml_url>
  <model_name></model_name>
  <csv_input_rows><![CDATA[sepal_length,sepal_width,petal_length,petal_width|
      5.1,3.5,1.4,0.2|7,3.2,4.7,1.4|6.3,2.9,5.6,1.8|]]>
  </csv_input_rows>
</scoring_request>

A nice tool for first experiments is the poster plugin for Firefox.

Now let’s turn to developing the client side. The implementation of the first mentioned functionality is pretty straight forward – build your model in the classical way and export it using the PMML package. You can find the supported methods inside PMMLs documentation or here. The following code shows a small example using decision trees and artificial data consisting only of numerical attributes (including target):

require(ggplot2)  # for visualization
require(pmml)  # for storing the model in xml
require(rpart)  # use decision trees

# ------------------------------------------------------- 
# Example 1 - categorical target with numeric inputs
# -------------------------------------------------------

# create artificial dataset
set.seed(1234)
gX <- rnorm(100, mean = 0, sd = 0.5)
gY <- rnorm(100, mean = 1, sd = 0.5)
rX <- rnorm(100, mean = 1, sd = 0.5)
rY <- rnorm(100, mean = 0, sd = 0.5)
dataset <- data.frame(X = c(gX, rX), Y = c(gY, rY), 
    Col = c(rep("red", 100), rep("green", 100)))

bp <- ggplot(dataset, aes(X, Y, colour = Col)) + geom_point()
bp + scale_colour_manual(name = "Col", values = c("green", "red"), 
        labels = c("GREEN","RED"), breaks = c("green", "red"))

plot of chunk unnamed-chunk-1


# create decision tree model
treeModel <- rpart(Col ~ ., data = dataset)

## export as pmml
# place to store the xml file
localFilenameAndPath = "/tmp/treeModel_numericalInput_categTarget.xml"  
# save model using pmml
saveXML(pmml(treeModel, model.name = "TreeModel", app.name = "RR/PMML", 
     dataset = dataset), file = localFilenameAndPath)

Most effort goes into developing the predict function. We should first transform the input data frame and split it into strings where every string represents one row in the data frame (including the header as one string). The fields have to be comma delimited. After that, all strings are combined into one string separated by the pipe (‘|’) symbol. The second step is creating the XML structure consisting of at least the main (<scoring_request>) and two subordinated nodes (<pmml_url> and <csv_input_rows>). The node <pmml_url> includes the url with schema ‘http://’, ‘file://’ or ‘ftp://’ and points to the place of your model. The node <csv_input_rows> contains the data formatted like mentioned above, wrapped inside a CDATA node. Last but not least the whole XML document needs to be transformed into a string. As a third we send the POST-Request to the server using curlPerform from the RCurl package: Last part will be to extract the results from the server response. The format is the same like in the request but it contains the predictions in one additional column. So using a combination of strsplit, gsub and sapply will get the job done. Because of that the prediction is always given as a character we need to convert it properly. Therefore the user has to specify a proper transformation function as input to the prediction function.

predictPMMLModel <- function(dataset,   # dataset for prediction 
        transformTargetAttribute,       # type of target attribute
        modelURL,                       # place where the scoring 
                                        # engine could find the pmml model
        applServerURL                   # servlet url
){
    require(XML)
    require(RCurl)

    header <- paste(colnames(dataset), collapse=",") # extract header
    # transformation to characters is necessary to avoid some “bad surprise” 
    # from R's handling of factor attributes
    datasetTransf <- data.frame(lapply(dataset, as.character), 
            stringsAsFactors=FALSE)
    dataString <- paste(header,"|", paste(do.call("rbind",
                            by(datasetTransf, 1:nrow(datasetTransf), function(row){
                                        paste(row, collapse=",")
                                    }, simplify = FALSE)), collapse ="|"), 
                            "|", sep = "")

    # create xml document
    xmlHeader <- xmlPINode(sys = 'xml', value = 'version="1.0" encoding="UTF-8"')
    xmlRequest <- xmlNode("scoring_request", 
            xmlNode("pmml_url", modelURL), 
            xmlNode("model_name"),
            xmlNode("csv_input_rows",xmlCDataNode(dataString)))

    # xml request as string
    fullXMLRequest <- paste(toString(xmlHeader),"\n", 
        gsub(" ", "", toString(xmlRequest, sep=""), fixed = TRUE))

    # http post request
    r = dynCurlReader()
    curlPerform(postfields = fullXMLRequest, url = applServerURL, 
            verbose = TRUE, post = 1L, writefunction = r$update)
    r$value()

    # parse results - !!caution: currently no error checking!!
    tmp <- xmlTreeParse(r$value())
    predictionString <- xmlValue(tmp[[1]][[1]][[4]])
    # extract predictions line by line
    predictionLines 
     <- strsplit(predictionString, split ="|", fixed = TRUE)[[1]][-1]
    predictions <- transformTargetAttribute(sapply(predictionLines, function(s){
                        gsub('\"','',
        tail(strsplit(s, ',', fixed = TRUE)[[1]], n=1))
                    }))
    names(predictions) <- NULL
    return(predictions)
}

Calling the final prediction function then is straightforward and the given examples show some cases with different combinations of categorical/numeric input/target attributes. Simple results for the training error are also given:

# prediction
prediction1 <- predictPMMLModel(dataset = dataset, 
    transformTargetAttribute = factor, 
    modelURL = paste("file://", localFilenameAndPath, sep = ""), 
    applServerURL = "http://localhost:8080/OpenScoring/Scoring")
table(dataset$Col, prediction1)  # tabulate results

# ------------------------------------------------------- 
# Example 2 - categorical target with mixed inputs
# -------------------------------------------------------

# create artificial dataset
set.seed(1234)
gX <- factor(sample(c("a", "b", "c"), size = 100, replace = TRUE, 
   prob = c(0.7, 0.2, 0.1)))
gY <- rnorm(100, mean = 1, sd = 0.5)
rX <- factor(sample(c("a", "b", "c"), size = 100, replace = TRUE, 
   prob = c(0.1, 0.2, 0.7)))
rY <- rnorm(100, mean = 0, sd = 0.5)

# http://stackoverflow.com/questions/8229904/r-combining-two-factors
dataset <- data.frame(X = unlist(list(gX, rX)), Y = c(gY, rY), 
   Col = c(rep("red", 100), rep("green", 100)))

bp <- ggplot(dataset, aes(X, Y, colour = Col)) + geom_point()
bp + scale_colour_manual(name = "Col", values = c("green", "red"), 
   labels = c("GREEN", "RED"), breaks = c("green", "red"))

plot of chunk unnamed-chunk-3


# create decision tree model
treeModel <- rpart(Col ~ ., data = dataset)

# export as pmml
localFilenameAndPath = "/tmp/treeModel_mixedInput_categTarget.xml"
saveXML(pmml(treeModel, model.name = "TreeModel", app.name = "RR/PMML", 
    dataset = dataset), file = localFilenameAndPath)
# prediction
prediction2 <- predictPMMLModel(dataset = dataset, 
    transformTargetAttribute = factor, 
    modelURL = paste("file://", localFilenameAndPath, sep = ""), 
    applServerURL = "http://localhost:8080/OpenScoring/Scoring")

table(dataset$Col, prediction2)  # tabulate results

# ----------------------------------------------- 
# Example 3 - numerical target with mixed input 
# -----------------------------------------------

# create artificial dataset
set.seed(1234)
gX <- factor(sample(c("a", "b", "c"), size = 100, replace = TRUE, 
   prob = c(0.7, 0.2, 0.1)))
gY <- rnorm(100, mean = 1, sd = 0.5)
rX <- factor(sample(c("a", "b", "c"), size = 100, replace = TRUE, 
   prob = c(0.1, 0.2, 0.7)))
rY <- rnorm(100, mean = 0, sd = 0.5)

dataset <- data.frame(X = unlist(list(gX, rX)), Y = c(gY, rY), 
   Col = c(rnorm(100, mean = -5, sd = 1), rnorm(100, mean = 5, sd = 1)))

bp <- ggplot(dataset, aes(X, Y, colour = Col)) + geom_point()
bp

plot of chunk unnamed-chunk-3


# create decision tree model
treeModel <- rpart(Col ~ ., data = dataset)

# export model as pmml
localFilenameAndPath = "/tmp/treeModel_mixedInput_numTarget.xml"
saveXML(pmml(treeModel, model.name = "TreeModel", app.name = "RR/PMML", 
   dataset = dataset), file = localFilenameAndPath)
# prediction
prediction3 <- predictPMMLModel(dataset = dataset, 
   transformTargetAttribute = as.numeric, 
   modelURL = paste("file://", localFilenameAndPath, sep = ""), 
   applServerURL = "http://localhost:8080/OpenScoring/Scoring")

modelResults <- data.frame(Y = dataset$Col, Y_hat = prediction3)
cor(modelResults$Y, modelResults$Y_hat)^2  # computing r squared

# ----------------------------------------------- 
# Example 4 - numerical target with numerical input
# -----------------------------------------------

# create first artificial dataset
set.seed(1234)
gX <- rnorm(100, mean = 0, sd = 0.5)
gY <- rnorm(100, mean = 1, sd = 0.5)
rX <- rnorm(100, mean = 1, sd = 0.5)
rY <- rnorm(100, mean = 0, sd = 0.5)
dataset <- data.frame(X = c(gX, rX), Y = c(gY, rY), 
   Col = c(rnorm(100, mean = -5, sd = 1), rnorm(100, mean = 5, sd = 1)))

bp <- ggplot(dataset, aes(X, Y, colour = Col)) + geom_point()
bp

plot of chunk unnamed-chunk-3


# create decision tree model
treeModel <- rpart(Col ~ ., data = dataset)

# export model as pmml
localFilenameAndPath = "/tmp/treeModel_numericalInput_numTarget.xml"
saveXML(pmml(treeModel, model.name = "TreeModel", app.name = "RR/PMML", 
   dataset = dataset), file = localFilenameAndPath)
# prediction
prediction4 <- predictPMMLModel(dataset = dataset, 
   transformTargetAttribute = as.numeric, 
   modelURL = paste("file://", localFilenameAndPath, sep = ""), 
   applServerURL = "http://localhost:8080/OpenScoring/Scoring")

modelResults <- data.frame(Y = dataset$Col, Y_hat = prediction4)
cor(modelResults$Y, modelResults$Y_hat)^2  # computing r squared

Finally, here are some thoughts about what has been build. We should begin with a contra-point, which is the fact that PMML (at the moment) restricts you to use some standard modeling techniques (depending on the PMML version and the implementation of the used software for modeling and prediction). I know some projects where this would be a serious concern. But on the other side I think that a lot of analytic questions will be fine with the offered portfolio of approaches – looking at my own projects it may be 50 to 70 percent. On the pro-side there are a lot more things to mention. First, even if restricted to some standard models, you could use the full power of R to build your model, to try experiments with it, to optimize and tweak it and to visualize the results. Second, developer and user no longer need to use the same kind of software – the server/client could be written in any language able to understand XML and PMML (only the server if the client is not used for modeling) and on how to communicate with web services. Third, like mentioned at the beginning of the post, you now could utilize “enterprise-proven” software (Tomcat) to run your model. This really is one of the points which matters in practice (“… thanks for your nice prototype, but we should now have to bring it to java/c#/… “). Although there are some limitation, I think there are even more positive aspects (e.g. Documentation, versioning of models) and I am really interested about any complementary and helpful comments.

To leave a comment for the author, please follow the link and comment on his blog: Data * Science + R.

↧

Decluttering ordination plots in vegan part 1: ordilabel()

January 12, 2013, 10:16 am

≫ Next: Decluttering ordination plots in vegan part 2: orditorp()

≪ Previous: Predictive Modeling using R and the OpenScoring-Engine – a PMML approach

(This article was first published on From the bottom of the heap » R, and kindly contributed to R-bloggers)

In an earlier post I showed how to customise ordination diagrams produced by our vegan package for R through use of colours and plotting symbols. In a series of short posts I want to cover some of the options available in vegan that can be used to help in producing better, clearer, less cluttered ordination diagrams.

First up we have ordilabel().

One of the problems that ordination results pose is that there is a lot is a lot information that we want to convey using a relatively small number of pixels. What we often end up with is a jumbled mess and because of the way the sample or species scores are plotted, the important observations could very well end up covered in all the rare species or odd samples just by virtue of their ordering in the data set.

The simplest tool that vegan provides to help in this regard is ordilabel(); it won’t produce a publication-ready, uncluttered ordination diagram but it will help you focus on the “important”¹ things.

ordilabel() draws sample or species scores with their label (site ID or species name/code) taken from the dimnames of the data used to fit the ordination. To help their display, however, ordilabel() draws the labels in a box with an opaque background so that the labels plotted later (i.e. above) cover earlier labels whilst remain visible because of the opaque background. ordilabel() also allows you to specify the importance of the samples or species via the priority argument, which in effect controls which labels get drawn first or beneath all the others.

Here I’ll use a PCA of the famous Dune Meadow data². First, we load vegan and the data and perform the ordination

require(vegan)
data(dune)
ord <- rda(dune) ## PCA of Dune data

In this example, I want to give plotting priority to those species or samples that are most abundant or most diverse, respectively. For this I will use Hill’s N₂ for both the species and the samples, both of which can be computed via the diversity() function

## species priority; which species drawn last, i.e. on top
priSpp <- diversity(dune, index = "invsimpson", MARGIN = 2)
## sample priority
priSite <- diversity(dune, index = "invsimpson", MARGIN = 1)

The MARGIN argument refers to which dimension or margin of the data is used; 1 means rows, 2 means columns. Hill’s N₂ is equal to the inverse (or reciprocal) of the Simpson diversity measure.

Throughout I’m going to use symmetric scaling of the two sets of scores for use in the biplot. As it is important to make sure the same scaling is used at each stage it is handy to store the scaling in an object and then refer to that object throughout. That way you can easily change the scaling used by altering the value in the object. Here I use scl and symmetric scaling is indicated by the number 3

## scaling to use
scl <- 3

ordilabel() adds labels to an existing plot, so first set up the plotting region for the PCA biplot using the plot() method with type = "n" to not plot any of the data

plot(ord, type = "n", scaling = 3)

Now we are ready to add labels to the plot. ordilabel() takes the ordination object as the first argument and extracts the scores indicated by the display argument from the fitted object. There are a number of standard plotting arguments to control the look and feel of the labels, but the important argument is priority to control the plotting order. Here we set it to the Hill’s N₂ values we computed earlier. The code chunk below adds both to the base plot we just generated

ordilabel(ord, display = "sites", font = 3, fill = "hotpink",
          col = "blue", priority = priSite, scaling = scl)
## You may prefer separate plots, but here add species as well
ordilabel(ord, display = "species", font = 2,
          priority = priSpp, scaling = scl)

The resulting biplot should look similar to the one below

PCA biplot of the dune meadow data with labels added by ordilabel()

Not perfect, but better than the standard plot() method in vegan.

Alternatively, one might wish to draw side by side biplots of the sample and species scores. This can be done simply with a call to layout() to split the current plot device into two plot regions, which we fill using very similar plotting commands as described above

layout(matrix(1:2, ncol = 2))
plot(ord, type = "n", scaling = scl)
ordilabel(ord, display = "sites", font = 3, fill = "hotpink",
          col = "blue", priority = priSite, scaling = scl)
plot(ord, type = "n", scaling = scl)
ordilabel(ord, display = "species", font = 2,
          priority = priSpp, scaling = scl)
layout(1)

Side-by-side PCA biplots of the dune meadow data with labels added by ordilabel()

You may notice some warnings about scaling not being a graphical parameter. These are harmful and arise because we pass scaling along as part of the ... argument which we also pass on to the plotting functions used to build the plot. We’ve tried hard to stop these warnings in vegan using a technique I blogged about a while back, but it looks like we missed a few of these. It will be fixed in a later version of vegan and the warnings will go away.

Next time we’ll look at orditorp().

Notes:
¹Whatever “important” means…
²Not that I think this is the best way to analyse these data, it is just for show!

To leave a comment for the author, please follow the link and comment on his blog: From the bottom of the heap » R.

↧

Decluttering ordination plots in vegan part 2: orditorp()

January 13, 2013, 5:13 am

≫ Next: Rcpp reaches 100 dependents on CRAN

≪ Previous: Decluttering ordination plots in vegan part 1: ordilabel()

(This article was first published on From the bottom of the heap » R, and kindly contributed to R-bloggers)

In the earlier post in this series I looked at the ordilabel() function to help tidy up ordination biplots in vegan. An alternative function vegan provides is orditorp(), the last four letters abbreviating the words text or points. That is a pretty good description of what orditorp() does; it draws sample or species labels using text where there is room and where there isn’t a plotting character is drawn instead. Essentially it boils down to being a one stop shop for calls to text() or points() as needed. Let’s see how it works…

As with last time out, I’ll illustrate how orditorp() works via a PCA biplot for the Dutch dune meadow data.

## load vegan and the data
require(vegan)
data(dune)
ord <- rda(dune) ## PCA of Dune data

## species priority; which species drawn last, i.e. on top
priSpp <- diversity(dune, index = "invsimpson", MARGIN = 2)
## sample priority
priSite <- diversity(dune, index = "invsimpson", MARGIN = 1)

## scaling to use
scl <- 3

I won’t explain any of the code above; it is the same as that used in the earlier post where an explanation was also provided.

orditorp() takes an ordination object as the first argument and in addition the display argument controls which set of scores is displayed. Note that orditorp() can only plot one set of scores at a time, which as we’ll see in a minute is not exactly ideal nor foolproof. Like ordilabel(), you are free to specify the importance of each sample or species via argument priority. In ordilable() the priority controlled the plotting order such that those samples or species with high priority were plotted last (uppermost). Instead, orditorp() draws labels for samples or species (if it can) for those with the highest priority first.

So we have something to talk to, recreate the basic samples and species biplot as used in the previous post but updated to use orditorp()

plot(ord, type = "n", scaling = 3)
orditorp(ord, display = "sites", priority = priSite, scaling = scl,
         col = "blue", cex = 1, pch = 19)
## You may prefer separate plots, but here species as well
orditorp(ord, display = "species", priority = priSpp, scaling = scl,
         col = "forestgreen", pch = 2, cex = 1)

PCA biplot of the Dutch dune meadow data produced using <code>orditorp()</code>

PCA biplot of the Dutch dune meadow data produced using orditorp()

The behaviour or orditorp() should now be reasonably clear; labels are drawn for sample or species only if there is room to do so, with a point being used instead. orditorp() isn’t perfect by any means. Because it can only drawn one set of scores at a time, there is no easy way to stop the species labels plotting over the sample labels and vice versa.

How it works is, first orditorp() calculates the heights and widths of the labels, adds a bit of space to this (more on this later) and then works out if the box given by the current sample or species label width/height, centred on the axis score coordinate, will obscure the label boxes of any labels previously drawn. If the label box doesn’t obscure any previous label boxes the label is drawn at the sample or species score coordinates. If it does obscure an existing label then a point is drawn instead. orditorp() draws the labels in order of priority and as it draws each subsequent label it checks to see if previous labels are not obscured.

This process isn’t infallible of course; for example the second highest priority sample or species could lie very close to the highest priority one in ordination space and if so orditorp() would not draw a label for this second highest priority sample or species because it would obscure the label of the highest priority one.

The amount of spacing or padding around each label is specified via the air argument which has a default of 1. air is interpreted as the proportion of half the label width or height that the label occupies. The default of 1 therefore means that in fact there is no additional spacing beyond the confines of the box that encloses the label. If air is greater than 1 proportionally more padding is added whilst values less than 1 indicate that labels can overlap. The figure below shows the species scores only with two values for air. In the left hand panel air = 2 is used and the labels are padded either side of the label by the entire string width or height. The right hand panel uses air = 0.5 which allows labels to overlap by up to a quarter of the string width or height in any direction from the plotting coordinate (in other words, the box that cannot be obscured when plotting subsequent labels is half the string width wide and half the string height high, centred on the plotting coordinates for the label).

layout(matrix(1:2, ncol = 2))
op <- par(mar = c(5,4,4,1) + 0.1)
## site/sample scores
plot(ord, type = "n", scaling = 3, main = expression(air == 2), cex = 1)
orditorp(ord, display = "species", priority = priSite, scaling = scl,
         col = "forestgreen", cex = 1, pch = 2, air = 2)
## Species scores
plot(ord, type = "n", scaling = 3, main = expression(air == 0.5), cex = 1)
orditorp(ord, display = "species", priority = priSpp, scaling = scl,
         col = "forestgreen", pch = 2, cex = 1, air = 0.5)
par(op)
layout(1)

PCA species plot of the Dutch dune meadow data produced using <code>orditorp()</code> showing the effect of changing argument <code>air</code>.

PCA species plot of the Dutch dune meadow data produced using orditorp() showing the effect of changing argument air.

One point that should be noted is that orditorp() doesn’t stop labels and points from overlaying one another, though as the labels are drawn after the points they shouldn’t get obscured too much. We could improve the situation a bit by drawing an opaque box around the label, or even make it partially transparent, so that the label always stood out from the plotting points. Although we’d run the risk of hiding points under labels and thus hiding information from the person looking at the figure.

One additional point to make is that orditorp() returns a logical vector indicating which sample or species scores were drawn with labels (TRUE) or points (FALSE), which might be useful for further plotting or adding to the diagram.

So there were have orditorp(). Next time I’ll take a look at ordipointlabel() which tackles the problem of producing a tidy ordination diagram in a far more complex way than either ordilabel() or orditorp().

To leave a comment for the author, please follow the link and comment on his blog: From the bottom of the heap » R.

↧

Rcpp reaches 100 dependents on CRAN

January 23, 2013, 5:05 pm

≫ Next: For descriptive statistics, values below LLOQ set to …

≪ Previous: Decluttering ordination plots in vegan part 2: orditorp()

(This article was first published on Thinking inside the box , and kindly contributed to R-bloggers)

With the arrival earlier today of the stochvol package onto the CRAN network for R, our Rcpp project reached a new milestone: 100 packages have either a Depends:, Imports: or LinkingTo: statement on it.

The full list will always be at the bottom of the CRAN page for Rcpp; I also manually edit a list on my Rcpp page. But for the record as of today, here is the current list as produced by a little helper script I keep:

 acer                apcluster           auteur             
 bcp                 bfa                 bfp                
 bifactorial         blockcluster        ccaPP              
 cda                 classify            clusteval          
 ConConPiWiFun       EpiContactTrace     fastGHQuad         
 fdaMixed            forecast            fugeR              
 GeneticTools        gMWT                gof                
 gRbase              gRim                growcurves         
 GUTS                jaatha              KernSmoothIRT      
 LaF                 maxent              mets               
 minqa               mirt                mRMRe              
 multmod             mvabund             MVB                
 NetworkAnalysis     ngspatial           oem                
 openair             orQA                parser             
 pbdBASE             pbdDMAT             phom               
 phylobase           planar              psgp               
 quadrupen           Rchemcpp            Rclusterpp         
 RcppArmadillo       RcppBDT             rcppbugs           
 RcppClassic         RcppClassicExamples RcppCNPy           
 RcppDE              RcppEigen           RcppExamples       
 RcppGSL             RcppOctave          RcppRoll           
 RcppSMC             RcppXts             rforensicbatwing   
 rgam                RInside             Rmalschains        
 Rmixmod             robustgam           robustHD           
 rococo              RProtoBuf           RQuantLib          
 RSNNS               RSofia              rugarch            
 RVowpalWabbit       SBSA                sdcMicro           
 sdcTable            simFrame            spacodiR           
 sparseHessianFD     sparseLTSEigen      SpatialTools       
 stochvol            surveillance        survSNP            
 termstrc            tmg                 transmission       
 trustOptim          unmarked            VIM                
 waffect             WideLM              wordcloud          
 zic

And not to be forgotten is BioConductor which has another 10:

 ddgraph            GeneNetworkBuilder GOSemSim          
 GRENITS            mosaics            mzR               
 pcaMethods         Rdisop             Risa              
 rTANDEM

As developers of Rcpp, we are both proud and also a little humbled. The packages using Rcpp span everything from bringing new libraries to R, to implementing faster ways of doing things we have before to doing completely new things. It is an exciting time to be using R, and to be connecting R to C++, especially with so many exciting things happening in C++ right now. Follow the Rcpp links for more, and come join us on the Rcpp-devel mailing list to discuss and learn.

To leave a comment for the author, please follow the link and comment on his blog: Thinking inside the box .

↧

For descriptive statistics, values below LLOQ set to …

February 3, 2013, 4:21 am

≫ Next: Extended (Simple) ASN Graph Visualization Example [R to D3]

≪ Previous: Rcpp reaches 100 dependents on CRAN

(This article was first published on Wiekvoet, and kindly contributed to R-bloggers)

That is what I read the other day. For calculation of descriptive statistics, values below the LLOQ (lower limit of quantification) were set to.... Then I wondered, wasn't there a trick in JAGS to incorporate the presence of missing data while estimating parameters of a distribution. How would that compare with standard methods such as imputing LLOQ at 0, half LLOQ, LLOQ or just ignoring the missing data? A simulation seems to be called for.

Bayes estimation

A bit of searching gave me a way function to estimate the mean and standard deviation, here wrapped in a function so resulting data has a nice shape.
library(R2jags)
library(plyr)
library(ggplot2)

Bsensnorm <- function(yy,limit) {
isCensored <- is.na(yy)
model1 <- function() {
for ( i in 1:N ) {
isCensored[i] ~ dinterval( yy[i] , limit )
yy[i] ~ dnorm( mu , tau )
}
tau <- 1/pow(sigma,2)
sigma ~ dunif(0,100)
mu ~ dnorm(0,1E-6)
}
datain <- list(yy=yy,isCensored=1-as.numeric(isCensored),N=length(yy),limit=limit)
params <- c('mu','sigma')
inits <- function() {
list(mu = rnorm(0,1),
sigma = runif(0,1))
}

jagsfit <- jags(datain, model=model1, inits=inits,
parameters=params,progress.bar="gui",
n.chains=1,n.iter=7000,n.burnin=2000,n.thin=2)
data.frame( system='Bayes Uninf',
mu_out=jagsfit$BUGSoutput$mean$mu,
sigma_out=jagsfit$BUGSoutput$mean$sigma)
}

Somewhat informative prior

As the original model did not perform very well at some points, I created an informative prior on the standard deviation. It is not the best prior, but just to give me an idea. To defend having a prior, once a method has a LLOQ, there should be some idea about precision. Otherwise, how is LLOQ determined? Also, in a real world there may be data points with complete data, which allows estimation of the standard deviation in a more hierarchical model.

BsensInf <- function(yy,limit) {
isCensored <- is.na(yy)
model2 <- function() {
for ( i in 1:N ) {
isCensored[i] ~ dinterval( yy[i] , limit )
yy[i] ~ dnorm( mu , tau )
}
tau <- 1/pow(sigma,2)
sigma ~ dnorm(1,1) %_% T(0.001,)
mu ~ dnorm(0,1E-6)
}
datain <- list(yy=yy,isCensored=1-as.numeric(isCensored),N=length(yy),limit=limit)
params <- c('mu','sigma')
inits <- function() {
list(mu = abs(rnorm(0,1))+.01,
sigma = runif(.1,1))
}

jagsfit <- jags(datain, model=model2, inits=inits, parameters=params,progress.bar="gui",
n.chains=1,n.iter=7000,n.burnin=2000,n.thin=2)
data.frame( system='Bayes Inf',
mu_out=jagsfit$BUGSoutput$mean$mu,
sigma_out=jagsfit$BUGSoutput$mean$sigma)
}

Simple imputation

These are the simple imputations each in a separate function.

L0sensnorm <- function(yy,limit) {
yy[is.na(yy)] <- 0
data.frame(system='as 0',mu_out=mean(yy),sigma_out=sd(yy))
}
Lhsensnorm <- function(yy,limit) {
yy[is.na(yy)] <- limit/2
data.frame(system='as half LOQ',mu_out=mean(yy),sigma_out=sd(yy))
}
Llsensnorm <- function(yy,limit) {
yy[is.na(yy)] <- limit
data.frame(system='as LOQ',mu_out=mean(yy),sigma_out=sd(yy))
}
LNAsensnorm <- function(yy,limit) {
yy <- yy[!is.na(yy)]
data.frame(system='as missing',mu_out=mean(yy),sigma_out=sd(yy))
}

Simulations

The script below is used to the methods with some settings I chose. Basically, the LLOQ is set at 1. The SD is 1. The mean ranges fro 1 to 3.5. The data is cut a a level of 1. The data is simulated from a normal distribution. If more than half of the data is below LLOQ then new samples are drawn until more than half of the data is above LLOQ.

limitest <- function(yy,limit) {
yy[yy<limit] <- NA
do.call(rbind,list(
Bsensnorm(yy,limit),
BsensInf(yy,limit),
L0sensnorm(yy,limit),
Lhsensnorm(yy,limit),
Llsensnorm(yy,limit),
LNAsensnorm(yy,limit)))
}
simtest <- function(n,muin,sdin,limit) {
numnotna<- 0
while(numnotna<n/2) {
yy <- rnorm(n,muin,sdin)
numnotna <- sum(yy>limit)
}
ll <- limitest(yy,limit)
ll$n_na <- sum(yy<limit)
ll
}
datain <- expand.grid(n=c(5,6,8,10,12,15,20,50,80),muin=seq(1,3.5,.05),sdin=1,limit=1)
dataout <- mdply(datain,simtest)

Results for means

After all simulations, a difference is calculated between the simulation mean and the estimated means. This made interpreting the plots much more easy.
dataout$dif <- dataout$mu_out-dataout$muin

Uninformed prior did not perform with very little data

With very little data, the standard deviation gets large and the mean is low. This may just be because there is no information stating they are not large and small respectively, and the chain just wanders of. Anyway, reason enough to find that with only few samples it is not a feasible method. Adding information about the standard deviation much improves this.

ggplot(dataout[dataout$n %in% c(5,6,8) & dataout$system %in% c('Bayes Uninf','Bayes Inf'),],aes(x=muin,y=dif)) + #group=Source,
geom_smooth(method='loess') +
geom_point() +
facet_grid(system ~n , drop=TRUE) +
scale_x_continuous('simulation mu') +
scale_y_continuous('difference observed and simulation mu')

At low number of observations variation of the mean swamps bias.

The variation in estimates is rather large. Even so, there is a trend visible that removing missing data and imputing LLOQ give too high estimates when mu is low and hence the number of missings is high. When mu is high then all methods obtain the same result.

ggplot(dataout[dataout$n %in% c(5,6,8) & dataout$system!='Bayes Uninf',],aes(x=muin,y=dif)) + #group=Source,
geom_smooth(method='loess') +
geom_point() +
facet_grid(system ~n , drop=TRUE) +
scale_x_continuous('simulation mu') +
scale_y_continuous('difference observed and simulation mu')

At intermediate number of observations method choice gets more important

Missing and imputation of LLOQ are clearly biased. Half LLOQ and 0 do much better. At 15 observations the informed prior is fairly similar to uninformed prior.

ggplot(dataout[dataout$n %in% c(10,12,15) & dataout$system!='Bayes xUninf',],aes(x=muin,y=dif)) + #group=Source,
geom_smooth(method='loess') +
geom_point() +
facet_grid(system ~n , drop=TRUE) +
scale_x_continuous('simulation mu') +
scale_y_continuous('difference observed and simulation mu')

At large number of observations bias gets far too large

At this point the missing option is not plotted any more, it is just not competitive. Setting at LLOQ gets a big positive bias, while 0 gets a negative bias, especially for 80 observations. There is no point in setting up an informed prior. Half LLOQ seems least biased out of the simple methods.

ggplot(dataout[dataout$n %in% c(20,50,80) & dataout$system!='as missing',],aes(x=muin,y=dif)) + #group=Source,
geom_smooth(method='loess') +
geom_point() +
facet_grid(system ~n , drop=TRUE) +
scale_x_continuous('simulation mu') +
scale_y_continuous('difference observed and simulation mu')

Standard deviation is negative biased for simple imputation

The uninformed prior estimation is too bad to plot and retain reasonable axes (and setting limits on y means the data outside those limits get eliminated from the smoother too, so that is not an option). Especially setting missing as LLOQ or ignoring them gets too low estimates of standard deviation. The informed prior gets a bit too large standard deviation.

ggplot(dataout[dataout$n%in% c(5,6,8) & dataout$system!='Bayes Uninf',],aes(x=muin,y=sigma_out)) + #group=Source,
geom_smooth(method='loess') +
geom_point() +
facet_grid(system ~n , drop=TRUE) +
scale_x_continuous('simulation mu') +
scale_y_continuous('estimated sigma')

ggplot(dataout[dataout$n%in% c(10,12,15) & dataout$system!='Bayes Uninf',],aes(x=muin,y=sigma_out)) + #group=Source,
geom_smooth(method='loess') +
geom_point() +
facet_grid(system ~n , drop=TRUE) +
scale_x_continuous('simulation mu') +
scale_y_continuous('estimated sigma')
With a large number of observations half LLOQ results in a negative bias on the standard deviation.

ggplot(dataout[dataout$n %in% c(20,50,80) & dataout$system!='as missing',],aes(x=muin,y=sigma_out)) + #group=Source,
geom_smooth(method='loess') +
geom_point() +
facet_grid(system ~n , drop=TRUE) +
scale_x_continuous('simulation mu') +
scale_y_continuous('estimated sigma')

Conclusion

At low number of observations using a Bayesian model and a good prior on the standard deviation is the best way to obtain descriptive statistics. Lacking those setting <LLOQ values to 1/2 LLOQ for estimation of means and to 0 for estimation of standard deviations seems best. But some care and more extensive simulations for the problem a hand are needed.

Finally, setting data to <LLOQ is something that has implications. As a statistician, I think using <LLOQ to display data in listings and tables is a good thing. However, when subsequent calculations by statisticians are needed, they may cause more trouble than they resolve. I have seen simple PCA go down the drain by <LLOQ, which is, truth to say, a shame given the effort in getting data.

To leave a comment for the author, please follow the link and comment on his blog: Wiekvoet.

↧

Abstract

Introduction

R code

Summary

HCL-Space

R2 Values

Comparison of R2 Values

Color Ratings

R2 Values

Dimensional Spread

Individual Axis Analysis

Conclusion

Comparison to Colorspace Palettes

Summary

Future Work

Acknowledgements

First step: Homemade stethoscope

Second step: Record your heartbeat!

Third step: Extracting data from sound

Fourth step: Visualizing your hertbeat

Last step: Publish the “knitred” html-file in my blog (WordPress) and on RPubs

Set up the Arduino board

Build a Java class to interface

Read the temperature from R

Code

1. Top 100 R posts of 2012

2. Statistics – how well did R-bloggers do in 2012?

3. My wishlist for 2013 – about the future of the R blogosphere

Bayes estimation

Somewhat informative prior

Simple imputation

Simulations

Results for means

Uninformed prior did not perform with very little data

At low number of observations variation of the mean swamps bias.

At intermediate number of observations method choice gets more important

At large number of observations bias gets far too large

Standard deviation is negative biased for simple imputation

Conclusion

R² Values

Comparison of R² Values

R² Values