Clusters of Texts

February 10, 2016, 3:14 am

≫ Next: Clustering French Cities (based on Temperatures)

≪ Previous: Introduction to Statistical Methods in R

(This article was first published on R-english – Freakonometrics, and kindly contributed to R-bloggers)

Another popular application of classification techniques is on texmining (see e.g. an old post on French president speaches). Consider the following example, inspired by Nobert Ryciak’s post, with 12 wikipedia pages, on various topics,

> library(tm)
> library(stringi)
> library(proxy)
> titles = c("Boosting_(machine_learning)",
+            "Random_forest",
+            "K-nearest_neighbors_algorithm",
+            "Logistic_regression",
+            "Boston_Bruins",
+            "Los_Angeles_Lakers",
+            "Game_of_Thrones",
+            "House_of_Cards_(U.S._TV_series)",
+            "True Detective (TV series)",
+            "Picasso",
+            "Henri_Matisse",
+            "Jackson_Pollock")
> articles = character(length(titles))
> for (i in 1:length(titles)) {
+   articles[i] = stri_flatten(readLines(stri_paste(wiki, titles[i])), col = " ")
+ }

Here, we store all the contents of the pages in a corpus (from the text mining package).

> docs = Corpus(VectorSource(articles))

This is what we have in that corpus

> a = stri_flatten(readLines(stri_paste(wiki, titles[1])), col = " ")
> a = Corpus(VectorSource(a))
> a[[1]]

Thoughts on Hypothesis Boosting</i></a>, Unpublished manuscript (Machine Learning class project, December 1988)</span></li> <li id="cite_note-4"><span class="mw-cite-backlink"><b><a href="#cite_ref-4">^</a></b></span> <span class="reference-text"><cite class="citation journal"><a href="/wiki/Michael_Kearns" title="Michael Kearns">Michael Kearns</a>; <a href="/wiki/Leslie_Valiant" title="Leslie Valiant">Leslie Valiant</a> (1989). <a rel="nofollow" class="external text" href="http://dl.acm.org/citation.cfm?id=73049">"Crytographic limitations on learning Boolean formulae and finite automata"</a>. <i>Symposium on T

This is because we read an html page.

> a = tm_map(a, function(x) 
> a = tm_map(a, function(x) stri_replace_all_fixed(x, "t", " "))
> a = tm_map(a, PlainTextDocument)
> a = tm_map(a, stripWhitespace)
> a = tm_map(a, removeWords, stopwords("english"))
> a = tm_map(a, removePunctuation)
> a = tm_map(a, tolower)
> a 

can  set  weak learners create  single strong learner  a weak learner  defined    classifier    slightly correlated   true classification  can label examples better  random guessing in contrast  strong learner   classifier   arbitrarily wellcorrelated   true classification robert

Now we have the text of the wikipedia document. What we did was

replace all “” elements with a space. We do it because there are not a part of text document but in general a html code.
replace all “/t” with a space.
convert previous result (returned type was “string”) to “PlainTextDocument”, so that we can apply the other functions from tm package, which require this type of argument.
remove extra whitespaces from the documents.
remove punctuation marks.
remove from the documents words which we find redundant for text mining (e.g. pronouns, conjunctions). We set this words as stopwords(“english”) which is a built-in list for English language (this argument is passed to the function removeWords.
transform characters to lower case.

Now we can do it on the entire corpus

> docs2 = tm_map(docs, function(x) stri_replace_all_regex(x, "<.+?>", " "))
> docs3 = tm_map(docs2, function(x) stri_replace_all_fixed(x, "t", " "))
> docs4 = tm_map(docs3, PlainTextDocument)
> docs5 = tm_map(docs4, stripWhitespace)
> docs6 = tm_map(docs5, removeWords, stopwords("english"))
> docs7 = tm_map(docs6, removePunctuation)
> docs8 = tm_map(docs7, tolower)

Now, we simply count words in each page,

> dtm <- DocumentTermMatrix(docs8)
> dtm2 <- as.matrix(dtm)
> dim(dtm2)
[1] 12 13683
> frequency <- colSums(dtm2)
> frequency <- sort(frequency, decreasing=TRUE)
> mots=frequency[frequency>20]
> s=dtm2[1,which(colnames(dtm2) %in% names(mots))]
> for(i in 2:nrow(dtm2)) s=cbind(s,dtm2[i,which(colnames(dtm2) %in% names(mots))])
> colnames(s)=titles

Once we have that dataset, we can use a PCA to visualise the ‘variables’ i.e. the pages

> library(FactoMineR)
> PCA(s)

We can also use non-supervised classification to group pages. But first, let us normalize the dataset

> s0=s/apply(s,1,sd)

Then, we can run a cluster dendrogram, using the Ward distance

> h <- hclust(dist(t(s0)), method = "ward")
> plot(h, labels = titles, sub = "")

Groups are consistent with intuition: painters are in the same cluster, as well as TV series, sports teams, and statistical techniques.

To leave a comment for the author, please follow the link and comment on their blog: R-english – Freakonometrics.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

↧

Clustering French Cities (based on Temperatures)

February 11, 2016, 5:47 am

≫ Next: Large scale eigenvalue decomposition and SVD with rARPACK

≪ Previous: Clusters of Texts

(This article was first published on R-english – Freakonometrics, and kindly contributed to R-bloggers)

In order to illustrate hierarchical clustering techniques and k-means, I did borrow François Husson‘s dataset, with monthly average temperature in several French cities.

> temp=read.table(
+ "http://freakonometrics.free.fr/FR_temp.txt",
+ header=TRUE,dec=",")

We have 15 cities, with monthly observations

> X=temp[,1:12]
> boxplot(X)

Since the variance seems to be rather stable, we will not ‘normalize’ the variables here,

> apply(X,2,sd)
    Janv     Fevr     Mars     Avri 
2.007296 1.868409 1.529083 1.414820 
     Mai     Juin     juil     Aout 
1.504596 1.793507 2.128939 2.011988 
    Sept     Octo     Nove     Dece 
1.848114 1.829988 1.803753 1.958449

In order to get a hierarchical cluster analysis, use for instance

> h <- hclust(dist(X), method = "ward")
> plot(h, labels = rownames(X), sub = "")

An alternative is to use

> library(FactoMineR)
> h2=HCPC(X)
> plot(h2)

Here, we visualise observations with a principal components analysis. We have here also an automatic selection of the number of classes, here 3. We can get the description of the groups using

> h2$desc.ind

or directly

> cah=hclust(dist(X))
> groups.3 <- cutree(cah,3)

We can also visualise those classes by ourselves,

> acp=PCA(X,scale.unit=FALSE)
> plot(acp$ind$coord[,1:2],col="white")
> text(acp$ind$coord[,1],acp$ind$coord[,2],
+ rownames(acp$ind$coord),col=groups.3)

It is possible to plot the centroïds of those clusters

> PT=aggregate(acp$ind$coord,list(groups.3),mean)
> points(PT$Dim.1,PT$Dim.2,pch=19)

If we add Voroid sets around those centroïds, here we do not see them (actually, we see the point – in the middle – that is exactly at the intersection of the three regions),

> library(tripack)
> V <- voronoi.mosaic(PT$Dim.1,PT$Dim.2)
> plot(V,add=TRUE)

To visualize those regions, use

> p=function(x,y){
+   which.min((PT$Dim.1-x)^2+(PT$Dim.2-y)^2)
+ }
> vx=seq(-10,12,length=251)
> vy=seq(-6,8,length=251)
> z=outer(vx,vy,Vectorize(p))
> image(vx,vy,z,col=c(rgb(1,0,0,.2),
+ rgb(0,1,0,.2),rgb(0,0,1,.2)))
> CL=c("red","black","blue")
> text(acp$ind$coord[,1],acp$ind$coord[,2],
+ rownames(acp$ind$coord),col=CL[groups.3])

Actually, those three groups (and those three regions) are also the ones we obtain using a k-mean algorithm,

> km=kmeans(acp$ind$coord[,1:2],3)
> km
K-means clustering 
with 3 clusters of sizes 3, 7, 5

(etc). But actually, since again we have some spatial data, it is possible to visualize them on a map

> library(maps)
> map("france")
> points(temp$Long,temp$Lati,col=groups.3,pch=19)

or, to visualize the regions, use e.g.

> library(car)
> for(i in 1:3) 
+ dataEllipse(temp$Long[groups.3==i],
+ temp$Lati[groups.3==i], levels=.7,add=TRUE,
+ col=i+1,fill=TRUE)

Those three regions actually make sense, geographically speaking.

To leave a comment for the author, please follow the link and comment on their blog: R-english – Freakonometrics.

↧

Large scale eigenvalue decomposition and SVD with rARPACK

February 21, 2016, 12:00 am

≫ Next: Nairobi Data Science Meet Up:Finding deep structures in data with Chris Orwa

≪ Previous: Clustering French Cities (based on Temperatures)

(This article was first published on Yixuan's Blog - R, and kindly contributed to R-bloggers)

In January 2016, I was honored to receive an “Honorable Mention” of the
John Chambers Award 2016.

A Short Story of rARPACK

Eigenvalue decomposition is a commonly used technique in
numerous statistical problems. For example, principal component analysis (PCA)
basically conducts eigenvalue decomposition on the sample covariance of a data
matrix: the eigenvalues are the component variances, and eigenvectors are the
variable loadings.

In R, the standard way to compute eigenvalues is the eigen() function.
However, when the matrix becomes large, eigen() can be very time-consuming:
the complexity to calculate all eigenvalues of a $n times n$ matrix is
$O(n^3)$.

While in real applications, we usually only need to compute a few
eigenvalues or eigenvectors, for example to visualize high dimensional
data using PCA, we may only use the first two or three components to draw
a scatterplot. Unfortunately in eigen(), there is no option to limit the
number of eigenvalues to be computed. This means that we always need to do the
full eigen decomposition, which can cause a huge waste in computation.

And this is why the rARPACK
package was developed. As the name indicates,
rARPACK was originally an R wrapper of the
ARPACK library, a FORTRAN package
that is used to calculate a few eigenvalues of a square matrix. However
ARPACK has stopped development for a long time, and it has some compatibility
issues with the current version of LAPACK. Therefore to maintain rARPACK in a
good state, I wrote a new backend for rARPACK, and that is the C++ library
Spectra.

The name of rARPACK was POORLY designed, I admit. Starting from version
0.8-0, rARPACK no longer relies on ARPACK, but due to CRAN polices and
reverse dependence, I have to keep using the old name.

Features and Usage

The usage of rARPACK is simple. If you want to calculate some eigenvalues
of a square matrix A, just call the function eigs() and tells it how many
eigenvalues you want (argument k), and which eigenvalues to calculate
(argument which). By default, which = "LM" means to pick the eigenvalues
with the largest magnitude (modulus for complex numbers and absolute value
for real numbers). If the matrix is known to be symmetric, calling
eigs_sym() is preferred since it guarantees that the eigenvalues are real.

library(rARPACK)
set.seed(123)
## Some random data
x = matrix(rnorm(1000 * 100), 1000)
## If retvec == FALSE, we don't calculate eigenvectors
eigs_sym(cov(x), k = 5, which = "LM", opts = list(retvec = FALSE))

For really large data, the matrix is usually in sparse form. rARPACK
supports several sparse matrix types defined in the Matrix
package, and you can even pass an implicit matrix defined by a function to
eigs(). See ?rARPACK::eigs for details.

library(Matrix)
spmat = as(cov(x), "dgCMatrix")
eigs_sym(spmat, 2)

## Implicitly define the matrix by a function that calculates A %*% x
## Below represents a diagonal matrix diag(c(1:10))
fmat = function(x, args)
{
    return(x * (1:10))
}
eigs_sym(fmat, 3, n = 10, args = NULL)

From Eigenvalue to SVD

An extension to eigenvalue decomposition is the singular value decomposition
(SVD), which works for general rectangular matrices. Still take PCA as
an example. To calculate variable loadings, we can perform an SVD on the
centered data matrix, and the loadings will be contained in the right singular
vectors. This method avoids computing the covariance matrix, and is generally
more stable and accurate than using cov() and eigen().

Similar to eigs(), rARPACK provides the function svds() to conduct
partial SVD, meaning that only part of the singular pairs (values and vectors)
are to be computed. Below shows an example that computes the first three PCs
of a 2000×500 matrix, and I compare the timings of three different algorithms:

library(microbenchmark)
set.seed(123)
## Some random data
x = matrix(rnorm(2000 * 500), 2000)
pc = function(x, k)
{
    ## First center data
    xc = scale(x, center = TRUE, scale = FALSE)
    ## Partial SVD
    decomp = svds(xc, k, nu = 0, nv = k)
    return(list(loadings = decomp$v, scores = xc %*% decomp$v))
}
microbenchmark(princomp(x), prcomp(x), pc(x, 3), times = 5)

The princomp() and prcomp() functions are the standard approaches in R
to do PCA, which will call eigen() and svd() respectively.
On my machine (Fedora Linux 23, R 3.2.3 with optimized single-threaded
OpenBLAS), the timing results are as follows:

Unit: milliseconds
        expr      min       lq     mean   median       uq      max neval
 princomp(x) 274.7621 276.1187 304.3067 288.7990 289.5324 392.3211     5
   prcomp(x) 306.4675 391.9723 408.9141 396.8029 397.3183 552.0093     5
    pc(x, 3) 162.2127 163.0465 188.3369 163.3839 186.1554 266.8859     5

Applications

SVD has some interesting applications, and one of them is image compression.
The basic idea is to perform a partial SVD on the image matrix, and then recover
it using the calculated singular values and singular vectors.

Below shows an image of size 622×1000:

(Orignal image)

If we use the first five singular pairs to recover the image,
then we need to store 8115 elements, which is only 1.3% of the original data
size. The recovered image will look like below:

(5 singular pairs)

Even if the recovered image is quite blurred, it already reveals the main
structure of the original image. And if we increase the number of singular pairs
to 50, then the difference is almost imperceptible, as is shown below.

(50 singular pairs)

There is also a nice Shiny App
developed by Nan Xiao,
Yihui Xie and Tong He that
allows users to upload an image and visualize the effect of compression using
this algorithm. The code is available on
GitHub.

Performance

Finally, I would like to use some benchmark results to show the
performance of rARPACK. As far as I know, there are very few packages
available in R that can do the partial eigenvalue decomposition, so the results
here are based on partial SVD.

The first plot compares different SVD functions on a 1000×500 matrix,
with dense format on the left panel, and sparse format on the right.

The second plot shows the results on a 5000×2500 matrix.

The functions used corresponding to the axis labels are as follows:

svd: svd() from base R, which computes the full SVD
irlba: irlba() from irlba
package, partial SVD
propack, trlan: propack.svd() and trlan.svd() from
svd package, partial SVD
svds: svds() from rARPACK

The code for benchmark and the environment to run the code can be
found here.

This article was written for R-bloggers,
whose builder, Tal Galili, kindly invited me
to write an introduction to the rARPACK package.

To leave a comment for the author, please follow the link and comment on their blog: Yixuan's Blog - R.

↧

Nairobi Data Science Meet Up:Finding deep structures in data with Chris Orwa

February 22, 2016, 10:52 am

≫ Next: Principal Component Analysis using R

≪ Previous: Large scale eigenvalue decomposition and SVD with rARPACK

(This article was first published on R – Data Science Africa, and kindly contributed to R-bloggers)

I sat down with former rugby school captain whose rugby career was cut short by a shoulder injury while playing for Black Blad at Kenyatta University. It is always a great pleasure to talk to someone who is extremely passionate about what he does and his passion for Data Science was evident during my chat with “BlackOrwa” at iHub Nairobi Offices. He is the Data Lab Manager at iHub and in another life he would have been a military man. He avoids over hyped movies (has not watched any of the Star wars) and is more inclined to movies set on one location like “Phone Booth”, “Identity” and “Buried”. Read on to know more about his startup Kwetha, IBM’S Watson and Bluemix, parallel processing, using GPU’s, PCA, mantel’s test among other things.

You have a very interesting mantra, “Nerdistic intent with delusions of grandeur”, tell us more about it.

This came about several years ago. I cofounded a data mining startup called Kwetha (means to find). My partner was the business guy and I was the crazy guy coming up with ideas and he used to say I am deluded and I was like yeah, but the ideas work, right? At the same time while watching music on YouTube, I came across a Kenyan rock band known as “Narcissistic Tendencies with Delusions of Grandeur” and being the nerd that I am, I flipped the phrase to simply explain what I do – mad scientist kind of feel.

Your blog is also quite interesting. What is the motivation behind it?

After campus I did not have a proper CV. I thought of starting a blog to showcase my data mining skills in addition to documenting my ideas and experiences. It was later on when I wrote a blog post on ‘Breaking Safaricom Scratch Card Code’ (more on this below) that I focused on data science posts. It was insane, I got about 2500 mentions on Twitter and the blog stats were on steroids – you can read about this here. I guess this was me living up to my mantra.

Data science is a relatively unknown field for most Kenyans. How did you venture into the field and what motivated you?

That is kind of an interesting story. At Kenyatta University, 4th year BSc Computer Science students are required to undertake specialization units. After checking the department’s notice board I opted for Advanced Artificial Intelligence. At the time, I used to play a lot of computer games and did a lot of computer modeling and I thought maybe if I did this course, then I could figure out how to make intelligent characters for my games. So I signed up for the class. Funny enough only 6 people signed up. So when we started the class, the lecturer said we were going to focus on data mining and initially I thought to myself, this not what I anticipated. At the time, I did not know what data mining was all about so I felt a bit let down. The lecturer had just come from the U.S and he explained how he used his data mining knowledge to predict stock prices and I was like, ‘so you can actually do that’. In short, I felt it was very interesting and started learning more about data mining algorithms. So after the course I knew I wanted to explore this more.

Tell us about your first Data science project.

My first data science project was studying consumer spending patterns on mobile phones based on reverse engineering scratch card serial numbers. I would go to town to Safaricom scratch card vendors and collect used cards from the bowls they kept to discard used cards. I would then manually enter the scratch card data into a spreadsheet and run various analyses to explore patterns and correlations. Later on, I came to iHub when the research arm was just being set up and met Angela Okune who was running a workshop on research methodologies. I pitched the idea of studying consumer spending patterns from scratch card data. She thought it was interesting and promised to seek funding for the project. I couldn’t wait for funding so I partnered with a friend Elvis Bando to push the analysis further. He cracked the serial number which made it possible to track how many scratch cards were produced, zonal spending patterns and profit projections. We wrote a blogpost about it and asked ourselves, what more could we do?

Tell us more about the work you do at iHub, what a typical day looks like and the major tools you use to do Data Science.

I started off as a consultant and later as a full time data scientist and currently as the Data Lab manager. My work involves writing code, developing analysis methodologies, managing people and looking for business opportunities where Data Science can be used to solve problems.

The whole iHub ethos is about open communities and connecting people. In line with that, we embrace open source tools, for example our servers run Linux, our core analysis languages are R and Python, primarily because they are good and open source makes our projects accessible to more people.

iHub is a for profit organization and we have a wide range of clients. Essentially we solve problems for our clients. One of the projects we have done was during the elections that involved identifying newsworthy information on Twitter. Currently, our most interesting project is code named “Umati” – it entails tracking hate speech online. A research done by an American professor came up with a 7 point methodology to identify offline hate speech. We are combining different Machine Learning methods to fit her methodology to the online world. The translation process poses challenging questions such as ‘how to measure intent’, ‘how to measure influence’, ‘how do we measure susceptibility’ et cetera. So we experimented a lot with sentiment analysis, subjectivity analysis, topic modelling, network analysis, classification, clustering among others to develop a perfect tool for identifying hate speech. These tools are open source and can be found here.

There are a lot of exciting things happening in the DS field like recently Google open sourced its ML system TensorFlow. What else has captured your eye?

At the bleeding edge I would say IBM’S Watson which has primarily been good at beating people at chess. I am looking forward to applications in health, transport and infrastructure. IBM also launched Bluemix which is a cloud infrastructure platform similar to Microsoft’s Azure but they have added ML and data mining capabilities. So if you want to build an application to say track traffic, you don’t have to worry about which algorithm to use or what the implementation factors are. It’s a drag and drop setup where you have the data source and the problem and you just plug this in and it solves the problem for you.

Microsoft has also bought Revolution analytics and are building a much better version of R. They are adding multi-threading capabilities to R and aim to make it easier to do data mining work. It is sought of like dreamweaver where you design a website and it generates code for you. Similarly here, you have logical process on one side and on the other it generates R code, spins up servers and does most of the back end processes. Good news for those who are afraid of R.

Something on Parallel Processing, Hadoop and GPU’S…

Hadoop is good but you can also use GPUs to do complex simulations, calculations, image manipulations etc. We do have GPUs here and we are finding ways to incorporate them in our analysis processes. A good example is when we were removing closely similar tweets from a corpus of 2.5 million tweets. This involves measuring each tweets against the rest while performing a ‘similarity’ test. The computer has to perform 6.25 trillion computations to exhaust the whole dataset which might take months if not years. GPUs offer a high level of parallelism than Hadoop which enable more than 1 million computations per second thereby reducing our problem to a 30 minute wait.

Let’s channel back to your presentation on finding deep structures in data. Why do we need to find deep structures in data and how do we go about it?

Clients normally give you data and require you to come up with interesting patterns and relationships. Standardization of methods and tools limit how much information can be squeezed from a dataset. So I was interested in deep structures to uncover non-obvious relationships between data points. I thought this would be useful when exploring questionnaires data to identifying dynamic patterns. The presentation focuses on combining different statistical concepts to reveal deep structures. Read how he solved this here.

Any reading assignments before the presentation?

Deep learning
Statistical background on Principal Component Analysis
Mantels test
Group theory
Clustering

More Reference Material

iHub Data Lab Work

Background

Data Science Meet-ups

Summer Data Jam

The post Nairobi Data Science Meet Up:Finding deep structures in data with Chris Orwa appeared first on Data Science Africa.

To leave a comment for the author, please follow the link and comment on their blog: R – Data Science Africa.

↧

Principal Component Analysis using R

February 27, 2016, 12:18 am

≫ Next: Nairobi Data Science Meetup: Paradigm Shift in Research with Samuel Kamande

≪ Previous: Nairobi Data Science Meet Up:Finding deep structures in data with Chris Orwa

(This article was first published on Data Perspective, and kindly contributed to R-bloggers)

Curse of Dimensionality:
One of the most commonly faced problems while dealing with data analytics problem such as recommendation engines, text analytics is high-dimensional and sparse data. At many times, we face a situation where we have a large set of features and fewer data points, or we have data with very high feature vectors. In such scenarios, fitting a model to the dataset, results in lower predictive power of the model. This scenario is often termed as the curse of dimensionality. In general, adding more data points or decreasing the feature space, also known as dimensionality reduction, often reduces the effects of the curse of dimensionality.
In this blog, we will discuss about principal component analysis, a popular dimensionality reduction technique. PCA is a useful statistical method that has found application in a variety of fields and is a common technique for finding patterns in data of high dimension.

Principal component analysis:

Consider below scenario:
The data, we want to work with, is in the form of a matrix A of mXn dimension, shown as below, where Ai,j represents the value of the i-th observation of the j-th variable.

Thus the N members of the matrix can be identified with the M rows, each variable corresponding to N-dimensional vectors. If N is very large it is often desirable to reduce the number of variables to a smaller number of variables, say k variables as in the image below, while losing as little information as possible.

Mathematically spoken, PCA is a linear orthogonal transformation that transforms the data to a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on.

The algorithm when applied linearly transforms m-dimensional input space to n-dimensional (n < m) output space, with the objective to minimize the amount of information/variance lost by discarding (m-n) dimensions. PCA allows us to discard the variables/features that have less variance.
Technically speaking, PCA uses orthogonal projection of highly correlated variables to a set of values of linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variables. This linear transformation is defined in such a way that the first principal component has the largest possible variance. It accounts for as much of the variability in the data as possible by considering highly correlated features. Each succeeding component in turn has the highest variance using the features that are less correlated with the first principal component and that are orthogonal to the preceding component.

In the above image, u1 & u2 are principal components wherein u1 accounts for highest variance in the dataset and u2 accounts for next highest variance and is orthogonal to u1.

PCA implementation in R:

For today’s post we use crimtab dataset available in R. Data of 3000 male criminals over 20 years old undergoing their sentences in the chief prisons of England and Wales. The 42 row names (“9.4″, 9.5″ …) correspond to midpoints of intervals of finger lengths whereas the 22 column names (“142.24″, “144.78”…) correspond to (body) heights of 3000 criminals, see also below.

head(crimtab)
    142.24 144.78 147.32 149.86 152.4 154.94 157.48 160.02 162.56 165.1 167.64 170.18 172.72 175.26 177.8 180.34
9.4      0      0      0      0     0      0      0      0      0     0      0      0      0      0     0      0
9.5      0      0      0      0     0      1      0      0      0     0      0      0      0      0     0      0
9.6      0      0      0      0     0      0      0      0      0     0      0      0      0      0     0      0
9.7      0      0      0      0     0      0      0      0      0     0      0      0      0      0     0      0
9.8      0      0      0      0     0      0      1      0      0     0      0      0      0      0     0      0
9.9      0      0      1      0     1      0      1      0      0     0      0      0      0      0     0      0
    182.88 185.42 187.96 190.5 193.04 195.58
9.4      0      0      0     0      0      0
9.5      0      0      0     0      0      0
9.6      0      0      0     0      0      0
9.7      0      0      0     0      0      0
9.8      0      0      0     0      0      0
9.9      0      0      0     0      0      0
 dim(crimtab)
[1] 42 22
str(crimtab)
 'table' int [1:42, 1:22] 0 0 0 0 0 0 1 0 0 0 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:42] "9.4" "9.5" "9.6" "9.7" ...
  ..$ : chr [1:22] "142.24" "144.78" "147.32" "149.86" ...

sum(crimtab)
[1] 3000

colnames(crimtab)
 [1] "142.24" "144.78" "147.32" "149.86" "152.4"  "154.94" "157.48" "160.02" "162.56" "165.1"  "167.64" "170.18" "172.72" "175.26" "177.8"  "180.34"
[17] "182.88" "185.42" "187.96" "190.5"  "193.04" "195.58"

let us use apply() to the crimtab dataset row wise to calculate the variance to see how each variable is varying.

apply(crimtab,2,var)

We observe that column “165.1” contains maximum variance in the data. Applying PCA using prcomp().

pca =prcomp(crimtab)

pca


Note: the resultant components of pca object from the above code corresponds to the standard deviations and Rotation. From the above standard deviations we can observe that the 1st PCA explained most of the variation, followed by other pcas’.  Rotation contains the principal component loadings matrix values which explains /proportion of each variable along each principal component.

Let’s plot all the principal components and see how the variance is accounted with each component.

par(mar = rep(2, 4))
plot(pca)

Clearly the first principal component accounts for maximum information.
Let us interpret the results of pca using biplot graph. Biplot is used to show the proportions of each variable along the two principal components.

#below code changes the directions of the biplot, if we donot include the below two lines the plot will be mirror image to the below one.
pca$rotation=-pca$rotation
pca$x=-pca$x
biplot (pca , scale =0)

The output of the preceding code is as follows:

In the preceding image, known as a biplot, we can see the two principal components (PC1 and PC2) of the crimtab dataset. The red arrows represent the loading vectors, which represent how the feature space varies along the principal component vectors.
From the plot, we can see that the first principal component vector, PC1, more or less places equal weight on three features: 165.1, 167.64, and 170.18. This means that these three features are more correlated with each other than the 160.02 and 162.56 features.
In the second principal component, PC2 places more weight on 160.02, 162.56 than the 3 features, “165.1, 167.64, and 170.18″ which are less correlated with them.
Complete Code for PCA implementation in R:

So by now we understood how to run the PCA, and how to interpret the principal components, where do we go from here? How do we apply the reduced variable dataset? In our next post we shall answer the above questions.

To leave a comment for the author, please follow the link and comment on their blog: Data Perspective.

↧

Nairobi Data Science Meetup: Paradigm Shift in Research with Samuel Kamande

February 29, 2016, 10:49 pm

≫ Next: Are you doing parallel computations in R? Then use BiocParallel

≪ Previous: Principal Component Analysis using R

(This article was first published on R – Data Science Africa, and kindly contributed to R-bloggers)

Samuel Kamande is a Data Scientist at Nielsen and his presentation will focus on “Paradigm Shift in Research”.
We caught up with him and he shared a lot about his work at Nielsen, some of the projects he has worked on like “Digital Divide project in Trinidad and Tobago in 2013”,thoughts on the future of Data Science and something on Baidu’s Deep-Learning System among other things.

We’d like to hear your story of how you got into data science. What motivated you to work in data science?
I am a statistician by training (MSc. Statistics, University of Nairobi). One person quipped that a Data Scientist is a statistician living in San Francisco or using a mac – By those definitions, I guess I am still a statistician. Transforming data into information, knowledge and insights has been the key source of motivation for me. In my 3 years of working with data, I have seen various clients across various industries and disciplines as well as internal management teams make important and inflectional decisions based on the insights deduced from small and large data sets. It has been fulfilling and has kept me going – solving problems, identifying opportunities, predicting the future in the midst of uncertainty – just to name a few. I have had the pleasure of working in a few positions that have exposed me to both practical statistics and programming, necessitated by the huge amounts of data involved. That has also constantly maintained my zeal. In the wide field that is Data Science, there is always something new to learn and try out every day.

How does a typical day as a Data Scientist at Nielsen look like? What tools and algorithms do you use often and which are your favorite?
A typical day for me involves supporting the Client Service teams in provision of technical consultation to solve Data science related client queries. It also involves provision of technical support on product enhancement and improvement for various clients in Africa. In addition to the in-house platforms, VBA, R, SAS Python (increasingly) are often used. R is my personal favorite tool. I have overtime delved deeper into its amazing capabilities and I am still exploring, ever since I ran those 15 lines of time-series modeling code in sophomore year.

How did you acquire these skills?
Aside from my university education, I have had to constantly work on my skills based on various projects. I had the opportunity to work with amazing programmers in huge projects at mSurvey. These projects needed a lot of statistical rigor and algorithm. Additionally, my current job provides the opportunities to work with a very talented pool of Data Scientists from across the globe to solve problems, and I have done my best to leverage on that. There is also freedom to explore, innovate and self-disrupt. Needless to mention that various online courses and a lot of practice has been and is still paramount for me.

What is the most interesting Data science project you have participated in?
There are many, but my most interesting would have to be the Digital Divide project in Trinidad and Tobago in 2013.. To measure the Digital Divide in Trinidad and Tobago we administered the first survey of its magnitude on mobile with inbuilt filters to ensure representativity and efficiency. We further automated the calculation of the Digital Divide Indices to avail real time visibility to the stakeholders. I was still a very raw statistician, and this gave me exposure to both design of big research studies as well as working with algorithms and various tools.

Data science is very hot at the moment. The field has received considerable media attention lately. How in your opinion has data science and big data changed the world?
Decisions across companies and governments are increasingly being made based on data and not only gut intuitions, experience notwithstanding. That for me is the biggest stride. It cannot be said enough that we have so much data with us, and it is only right that we use it to better the world. Data Scientists are changing the world. Algorithm by algorithm, model by model.

There are a lot of exciting things happening in the field like Google open-sourcing its Tensor Flow machine learning library so have other companies like IBM. What is exciting you most at the moment in the field? What problems look most promising to be solved using ML?
Personally, two fronts really excite me; one is around the accuracy of ad targeting seeing as sufficient data is already available. The second is around the ability to better predict consumer behavior based on data from across platforms. For these, we’ll also see the move to more prescriptive outputs, more recommendations from the data. There are other fascinating fronts as well – Like Baidu’s Deep-Learning System that could rival People at Speech Recognition

With all this attention and interest, 5 years from now, how do you think the DS field WILL look like?
SEXIER (Quoting Val Harian, Chief Economist at Google). We’ll take over the world. Seriously though, I think every single critical decision across fields and countries will be made based on data. And that will need Data Scientists. The likes of Dj Patil already set this in motion, and we’ll reap generously from it going forward.
Has data science impacted your day to day life?
Yes – I now tend to approach problems differently due to the numerous tools at my disposal. The experience accrued from solving problems also ensures that the next problem is solved more efficiently, thus freeing up more time for innovation and skill improvement. Google, through their simple ML algorithms, have also made my life a bit easier.

We are very excited about your presentation. May I ask, what can we expect on 3rd March?
Coming from a statistical background, and still working in a heavily statistical environment, my main interest right now is the evident paradigm shift from traditional sample research and structured data, to an integrated approach of that and big data. Being the first meet-up, I will present my perspective around this, and then open it up for other points of view from across industries. This will provide the basis of the tangent on which we’ll take the meet-ups thereafter.

My high school teacher always gave us reading assignments before her next class. Any homework for those that plan to attend to ensure the meeting is very interactive?
I would like attendees to think about the various topics and areas they would like to leverage on from the Data Science meet-ups so that the organizers source for speakers with this in mind going forward. The discussions will get more technical, and we’ll have various subject matter experts making presentations. That is purely contingent to the recommendations of the attendees.

If you could give 1 piece of advice to your younger self about Machine Learning, what would you tell him?
Extend the Statistical modeling concepts (Regression, classification, PCA/FA, Outlier detection etc.) to bigger data sets and use ML algorithms. I would not have to do it in retrospect like I am doing now. Then, the knowledge was raw and I had more time in my hands.

Advice to anyone who is interested in Data science?
Learn something new every day – statistics and programming. It does not just end there, put it into practice. There is sufficient material online (Coursera, Lynda etc).

Have you watched star wars?
Yes. The force awakens was my first. Hopefully I am not too late to the Star Wars party.

The post Nairobi Data Science Meetup: Paradigm Shift in Research with Samuel Kamande appeared first on Data Science Africa.

To leave a comment for the author, please follow the link and comment on their blog: R – Data Science Africa.

↧

Are you doing parallel computations in R? Then use BiocParallel

March 6, 2016, 4:00 pm

≫ Next: Perform co-operations with the coop package

≪ Previous: Nairobi Data Science Meetup: Paradigm Shift in Research with Samuel Kamande

(This article was first published on Fellgernon Bit - rstats, and kindly contributed to R-bloggers)

It’s the morning of the first day of oral conferences at #ENAR2016. I feel like I have a spidey sense since I woke up 3 min after an email from Jeff Leek; just a funny coincidence. Anyhow, I promised Valerie Obenchain at #Bioc2014 that I would write a post about one of my favorite Bioconductor packages: BiocParallel (Morgan, Obenchain, Lang, and Thompson, 2016). By now it’s on the top 5% of downloaded Bioconductor packages, so many people know about it or are unaware that their favorite package uses it behind the scenes.

While I haven’t blogged about BiocParallel yet, I did give a presentation about it at our computing club back in April 2nd, 2015. See it here (source). I’m going to follow its structure in this post.

Parallel computing

Before even thinking about using BiocParallel you have to decide whether parallel computing is the thing you need.

While I’m not talking about cloud computing, I still find this picture funny.

There’s different types of parallel computing, but what I’m referring to here is called embarrassingly parallel where you have a task to do for a set of inputs, you split your inputs into subsets and perform the task on these subsets. Performing this task for one input a a time is called serial programming and it’s what we do in most cases when using functions like lapply() or for loops.

plot(y = 10 / (1:10), 1:10, xlab = 'Number of cores', ylab = 'Time',
    main = 'Ideal scenario', type = 'o', col = 'blue',
    cex = 2, cex.axis = 2, cex.lab = 1.5, cex.main = 2, pch = 16)

center

You might be running a simulation for a different set of parameters (a parameter grid) and running each simulation could take some time. Parallel computing can help you speed up this problem. In the ideal scenario, the higher number of computing cores (units that evaluate subsets of your inputs) the less time you need to run your full analysis.

plot(y = 10 / (1:10), 1:10, xlab = 'Number of cores', ylab = 'Time',
    main = 'Reality', type = 'o', col = 'blue',
    cex = 2, cex.axis = 2, cex.lab = 1.5, cex.main = 2, pch = 16)
lines(y = 10 / (1:10) * c(1, 1.05^(2:10) ), 1:10, col = 'red',
    type = 'o', cex = 2)

center

However, in reality parallel computing is not cost-free. It involves some communication costs, like sending the data to the cores, aggregating the results in a way that you can then easily use, among other things. So, it’ll be a bit slower than the ideal scenario but you can potentially still greatly reduce the overall time.

Having said all of the above, lets say that you now want to do some parallel computing in R. Where do you start? A pretty good place to start is the CRAN Task View: High-Performance and Parallel Computing with R. There you’ll find a lot of information about different packages that enable you to do parallel computing with R.

But you’ll soon be lost in a sea of new terms.

Why use BiocParallel?

It’s simple to use.
You can try different parallel backends without changing your code.
You can use it to submit cluster jobs.
You’ll have access to great support from the Bioconductor developer team.

Those are the big reasons of why I use BiocParallel. But let me go through them a bit more slowly.

Birthday example

I’m going to use as an example the birthday problem where you want to find out empirically the probability that two people share the same birthday in a room.

birthday <- function(n) {
    m <- 10000
    x <- numeric(m)
    for(i in seq_len(m)) {
        b <- sample(seq_len(365), n, replace = TRUE)
        x[i] <- ifelse(length(unique(b)) == n, 0, 1)
    }
    mean(x)
}

Naive birthday code

Once you have written the code for it, you can then use lapply() or a for loop to calculate the results.

system.time( lapply(seq_len(100), birthday) )

##    user  system elapsed 
##  25.610   0.442  27.430

Takes around 25 seconds.

Via doMC

If you looked at CRAN Task View: High-Performance and Parallel Computing with R you might have found the doMC (Analytics and Weston, 2015).

It allows you to run computations in parallel as shown below.

library('doMC')

## Loading required package: foreach

## Loading required package: iterators

## Loading required package: parallel

registerDoMC(2)
system.time( x <- foreach(j = seq_len(100)) %dopar% birthday(j) )

##    user  system elapsed 
##  12.819   0.246  13.309

While it’s a bit faster, the main problem is that you had to change your code in order to be able to use it.

With BiocParallel

This is how you would run things with BiocParallel.

library('BiocParallel')
system.time( y <- bplapply(seq_len(100), birthday) )

##    user  system elapsed 
##   0.021   0.011  16.095

The only change here is using bplapply() instead of lapply(), so just 2 characters. Well, that and loading the BiocParallel package.

BiocParallel’s advantages

There are many computation backends and one of the strongest features of BiocParallel is that it’s easy to switch between them. For example, my computer can run the following options:

registered()

## $MulticoreParam
## class: MulticoreParam 
##   bpjobname:BPJOB; bpworkers:2; bptasks:0; bptimeout:Inf; bpRNGseed:; bpisup:FALSE
##   bplog:FALSE; bpthreshold:INFO; bplogdir:NA
##   bpstopOnError:FALSE; bpprogressbar:FALSE
##   bpresultdir:NA
## cluster type: FORK 
## 
## $SnowParam
## class: SnowParam 
##   bpjobname:BPJOB; bpworkers:2; bptasks:0; bptimeout:Inf; bpRNGseed:; bpisup:FALSE
##   bplog:FALSE; bpthreshold:INFO; bplogdir:NA
##   bpstopOnError:FALSE; bpprogressbar:FALSE
##   bpresultdir:NA
## cluster type: SOCK 
## 
## $SerialParam
## class: SerialParam 
##   bplog:FALSE; bpthreshold:INFO
##   bpcatchErrors:FALSE

If I was doing this in our computing cluster, I would see even more options.

Now lets say that I want to test different computation backends, or even run things in serial mode so I can trace a bug down more easily. Well, all I have to do is change the BPPARAM argument as shown below.

## Test in serial mode
system.time( y.serial <- bplapply(1:10, birthday,
    BPPARAM = SerialParam()) )

##    user  system elapsed 
##   2.577   0.033   2.733

## Try Snow
system.time( y.snow <- bplapply(1:10, birthday, 
    BPPARAM = SnowParam(workers = 2)) )

##    user  system elapsed 
##   0.027   0.006   2.436

Talking about computing clusters, you might be interested in using BatchJobs (Bischl, Lang, Mersmann, Rahnenführer, et al., 2015) just like Prasad Patil did for his PhD work. Well, with BiocParallel you can also chose to use the BatchJobs backend. I have code showing this at the presentation I referenced earlier.

Where do I start?

If you are convinced about using BiocParallel, which I hope you are by now, check out the Introduction to BiocParallel vignette available at BiocParallel’s landing page. It explains in more detail how to use it and it’s rich set of features. But if you just want to jump right in and start playing around with it, install it by running the following code:

## try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
biocLite("BiocParallel")

Conclusions

Like I said earlier, BiocParallel is simple to use and has definite advantages over other solutions.

You can try different parallel backends without changing your code.
You can use it to submit cluster jobs.
You’ll have access to great support from the Bioconductor developer team. See the biocparallel tag at the support website.

Have fun using it!

Reproducibility

## Reproducibility info
library('devtools')
session_info()

## Session info --------------------------------------------------------------

##  setting  value                       
##  version  R version 3.2.2 (2015-08-14)
##  system   x86_64, darwin13.4.0        
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  tz       America/Chicago             
##  date     2016-03-07

## Packages ------------------------------------------------------------------

##  package        * version  date       source        
##  bibtex           0.4.0    2014-12-31 CRAN (R 3.2.0)
##  BiocParallel   * 1.4.3    2015-12-16 Bioconductor  
##  bitops           1.0-6    2013-08-17 CRAN (R 3.2.0)
##  codetools        0.2-14   2015-07-15 CRAN (R 3.2.2)
##  devtools       * 1.10.0   2016-01-23 CRAN (R 3.2.3)
##  digest           0.6.9    2016-01-08 CRAN (R 3.2.3)
##  doMC           * 1.3.4    2015-10-13 CRAN (R 3.2.0)
##  evaluate         0.8      2015-09-18 CRAN (R 3.2.0)
##  foreach        * 1.4.3    2015-10-13 CRAN (R 3.2.0)
##  formatR          1.2.1    2015-09-18 CRAN (R 3.2.0)
##  futile.logger    1.4.1    2015-04-20 CRAN (R 3.2.0)
##  futile.options   1.0.0    2010-04-06 CRAN (R 3.2.0)
##  httr             1.1.0    2016-01-28 CRAN (R 3.2.3)
##  iterators      * 1.0.8    2015-10-13 CRAN (R 3.2.0)
##  knitcitations  * 1.0.7    2015-10-28 CRAN (R 3.2.0)
##  knitr          * 1.12.3   2016-01-22 CRAN (R 3.2.3)
##  lambda.r         1.1.7    2015-03-20 CRAN (R 3.2.0)
##  lubridate        1.5.0    2015-12-03 CRAN (R 3.2.3)
##  magrittr         1.5      2014-11-22 CRAN (R 3.2.0)
##  memoise          1.0.0    2016-01-29 CRAN (R 3.2.3)
##  plyr             1.8.3    2015-06-12 CRAN (R 3.2.1)
##  R6               2.1.2    2016-01-26 CRAN (R 3.2.3)
##  Rcpp             0.12.3   2016-01-10 CRAN (R 3.2.3)
##  RCurl            1.95-4.7 2015-06-30 CRAN (R 3.2.1)
##  RefManageR       0.10.6   2016-02-15 CRAN (R 3.2.3)
##  RJSONIO          1.3-0    2014-07-28 CRAN (R 3.2.0)
##  snow             0.4-1    2015-10-31 CRAN (R 3.2.0)
##  stringi          1.0-1    2015-10-22 CRAN (R 3.2.0)
##  stringr          1.0.0    2015-04-30 CRAN (R 3.2.0)
##  XML              3.98-1.3 2015-06-30 CRAN (R 3.2.0)

References

Citations made with knitcitations (Boettiger, 2015).

[1]
R. Analytics and S. Weston.
doMC: Foreach Parallel Adaptor for ‘parallel’.
R package version 1.3.4.
2015.
URL: http://CRAN.R-project.org/package=doMC.

[2]
B. Bischl, M. Lang, O. Mersmann, J. Rahnenführer, et al.
“BatchJobs and BatchExperiments: Abstraction Mechanisms for Using R in Batch Environments”.
In: Journal of Statistical Software 64.11 (2015), pp. 1–25.
URL: http://www.jstatsoft.org/v64/i11/.

[3]
C. Boettiger.
knitcitations: Citations for ‘Knitr’ Markdown Files.
R package version 1.0.7.
2015.
URL: http://CRAN.R-project.org/package=knitcitations.

[4]
M. Morgan, V. Obenchain, M. Lang and R. Thompson.
BiocParallel: Bioconductor facilities for parallel evaluation.
R package version 1.4.3.
2016.

Want more?

Check other @jhubiostat student blogs at Bmore Biostats as well as topics on #rstats.

To leave a comment for the author, please follow the link and comment on their blog: Fellgernon Bit - rstats.

↧

Perform co-operations with the coop package

April 6, 2016, 3:09 am

≫ Next: R benchmark for High-Performance Analytics and Computing (I)

≪ Previous: Are you doing parallel computations in R? Then use BiocParallel

(This article was first published on R – librestats, and kindly contributed to R-bloggers)

About

The coop package does co-operations: covariance, correlation, and cosine, and it does them quickly. The package is available on CRAN and GitHub, and has two vignettes:

Incidentally, the vignettes don’t render correctly on CRAN’s end for some reason; if any of you rmarkdown people have suggestions, I’m all ears.

The package is licensed under the permissive 2-clause BSD, and all the C code except for the custom NA handling stuff can easily be built as a standalone shared library if R’s not your thing.

To get the most out of the package, you need a good BLAS library and a modern compiler that supports OpenMP, preferably version 4. Both of these issues have been written about endlessly, but if this is the first you’ve heard of them, see the package vignettes.

History

The package was born out of needing to compute cosine similarities for a homework assigned to me last fall semester. I grabbed a package on the CRAN to perform the operation. I have a very good nose for when something is taking longer than it should, and was getting annoyed at how slow it was. So I wrote a version myself that was significantly faster for my purposes. Then I implemented a sparse version, because why not. At some point I wrote a custom na.omit() function for matrices which is ~3x faster than R’s.

While implementing the cosine stuff, I realized it was basically the same calculation as covariance and (pearson) correlation, which I had already done in the pcapack package. So I tossed those in as well and put it all under one interface.

My enthusiasm for this problem also nicely explains why I haven’t had a date in 4 years.

Benchmarks

It’s fast:

library(coop)
library(rbenchmark)
cols <- c("test", "replications", "elapsed", "relative")
reps <- 25
 
m <- 20000
n <- 500
x <- matrix(rnorm(m*n), m, n)

benchmark(cov(x), covar(x), replications=reps, columns=cols)
##      test replications elapsed relative
## 2 covar(x)           25   2.974    1.000
## 1   cov(x)           25  69.960   23.524

More benchmarks are available in the package readme and one of the package vignettes

These benchmarks use the default “non-inplace” method, which for covariance and correlation will require a copy of the input data matrix. This is part of the usual time/space tradeoff, but if you’re crunched for space and can wait, there’s also an in-place method which uses considerably less extra storage at the cost of being way slower. Them’s the breaks, kiddo. But it’s still 3x or more faster than R’s implementations on my hardware.

Future Directions

The package is being actively maintained with new features added pretty regularly. I expect the package to mostly mature after a few more rounds of updates, namely:

Finish row-wise deletion for NA’s with COO (sparse) storage.
Support use="pairwise.complete.obs" once I figure out what the hell that actually means.
Some C stuff you don’t care about.

Not on the agenda: implementing other cov/cor methods. The pcaPP package purportedly does kendall’s tau very efficiently. Spearman’s rank also doesn’t really interest me at the moment.

That’s it! If you try the package and discover any issues, please let me know on GitHub.

To leave a comment for the author, please follow the link and comment on their blog: R – librestats.

↧

R benchmark for High-Performance Analytics and Computing (I)

April 14, 2016, 10:10 pm

≫ Next: “Data Mining with R” Course | May 17-18

≪ Previous: Perform co-operations with the coop package

(This article was first published on Blog – ParallelR, and kindly contributed to R-bloggers)

Objectives of Experiments

R is more and more popular in various fields, including the high-performance analytics and computing (HPAC) fields. Nowadays, the architecture of HPC system can be classified as pure CPU system, CPU + Accelerators (GPGPU/FPGA) heterogeneous system, CPU + Coprocessors system. In software side, high performance scientific libraries, such as basic linear algebra subprograms (BLAS), will significantly influence the performance of R for HPAC applications.

So, in the first post of R benchmark series, the experiments mainly contain two aspects:

(1) Performance on different architectures of HPC system,

(2) Performance on different BLAS libraries.

Benchmark and Testing Goals

In this post, we choose R-25 benchmark (available in here ) which includes the most popular, widely acknowledged functions in the high performance analytic field. The testing script includes fifteen common computational intensive tasks (in Table-1) grouped into three categories:

(1) Matrix Calculation (1-5)

(2) Matrix function (6-10)

(3) Programmation (11-15)

Table-1 R-25 Benchmark Description
Task Number	R-25 Benchmark Description
1	Creation,transposition,deformation of a 2500*2500 matrix
2	2400*2400 normal distributed random matrix
3	Sorting of 7,000,000 random values
4	2800*2800 cross-product matrix
5	Linear regression over a 3000*3000 matrix
6	FFT over 2,400,000 random values
7	Eigenvalues of a 640*640 random values
8	Determinant of a 2500*2500 random matrix
9	Cholesky decomposition of a 3000*3000 matrix
10	Inverse of a 1600*1600 random matrix
11	3,500,000 Fibonacci numbers calculation(vector calculation)
12	Creation of a 3000*3000 Hilbert matrix(matrix calculation)
13	Grand common divisors of 400,000 pairs(recursion)
14	Creation of a 500*500 Toeplitz matrix(loops)
15	Escoufier’s method on a 45*45 matrix(mixed)

In our benchmark, we measured the performance of R-25 benchmark on various hardware platforms, including Intel Xeon CPU processors, NVIDIA GPGPU cards and Intel Xeon Phi coprocessors. Meanwhile, R built with different BLAS libraries results in different performance, so we tested R with self-contained BLAS, OpenBLAS, Intel MKL and CUDA BLAS. Because the performance of self-contained BLAS is hugely lower than the other BLAS library and in practice really HPAC users of R always built R with high performance BLAS, the testing results running with self-contained BLAS is negligible.

Moreover, in order to investigate the performance of functions or algorithms such as GEMM that HPC users mostly used, we explore the speed-up when varying the size of the matrices and number of elements as known as scalability.

System Descriptions

To evaluate the applicability of different methods for improving R performance in a HPC environment, the hardware and software of platform we used listed in the Table-2 and Table-3.

Results and Discussions

(1) General Comparisons

Fig. 1 shows the speedup of R using different BLAS libraries and different hosts. The default R running with OpenBLAS is shown in red as our baseline for comparison so that its speedup is constantly equal to one.

Intel Xeon E5-2670 has eight physical cores in one chipset, so there are 16 physical cores in one server node.Intel MKL library supports the single thread mode (Sequential) or OpenMP threading mode. MKL with OpenMP threading mode defaultly uses all physical cores in one node(here is 16).Fig.1 shows the results of using Intel MKL for 1 thread and 16 threads with automatic parallel execution are shown in blue. There are five subtasks showing a significant benefit from either optimized sequential math library or the automatic parallelization with MKL including crossprod (matrix size 2800*2800), linear regression, matrix decomposition, computing inverse and determinant of a matrix. Other non-computational intensive tasks received very little performance gains from parallel execution with MKL.

Fig.1 Performance comparison among Intel MKL and NVIDIA BLAS against R+OpenBLAS

We also exploited parallelism with CUDA BLAS (libnvblas.so) on NVIDIA GPU platform. Since drop-in library (nvblas) only accelerated the level 3 BLAS functions and overhead of preloading, the result (green column) in Fig.2 showed little benefit and even worse performance for some computing tasks against Intel MKL accelerations.

Fig.2 Performance comparison for CPU and GPU with NVIDIA BLAS and Intel MKL

(2) Scalability on NVIDIA GPU

The performance using two GPU devices (green column) is not superior to using one GPU device (blue column) , even the results of some subtasks on one GPU device gains more. Taking the function crossproduct with computing-intensive as an example is to explain the difference between one GPU device and two GPU device, as followed the Fig. 3. The advantage of the performance of the two card is gradually displayed as the size of the matrix increases. The sub-vertical axis shows the ratio of the elapsed time on two devices to one device. A ratio greater than 1 indicates that the two card performance is better than 1 cards,and the greater the ratio of the two cards, the better the performance of the card.

Fig.3 Scalability for 1X and 2X NVIDIA K40m GPU for ‘crossprod’ function

(3) Heterogeneous Parallel Models on Intel Xeon Phi (MIC)

To compare the parallelism supported by pure CPU (Intel Xeon processor) and Intel Xeon Phi coprocessor, we conducted batch runs ( 10 times for the average elapsed time) with the different matrix size of matrix production. MKL supports automatic offload computation to Intel Xeon Phi card, but before using you must know ,

Automatic offload functions in MKL

Level-3 BLAS: GEMM, TRSM, TRMM, SYMM
LAPACK 3 amigos : LU, QR, Cholesky

Matrix size for offloading

GEMM: M, N >2048, K>256
SYMM: M, N >2048
TRSM/TRMM: M, N >3072
LU: M, N>8192

Here, we use `a%*%a` substituted for the function `crossprod` used in R-benchmark-25.R because `crossprod` can not be auto-offloaded to Intel Xeon Phi. We compared the elapsed time running on CPU+Xeon Phi with running on pure CPU. In Fig.4, the vertical axis is the ratio of running elapsed time with CPU+Xeon Phi running mode to elapsed time with pure CPU running mode. The results showed the greater size of the matrix, the better performance CPU+Xeon Phi gains. The matrix size less than 4000 could get the best performance on pure CPU.

Fig.4 Heterogeneous Computing with Intel Xeon and Intel Xeon Phi

Fig.5 shows the 80% computation on Xeon Phi could get the best performance as the matrix size is growing, 70% computation on Xeon Phi could get the steadily better performance when the matrix size larger than 2000.

Fig.5 Different computation ratio on Intel Xeon Phi result in different performance

(4) Comparison NVIDIA GPU with Intel Xeon Phi

Here, we plotted the results of NVIDIA GPU and Intel Xeon Phi compared to Intel Xeon in Fig.6. In general, 80% running on Xeon Phi(2X 7110P)+Xeon CPU(2X E5-2670) gets similar performance to 1X K40m+2X E5-2670(2X 7110P ~ 1X K40m). When the matrix size is less than 12000, GPU gets better performance than Xeon Phi. And after that, Intel Xeon Phi shows the similar performance with NVIDIA K40m. For this benchmark, it can clearly seen that NVIDIA’s Tesla GPU(2X K40m) outperforms significantly.At 16000 of matrix size, nearly 3.9x faster than the 8-core dual E5-2670(Sandy-Bridge CPU) and 2.3x faster than the 80% running on Xeon Phi. The Xeon Phi is 2.8x faster than the Sandy-Bridge.

Fig.6 Comparison NVIDIA GPU with Intel Xeon Phi

Conclusions

In this article, we tested the R-benchmark-25.R script on the different hardware platform with different BLAS libraries. From our analysis, we concluded

(1) R built with Intel MKL (either sequential or threaded) can accelerate lots of computationally intensive algorithms of HPAC and get the best performance, such as linear regression, PCA, SVD

(2) R is performed faster on GPU for matrix production (GEMM) since it’s really computational intensive algorithm and GPU has more computing cores than Intel Xeon or Xeon Phi

(3) R executed in the heterogeneous platforms (CPU+GPU or CPU+MIC) can gain more performance improvements

(4) R can get more benefits from multiple GPUs, especially for large GEMM operations.

In the next post, we will further investigate the benchmark performance with different R parallel packages and commercial productions of R .

Appendix : How to build R with different BLAS library

Stock R

(1) Stock R build

Download base R package from the R project website, the current package is R-3.2.3.

Enter into the R root directory, and execute

> $./configure –with-readline=no –with-x=no –prefix=$HOME/R-3.2.3-ori

> $make -j4

> $make install

(2) Add R bin directory and library directory to the environment variables PATH and LD_LIBRARY_PATH seperately, just like as:

> export PATH=$HOME/R-3.2.3-ori/bin:$PATH

> export LD_LIBRARY_PATH=$HOME/R-3.2.3-ori/lib64/R/lib:$LD_LIBRARY_PATH

R with OpenBLAS

(1) OpenBLAS build

Download OpenBlas-0.2.15.tar.gz from http://www.openblas.net/

Change directory to OpenBLAS Home directory, and execute

> $make

> $make PREFIX=$OPENBLAS_INSTALL_DIRECTORY install

(2) Set the OpenBLAS library environment

(3) Run benchmark

> $LD_PRELOAD=$OPENBLAS_HOME/lib/libopenblas.so R

R with Intel MKL

(1）Obtain Intel parallel studio software from Intel website

(2) Install the parallel studio

(3) Set the Intel compiler and MKL library environment

(4) Build R with MKL

Link MKL libraries configuration file mkl.conf as follows:

a. Sequencial MKL or MKL single thread

#make sure intel compiler is installed and loaded which can be set in .bashrc
## as e.g.
source /opt/intel/bin/compilervars.sh intel64
MKL_LIB_PATH=/opt/intel/mkl/lib/intel64## Use intel compiler
CC=’icc -std=c99′
CFLAGS=’-g -O3 -wd188 -ip ‘F77=’ifort’
FFLAGS=’-g -O3 ‘CXX=’icpc’
CXXFLAGS=’-g -O3 ‘FC=’ifort’
FCFLAGS=’-g -O3 ‘## MKL sequential, ICC
MKL=” -L${MKL_LIB_PATH}
-Wl,–start-group
-lmkl_intel_lp64
-lmkl_sequential
-lmkl_core
-Wl,–end-group”

b. OpenMP Threading MKL

#make sure intel compiler is installed and loaded which can be set in .bashrc
## as e.g.
source /opt/intel/bin/compilervars.sh intel64
MKL_LIB_PATH=/opt/intel/mkl/lib/intel64## Use intel compiler
CC=’icc -std=c99′
CFLAGS=’-g -O3 -wd188 -ip ‘F77=’ifort’
FFLAGS=’-g -O3 ‘CXX=’icpc’
CXXFLAGS=’-g -O3 ‘FC=’ifort’
FCFLAGS=’-g -O3 ‘## MKL With Intel MP threaded , ICC
MKL=” -L${MKL_LIB_PATH}
-Wl,–start-group
-lmkl_intel_lp64
-lmkl_intel_thread
-lmkl_core
-Wl,–end-group
-liomp5 -lpthread”

build R with following command,

> $./configure –prefix=$HOME/R-3.2.3-mkl-icc –with-readline=no –with-x=no –with-blas=”$MKL” –with-lapack CC=’icc -std=c99′ CFLAGS=’-g -O3 -wd188 -ip ‘ F77=’ifort’ FFLAGS=’-g -O3 ‘ CXX=’icpc’ CXXFLAGS=’-g -O3 ‘ FC=’ifort’ FCFLAGS=’-g -O3 ‘

> $make -j 4; make install

(5) Set $HOME/R-3.2.3-mkl-icc environment

R with CUDA BLAS

(1) Install the driver and CUDA tools with version up to 6.5 for NVIDIA Tesla Cards

(2)Set the CUDA environment

(3)Edit the nvblas.conf file

# This is the configuration file to use NVBLAS Library
# Setup the environment variable NVBLAS_CONFIG_FILE to specify your own config file.
# By default, if NVBLAS_CONFIG_FILE is not defined,
# NVBLAS Library will try to open the file “nvblas.conf” in its current directory
# Example : NVBLAS_CONFIG_FILE /home/cuda_user/my_nvblas.conf
# The config file should have restricted write permissions accesses# Specify which output log file (default is stderr)
NVBLAS_LOGFILE nvblas.log#Put here the CPU BLAS fallback Library of your choice
#It is strongly advised to use full path to describe the location of the CPU Library
NVBLAS_CPU_BLAS_LIB /opt/R-3.2.3-ori/lib64/R/lib/libRblas.so
#NVBLAS_CPU_BLAS_LIB <mkl_path_installtion>/libmkl_rt.so# List of GPU devices Id to participate to the computation
# Use ALL if you want all your GPUs to contribute
# Use ALL0, if you want all your GPUs of the same type as device 0 to contribute
# However, NVBLAS consider that all GPU have the same performance and PCI bandwidth
# By default if no GPU are listed, only device 0 will be used#NVBLAS_GPU_LIST 0 2 4
#NVBLAS_GPU_LIST ALL
NVBLAS_GPU_LIST ALL# Tile Dimension
NVBLAS_TILE_DIM 2048# Autopin Memory
NVBLAS_AUTOPIN_MEM_ENABLED#List of BLAS routines that are prevented from running on GPU (use for debugging purpose
# The current list of BLAS routines supported by NVBLAS are
# GEMM, SYRK, HERK, TRSM, TRMM, SYMM, HEMM, SYR2K, HER2K#NVBLAS_GPU_DISABLED_SGEMM
#NVBLAS_GPU_DISABLED_DGEMM
#NVBLAS_GPU_DISABLED_CGEMM
#NVBLAS_GPU_DISABLED_ZGEMM# Computation can be optionally hybridized between CPU and GPU
# By default, GPU-supported BLAS routines are ran fully on GPU
# The option NVBLAS_CPU_RATIO_<BLAS_ROUTINE> give the ratio [0,1]
# of the amount of computation that should be done on CPU
# CAUTION : this option should be used wisely because it can actually
# significantly reduced the overall performance if too much work is given to CPU#NVBLAS_CPU_RATIO_CGEMM 0.07

Set NVBLAS_CONFIG_FILE to the nvblas.conf location

(4) Run the benchmark

> LD_PRELOAD=/opt/cuda-7.5/lib64/libnvblas.so R

R with MKL on Intel Xeon Phi

(1) Build R with MKL

Build R with MKL is same to Threaded MKL at 6

(2) Enable MKL MIC Automatic Offload Mode

> export MKL_MIC_ENABLE=1

> export MIC_KMP_AFFINITY=compact

Otherwise , you can set the workload division between host CPU and MIC card. If one host has two MIC cards, you could set:

> export MKL_HOST_WORKDIVISION=0.2

> export MKL_MIC_0_WORKDIVISION=0.4

> export MKL_MIC_1_WORKDIVISION=0.4

To leave a comment for the author, please follow the link and comment on their blog: Blog – ParallelR.

↧

“Data Mining with R” Course | May 17-18

May 9, 2016, 2:09 am

≫ Next: Principal Components Regression, Pt.1: The Standard Method

≪ Previous: R benchmark for High-Performance Analytics and Computing (I)

(This article was first published on MilanoR, and kindly contributed to R-bloggers)

The two-days course Data Mining with R, is organized by the R training and consulting company Quantide. Next live class is on May 17-18 in Legnano (Milan).

If you want to know more about Quantide, check out Quantide’s website.

If you wish to attend the class, reserve a seat on the course ticket page.

R Live Class – Data Mining with R

May 17-18, Legnano (Milan)

This course introduces some of most important and popular techniques in data-mining applications with R.
Data mining is the computational process of discovering patterns in large data sets.
During the two-days course we will review a wide variety of techniques to catch information from big amount of data: Dimensionality reduction, Clustering, Classification and Prediction examples will be presented and deepened.

Course organization

The course will start with an introduction to basic methods for data description. After that, we will review the most popular techniques for data/dimensionality reduction, as Multidimensional Scaling, Principal Components Analysis, Correspondence Analysis. Next, we will focus on methods for searching for “natural subgroups” within data, as Hierachical/non hierarchical Cluster Analysis, Gaussian Mixtures Models.

The end of first day and the begin of second day will present techniques for classification analysis (Linear/Quadratic Discriminant Analysis, Logistic Regression, K-Nearest-Neighborhood,…).

Finally, in remaining part of second day, we will review some techniques for variables selection, collinearity reduction, and best prediction for regression models (PCA regresssion, Ridge Regression, Lasso Regression, Elastic-Net regression, ..)

Price

Euro 800 + VAT

Outline

Univariate Descriptive Statistics
Reduction of Data Dimensions (MDS, PCA and EFA, CA)
Clustering (HC, NHC, GMM)
Classification (LDA, CLASS, KNN)
Prediction (Several techniques to model data)

FAQ

Should I take this course?

This class will be a good fit for you if you are already using R and wants to get an overview of data-mining techniques with R. Some background in theoretical statistics, probability, linear and logistic regression is required.

What does the cost include?

The cost includes lunch, comprehensive course materials + 1 hour of individual online post course support for each student within 30 days from course date.

There is a students discount?

We offer an academic discount for those engaged in full time studies or research. Please contact us for further information at training[at]quantide[dot]com

What should I bring?

A laptop with the latest version of R and R-Studio.

Who will I learn from?

Enrico Pegoraro works in R training and consulting activities, with a special focus on Six Sigma, industrial statistical analysis and corporate training courses. Enrico graduated in Statistics from the University of Padua.
He has taught statistical models and R for hundreds of hours during specialized and applied courses, in universities, masters and companies.

What language is the course taught?

This course is taught in italian. Course material in English language

How can I reach your place?

Legnano is about 30 min by train from Milano. Trains from Milano to Legnano are scheduled every 30 minutes, and Quantide premises are 3 walking minutes from Legnano train station.

How can I contact you if I have further questions?

You can contact us attraining[at]quantide[dot]com

The post “Data Mining with R” Course | May 17-18 appeared first on MilanoR.

To leave a comment for the author, please follow the link and comment on their blog: MilanoR.

↧

Principal Components Regression, Pt.1: The Standard Method

May 16, 2016, 9:24 pm

≫ Next: Principal Components Regression in R, an operational tutorial

≪ Previous: “Data Mining with R” Course | May 17-18

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

In this note, we discuss principal components regression and some of the issues with it:

The need for scaling.
The need for pruning.
The lack of “y-awareness” of the standard dimensionality reduction step.

The purpose of this article is to set the stage for presenting dimensionality reduction techniques appropriate for predictive modeling, such as y-aware principal components analysis, variable pruning, L2-regularized regression, supervised PCR, or partial least squares. We do this by working detailed examples and building the relevant graphs. In our follow-up article we describe and demonstrate the idea of y-aware scaling.

Note we will try to say “principal components” (plural) throughout, following Everitt’s The Cambridge Dictionary of Statistics, though this is not the only common spelling (e.g. Wikipedia: Principal component regression). We will work all of our examples in R.

Principal Components Regression

In principal components regression (PCR), we use principal components analysis (PCA) to decompose the independent (x) variables into an orthogonal basis (the principal components), and select a subset of those components as the variables to predict y. PCR and PCA are useful techniques for dimensionality reduction when modeling, and are especially useful when the independent variables are highly colinear.

Generally, one selects the principal components with the highest variance — that is, the components with the largest singular values — because the subspace defined by these principal components captures most of the variation in the data, and thus represents a smaller space that we believe captures most of the qualities of the data. Note, however, that standard PCA is an “x-only” decomposition, and as Jolliffe (1982) shows through examples from the literature, sometimes lower-variance components can be critical for predicting y, and conversely, high variance components are sometimes not important.

Mosteller and Tukey (1977, pp. 397-398) argue similarly that the components with small variance are unlikely to be important in regression, apparently on the basis that nature is “tricky, but not downright mean”. We shall see in the examples below that without too much effort we can find examples where nature is “downright mean”. — Jolliffe (1982)

The remainder of this note presents principal components analysis in the context of PCR and predictive modeling in general. We will show some of the issues in using an x-only technique like PCA for dimensionality reduction. In a follow-up note, we’ll discuss some y-aware approaches that address these issues.

First, let’s build our example. In this sort of teaching we insist on toy or synthetic problems so we actually know the right answer, and can therefore tell which procedures are better at modeling the truth.

In this data set, there are two (unobservable) processes: one that produces the output yA and one that produces the output yB. We only observe the mixture of the two: y = yA + yB + eps, where eps is a noise term. Think of y as measuring some notion of success and the x variables as noisy estimates of two different factors that can each drive success. We’ll set things up so that the first five variables (x.01, x.02, x.03, x.04, x.05) have all the signal. The odd numbered variables correspond to one process (yB) and the even numbered variables correspond to the other (yA).

Then, to simulate the difficulties of real world modeling, we’ll add lots of pure noise variables (noise*). The noise variables are unrelated to our y of interest — but are related to other “y-style” processes that we are not interested in. As is common with good statistical counterexamples, the example looks like something that should not happen or that can be easily avoided. Our point is that the data analyst is usually working with data just like this.

Data tends to come from databases that must support many different tasks, so it is exactly the case that there may be columns or variables that are correlated to unknown and unwanted additional processes. The reason PCA can’t filter out these noise variables is that without use of y, standard PCA has no way of knowing what portion of the variation in each variable is important to the problem at hand and should be preserved. This can be fixed through domain knowledge (knowing which variables to use), variable pruning and y-aware scaling. Our next article will discuss these procedures; in this article we will orient ourselves with a demonstration of both what a good analysis and what a bad analysis looks like.

All the variables are also deliberately mis-scaled to model some of the difficulties of working with under-curated real world data.

# build example where even and odd variables are bringing in noisy images
# of two different signals.
mkData <- function(n) {
  for(group in 1:10) {
    # y is the sum of two effects yA and yB
    yA <- rnorm(n)
    yB <- rnorm(n)
    if(group==1) {
      d <- data.frame(y=yA+yB+rnorm(n))
      code <- 'x'
    } else {
      code <- paste0('noise',group-1)
    }
    yS <- list(yA,yB)
    # these variables are correlated with y in group 1,
    # but only to each other (and not y) in other groups
    for(i in 1:5) {
      vi <- yS[[1+(i%%2)]] + rnorm(nrow(d))
      d[[paste(code,formatC(i,width=2,flag=0),sep='.')]] <- ncol(d)*vi
    }
  }
  d
}

Notice the copy of y in the data frame has additional “unexplainable variance” so only about 66% of the variation in y is predictable.

Let’s start with our train and test data.

# make data
set.seed(23525)
dTrain <- mkData(1000)
dTest <- mkData(1000)

Let’s look at our outcome y and a few of our variables.

summary(dTrain[, c("y", "x.01", "x.02",
                   "noise1.01", "noise1.02")])

##        y                 x.01               x.02        
##  Min.   :-5.08978   Min.   :-4.94531   Min.   :-9.9796  
##  1st Qu.:-1.01488   1st Qu.:-0.97409   1st Qu.:-1.8235  
##  Median : 0.08223   Median : 0.04962   Median : 0.2025  
##  Mean   : 0.08504   Mean   : 0.02968   Mean   : 0.1406  
##  3rd Qu.: 1.17766   3rd Qu.: 0.93307   3rd Qu.: 1.9949  
##  Max.   : 5.84932   Max.   : 4.25777   Max.   :10.0261  
##    noise1.01          noise1.02       
##  Min.   :-30.5661   Min.   :-30.4412  
##  1st Qu.: -5.6814   1st Qu.: -6.4069  
##  Median :  0.5278   Median :  0.3031  
##  Mean   :  0.1754   Mean   :  0.4145  
##  3rd Qu.:  5.9238   3rd Qu.:  6.8142  
##  Max.   : 26.4111   Max.   : 31.8405

Usually we recommend doing some significance pruning on variables before moving on — see here for possible consequences of not pruning an over-abundance of variables, and here for a discussion of one way to prune, based on significance. For this example, however, we will deliberately attempt dimensionality reduction without pruning (to demonstrate the problem). Part of what we are trying to show is to not assume PCA performs these steps for you.

Ideal situation

First, let’s look at the ideal situation. If we had sufficient domain knowledge (or had performed significance pruning) to remove the noise, we would have no pure noise variables. In our example we know which variables carry signal and therefore can limit down to them before doing the PCA as follows.

goodVars <-  colnames(dTrain)[grep('^x.',colnames(dTrain))]
dTrainIdeal <- dTrain[,c('y',goodVars)]
dTestIdeal <-  dTrain[,c('y',goodVars)]

Let’s perform the analysis and look at the magnitude of the singular values.

# do the PCA
dmTrainIdeal <- as.matrix(dTrainIdeal[,goodVars])
princIdeal <- prcomp(dmTrainIdeal,center = TRUE,scale. = TRUE)

# extract the principal components
rot5Ideal <- extractProjection(5,princIdeal)

# prepare the data to plot the variable loadings
rotfIdeal = as.data.frame(rot5Ideal)
rotfIdeal$varName = rownames(rotfIdeal)
rotflongIdeal = gather(rotfIdeal, "PC", "loading",
                       starts_with("PC"))
rotflongIdeal$vartype = ifelse(grepl("noise", 
                                     rotflongIdeal$varName),
                               "noise", "signal")

# plot the singular values
dotplot_identity(frame = data.frame(pc=1:length(princIdeal$sdev), 
                            magnitude=princIdeal$sdev), 
                 xvar="pc",yvar="magnitude") +
  ggtitle("Ideal case: Magnitudes of singular values")

The magnitudes of the singular values tell us that the first two principal components carry most of the signal. We can also look at the variable loadings of the principal components. The plot of the variable loadings is a graphical representation of the coordinates of the principal components. Each coordinate corresponds to the contribution of one of the original variables to that principal component.

dotplot_identity(rotflongIdeal, "varName", "loading", "vartype") + 
  facet_wrap(~PC,nrow=1) + coord_flip() + 
  ggtitle("x scaled variable loadings, first 5 principal components") + 
  scale_color_manual(values = c("noise" = "#d95f02", "signal" = "#1b9e77"))

We see that we recover the even/odd loadings of the original signal variables. PC1 has the odd variables, and PC2 has the even variables. The next three principal components complete the basis for the five original variables.

Since most of the signal is in the first two principal components, we can look at the projection of the data into that plane, using color to code y.

# signs are arbitrary on PCA, so instead of calling predict we pull out
# (and alter) the projection by hand
projectedTrainIdeal <-
  as.data.frame(dmTrainIdeal %*% extractProjection(2,princIdeal),
                                 stringsAsFactors = FALSE)
projectedTrainIdeal$y <- dTrain$y
ScatterHistN(projectedTrainIdeal,'PC1','PC2','y',
               "Ideal Data projected to first two principal components")

Notice that the value of y increases both as we move up and as we move right. We have recovered two orthogonal features that each correlate with an increase in y (in general the signs of the principal components — that is, which direction is “positive” — are arbitrary, so without precautions the above graph can appear flipped). Recall that we constructed the data so that the odd variables (represented by PC1) correspond to process yB and the even variables (represented by PC2) correspond to process yA. We have recovered both of these relations in the figure.

This is why you rely on domain knowledge, or barring that, at least prune your variables. For this example variable pruning would have gotten us to the above ideal case. In our next article we will show how to perform the significance pruning.

X-only PCA

To demonstrate the problem of x-only PCA on unpruned data in a predictive modeling situation, let’s analyze the same data without limiting ourselves to the known good variables. We are pretending (as is often the case) we don’t have the domain knowledge indicating which variables are useful and we have neglected to significance prune the variables before PCA. In our experience, this is a common mistake in using PCR, or, more generally, with using PCA in predictive modeling situations.

This example will demonstrate how you lose modeling power when you don’t apply the methods in a manner appropriate to your problem. Note that the appropriate method for your data may not match the doctrine of another field, as they may have different data issues.

The wrong way: PCA without any scaling

We deliberately mis-scaled the original data when we generated it. Mis-scaled data is a common problem in data science situations, but perhaps less common in carefully curated scientific situations. In a messy data situation like the one we are emulating, the best practice is to re-scale the x variables; however, we’ll first naively apply PCA to the data as it is. This is to demonstrate the sensitivity of PCA to the units of the data.

vars <- setdiff(colnames(dTrain),'y')

duTrain <- as.matrix(dTrain[,vars])
prinU <- prcomp(duTrain,center = TRUE,scale. = FALSE) 

dotplot_identity(frame = data.frame(pc=1:length(prinU$sdev), 
                            magnitude=prinU$sdev), 
                 xvar="pc",yvar="magnitude") +
  ggtitle("Unscaled case: Magnitudes of singular values")

There is no obvious knee in the magnitudes of the singular values, so we are at a loss as to how many variables we should use. In addition, when we look at the variable loading of the first five principal components, we will see another problem:

rot5U <- extractProjection(5,prinU)
rot5U = as.data.frame(rot5U)
rot5U$varName = rownames(rot5U)
rot5U = gather(rot5U, "PC", "loading",
                       starts_with("PC"))
rot5U$vartype = ifelse(grepl("noise", 
                                     rot5U$varName),
                               "noise", "signal")

dotplot_identity(rot5U, "varName", "loading", "vartype") + 
  facet_wrap(~PC,nrow=1) + coord_flip() + 
  ggtitle("unscaled variable loadings, first 5 principal components") + 
  scale_color_manual(values = c("noise" = "#d95f02", "signal" = "#1b9e77"))

The noise variables completely dominate the loading of the first several principal components. Because of the way we deliberately mis-scaled the data, the noise variables are of much larger magnitude than the signal variables, and so the true signal is masked when we decompose the data.

Since the magnitudes of the singular values don’t really give us a clue as to how many components to use in our model, let’s try using all of them. This actually makes no sense, because using all the principal components is equivalent to using all the variables, thus defeating the whole purpose of doing PCA in the first place. But let’s do it anyway (as many unwittingly do).

# get all the principal components
# not really a projection as we took all components!
projectedTrain <- as.data.frame(predict(prinU,duTrain),
                                 stringsAsFactors = FALSE)
vars = colnames(projectedTrain)
projectedTrain$y <- dTrain$y

varexpr = paste(vars, collapse="+")
fmla = paste("y ~", varexpr)

model <- lm(fmla,data=projectedTrain)
summary(model)

## 
## Call:
## lm(formula = fmla, data = projectedTrain)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1748 -0.7611  0.0111  0.7821  3.6559 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.504e-02  3.894e-02   2.184 0.029204 *  
## PC1          1.492e-04  4.131e-04   0.361 0.717983    
## PC2          1.465e-05  4.458e-04   0.033 0.973793    
## PC3         -7.372e-04  4.681e-04  -1.575 0.115648    
## PC4          6.894e-04  5.211e-04   1.323 0.186171    
## PC5          7.529e-04  5.387e-04   1.398 0.162577    
## PC6         -2.382e-04  5.961e-04  -0.400 0.689612    
## PC7          2.555e-04  6.142e-04   0.416 0.677546    
## PC8          5.850e-04  6.701e-04   0.873 0.382908    
## PC9         -6.890e-04  6.955e-04  -0.991 0.322102    
## PC10         7.472e-04  7.650e-04   0.977 0.328993    
## PC11        -7.034e-04  7.839e-04  -0.897 0.369763    
## PC12         7.062e-04  8.039e-04   0.878 0.379900    
## PC13         1.098e-04  8.125e-04   0.135 0.892511    
## PC14        -8.137e-04  8.405e-04  -0.968 0.333213    
## PC15        -5.163e-05  8.716e-04  -0.059 0.952776    
## PC16         1.945e-03  9.015e-04   2.158 0.031193 *  
## PC17        -3.384e-04  9.548e-04  -0.354 0.723143    
## PC18        -9.339e-04  9.774e-04  -0.955 0.339587    
## PC19        -6.110e-04  1.005e-03  -0.608 0.543413    
## PC20         8.747e-04  1.042e-03   0.839 0.401494    
## PC21         4.538e-04  1.083e-03   0.419 0.675310    
## PC22         4.237e-04  1.086e-03   0.390 0.696428    
## PC23        -2.011e-03  1.187e-03  -1.694 0.090590 .  
## PC24         3.451e-04  1.204e-03   0.287 0.774416    
## PC25         2.156e-03  1.263e-03   1.707 0.088183 .  
## PC26        -6.293e-04  1.314e-03  -0.479 0.631988    
## PC27         8.401e-04  1.364e-03   0.616 0.538153    
## PC28        -2.578e-03  1.374e-03  -1.876 0.061014 .  
## PC29         4.354e-04  1.423e-03   0.306 0.759691    
## PC30         4.098e-04  1.520e-03   0.270 0.787554    
## PC31         5.509e-03  1.650e-03   3.339 0.000875 ***
## PC32         9.097e-04  1.750e-03   0.520 0.603227    
## PC33         5.617e-04  1.792e-03   0.314 0.753964    
## PC34        -1.247e-04  1.870e-03  -0.067 0.946837    
## PC35        -6.470e-04  2.055e-03  -0.315 0.752951    
## PC36         1.435e-03  2.218e-03   0.647 0.517887    
## PC37         4.906e-04  2.246e-03   0.218 0.827168    
## PC38        -2.915e-03  2.350e-03  -1.240 0.215159    
## PC39        -1.917e-03  2.799e-03  -0.685 0.493703    
## PC40         4.827e-04  2.820e-03   0.171 0.864117    
## PC41        -6.016e-05  3.060e-03  -0.020 0.984321    
## PC42         6.750e-03  3.446e-03   1.959 0.050425 .  
## PC43        -3.537e-03  4.365e-03  -0.810 0.417996    
## PC44        -4.845e-03  5.108e-03  -0.948 0.343131    
## PC45         8.643e-02  5.456e-03  15.842  < 2e-16 ***
## PC46         7.882e-02  6.267e-03  12.577  < 2e-16 ***
## PC47         1.202e-01  6.693e-03  17.965  < 2e-16 ***
## PC48        -9.042e-02  1.163e-02  -7.778 1.92e-14 ***
## PC49         1.309e-01  1.670e-02   7.837 1.23e-14 ***
## PC50         2.893e-01  3.546e-02   8.157 1.08e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.231 on 949 degrees of freedom
## Multiple R-squared:  0.5052, Adjusted R-squared:  0.4791 
## F-statistic: 19.38 on 50 and 949 DF,  p-value: < 2.2e-16

estimate <- predict(model,newdata=projectedTrain)
trainrsq <- rsq(estimate,projectedTrain$y)

Note that most of the variables that achieve significance are the very last ones! We will leave it to the reader to confirm that using even as many as the first 25 principal components — half the variables — explains little of the variation in y. If we wanted to use PCR to reduce the dimensionality of the problem, we have failed. This is an example of what Jolliffe would have called a “downright mean” modeling problem, which we caused by mis-scaling the data. Note the r-squared of 0.5052 for comparison, later.

So now let’s do what we should have done in the first place: scale the data.

A better way: Preparing the training data with x-only scaling

Standard practice is to center the data at mean zero and scale it to unit standard deviation, which is easy with the scale command.

dTrainNTreatedUnscaled <- dTrain
dTestNTreatedUnscaled <- dTest

# scale the data
dTrainNTreatedXscaled <- 
  as.data.frame(scale(dTrainNTreatedUnscaled[,colnames(dTrainNTreatedUnscaled)!='y'],
                      center=TRUE,scale=TRUE),stringsAsFactors = FALSE)
dTrainNTreatedXscaled$y <- dTrainNTreatedUnscaled$y
dTestNTreatedXscaled <- 
  as.data.frame(scale(dTestNTreatedUnscaled[,colnames(dTestNTreatedUnscaled)!='y'],
                      center=TRUE,scale=TRUE),stringsAsFactors = FALSE)
dTestNTreatedXscaled$y <- dTestNTreatedUnscaled$y

# get the variable ranges
ranges = vapply(dTrainNTreatedXscaled, FUN=function(col) c(min(col), max(col)), numeric(2))
rownames(ranges) = c("vmin", "vmax") 
rframe = as.data.frame(t(ranges))  # make ymin/ymax the columns
rframe$varName = rownames(rframe)
varnames = setdiff(rownames(rframe), "y")
rframe = rframe[varnames,]
rframe$vartype = ifelse(grepl("noise", rframe$varName),
                        "noise", "signal")

summary(dTrainNTreatedXscaled[, c("y", "x.01", "x.02", 
                                  "noise1.01", "noise1.02")])

##        y                 x.01               x.02         
##  Min.   :-5.08978   Min.   :-3.56466   Min.   :-3.53178  
##  1st Qu.:-1.01488   1st Qu.:-0.71922   1st Qu.:-0.68546  
##  Median : 0.08223   Median : 0.01428   Median : 0.02157  
##  Mean   : 0.08504   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.: 1.17766   3rd Qu.: 0.64729   3rd Qu.: 0.64710  
##  Max.   : 5.84932   Max.   : 3.02949   Max.   : 3.44983  
##    noise1.01          noise1.02       
##  Min.   :-3.55505   Min.   :-3.04344  
##  1st Qu.:-0.67730   1st Qu.:-0.67283  
##  Median : 0.04075   Median :-0.01098  
##  Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.: 0.66476   3rd Qu.: 0.63123  
##  Max.   : 3.03398   Max.   : 3.09969

barbell_plot(rframe, "varName", "vmin", "vmax", "vartype") +
  coord_flip() + ggtitle("x scaled variables: ranges") + 
  scale_color_manual(values = c("noise" = "#d95f02", "signal" = "#1b9e77"))

Note that the signal and noise variables now have commensurate ranges.

The principal components analysis

vars = setdiff(colnames(dTrainNTreatedXscaled), "y")

dmTrain <- as.matrix(dTrainNTreatedXscaled[,vars])
dmTest <- as.matrix(dTestNTreatedXscaled[,vars])
princ <- prcomp(dmTrain,center = TRUE,scale. = TRUE) 
dotplot_identity(frame = data.frame(pc=1:length(princ$sdev), 
                            magnitude=princ$sdev), 
                 xvar="pc",yvar="magnitude") +
  ggtitle("x scaled variables: Magnitudes of singular values")

Now the magnitudes of the singular values suggest that we can try to model the data with only the first twenty principal components. But first, let’s look at the variable loadings of the first five principal components.

rot5 <- extractProjection(5,princ)
rotf = as.data.frame(rot5)
rotf$varName = rownames(rotf)
rotflong = gather(rotf, "PC", "loading", starts_with("PC"))
rotflong$vartype = ifelse(grepl("noise", rotflong$varName), 
                          "noise", "signal")

dotplot_identity(rotflong, "varName", "loading", "vartype") + 
  facet_wrap(~PC,nrow=1) + coord_flip() + 
  ggtitle("x scaled variable loadings, first 5 principal components") + 
  scale_color_manual(values = c("noise" = "#d95f02", "signal" = "#1b9e77"))

The signal variables now have larger loadings than they did in the unscaled case, but the noise variables still dominate the projection, in aggregate swamping out the contributions from the signal variables. The two processes that produced y have diffused amongst the principal components, rather than mostly concentrating in the first two, as they did in the ideal case. This is because we constructed the noise variables to have variation and some correlations with each other — but not be correlated with y. PCA doesn’t know that we are interested only in variable correlations that are due to y, so it must decompose the data to capture as much variation, and as many variable correlations, as possible.

In other words, PCA must represent all processes present in the data, regardless of whether we are trying to predict those particular processes or not. Without the knowledge of the y that we are trying to predict, PCA is forced to prepare for any possible future prediction task.

Modeling

Let’s build a model using only the first twenty principal components, as our above analysis suggests we should.

# get all the principal components
# not really a projection as we took all components!
projectedTrain <- as.data.frame(predict(princ,dmTrain),
                                 stringsAsFactors = FALSE)
projectedTrain$y <- dTrainNTreatedXscaled$y

ncomp = 20
# here we will only model with the first ncomp principal components
varexpr = paste(paste("PC", 1:ncomp, sep=''), collapse='+')
fmla = paste("y ~", varexpr)

model <- lm(fmla,data=projectedTrain)
summary(model)

## 
## Call:
## lm(formula = fmla, data = projectedTrain)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2612 -0.7939 -0.0096  0.7898  3.8352 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.085043   0.039391   2.159 0.031097 *  
## PC1          0.107016   0.025869   4.137 3.82e-05 ***
## PC2         -0.047934   0.026198  -1.830 0.067597 .  
## PC3          0.135933   0.026534   5.123 3.62e-07 ***
## PC4         -0.162336   0.026761  -6.066 1.87e-09 ***
## PC5          0.356880   0.027381  13.034  < 2e-16 ***
## PC6         -0.126491   0.027534  -4.594 4.92e-06 ***
## PC7          0.092546   0.028093   3.294 0.001022 ** 
## PC8         -0.134252   0.028619  -4.691 3.11e-06 ***
## PC9          0.280126   0.028956   9.674  < 2e-16 ***
## PC10        -0.112623   0.029174  -3.860 0.000121 ***
## PC11        -0.065812   0.030564  -2.153 0.031542 *  
## PC12         0.339129   0.030989  10.943  < 2e-16 ***
## PC13        -0.006817   0.031727  -0.215 0.829918    
## PC14         0.086316   0.032302   2.672 0.007661 ** 
## PC15        -0.064822   0.032582  -1.989 0.046926 *  
## PC16         0.300566   0.032739   9.181  < 2e-16 ***
## PC17        -0.339827   0.032979 -10.304  < 2e-16 ***
## PC18        -0.287752   0.033443  -8.604  < 2e-16 ***
## PC19         0.297290   0.034657   8.578  < 2e-16 ***
## PC20         0.084198   0.035265   2.388 0.017149 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.246 on 979 degrees of freedom
## Multiple R-squared:  0.4776, Adjusted R-squared:  0.467 
## F-statistic: 44.76 on 20 and 979 DF,  p-value: < 2.2e-16

projectedTrain$estimate <- predict(model,newdata=projectedTrain)
ScatterHist(projectedTrain,'estimate','y','Recovered 20 variable model versus truth (train)',
            smoothmethod='identity',annot_size=3)

trainrsq <- rsq(projectedTrain$estimate,projectedTrain$y)

This model explains 47.76% of the variation in the training set. We do about as well on test.

projectedTest <- as.data.frame(predict(princ,dmTest),
                                 stringsAsFactors = FALSE)
projectedTest$y <- dTestNTreatedXscaled$y
projectedTest$estimate <- predict(model,newdata=projectedTest)
testrsq <- rsq(projectedTest$estimate,projectedTest$y)
testrsq

## [1] 0.5033022

This is pretty good; recall that we had about 33% unexplainable variance in the data, so we would not expect any modeling algorithm to get better than an r-squared of about 0.67.

We can confirm that this performance is as good as simply regressing on all the variables without the PCA, so we have at least not lost information via our dimensionality reduction.

# fit a model to the original data
vars <- setdiff(colnames(dTrain),'y')
formulaB <- paste('y',paste(vars,collapse=' + '),sep=' ~ ')
modelB <- lm(formulaB,data=dTrain)
dTrainestimate <- predict(modelB,newdata=dTrain)
rsq(dTrainestimate,dTrain$y)

## [1] 0.5052081

dTestestimate <- predict(modelB,newdata=dTest)
rsq(dTestestimate,dTest$y)

## [1] 0.4751995

We will show in our next article how to get a similar test r-squared from this data using a model with only two variables.

Are we done?

Scaling the variables improves the performance of PCR on this data relative to not scaling, but we haven’t completely solved the problem (though some analysts are fooled into thinking thusly). We have not explicitly recovered the two processes that drive y, and recovering such structure in the data is one of the purposes of PCA — if we did not care about the underlying structure of the problem, we could simply fit a model to the original data, or use other methods (like significance pruning) to reduce the problem dimensionality.

It is a misconception in some fields that the variables must be orthogonal before fitting a linear regression model. This is not true. A linear model fit to collinear variables can still predict well; the only downside is that the coefficients of the model are not necessarily as easily interpretable as they are when the variables are orthogonal (and ideally, centered and scaled, as well). If your data has so much collinearity that the design matrix is ill-conditioned, causing the model coefficients to be inappropriately large or unstable, then regularization (ridge, lasso, or elastic-net regression) is a good solution. More complex predictive modeling approaches, for example random forest or gradient boosting, also tend to be more immune to collinearity.

So if you are doing PCR, you presumably are interested in the underlying structure of the data, and in this case, we haven’t found it. Projecting onto the first few principal components fails to show much of a relation between these components and y.

We can confirm the first two x-scaled principal components are not informative with the following graph.

proj <- extractProjection(2,princ)
# apply projection
projectedTrain <- as.data.frame(dmTrain %*% proj,
                      stringsAsFactors = FALSE)
projectedTrain$y <- dTrainNTreatedXscaled$y
# plot data sorted by principal components
ScatterHistN(projectedTrain,'PC1','PC2','y',
               "x scaled Data projected to first two principal components")

We see that y is not well ordered by PC1 and PC2 here, as it was in the ideal case, and as it will be with the y-aware PCA.

In our next article we will show that we can explain almost 50% of the y variance in this data using only two variables. This is quite good as even the “all variable” model only picks up about that much of the relation and y by design has about 33% unexplainable variation. In addition to showing the standard methods (including variable pruning) we will introduce a technique we call “y-aware scaling.”

References

Everitt, B. S. The Cambridge Dictionary of Statistics, 2nd edition, Cambridge University Press, 2005.

Jolliffe, Ian T. “A Note on the Use of Principal Components in Regression,” Journal of the Royal Statistical Society. Series C (Applied Statistics), Vol. 31, No. 3 (1982), pp. 300-303

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

↧

Principal Components Regression in R, an operational tutorial

May 17, 2016, 8:30 am

≫ Next: IP string to integer conversion with Rcpp

≪ Previous: Principal Components Regression, Pt.1: The Standard Method

(This article was first published on Revolutions, and kindly contributed to R-bloggers)

John Mount Ph. D.
Data Scientist at Win-Vector LLC

Win-Vector LLC's Dr. Nina Zumel has just started a two part series on Principal Components Regression that we think is well worth your time. You can read her article here. Principal Components Regression (PCR) is the use of Principal Components Analysis (PCA) as a dimension reduction step prior to linear regression. It is one of the best known dimensionality reduction techniques and a staple procedure in many scientific fields. PCA is used because:

It can find important latent structure and relations.
It can reduce over fit.
It can ease the curse of dimensionality.
It is used in a ritualistic manner in many scientific disciplines. In some fields it is considered ignorant and uncouth to regress using original variables.

We often find ourselves having to often remind readers that this last reason is not actually positive. The standard derivation of PCA involves trotting out the math and showing the determination of eigenvector directions. It yields visually attractive diagrams such as the following.

Wikipedia: PCA

And this leads to a deficiency in much of the teaching of the method: glossing over the operational consequences and outcomes of applying the method. The mathematics is important to the extent it allows you to reason about the appropriateness of the method, the consequences of the transform, and the pitfalls of the technique. The mathematics is also critical to the correct implementation, but that is what one hopes is already supplied in a reliable analysis platform (such as R).

Dr. Zumel uses the expressive and graphical power of R to work through the use of Principal Components Regression in an operational series of examples. She works through how Principal Components Regression is typically mis-applied and continues on to how to correctly apply it. Taking the extra time to work through the all too common errors allows her to demonstrate and quantify the benefits of correct technique. Dr. Zumel will soon follow part 1 later with a shorter part 2 article demonstrating important "y-aware" techniques that squeeze much more modeling power out of your data in predictive analytic situations (which is what regression actually is). Some of the methods are already in the literature, but are still not used widely enough. We hope the demonstrated techniques and included references will give you a perspective to improve how you use or even teach Principal Components Regression. Please read on here.

To leave a comment for the author, please follow the link and comment on their blog: Revolutions.

↧

IP string to integer conversion with Rcpp

May 18, 2016, 11:30 pm

≫ Next: RTCGA factory of R packages – Quick Guide

≪ Previous: Principal Components Regression in R, an operational tutorial

(This article was first published on Opiate for the masses, and kindly contributed to R-bloggers)

IP address conversion

At work I recently had to match data on IP addresses and some fuzzy timestamp matching – a mess, to say the least. But before I could even tackle that problem, one dataset had the IPs stored as a character (e.g. 10.0.0.0), while the other dataset had the IP addresses converted as integers (e.g. 167772160).

Storing IPs as integers has the advantage of saving some space and making calculations easier. This page goes into detail on how this conversion is made. You split the IP address into the four octets and then shift each octet by sets of 8 bit:

(first octet * 256³) + (second octet * 256²) + (third octet * 256) + (fourth octet)

Using 10.0.0.0 as an example:

(10*256^3) + (0*256^2) + (0*256^1) + (0*256^0) = 167772160

Converting in R

Since it’s a simple mathematical conversion, it’s easy to write a function that will convert the IP to integer, and also back. Stackoverflow has an answer here.

Converting in Rcpp

During my googleing, I stumbled across this blogpost, which solved the problem with some CPP code using the magic of boost, a CPP library with lots of nice functions.

Since my data had several millions of rows, generally anything that speeds up conversions is a good idea! I tried the code available at their site which threw some errors on comment characters. After removing the comments, everything worked nicely (don’t forget to install the boost libraries! sudo apt-get install libboost-dev):

#include <Rcpp.h> 
#include <boost/asio/ip/address_v4.hpp>

using namespace Rcpp;

// [[Rcpp::export]]
unsigned long rinet_pton (CharacterVector ip) { 
  return(boost::asio::ip::address_v4::from_string(ip[0]).to_ulong());
}

// [[Rcpp::export]]
CharacterVector rinet_ntop (unsigned long addr) {
  return(boost::asio::ip::address_v4(addr).to_string());
}

Running the code in R is simple, and you’ll get the result without any problems:

library(Rcpp)
library(inline)
Rcpp::sourceCpp("iputils.cpp")

# test convert an IPv4 string to integer
rinet_pton("10.0.0.0")
#[1] 167772160

# test conversion back
rinet_ntop(167772160)
#[1] "10.0.0.0"

Unfortunately, the result returned is a scalar. Running the command in a mutate() only returns the first IP for all rows.
So, I took to vectorising the code. Time to grab the excellent advanced R website/book by Hadley. Specifically, the Rcpp section. Checking the cpp code, I noticed that rinet_pton returns a scalar (unsigned long), even though a vector is used as an input (CharacterVector). Moreover, it will always pick the first IP from the character vector input to return: from_string(ip[0]).

Going by the Rcpp documentation, I changed the inputs and returns to vectors always, and wrote a quick cpp loop to vectorise the functions.

// [[Rcpp::export]]
IntegerVector rinet_pton (CharacterVector ip) { 
  int n = ip.size();
  IntegerVector out(n);
  
  for(int i = 0; i < n; ++i) {
    out[i] = boost::asio::ip::address_v4::from_string(ip[i]).to_ulong();
  }
  return out;
}


// [[Rcpp::export]]
CharacterVector rinet_ntop (IntegerVector addr) {
  int n = addr.size();
  CharacterVector out(n);
  
  for(int i = 0; i < n; ++i) {
    out[i] = boost::asio::ip::address_v4(addr[i]).to_string();
  }
  return out;
}

With the functions now vectorised, it’s easy to pass vectors, and run the function on a dataframe column.

rinet_pton(c("10.0.0.0", "192.168.0.1"))
# [1]   167772160 -1062731775

I should note that I know nothing of cpp programming, and this was fully hacked by following the Rcpp examples. The rinet_ntop() function throws an error on passing negative numbers (it expects an unsigned long), so you can’t reconvert the 192.168.0.1 IP to integer and back. This was not a problem for me, since all I needed was to match the IPs, and my integer IPs in the one dataset were created via boost in the first place.

The code is available on github as a gist.

IP string to integer conversion with Rcpp was originally published by Kirill Pomogajko at Opiate for the masses on May 19, 2016.

To leave a comment for the author, please follow the link and comment on their blog: Opiate for the masses.

↧

RTCGA factory of R packages – Quick Guide

May 3, 2016, 5:00 pm

≫ Next: Installing WVPlots and “knitting R markdown”

≪ Previous: IP string to integer conversion with Rcpp

(This article was first published on r-addict.com, and kindly contributed to R-bloggers)

Yesterday we have been delivered with the new version of R – R 3.3.0 (codename Supposedly Educational). This enabled Bioconductor (yes, not all packages are distributed on CRAN) to release it’s new version 3.3. This means that all packages held on Bioconductor, that were under rapid and vivid development, have been moved to stable-release versions and now can be easily installed. This happens once or twice a year. With that date I have finished work with RTCGA package and released, on Bioconductor, the RTCGA Factory of R Packages. Read this quick guide to find out more about this R Toolkit for Biostatistics with the usage of data from The Cancer Genome Atlas study.

About The Cancer Genome Atlas

The Cancer Genome Atlas (TCGA) is a comprehensive and coordinated effort to accelerate our understanding of the molecular basis of cancer through the application of genome analysis technologies, including large-scale genome sequencing – http://cancergenome.nih.gov/.

Our team converted selected datasets from this study into few separate packages that are hosted on Bioconductor. These R packages make selected datasets easier to access and manage. Data sets in RTCGA packages are large and cover complex relations between clinical outcomes and genetic background.

To use RTCGA install package with instructions from it’s Bioconductor home page

## try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
biocLite("RTCGA")

Check, Download and Read Data

Packages from the RTCGA factory will be useful for at least three audiences: biostatisticians that work with cancer data; researchers that are working on large scale algorithms, for them RTGCA data will be a perfect blasting site; teachers that are presenting data analysis method on real data problems.

library(RTCGA)

TCGA releases various datasets over time for different cohorts, that are determined by cancer types. One can check

infoTCGA() – what are cohort codes and counts for each cohort from TCGA project,
checkTCGA('Dates') – what are TCGA datasets’ dates of release,
checkTCGA('DataSets', cancerType = "BRC") – what are TCGA datasets’ names for current release date and cohort.

With that knowledge we are able to download specific datasets from TCGA study. The following command downloads datasets that have string Merge_Clinical.Level_1 in it’s name for BRCA cohort type (Breast carcinoma) for 2015-11-01 date of release.

downloadTCGA(cancerTypes = "BRCA",
             dataSet = "Merge_Clinical.Level_1",
             destDir = "output_dir",
             date = "2015-11-01")

For specific datasets (8 types) we have prepared readTCGA funciton that reads dataset to the tidy format, using datatable::fread function. For expression datasets we also change columns types to natural numeric values.

readTCGA(path = file.path("output_dir",
                          grep("clinical_clin_format.txt",
                               list.files("output_dir/",
                                          recursive = TRUE),
                               value = TRUE)
                          ),
         dataType = "clinical") -> BRCA.clinical.20151101
dim(BRCA.clinical.20151101)

[1] 1098 1494

Prepared Available Datasets

For the most popular datasets types we have prepared data packages that provides various genetic information for 2015-11-01 date of TCGA release. You can read about those datasets and install them with

?datasetsTCGA
?installTCGA

Those datasets can be converted to Bioconductor format with convertTCGA function. You can check full documentation prepared with staticdocs here – http://rtcga.github.io/RTCGA/staticdocs/.

Manipulate and Visualize Data

For prepared datasets we have provided functions to manipulate and visualize effect of statistical procedures like Principal Component Analysis (based on ggbiplot) or estimates of the Kaplan-Meier survival curves (based on the elegant survminer package). Check few examples below

Survival Curves

library(RTCGA.clinical)
survivalTCGA(BRCA.clinical,
             OV.clinical,
             extract.cols = "admin.disease_code") -> BRCAOV.survInfo
## Kaplan-Meier Survival Curves
kmTCGA(BRCAOV.survInfo,
       explanatory.names = "admin.disease_code",
       pval = TRUE,
       xlim = c(0,2000),
       break.time.by = 500)

plot of chunk unnamed-chunk-7

PCA Biplot

library(dplyr)
## RNASeq expressions
library(RTCGA.rnaseq)
expressionsTCGA(BRCA.rnaseq, OV.rnaseq, HNSC.rnaseq) %>%
   rename(cohort = dataset) %>%  
   filter(substr(bcr_patient_barcode, 14, 15) == "01") -> 
   BRCA.OV.HNSC.rnaseq.cancer

pcaTCGA(BRCA.OV.HNSC.rnaseq.cancer,
        group.names = "cohort",
        title = "Genes expressions vs cohort types")

plot of chunk unnamed-chunk-8

For more visualization examples visit RTCGA project website. If you have noticed any bugs or have any reflections please open an issue under project’s repository or post a comment on below Disqus panel.

To leave a comment for the author, please follow the link and comment on their blog: r-addict.com.

↧

Installing WVPlots and “knitting R markdown”

May 20, 2016, 8:55 pm

≫ Next: Tutorial: GitHub for Data Scientists without the Terminal

≪ Previous: RTCGA factory of R packages – Quick Guide

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

Some readers have been having a bit of trouble using devtools to install WVPlots. I thought I would write a note with a few instructions to help. These are things you should not have to do often, and things those of us already running R have stumbled through and forgotten about.

First you will need install (likely admin) privileges on your machine and a network connection that is not blocking and of cran, RStudio or Github.

Make sure you have up to date copies of both R and RStudio. We have to assume you are somewhat familiar with R and RStudio (so suggest a tutorial if you are not).

Once you have these we will try to “knit” or render a R markdown document. To do this start RStudio select File->"New File"->"R Markdown" as shown below (menus may be different on different systems, you will have to look around a bit).

RStudio1

Then click “OK”. Then press the “Knit HTML” button as shown in the next figure.

RStudio2

This will ask you to pick a filename to save as (anything ending in “.Rmd” will do). If RStudio asks to install anything let it. In the end you should get a rendered copy of RStudio’s example document. If any of this doesn’t work you can look to RStudio documentation.

Assuming the above worked paste the following commands into RStudio’s “Console” window (entering a “return” after the paste to ensure execution).
[Note any time we say paste or type, watch out for any errors caused by conversion of normal machine quotes to insidious smart quotes.]

install.packages(c('RCurl','ggplot2','tidyr',
                    'devtools','knitr'))

The set of packages you actually need can usually be found by looking at the R you wish to run and looking for any library() or :: commands. R scripts and worksheets tend not to install packages on their own as that would be a bit invasive.

If the above commands execute without error (messages and warnings are okay) you can then try the command below to install WVPlots:

devtools::install_github('WinVector/WVPlots',
                        build_vignettes=TRUE)

If the above fails (some Windows users are seeing “curl” errors) it can be a problem with your machine (perhaps permissions, or no curl library installed), network, anti-virus, or firewall software. If it does fail you can try to install WVPlots yourself by doing the following:

Navigate a web browser to http://winvector.github.io/WVPlots/.
From there download the file WVPlots_0.1.tar.gz.
In the RStudio “Consoel” window type: install.packages('~/Downloads/WVPlots_0.1.tar.gz',repos=NULL) (replacing '~/Downloads/WVPlots_0.1.tar.gz' with wherever you downloaded WVPlots_0.1.tar.gz to).

If the above worked you can test the WVPlots package by typing library("WVPlots").

Now you can try knitting one of our example worksheets.

Navigate a web browser to https://github.com/WinVector/Examples/blob/master/PCR/XonlyPCA.Rmd
Download the file XonlyPCA.Rmd by right-clicking on the “Raw” button (towards the top right).
Rename the downloaded file from XonlyPCA.Rmd.txt to XonlyPCA.Rmd.
In Rstudio use File->"Open File" to open XonlyPCA.Rmd.
Press the “Knit HTML” button (top midle of the editor pane) and this should produced the rendered result.

If this isn’t working then something is either not installed or configured correctly, or something is blocking access (such as anti-virus software or firewall software). The best thing to do is find another local R user and debug together.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

↧

Tutorial: GitHub for Data Scientists without the Terminal

May 20, 2016, 11:12 pm

≫ Next: Principal Components Regression, Pt. 2: Y-Aware Methods

≪ Previous: Installing WVPlots and “knitting R markdown”

(This article was first published on R – Modern Data, and kindly contributed to R-bloggers)

Git and GitHub are indispensable tools for anyone analysing data, developing software or disseminating results. Originally designed for software engineers, GitHub is now widely used in many disciplines, especially for researchers in academia. Having a source code management software such as GitHub to host your code and have detailed project documentation is a huge step towards ensuring research is reproducible. It also makes it easier for others to build upon the work you have already done which leads to more efficient use of research time, not to mention your citation count will increase.

Learning Git and GitHub can be a daunting task, especially if you’re not familiar or used to working with the command line (a.k.a terminal). With this in mind we created a new introductory tutorial, catered towards data scientists using R, titled:

GitHub for Data Scientists without the terminal

We provide step-by-step instructions and detailed screenshots to guide you along the way. You will learn about:

Installing Git
2. Signup for a GitHub account and a Hello World tutorial
3. Installing GitHub Desktop
4. Version control R code using an example of PCA
5. Create a branch, pull request and merge
6. Introduction to Git functionality in RStudio
7. Create and publish an R Markdown document
8. Create an online CV

It is not uncommon now for employers to prioritize your GitHub portfolio over your CV. This tutorial demonstrates how simple it is to get up and running with GitHub. In addition to having an easy-to-use interface, it allows you to easily create websites and host dynamic documents. I encourage you to adopt this workflow, whether you work in industry or academia, to showcase your work, increase efficiency and ensure reproducibility.

To leave a comment for the author, please follow the link and comment on their blog: R – Modern Data.

↧

Principal Components Regression, Pt. 2: Y-Aware Methods

May 23, 2016, 9:38 am

≫ Next: Principal Components Regression, Pt. 3: Picking the Number of Components

≪ Previous: Tutorial: GitHub for Data Scientists without the Terminal

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

In our previous note, we discussed some problems that can arise when using standard principal components analysis (specifically, principal components regression) to model the relationship between independent (x) and dependent (y) variables. In this note, we present some dimensionality reduction techniques that alleviate some of those problems, in particular what we call Y-Aware Principal Components Analysis, or Y-Aware PCA. We will use our variable treatment package vtreat in the examples we show in this note, but you can easily implement the approach independently of vtreat.

What is Y-Aware PCA?

As with other geometric algorithms, principal components analysis is sensitive to the units of the data. In standard ("x-only") PCA, we often attempt to alleviate this problem by rescaling the x variables to their "natural units": that is, we rescale x by its own standard deviation. By individually rescaling each x variable to its "natural unit," we hope (but cannot guarantee) that all the data as a group will be in some "natural metric space," and that the structure we hope to discover in the data will manifest itself in this coordinate system. As we saw in the previous note, if the structure that we hope to discover is the relationship between x and y, we have even less guarantee that we are in the correct space, since the decomposition of the data was done without knowledge of y.

Y-aware PCA is simply PCA with a different scaling: we rescale the x data to be in y-units. That is, we want scaled variables x’ such that a unit change in x’ corresponds to a unit change in y. Under this rescaling, all the independent variables are in the same units, which are indeed the natural units for the problem at hand: characterizing their effect on y. (We also center the transformed variables x’ to be zero mean, as is done with standard centering and scaling).

It’s easy to determine the scaling for a variable x by fitting a linear regression model between x and y:


y = m * x + b

The coefficient m is the slope of the best fit line, so a unit change in x corresponds (on average) to a change of m units in y. If we rescale (and recenter) x as


x' := m * x - mean(m * x)

then x’ is in y units. This y-aware scaling is both complementary to variable pruning and powerful enough to perform well on its own.

In vtreat, the treatment plan created by designTreatmentsN() will store the information needed for y-aware scaling, so that if you then prepare your data with the flag scale=TRUE, the resulting treated frame will be scaled appropriately.

An Example of Y-Aware PCA

First, let’s build our example. We will use the same data set as our earlier "X only" discussion.

We’ll set things up so that the first five variables (x.01, x.02, x.03, x.04, x.05) have all the signal. The odd numbered variables correspond to one process (yB) and the even numbered variables correspond to the other (yA). Then, to simulate the difficulties of real world modeling, we’ll add lots of pure noise variables (noise*). The noise variables are unrelated to our y of interest — but are related to other "y-style" processes that we are not interested in. We do this because in real applications, there is no reason to believe that unhelpful variables have limited variation or are uncorrelated with each other, though things would certainly be easier if we could so assume. As we showed in the previous note, this correlation undesirably out-competed the y induced correlation among signaling variables when using standard PCA.

All the variables are also deliberately mis-scaled to model some of the difficulties of working with under-curated real world data.

Let’s start with our train and test data.

# make data
set.seed(23525)
dTrain <- mkData(1000)
dTest <- mkData(1000)

Let’s look at our outcome y and a few of our variables.

summary(dTrain[, c("y", "x.01", "x.02", "noise1.01", "noise1.02")])

##        y                 x.01               x.02        
##  Min.   :-5.08978   Min.   :-4.94531   Min.   :-9.9796  
##  1st Qu.:-1.01488   1st Qu.:-0.97409   1st Qu.:-1.8235  
##  Median : 0.08223   Median : 0.04962   Median : 0.2025  
##  Mean   : 0.08504   Mean   : 0.02968   Mean   : 0.1406  
##  3rd Qu.: 1.17766   3rd Qu.: 0.93307   3rd Qu.: 1.9949  
##  Max.   : 5.84932   Max.   : 4.25777   Max.   :10.0261  
##    noise1.01          noise1.02       
##  Min.   :-30.5661   Min.   :-30.4412  
##  1st Qu.: -5.6814   1st Qu.: -6.4069  
##  Median :  0.5278   Median :  0.3031  
##  Mean   :  0.1754   Mean   :  0.4145  
##  3rd Qu.:  5.9238   3rd Qu.:  6.8142  
##  Max.   : 26.4111   Max.   : 31.8405

Next, we’ll design a treatment plan for the frame, and examine the variable significances, as estimated by vtreat.

# design treatment plan
treatmentsN <- designTreatmentsN(dTrain,setdiff(colnames(dTrain),'y'),'y',
                                 verbose=FALSE)

scoreFrame = treatmentsN$scoreFrame
scoreFrame$vartype = ifelse(grepl("noise", scoreFrame$varName), "noise", "signal")

dotplot_identity(scoreFrame, "varName", "sig", "vartype") + 
  coord_flip()  + ggtitle("vtreat variable significance estimates")+ 
  scale_color_manual(values = c("noise" = "#d95f02", "signal" = "#1b9e77"))

Note that the noise variables typically have large significance values, denoting statistical insignificance. Usually we recommend doing some significance pruning on variables before moving on — see here for possible consequences of not pruning an over-abundance of variables, and here for a discussion of one way to prune, based on significance. For this example, however, we will attempt dimensionality reduction without pruning.

Y-Aware PCA

Prepare the frame with y-aware scaling

Now let’s prepare the treated frame, with scaling turned on. We will deliberately turn off variable pruning by setting pruneSig = 1. In real applications, you would want to set pruneSig to a value less than one to prune insignificant variables. However, here we turn off variable pruning to show that you can recover some of pruning’s benefits via scaling effects, because the scaled noise variables should not have a major effect in the principal components analysis. Pruning by significance is in fact a good additional precaution complementary to scaling by effects.

# prepare the treated frames, with y-aware scaling
examplePruneSig = 1.0 
dTrainNTreatedYScaled <- prepare(treatmentsN,dTrain,pruneSig=examplePruneSig,scale=TRUE)
dTestNTreatedYScaled <- prepare(treatmentsN,dTest,pruneSig=examplePruneSig,scale=TRUE)

# get the variable ranges
ranges = vapply(dTrainNTreatedYScaled, FUN=function(col) c(min(col), max(col)), numeric(2))
rownames(ranges) = c("vmin", "vmax") 
rframe = as.data.frame(t(ranges))  # make ymin/ymax the columns
rframe$varName = rownames(rframe)
varnames = setdiff(rownames(rframe), "y")
rframe = rframe[varnames,]
rframe$vartype = ifelse(grepl("noise", rframe$varName), "noise", "signal")

# show a few columns
summary(dTrainNTreatedYScaled[, c("y", "x.01_clean", "x.02_clean", "noise1.02_clean", "noise1.02_clean")])

##        y              x.01_clean         x.02_clean      
##  Min.   :-5.08978   Min.   :-2.65396   Min.   :-2.51975  
##  1st Qu.:-1.01488   1st Qu.:-0.53547   1st Qu.:-0.48904  
##  Median : 0.08223   Median : 0.01063   Median : 0.01539  
##  Mean   : 0.08504   Mean   : 0.00000   Mean   : 0.00000  
##  3rd Qu.: 1.17766   3rd Qu.: 0.48192   3rd Qu.: 0.46167  
##  Max.   : 5.84932   Max.   : 2.25552   Max.   : 2.46128  
##  noise1.02_clean      noise1.02_clean.1   
##  Min.   :-0.0917910   Min.   :-0.0917910  
##  1st Qu.:-0.0186927   1st Qu.:-0.0186927  
##  Median : 0.0003253   Median : 0.0003253  
##  Mean   : 0.0000000   Mean   : 0.0000000  
##  3rd Qu.: 0.0199244   3rd Qu.: 0.0199244  
##  Max.   : 0.0901253   Max.   : 0.0901253

barbell_plot(rframe, "varName", "vmin", "vmax", "vartype") +
  coord_flip() + ggtitle("y-scaled variables: ranges") + 
  scale_color_manual(values = c("noise" = "#d95f02", "signal" = "#1b9e77"))

Notice that after the y-aware rescaling, the signal carrying variables have larger ranges than the noise variables.

The Principal Components Analysis

Now we do the principal components analysis. In this case it is critical that the scale parameter in prcomp is set to FALSE so that it does not undo our own scaling. Notice the magnitudes of the singular values fall off quickly after the first two to five values.

vars <- setdiff(colnames(dTrainNTreatedYScaled),'y')
# prcomp defaults to scale. = FALSE, but we already scaled/centered in vtreat- which we don't want to lose.
dmTrain <- as.matrix(dTrainNTreatedYScaled[,vars])
dmTest <- as.matrix(dTestNTreatedYScaled[,vars])
princ <- prcomp(dmTrain, center = FALSE, scale. = FALSE)
dotplot_identity(frame = data.frame(pc=1:length(princ$sdev), 
                            magnitude=princ$sdev), 
                 xvar="pc",yvar="magnitude") +
  ggtitle("Y-Scaled variables: Magnitudes of singular values")

When we look at the variable loadings of the first five principal components, we see that we recover the even/odd loadings of the original signal variables. PC1 has the odd variables, and PC2 has the even variables. These two principal components carry most of the signal. The next three principal components complete the basis for the five original signal variables. The noise variables have very small loadings, compared to the signal variables.

proj <- extractProjection(2,princ)
rot5 <- extractProjection(5,princ)
rotf = as.data.frame(rot5)
rotf$varName = rownames(rotf)
rotflong = gather(rotf, "PC", "loading", starts_with("PC"))
rotflong$vartype = ifelse(grepl("noise", rotflong$varName), "noise", "signal")

dotplot_identity(rotflong, "varName", "loading", "vartype") + 
  facet_wrap(~PC,nrow=1) + coord_flip() + 
  ggtitle("Y-Scaled Variable loadings, first five principal components") + 
  scale_color_manual(values = c("noise" = "#d95f02", "signal" = "#1b9e77"))

Let’s look at the projection of the data onto its first two principal components, using color to code the y value. Notice that y increases both as we move up and as we move right. We have recovered two features that correlate with an increase in y. In fact, PC1 corresponds to the odd signal variables, which correspond to process yB, and PC2 corresponds to the even signal variables, which correspond to process yA.

# apply projection
projectedTrain <- as.data.frame(dmTrain %*% proj,
                      stringsAsFactors = FALSE)
# plot data sorted by principal components
projectedTrain$y <- dTrainNTreatedYScaled$y
ScatterHistN(projectedTrain,'PC1','PC2','y',
               "Y-Scaled Training Data projected to first two principal components")

Now let’s fit a linear regression model to the first two principal components.

model <- lm(y~PC1+PC2,data=projectedTrain)
summary(model)

## 
## Call:
## lm(formula = y ~ PC1 + PC2, data = projectedTrain)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3470 -0.7919  0.0172  0.7955  3.9588 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.08504    0.03912   2.174     0.03 *  
## PC1          0.78611    0.04092  19.212   <2e-16 ***
## PC2          1.03243    0.04469  23.101   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.237 on 997 degrees of freedom
## Multiple R-squared:  0.4752, Adjusted R-squared:  0.4742 
## F-statistic: 451.4 on 2 and 997 DF,  p-value: < 2.2e-16

projectedTrain$estimate <- predict(model,newdata=projectedTrain)
trainrsq = rsq(projectedTrain$estimate,projectedTrain$y)

ScatterHist(projectedTrain,'estimate','y','Recovered model versus truth (y aware PCA train)',
            smoothmethod='identity',annot_size=3)

This model, with only two variables, explains 47.52% of the variation in y. This is comparable to the variance explained by the model fit to twenty principal components using x-only PCA (as well as a model fit to all the original variables) in the previous note.

Let’s see how the model does on hold-out data.

# apply projection
projectedTest <- as.data.frame(dmTest %*% proj,
                      stringsAsFactors = FALSE)
# plot data sorted by principal components
projectedTest$y <- dTestNTreatedYScaled$y
ScatterHistN(projectedTest,'PC1','PC2','y',
               "Y-Scaled Test Data projected to first two principal components")

projectedTest$estimate <- predict(model,newdata=projectedTest)
testrsq = rsq(projectedTest$estimate,projectedTest$y)
testrsq

## [1] 0.5063724

ScatterHist(projectedTest,'estimate','y','Recovered model versus truth (y aware PCA test)',
            smoothmethod='identity',annot_size=3)

We see that this two-variable model captures about 50.64% of the variance in y on hold-out — again, comparable to the hold-out performance of the model fit to twenty principal components using x-only PCA. These two principal components also do a much better job of capturing the internal structure of the data — that is, the relationship of the signaling variables to the yA and yB processes — than the first two principal components of the x-only PCA.

Is this the same as `caret::preProcess`?

In this note, we used vtreat, a data.frame processor/conditioner that prepares real-world data for predictive modeling in a statistically sound manner, followed by principal components regression. One could instead use caret. The caret package, as described in the documentation, "is a set of functions that attempt to streamline the process for creating predictive models."

caret::preProcess is designed to implement a number of sophisticated x alone transformations, groupings, prunings, and repairs (see caret/preprocess.html#all, which demonstrates "the function on all the columns except the last, which is the outcome" on the schedulingData dataset). So caret::preProcess is a super-version of the PCA step.

We could use it as follows either alone or before vtreat design/prepare as a initial pre-processor. Using it alone is similar to PCA for this data set, as our example doesn’t have some of the additional problems caret::preProcess is designed to help with.

library('caret')
origVars <- setdiff(colnames(dTrain),'y')
# can try variations such adding/removing non-linear steps such as "YeoJohnson"
prep <- preProcess(dTrain[,origVars],
                     method = c("center", "scale", "pca"))
prepared <- predict(prep,newdata=dTrain[,origVars])
newVars <- colnames(prepared)
prepared$y <- dTrain$y
print(length(newVars))

## [1] 44

modelB <- lm(paste('y',paste(newVars,collapse=' + '),sep=' ~ '),data=prepared)
print(summary(modelB)$r.squared)

## [1] 0.5004569

print(summary(modelB)$adj.r.squared)

## [1] 0.4774413

preparedTest <- predict(prep,newdata=dTest[,origVars])
testRsqC <- rsq(predict(modelB,newdata=preparedTest),dTest$y)
testRsqC

## [1] 0.4824284

The 44 caret-chosen PCA variables are designed to capture 95% of the in-sample explainable variation of the variables. The linear regression model fit to the selected variables explains about 50.05% of the y variance on training and 48.24% of the y variance on test. This is quite good, comparable to our previous results. However, note that caret picked more than the twenty principal components that we picked visually in the previous note, and needed far more variables than we needed with y-aware PCA.

Because caret::preProcess is x-only processing, the first few variables capture much less of the y variation. So we can’t model y without using a lot of the derived variables. To show this, let’s try fitting a model using only five of caret‘s PCA variables.

model5 <- lm(paste('y',paste(newVars[1:5],collapse=' + '),sep=' ~ '),data=prepared)
print(summary(model5)$r.squared)

## [1] 0.1352

print(summary(model5)$adj.r.squared)

## [1] 0.1308499

The first 5 variables only capture about 13.52% of the in-sample variance; without being informed about y, we can’t know which variation to preserve and which we can ignore. We certainly haven’t captured the two subprocesses that drive y in an inspectable manner.

Other Y-aware Approaches to Dimensionality Reduction

If your goal is regression, there are other workable y-aware dimension reducing procedures, such as L2-regularized regression or partial least squares. Both methods are also related to principal components analysis (see Hastie, etal 2009).

Bair, etal proposed a variant of principal components regression that they call Supervised PCR. In supervised PCR, as described in their 2006 paper, a univariate linear regression model is fit to each variable (after scaling and centering), and any variable whose coefficient (what we called m above) has a magnitude less than some threshold (theta) is pruned. PCR is then done on the remaining variables. Conceptually, this is similar to the significance pruning that vtreat offers, except that the pruning criterion is "effects-based" (that is, it’s based on the magnitude of a parameter, or the strength of an effect) rather than probability-based, such as pruning on significance.

One issue with an effects-based pruning criterion is that the appropriate pruning threshold varies from problem to problem, and not necessarily in an obvious way. Bair, etal find an appropriate threshold via cross-validation. Probability-based thresholds are in some sense more generalizable from problem to problem, since the score is always in probability units — the same units for all problems. A simple variation of supervised PCR might prune on the significance of the coefficient m, as determined by its t-statistic. This would be essentially equivalent to significance pruning of the variables via vtreat before standard PCR.

Note that vtreat uses the significance of the one-variable model fits, not coefficient significance to estimate variable significance. When both the dependent and independent variables are numeric, the model significance and the coefficient significance are identical (see Weisberg, Applied Linear Regression). In more general modeling situations where either the outcome is categorical or the original input variable is categorical with many degrees of freedom, they are not the same, and, in our opinion, using the model significance is preferable.

In general modeling situations where you are not specifically interested in the structure of the feature space, as described by the principal components, then we recommend significance pruning of the variables. As a rule of thumb, we suggest setting your significance pruning threshold based on the rate at which you can tolerate bad variables slipping into the model. For example, setting the pruning threshold at (p=0.05) would let pure noise variables in at the rate of about 1 in 20 in expectation. So a good upper bound on the pruning threshold might be 1/nvar, where nvar is the number of variables. We discuss this issue briefly here in the vtreat documentation.

vtreat does not supply any joint variable dimension reduction as we feel dimension reduction is a modeling task. vtreat is intended to limit itself to only necessary "prior to modeling" processing and includes significance pruning reductions because such pruning can be necessary prior to modeling.

Conclusion

In our experience, there are two camps of analysts: those who never use principal components regression and those who use it far too often. While principal components analysis is a useful data conditioning method, it is sensitive to distances and geometry. Therefore it is only to be trusted when the variables are curated, pruned, and in appropriate units. Principal components regression should not be used blindly; it requires proper domain aware scaling, initial variable pruning, and posterior component pruning. If the goal is regression many of the purported benefits of principal components regression can be achieved through regularization.

The general principals are widely applicable, and often re-discovered and re-formulated in useful ways (such as autoencoders).

In our next note, we will look at some ways to pick the appropriate number of principal components procedurally.

References

Bair, Eric, Trevor Hastie, Debashis Paul and Robert Tibshirani, "Prediction by Supervised Principal Components", Journal of the American Statistical Association, Vol. 101, No. 473 (March 2006), pp. 119-137.
Hastie, Trevor, Robert Tibshirani, and Jerome Friedman, The Elements of Statistical Learning, 2nd Edition, 2009.
Weisberg, Sanford, Applied Linear Regression, Third Edition, Wiley, 2005.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

↧

Principal Components Regression, Pt. 3: Picking the Number of Components

May 30, 2016, 11:08 am

≫ Next: Building the Data Matrix for the Task at Hand and Analyzing Jointly the Resulting Rows and Columns

≪ Previous: Principal Components Regression, Pt. 2: Y-Aware Methods

(This article was first published on R – Win-Vector Blog, and kindly contributed to R-bloggers)

In our previous note we demonstrated Y-Aware PCA and other y-aware approaches to dimensionality reduction in a predictive modeling context, specifically Principal Components Regression (PCR). For our examples, we selected the appropriate number of principal components by eye. In this note, we will look at ways to select the appropriate number of principal components in a more automated fashion.

Before starting the discussion, let’s quickly redo our y-aware PCA. Please refer to our previous post for a full discussion of this data set and this approach.

#
# make data
#
set.seed(23525)
dTrain <- mkData(1000)
dTest <- mkData(1000)

#
# design treatment plan
#
treatmentsN <- designTreatmentsN(dTrain,
                                 setdiff(colnames(dTrain),'y'),'y',
                                 verbose=FALSE)

#
# prepare the treated frames, with y-aware scaling
#
examplePruneSig = 1.0 
dTrainNTreatedYScaled <- prepare(treatmentsN,dTrain,
                                 pruneSig=examplePruneSig,scale=TRUE)
dTestNTreatedYScaled <- prepare(treatmentsN,dTest,
                                pruneSig=examplePruneSig,scale=TRUE)

#
# do the principal components analysis
#
vars <- setdiff(colnames(dTrainNTreatedYScaled),'y')
# prcomp defaults to scale. = FALSE, but we already 
# scaled/centered in vtreat- which we don't want to lose.
dmTrain <- as.matrix(dTrainNTreatedYScaled[,vars])
dmTest <- as.matrix(dTestNTreatedYScaled[,vars])
princ <- prcomp(dmTrain, center = FALSE, scale. = FALSE)

If we examine the magnitudes of the resulting singular values, we see that we should use from two to five principal components for our analysis. In fact, as we showed in the previous post, the first two singular values accurately capture the two unobservable processes that contribute to y, and a linear model fit to these two components captures most of the explainable variance in the data, both on training and on hold-out data.

We picked the number of principal components to use by eye; but it’s tricky to implement code based on the strategy "look for a knee in the curve." So how might we automate picking the appropriate number of components in a reliable way?

X-Only Approaches

Jackson (1993) and Peres-Neto, et.al. (2005) are two excellent surveys and evaluations of the different published approaches to picking the number of components in standard PCA. Those methods include:

Look for a "knee in the curve" — the approach we have taken, visually.
Only for data that has been scaled to unit variance: keep the components corresponding to singular values greater than 1.
Select enough components to cover some fixed fraction (generally 95%) of the observed variance. This is the approach taken by caret::preProcess.
Perform a statistical test to see which singular values are larger than we would expect from an appropriate null hypothesis or noise process.

The papers also cover other approaches, as well as different variations of the above.

Kabakoff (R In Action, 2nd Edition, 2015) suggests comparing the magnitudes of the singular values to those extracted from random matrices of the same shape as the original data. Let’s assume that the original data has k variables, and that PCA on the original data extracts the k singular values s_i and the k principal components PC_i.To pick the appropriate number of principal components:

For a chosen number of iterations, N (choose N >> k):

Generate a random matrix of the correct size
Do PCA and extract the singular values

Then for each of the k principal components:

Find the mean of the ith singular value, r_i
If s_i > r_i, then keep PC_i

The idea is that if there is more variation in a given direction than you would expect at random, then that direction is probably meaningful. If you assume that higher variance directions are more useful than lower variance directions (the usual assumption), then one handy variation is to find the first i such that s_i < r_i, and keep the first i-1 principal components.

This approach is similar to what the authors of the survey papers cited above refer to as the broken-stick method. In their research, the broken-stick method was among the best performing approaches for a variety of simulated and real-world examples.

With the proper adjustment, all of the above heuristics work as well in the y-adjusted case as they do with traditional x-only PCA.

A Y-Aware Approach: The Permutation Test

Since in our case we know y, we can — and should — take advantage of this information. We will use a variation of the broken-stick method, but rather than comparing our data to a random matrix, we will compare our data to alternative datasets where x has no relation to y. We can do this by randomly permuting the y values. This preserves the structure of x — that is, the correlations and relationships of the x variables to each other — but it changes the units of the problem, that is, the y-aware scaling. We are testing whether or not a given principal component appears more meaningful in a metric space induced by the true y than it does in a random metric space, one that preserves the distribution of y, but not the relationship of y to x.

You can read a more complete discussion of permutation tests and their application to variable selection (significance pruning) in this post.

In our example, we’ll use N=100, and rather than using the means of the singular values from our experiments as the thresholds, we’ll use the 98th percentiles. This represents a threshold value that is likely to be exceeded by a singular value induced in a random space only 1/(the number of variables) (1/50=0.02) fraction of the time.

#
# Resample y, do y-aware PCA, 
# and return the singular values
#
getResampledSV = function(data,yindices) {
  # resample y
  data$y = data$y[yindices]
  
  # treatment plan
  treatplan = vtreat::designTreatmentsN(data, 
                                setdiff(colnames(data), 'y'), 
                                'y', verbose=FALSE)
  # y-aware scaling
  dataTreat = vtreat::prepare(treatplan, data, pruneSig=1, scale=TRUE)
  
  # PCA
  vars = setdiff(colnames(dataTreat), 'y')
  dmat = as.matrix(dataTreat[,vars])
  princ = prcomp(dmat, center=FALSE, scale=FALSE)
  
  # return the magnitudes of the singular values
  princ$sdev
}

#
# Permute y, do y-aware PCA, 
# and return the singular values
#
getPermutedSV = function(data) {
  n = nrow(data)
  getResampledSV(data,sample(n,n,replace=FALSE))
}

#
# Run the permutation tests and collect the outcomes
#
niter = 100 # should be >> nvars
nvars = ncol(dTrain)-1
# matrix: 1 column for each iter, nvars rows
svmat = vapply(1:niter, FUN=function(i) {getPermutedSV(dTrain)}, numeric(nvars))
rownames(svmat) = colnames(princ$rotation) # rows are principal components
colnames(svmat) = paste0('rep',1:niter) # each col is an iteration

# plot the distribution of values for the first singular value
# compare it to the actual first singular value
ggplot(as.data.frame(t(svmat)), aes(x=PC1)) + 
  geom_density() + geom_vline(xintercept=princ$sdev[[1]], color="red") +
  ggtitle("Distribution of magnitudes of first singular value, permuted data")

Here we show the distribution of the magnitude of the first singular value on the permuted data, and compare it to the magnitude of the actual first singular value (the red vertical line). We see that the actual first singular value is far larger than the magnitude you would expect from data where x is not related to y. Let’s compare all the singular values to their permutation test thresholds. The dashed line is the mean value of each singular value from the permutation tests; the shaded area represents the 98th percentile.

# transpose svmat so we get one column for every principal component
# Get the mean and empirical confidence level of every singular value
as.data.frame(t(svmat)) %>% dplyr::summarize_each(funs(mean)) %>% as.numeric() -> pmean
confF <- function(x) as.numeric(quantile(x,1-1/nvars))
as.data.frame(t(svmat)) %>% dplyr::summarize_each(funs(confF)) %>% as.numeric() -> pupper

pdata = data.frame(pc=seq_len(length(pmean)), magnitude=pmean, upper=pupper)

# we will use the first place where the singular value falls 
# below its threshold as the cutoff.
# Obviously there are multiple comparison issues on such a stopping rule,
# but for this example the signal is so strong we can ignore them.
below = which(princ$sdev < pdata$upper)
lastSV = below[[1]] - 1

This test suggests that we should use 5 principal components, which is consistent with what our eye sees. This is perhaps not the "correct" knee in the graph, but it is undoubtably a knee.

Bootstrapping

Empirically estimating the quantiles from the permuted data so that we can threshold the non-informative singular values will have some undesirable bias and variance, especially if we do not perform enough experiment replications. This suggests that instead of estimating quantiles ad-hoc, we should use a systematic method: The Bootstrap. Bootstrap replication breaks the input to output association by re-sampling with replacement rather than using permutation, but comes with built-in methods to estimate bias-adjusted confidence intervals. The methods are fairly technical, and on this dataset the results are similar, so we don’t show them here, although the code is available in the R markdown document used to produce this note.

Significance Pruning

Alternatively, we can treat the principal components that we extracted via y-aware PCA simply as transformed variables — which is what they are — and significance prune them in the standard way. As our article on significance pruning discusses, we can estimate the significance of a variable by fitting a one variable model (in this case, a linear regression) and looking at that model’s significance value. You can pick the pruning threshold by considering the rate of false positives that you are willing to tolerate; as a rule of thumb, we suggest one over the number of variables.

In regular significance pruning, you would take any variable with estimated significance value lower than the threshold. Since in the PCR situation we presume that the variables are ordered from most to least useful, you can again look for the first position i where the variable appears insignificant, and use the first i-1 variables.

We’ll use vtreat to get the significance estimates for the principal components. We’ll use one over the number of variables (1/50 = 0.02) as the pruning threshold.

# get all the principal components
# not really a projection as we took all components!
projectedTrain <- as.data.frame(predict(princ,dTrainNTreatedYScaled),
                                 stringsAsFactors = FALSE)
vars = colnames(projectedTrain)
projectedTrain$y = dTrainNTreatedYScaled$y

# designing the treatment plan for the transformed data
# produces a data frame of estimated significances
tplan = designTreatmentsN(projectedTrain, vars, 'y', verbose=FALSE)

threshold = 1/length(vars)
scoreFrame = tplan$scoreFrame
scoreFrame$accept = scoreFrame$sig < threshold

# pick the number of variables in the standard way:
# the number of variables that pass the significance prune
nPC = sum(scoreFrame$accept)

Significance pruning picks 2 principal components, again consistent with our visual assessment. This time, we picked the correct knee: as we saw in the previous post, the first two principal components were sufficient to describe the explainable structure of the problem.

Conclusion

Since one of the purposes of PCR/PCA is to discover the underlying structure in the data, it’s generally useful to examine the singular values and the variable loadings on the principal components. However an analysis should also be repeatable, and hence, automatable, and it’s not straightforward to automate something as vague as "look for a knee in the curve" when selecting the number of principal components to use. We’ve covered two ways to programatically select the appropriate number of principal components in a predictive modeling context.

To conclude this entire series, here is our recommended best practice for principal components regression:

Significance prune the candidate input variables.
Perform a Y-Aware principal components analysis.
Significance prune the resulting principal components.
Regress.

Thanks to Cyril Pernet, who blogs at NeuroImaging and Statistics, for requesting this follow-up post and pointing us to the Jackson reference.

References

Jackson, Donald A. "Stopping Rules in Principal Components Analysis: A Comparison of Heuristical and Statistical Approaches", Ecology Vol 74, no. 8, 1993.
Kabacoff, Robert I. R In Action, 2nd edition, Manning, 2015.
Efron, Bradley and Robert J. Tibshirani. An Introduction to the Bootstrap, Chapman and Hall/CRC, 1998.
Peres-Neto, Pedro, Donald A. Jackson and Keith M. Somers. "How many principal components? Stopping rules for determining the number of non-trivial axes revisited", Computational Statistics & Data Analysis, Vol 49, no. 4, 2005.

To leave a comment for the author, please follow the link and comment on their blog: R – Win-Vector Blog.

↧

Building the Data Matrix for the Task at Hand and Analyzing Jointly the Resulting Rows and Columns

June 5, 2016, 5:07 pm

≫ Next: What are the Best Machine Learning Packages in R?

≪ Previous: Principal Components Regression, Pt. 3: Picking the Number of Components

(This article was first published on Engaging Market Research, and kindly contributed to R-bloggers)

Someone decided what data ought to go into the matrix. They placed the objects of interest in the rows and the features that differentiate among those objects into the columns. Decisions were made either to collect information or to store what was gathered for other purposes (e.g., data mining).

A set of mutually constraining choices determines what counts as an object and a feature. For example, in the above figure, the beer display case in the convenience store contains a broad assortment of brands in both bottles and cans grouped together by volume and case sizes. The features varying among these beers are the factors that consumers consider when making purchases for specific consumption situations: beers to bring on an outing, beers to have available at home for self and guests, and beers to serve at a casual or formal party.

These beers and their associated attributes would probably not have made it into a taste test at a local brewery. Craft beers come with their own set of distinguishing features: spicy, herbal, sweet and burnt (see the data set called beer.tasting.notes in the ExPosition R package). Yet, preference still depends on consumption occasion for a craft beer needs to be paired with food and desires vary by time of day and ongoing activities and who is drinking with you. What gets measured and how it is measured depends on the situation and the accompanying reasons for data collection (the task at hand).

It should be noted that data matrix construction may evolve over time and may change with context. Rows and columns are added and deleted in order to maintain a type of synchronization with one beer suggesting the inclusion of other beers in the consideration set and at the same time inserting new columns into the data matrix in order to differentiate among the additional beers.

Given such mutual dependencies, why would we want to perform separate analyses of the rows (e.g., cluster analysis) and the columns (e.g., factor analysis), as if row (column) relationships were invariant regardless of the columns (rows)? That is, the similarity of the rows in a cluster analysis depends on the variables populating the columns, and the correlations among the variables are calculated as shared covariation over the rows. The correlation between two variables changes with the individuals sampled (e.g., restriction of range), and the similarity between two observations fluctuates with the basis of comparison (e.g., features included in the object representation). Given the shared determination of what gets included as the rows and the columns of our data matrix, why wouldn’t we turn to a joint scaling procedure such as correspondence analysis (when the cells are counts) or biplots (when the cells are ratings or other continuous measures)?

We can look at such a joint scaling using that beer.tasting.notes data set mentioned earlier in this post. Some 38 craft beers were rated on 16 different flavor profiles. Although the ExPosition R package contains functions that will produce a biplot, I will run the analysis with FactoMineR because it may be more familiar and I have written about this R package previously. The biplot below could have been shown in a single graph, but it might have been somewhat crowded with 38 beers and 16 ratings on the same page. Instead, by default, FactoMineR plots the rows on the Individuals factor map and the columns on the Variables factor map. Both plots come from a single principal component analysis with rows as points and columns as arrows. The result is a vector map with directional interpretation, so that points (beers) have perpendicular projections onto arrows (flavors) that reproduce as closely as possible the beer’s rating on the flavor variable.

Beer tasting provides a valuable illustration of the process by which we build the data matrix for the task at hand. Beer drinkers are not as likely as those drinking wine to discuss the nuances of taste so that you might not know the referents for some of these flavor attributes. As a child, you learned which citrus fruits are bitter and sweet: lemons and oranges. How would you become a beer expert? You would need to participate in a supervised learning experience with beer tastings and flavor labels. For instance, you must taste a sweet and a bitter beer and be told which is which in order to learn the first dimension of sweet vs. bitter as shown in the above biplot.

Obviously, I cannot serve you beer over the Internet, but here is a description of Wild Devil, a beer positioned on the Individuals Factor Map in the direction pointed to by the labels Bitter, Spicy and Astringent on the Variables Factor Map.

It’s arguable that our menacingly delicious HopDevil has always been wild. With bold German malts and whole flower American hops, this India Pale Ale is anything but prim. But add a touch of brettanomyces, the unruly beast responsible for the sharp tang and deep funk found in many Belgian ales, and our WildDevil emerges completely untamed. Floral, aromatic hops still leap from this amber ale, as a host of new fermentation flavor kicks up notes of citrus and pine.

Hopefully, you can see how the descriptors in our columns acquire their meaning through an understanding of how beers are brewed. It is an acquired taste with the acquisition made easier by knowledge of the manufacturing process. What beers should be included as rows? We will need to sample all those variations in the production process that create flavor differences. The result is a relatively long feature list. Constraints are needed and supplied by the task at hand. Thankfully, the local brewery has a limited selection, as shown in the very first figure, so that all we need are ratings that will separate the five beers in tasting tray.

In this case the data matrix contains ratings, which seem to be the averages of two researchers. Had there been many respondents using a checklist, we could have kept counts and created a similar joint map with correspondence analysis. I showed how this might be done in an earlier post mapping the European car market. That analysis demonstrated how the Prius has altered market perceptions. Specifically, Prius has come to be so closely associated with the attributes Green and Environmentally Friendly that our perceptual map required a third dimension anchored by Prius pulling together Economical with Green and Environmental. In retrospect, had the Prius not been included in the set of objects being rated, would we have included Green and Environmentally Friendly in the attribute list?

Now, we can return to our convenience store with a new appreciation of the need to analyze jointly the rows and the columns of our data matrix. After all, this is what consumers do when they form consideration sets and attend only to the differentiating features for this smaller number of purchase alternatives. Human attention enables us to find the cooler in the convenience store and then ignore most of the stuff in that cooler. Preference learning involves knowing how to build the data matrix for the task at hand and thus simplifying the process by focusing only on the most relevant combination of objects and features.

Finally, I have one last caution. There is no reason to insist on homogeneity over all the cells of our data matrix. In the post referenced in the last paragraph, Attention is Preference, I used nonnegative matrix factorization (NMF) to jointly partition the rows and columns of the cosmetics market into subspaces of differing familiarity with brands (e.g., brands that might be sold in upscale boutiques or department stores versus more downscale drug stores). You should not be confused because the rows are consumers and not objects such as car models and beers. The same principles apply whenever goes in the rows. The question is whether all the rows and columns can be represented in a single common space or whether the data matrix is better described as a number of subspaces where blocks of rows and columns have higher density with sparsity across the rest of the cells. Here are these attributes that separate these beers, and there are those attributes that separate those beers (e.g., the twist-off versus pry-off cap tradeoff applies only to beer in bottles).

R Code to Run the Analysis:

# ExPosition must be installed
library(ExPosition)
 
# load data set and examine its structure
data("beer.tasting.notes")
str(beer.tasting.notes)
head(beer.tasting.notes$data)
 
# how ExPosition runs the biplot
pca.beer <- epPCA(beer.tasting.notes$data)
 
# FactoMiner must be installed
library(FactoMineR)
 
# how FactoMineR runs the biplot
fm.beer<-PCA(beer.tasting.notes$data)

Created by Pretty R at inside-R.org

To leave a comment for the author, please follow the link and comment on their blog: Engaging Market Research.

↧

What are the Best Machine Learning Packages in R?

June 6, 2016, 10:11 am

≫ Next: Why you should read Nina Zumel’s 3 part series on principal components analysis and regression

≪ Previous: Building the Data Matrix for the Task at Hand and Analyzing Jointly the Resulting Rows and Columns

Guest post by Khushbu Shah

The most common question asked by prospective data scientists is – “What is the best programming language for Machine Learning?” The answer to this question always results in a debate whether to choose R, Python or MATLAB for Machine Learning. Nobody can, in reality, answer the question as to whether Python or R is best language for Machine Learning. However, the programming language one should choose for machine learning directly depends on the requirements of a given data problem, the likes and preferences of the data scientist and the context of machine learning activities they want to perform. According to a survey on Kaggler’s Favourite Tools, the open source R programming language turned out to be the favourite among 543 Kagglers of the 1714 Kaggler’s listing their data science tools.

R is the preeminent choice among data professionals who want to understand and explore data, using statistical methods and graphs. It has several machine learning packages and advanced implementations for the top machine learning algorithms – which every data scientist must be familiar with, to explore, model and prototype the given data. R is an open source language to which people have contributed, from around the world. Starting from data collection and cleaning to reproducible research – you will find a Black Box written by someone else, which you can directly use in your program. This Black Box is known as Package in R. A Package in R is nothing but collection of pre-written codes which can be reused.

As per CRAN there are around 8,341 packages that are currently available. Apart from CRAN, there are other repositories which contribute multiple packages. The simple straightforward syntax to install any of these machine learning packages is: install.packages (“Name_Of_R_Package”).

Few basic packages without which your life as a data scientist, will be tough include dplyr, ggplot2, reshape2 etc. In this article we will be more focused on packages used in the field of Machine Learning.

Enrol now for a hands-on project based Data Science in R Programming Course

MICE Package – Takes care of your Missing Values

If missing values are something which haunts you then MICE package is the real friend of yours.

When we face an issue of missing values we generally go ahead with basic imputations such as replacing with 0, replacing with mean, replacing with mode etc. but each of these methods are not versatile and could result into a possible data discrepancy.

MICE package helps you to impute missing values by using multiple techniques, depending on the kind of data you are working with.

Let’s take an example on using MICE Package.
```
    dataset <- data.frame(var1=rnorm(20,0,1), var2=rnorm(20,5,1))
    dataset <- NA
    dataset <- NA
    summary(dataset)
```
Till now we have created a random dataframe and introduced few missing values in the data intentionally. Now it’s time to see MICE at work and forget the worry
```
    install.pckages(“mice”)
    require(mice)
    dataset2 <- mice(dataset)
    dataset2<-complete(dataset2)
    summary(dataset2)
```
We have used the default parameters of MICE package for the example, but you can read and change the parameters as per your requirements.
rpart package: Lets partition your data

(rpart) package in R language, is used to build classification or regression models using a two stage procedure and the resultant models is represented in the form of binary trees. The basic way to plot any regression or classification tree using the rpart package is to call the plot() function. The results might not be pretty by just using the basic plot() function, so there is an alternative i.e. the prp() function which is powerful and flexible. prp() function in rpart.plot package is often referred to as the authentic Swiss army knife for plotting regression trees.

rpart() function helps establish a relationship between a dependant and independent variables so that a business can understand the variance in the dependant variables based on the independent variables. For instance, if an eLearning company has to find out how their sales (dependant variables) have been impacted due to promotions on Social Media, WOM, Newspapers, Referral Sites, etc. then rpart package has several functions that can help with this analysis phenomenon.

rpart stands for Recursive Partitioning and Regression Trees. Using rpart you can run Regression as well as classification. If we talk about syntax, it is pretty simple-

rpart(formula, data=, method=,control=)
- where formula contains the combination of dependent & independent variables; data is the name of your dataset, method depends on the objective i.e. for classification tree, it will beclass; and control is specific to your requirement for example, we want a minimum number variable to split a node etc.
Let’s consider iris dataset, which looks like –

Assuming our objective is to predict Species using a decision tree, it can be achieved by a simple line of code
```
    rpart_tree <- rpart(formula = Species~., data=iris, method = ‘class’)
    summary(rpart_tree)
    plot(rpart_tree)
```
Let see how does our built tree look like:
Here you can see the splits of different nodes and predicted class.

To predict for a new dataset, you have simple function predict(tree_name,new_data) which will give you the predicted classes.
PARTY: Let’s again partition your data

PARTY package in R is used for recursive partitioning and this package reflects the continuous development of ensemble methods.

PARTY is yet another package to build decision trees based on Conditional Inference algorithm. ctree() is the main function of PARTY package which is used extensively, which reduces the training time and bias.

Similar to other predictive analytics functions in R, PARTY also has similar syntax i.e.

ctree(formula,data)

which will build your decision tree, taking the default values of various arguments into consideration which can be tweaked based on requirements.

Let’s build a tree using the same example discussed above.
```
    party_tree <- ctree(formula=Species~. , data = iris)
    plot(party_tree)
```
Let see how does the built tree looks like –

In this package also you have a predict function which will be used to predict classes for the new data coming in.
CARET: Classification And REgression Training

Classification and REgression Training (CARET) package is developed with the intent to combine model training and prediction. Data scientists can run several different algorithms for a given business problem using the CARET package. Data scientists might not be aware as to which is the best algorithm for a given problem. CARET package helps investigate the optimal parameters for an algorithm with controlled experiments. The grid search method of the caret R package searches parameters by combining various methods to estimate the performance of a given model. After looking at all the trial combinations, the grid search method finds the combination that gives best results.

Data scientists can streamline the prices of building predictive models with the help of specialized inbuilt functions for data splitting, feature selection, data pre-processing, variable importance estimation, model tuning through resampling and model visualizations.

CARET package is one of the best packages in R. The developers of this package understood that it is hard to know about the best suited algorithm for the given problem case. There can be situations where you are using a particular model and doubting your data but the problem lies in the algorithm you have chosen.

After installing CARET package, you can run names(getModelInfo()) and see that there are 217 possible methods which can be run through a single package.

To build any predictive model, CARET uses train() function; The syntax of train function looks like –

train(formula, data, method)

Where method is the predictive model you are trying to build. Let’s use the iris dataset and fit a linear regression model to predict Sepal.Length
```
    Lm_model <- train(Sepal.Length~Sepal.Width + Petal.Length + Petal.Width, data=iris, method = “lm”)
    summary(lm_model)
```
CARET package is just not for building models but it also takes care of splitting your data into test and train, do transformation etc.

In short, this is the GoTo package in R – for your all Predictive Modeling related needs.
randomForest: Let’s combine multiple trees to build our own forest

Random Forest algorithm is one of the most widely used algorithms when it comes to Machine Learning. R package randomForest is used to create large number of decision trees and then each observation is inputted into the decision tree. The common output obtained for maximum of the observations is considered as the final output. When using randomForest algorithm, data scientists/analysts have to ensure that the variables must be numeric or factors. Factors cannot have more than 32 levels when implementing randomForest.

As you must be aware that Random Forest takes random samples of variables as well as observations and build multiple trees. These multiple trees are then combined at the end & votes are taken to finally predict the class of the response variable.

Let’s use the iris data example to build a Random Forest using randomForest package.

Rf_fit<-randomForest(formula=Species~., data=iris)

You run a line similar to other packages and your Random Forest is ready to be used. Let see how does this built Forest performs.
```
    print(Rf_fit)
    importance(Rf_fit)
```
You might need to play with different control parameters in randomForest e.g., Number of variables in each tree, No. of trees you want to build etc. Generally, data scientists run multiple iterations and select the best combination.
nnet: It’s all about hidden layers

This is the most widely used and easy to understand neural network package, but it is limited to a single layer of nodes. However, several studies have shown that more nodes are not required as they do not contribute in enhancing the performance of the model but rather increase the calculation time and complexity of the model.

This package does not provide any specific set of methods for finding the number of nodes in the hidden layer. So, when big data professionals implement nnet it is always suggested that they set it to a value that lies between the number of input and output nodes. nnet packages provides implementation for Artificial Neural Networks algorithm which works – based on the understanding of how a human brain functions, given the input and output signals. ANNs find great applications in forecasting for airlines. In fact, neural network structures provide better forecasts using the nnet functions than the traditional forecasting methods like exponential smoothing, regression, etc.

R has multiple packages for building neural networks e.g, nnet, neuralnet, RSNNS. Let’s use our iris data example again (I know you are bored of iris). Let’s try to predict Species using the nnet now and see how does it look –

nnet_model <- nnet(Species~., data=iris, size = 10)

You can observe 10 hidden layers in the Neural Network output – this is because we gave size=10 while building the Neural Net.

Unfortunately, there is no direct way to plot the built neural networks but there are plenty of custom functions contributed on GitHub which you can use.

To build the above shown network we have used – https://gist.githubusercontent.com/fawda123/7471137/raw/466c1474d0a505ff044412703516c34f1a4684a5/nnet_plot_update.r
e1071: Let the vectors support your model

Wondering if this of junk value? Not at all! This is a very vital package in R language that has specialized functions for implementing Naïve Bayes (conditional probability), SVM, Fourier Transforms, Bagged Clustering, Fuzzy Clustering, etc. In fact, the first R interface for SVM implementation was in e1071 R package – for instance, if a data scientist is trying to find out what is the probability that a person who buys an iPhone 6S also buys an iPhone 6S Case.

This kind of analysis is based on conditional probability, so data scientists can make use of e1071 R package which has specialized functions for implementing Naive Bayes Classifier.

Support Vector Machines are there to rescue you when you have a dataset which is not separable in the given dimensions and you need to promote your data to higher dimensions in order to classify or regress it.

Support Vector Machine a.k.a SVM uses Kernel Functions (To optimize mathematical operations) and maximize the margin between two classes.

Similar to other functions discussed above, syntax for SVM is also similar:

svm_model <- svm(Species ~Sepal.Length + Sepal.Width, data=iris)

In order to visualize the classified SVM, we need to use plot() function with the data also

plot(svm_model, data = iris[,c(1,2,5)])

In the plot shown above, we can clearly see the decision boundaries which were received after applying SVM on iris data.

There are multiple parameters which you would have to change in order to get the best accuracy e.g, kernel, cost, gamma, coefficients etc.

To get a bet classifier using SVM you will have to experiment with many of these factors, for example kernel can take multiple values like – linear, Gaussian, Cosine etc.
kernLab: kernel trick packaged well

Kernlab takes advantage of the S4 object model in R language, so that data scientists can use kernel based machine learning algorithms. Kernlab has implementations for SVM, kernel feature analysis, dot product primitives, ranking algorithm, Gaussian processes and a spectral clustering algorithm. Kernel based machine learning methods are used when it is challenging to solve clustering, classification and regression problems – in the space in which the observations are made.

Kernlab package is widely used in the implementation of SVM which eases pattern recognition to a great extent. It has various kernel functions like – tanhdot (hyperbolic tangent kernel Function), polydot (polynomial kernel function), laplacedot (laplacian kernel function) and many more to perform pattern recognition.

Till now you might have understood the power of Kernel functions used in SVM. If Kernel functions are not there, then SVM is not possible altogether.

SVM is not the only technique which uses Kernel Trick but there are plenty of other kernel based algorithms which are quite popular and useful, e.g., RVM, Kernel based PCA, Dimensionality reduction etc.

kernLab package is a house to ~20 of such algorithms which work on the power of Kernels.

kernLab has its own predefined kernels but user has flexibility to build and use their own kernel functions.

Let’s initialize our own Radial Basis Function with a sigma value of 0.01
```
    Myrbf <- rbfdot(sigma = 0.01)
    Myrbf   
```
If you want to see the class of Myrbf, you can do it by simply running class() function over the created object
Every kernel object accepts two vectors and returns dot product of them. Let’s create two vectors and see their dot products –
```
    x<-rnorm(10)
    y<-rnorm(10)
    Myrbf(x,y)
```
Here we created two random normal variables x & y with 10 values each and computed their dot product using Myrbf kernel.

Let’s see an example of SVM using Myrbf kernel

We’ll be using iris data again to understand the working of SVM using kernLab –
```
    Kernlab_svm <- ksvm(Species ~ Sepal.Length + Sepal.Width, data = iris, kernel = Myrbf, C=4)
    Kernlab_svm    
```
Let’s use the built Support Vector Machine and see how did it predicted:
```
    predited<-predict(Kernlab_svm,iris)
    table(predicted = predicted, true = iris$Species)
```

Closing Note:

Every package or function in R has some default values associated with it, before applying any algorithm you must know about the various options available. Passing default values will throw you some result but you can’t be sure that the output is the most optimized or accurate one.

There are many other machine learning packages available in the CRAN repository like igraph, glmnet, gbm, tree, CORElearn, mboost, etc. which are used in different industries to build performance efficient models. We have observed the scenarios where changing just one parameter can modify the output completely. So, don’t rely on default values of parameters – Understand your data and requirements before applying any algorithm.

Enrol now for a hands-on project based Data Science in R Programming Course

↧

A Short Story of rARPACK

Features and Usage

From Eigenvalue to SVD

Applications

Performance

Parallel computing

Why use BiocParallel?

Birthday example

Naive birthday code

Via doMC

With BiocParallel

BiocParallel’s advantages

Where do I start?

Conclusions

Reproducibility

References

Want more?

About

History

Benchmarks

Future Directions

Objectives of Experiments

Benchmark and Testing Goals

System Descriptions

Results and Discussions

(1) General Comparisons

(2) Scalability on NVIDIA GPU

(3) Heterogeneous Parallel Models on Intel Xeon Phi (MIC)

(4) Comparison NVIDIA GPU with Intel Xeon Phi

Conclusions

Appendix : How to build R with different BLAS library

Stock R

R with OpenBLAS

R with Intel MKL

R with CUDA BLAS

R with MKL on Intel Xeon Phi

R Live Class – Data Mining with R

May 17-18, Legnano (Milan)

Course organization

Price

Outline

FAQ

Should I take this course?

What does the cost include?

There is a students discount?

What should I bring?

Who will I learn from?

What language is the course taught?

How can I reach your place?

How can I contact you if I have further questions?

Principal Components Regression

Ideal situation

X-only PCA

The wrong way: PCA without any scaling

A better way: Preparing the training data with x-only scaling

The principal components analysis

Modeling

Are we done?

References

IP address conversion

Converting in R

Converting in Rcpp

About The Cancer Genome Atlas

Check, Download and Read Data

Prepared Available Datasets

Manipulate and Visualize Data

Survival Curves

PCA Biplot

What is Y-Aware PCA?

An Example of Y-Aware PCA

Y-Aware PCA

Prepare the frame with y-aware scaling

The Principal Components Analysis

Is this the same as caret::preProcess?

Other Y-aware Approaches to Dimensionality Reduction

Conclusion

References

X-Only Approaches

A Y-Aware Approach: The Permutation Test

Bootstrapping

Is this the same as `caret::preProcess`?

`MICE` Package – Takes care of your Missing Values

`rpart` package: Lets partition your data

`PARTY`: Let’s again partition your data

`CARET`: Classification And REgression Training

`randomForest`: Let’s combine multiple trees to build our own forest

`nnet`: It’s all about hidden layers

`e1071`: Let the vectors support your model

`kernLab`: kernel trick packaged well