Yesterday Hadley’s functional programming package purrr was published to CRAN. It is designed to bring convenient functional programming paradigma and add another data manipulation framework for R.
“Where dplyr focusses on data frames, purrr focusses on vectors” – Hadley Wickham in a blogpost
The core of the package consists of map functions, which operate similar to base apply functions. So I tried to rebuild the first example used to present the map function in the blog and Github repo with apply. The example looks like this:
mtcars %>%split(.$cyl)%>%lapply(function(x) lm(mpg~wt, data = x))%>%lapply(summary)%>%sapply(function(x) x$r.squared)
## 4 6 8
## 0.5086326 0.4645102 0.4229655
As you can see, map works with 3 different inputs – function names (exactly like apply), anonymous function as formula and character to select elements. Apply on the other hand only accepts functions, but with a little piping voodoo we can also shortcut the anonymous functions like this:
mtcars %>%split(.$cyl)%>%lapply(.%>% lm(mpg~wt, data =.))%>%lapply(summary)%>%sapply(.%>%.$r.squared)
## 4 6 8
## 0.5086326 0.4645102 0.4229655
Which is almost as appealing as the map alternative. Checking the speed of both approaches also reveals no significant time differences:
library(microbenchmark)
map <-function(){mtcars %>%split(.$cyl)%>% map(~ lm(mpg ~ wt, data =.))%>% map(summary)%>% map_dbl("r.squared")}
apply <-function(){mtcars %>%split(.$cyl)%>%lapply(.%>% lm(mpg~wt, data =.))%>%lapply(summary)%>%sapply(.%>%.$r.squared)}
microbenchmark(map,apply)
## Unit: nanoseconds
## expr min lq mean median uq max neval
## map 14 27 31.69 27 27 513 100
## apply 14 27 281.34 27 27 24556 100
Now don’t get me wrong, I don’t want to say, that purrr is worthless or reduntant. I just picked the most basic function of the package and explained what it does by rewriting with apply functions. The map function and their derivatives are convenient alternatives to apply in my eyes without computational overhead. Furthermore the package offers very interesting functions like zip_n and lift. If you love apply and word with lists, you definitely should check out purrr.
Following the great success of the EARL conference in London earlier this month, our attention now turns to EARL Boston which will take place between 2-4th November just across the pond; the competition is already on to see which will come out top in 2015!! With presentations, speakers and venues as different and diverse as the cities themselves, our Boston conference is well worth a visit, even if you also visted our EARL London event. Limited tickets are available online for this unique event, so why not buy yours now?
As EARL is all about applications of the R programming language, used for statistical analysis and visualisation, we thought it would be interesting to put the two cities head-to-head statistically- to present you with some interesting visual comparisons of London vs Boston.
The first plot shows the number of residents in London and Boston, compared to the number of residents in London and Boston including the metropolitan area that surrounds each city. This is the area from which it is practical to commute to work in the city. In London, this is known as the “commuter belt”
The next plot shows the number of jobs in each city, for a selection of sectors.
Whilst it is obvious to see that London is a lot larger than Boston, it is interesting to see that they both have similar shapes, showing that the proportion of workers in each industry is comparable.
Both conferences are held at popular tourist attractions. But which is more popular on social media?
Both are fantastic conference venues, however it is to be expected that the number of visits to London’s Tower Bridge is a lot higher than the science museum in Boston, since the city itself is much larger in size and attracts a higher number of tourists per year. Furthermore, visits to Tower Bridge are likely to be shorter whereas Science Museum visits tend to take up a whole day, therefore it is probable that the flow of visiters over a day boosts the stats for London over Boston in this particular comparison. With this is mind, it is quite impressive that the number of likes for the Boston Science Museum is actually in line with those of Tower Bridge.
Can you guess what’s trending on Twitter in the two cities? Hover over each word to find out! Orange words come from Boston and grey words come from London.
London and Boston are in a celebratory mood, with tweets about Eid, the Rugby World Cup and #diverseauthorday. However, other trends are more serious, for example #mecca refers to the mecca stampede and ‘#cypresswood’ and ‘#Edna’ refers to a traffic accident. London is also engaged with many serious issues such as affordable housing (‘#nfs15’) and the first minister’s questions (‘#fmqs’). This contrasts with the tweets about Big Brother (‘#BB17’)! It’s also worth checking out the great puns for #MakeAMovieMoreChildFriendly.
Both cities are famed for their many breweries. The next infographic shows precisely how many!
..and you can explore them yourself here:
And finally let’s compare the weather. Hopefully it will be less rainy in Boston than it was in London – although the stats do not appear to be weighted in our favour…
To leave a comment for the author, please follow the link and comment on their blog: Mango Solutions.
Back in the mists of time, whilst programming early versions of Canoco, Cajo ter Braak decided to allow users to specify how species and site ordination scores were scaled relative to one another via a simple numeric coding system. This was fine for the DOS-based software that Canoco was at the time; you entered 2 when prompted and you got species scaling, -1 got you site or sample scaling and Hill’s scaling or correlation-based scores depending on whether your ordination was a linear or unimodal method. This system persisted; even in the Windows era of Canoco these numeric codes can be found lurking in the .con files that describe the analysis performed. This use of numeric codes for scaling types was so pervasive that it was logical for Jari Oksanen to include the same system when the first cca() and rda() functions were written and in doing so Jari perpetuated one of the most frustrating things I’ve ever had to deal with as a user and teacher of ordination methods. But, as of last week, my frustration is no more…
…because we released a patch update to the CRAN version of vegan. Normally we don’t introduce new functionality in patch releases but the change I made to the way users can request ordination scores was pretty trivial and maintained backwards compatibility.
Previously, different scalings could be requested using the scaling argument. scaling is an argument of the scores() function; anything function using scores() would either have scaling as a formal argument too, or would pass scaling on to scores() internally. To date, the different scores were specified as per DOS-era Canoco as numeric values. Now, scores() accepts either those same old numeric values or a character string for scaling coupled with a second logical argument. Vegan accepts the following character values to select the type of scaling:
“sites”, which gives site-focussed scaling, equivalent to numeric value 1
“species” (the default), which gives species- (variable-) focused scaling, equivalent to numeric value 2
“symmetric”, which gives a so-called symmetric scaling, and is equivalent to numeric value 3.
To get negative versions of these values, the correlation or hill argument should be set to TRUE as follows
correlation (default FALSE) for correlation-like scores for PCA/RDA/CAPSCALE models, or
hill (default FALSE) for Hill’s scaling for CA/CCA models
Whilst this requires the setting of two different arguments, it’s certainly a lot easier to remember these two arguments than what the numerical codes mean.
Obligatory Dutch dune meadows example
Here’s a quick example of the new usage showing a PCA of the classic Dutch dune meadow data set.
library("vegan")
Loading required package: permute
Loading required package: lattice
This is vegan 2.3-1
data(dune)
ord <- rda(dune)# fit the PCA
layout(matrix(1:2, ncol =2))
plot(ord, scaling ="species")
plot(ord, scaling ="species", correlation =TRUE)
layout(1)
PCA of the Dutch dune meadow data set. Both biplots are drawn using species scaling, but the one on the right standardizes the species scores.
The two biplots are based on the same underlying ordination and both focus the scaling on best representing the relationships between species (scaling = “species”), but the biplot on the right uses correlation-like scores. This has the effect of making the species have equal representation on the plot without doing the PCA with standardized species data (all species having unit variance).
The Statistical Disclosure Limitation (SDL) problem involves modifying a data set in such a manner that statistical analysis on the modified data is reasonably close to that performed on the original data, while preserving the privacy of individuals in the data set. For instance, we might have a medical data set on which we want to allow researchers to do their statistical analyses but not violate the privacy of the patients in the study.
In this posting, I’ll briefly explain what SDL is, and then describe a new method that Pat Tendick and I are proposing. Our paper is available as arxiv:1510.04406 and R code to implement the method is available on GitHub. See the paper for details.
This is a very difficult problem, one that arguably has not been fully solved, in spite of decades of work by some really sharp people. Some of the common methods are: adding mean-0 noise to each variable; finding pairs of similar records and then swapping their values of the sensitive variables; and (in the case in which all variables are categorical), suppressing cells that contain just 1 or a few cases.
As an example of the noise addition method, consider a patient data set that includes the variables Age and Income. Suppose a nefarious user of the data happens to have external knowledge that James is the oldest patient in the study. The Bad Guy can then issue a query asking for the income of the oldest patient (not mentioning James), thus revealing James’ salary. But if the public version of the data has had noise added, James’s listed income will not be his real one, and he may well not be the oldest listed patient anymore anyway.
Given the importance of this topic — JSM 2015 had 3 separate sessions devoted to it — it is surprising that rather little public-domain software is available. The only R package I know of is sdcMicro on CRAN (which by the way includes an excellent vignette from which you can learn a lot about SDL). NISS has the Java-based WebSwap (from whose docs you can also learn about SDL).
Aside from the availability of software, one big concern with many SDL methods is that the multivariate structure of the data may be distorted in the modification process. This is crucial, since most statistical analyses are multivariate in nature, e.g. regression, PCA etc., and thus a major distortion in the multivariate structure can result in seriously misleading estimates.
In the noise addition method, this can be achieved by setting the noise covariance matrix to that of the original data, but for the other methods maintaining the proper multivariate structure is a challenge.
While arguably noise addition works well for data consisting only of continuous variables, and data swapping and cell suppression are often acceptable for the purely-categorical case, the mixed continuous-categorical setting is tough.
Our new method achieves both of the above goals. It (a) is applicable to any kind of data, including the mixed continuous-categorical case, and (b) maintains the correct multivariate structure. Rather counterintuitively, our method achieves (b) while actually treating the variables as (conditionally) independent.
The method has several tuning parameters. In some modern statistical methods, tuning parameters are a real pain, but in SDL, the more tuning parameters the better! The database administrator needs to have as many ways as possible to develop a public form of the database that has both good statistical accuracy and good privacy protection.
As an example, I took some Census data for the year 2000 (5% PUMS), involving programmers in Silicon Valley. In order to simulate an employee database, I sampled 5000 records, keeping the variables WageIncome, Age, Gender, WeeksWorked, MSDegree and PhD. See our paper for details, but here is a quick overview.
First, to see that goal (b) above has been maintained reasonably well, I ran a linear regression analysis, predicting WageIncome from the other variables. I did this twice, once for the original data and once for the modified set, for a given combination of values of the tuning parameters. Here are the estimated regression coefficients:
data
Age
Gender
WeeksWorked
MS
PhD
orig.
447.2
-9591.7
1286.4
17333.0
21291.3
modif.
466.1
-8423.2
1270.7
18593.9
22161.4
This is not bad. Each pair of coefficients is within one original standard error of the other (not shown). The database administrator could try lots of other combinations of the tuning parameters, and likely get even closer. But what about privacy?
In the original data set, there was exactly one female worker with age under 31:
> p1[p1$sex==2 & p1$phd==1 & p1$age < 31,]
age sex wkswrkd ms phd wageinc
7997 30.79517 2 52 0 1 100000
Which such workers, if any, are listed in the modified data?
> p1pc p1pc[p1pc$sex==2 & p1pc$phd==1 & p1pc$age < 31,]
age sex wkswrkd ms phd wageinc
12522 30.5725 2 52 0 1 50000
There is only one person listed in the released data of the given description (female, PhD, age under 31). But she is listed as having an income of $50,000 rather than $100,000. In fact, it is a different person, worker number 12522, not 7997. (Of course, ID numbers would be suppressed.)
So what happened to worker number 7997?
> which(rownames(p1p) == 7997)
[1] 3236
> p1p[3236,]
age sex wkswrkd ms phd wageinc
7997 31.9746 1 52 0 1 100000
Ah, she became a man! That certainly hides her. Under another luck of the draw, her record may have become all NA values.
In this way, the database administrator can set up a number of statistical analysis test cases, and a number of records at high risk of identification, and then try various combinations of the tuning parameters in order to obtain a modified data set that achieves a desired balance between statistical fidelity and privacy.
To leave a comment for the author, please follow the link and comment on their blog: Mad (Data) Scientist.
If you have ever retrieved data from Twitter, Facebook or Instagram with R, you might have noticed a strange phenomenon. While R seems to be able to display some emoticons properly, many other times it doesn’t, making any further analysis impossible unless you get rid of them. With a little hack, I decoded these emoticons and put them all in a dictionary for further use. I’ll explain how I did it and share the decoder with you.
The meaning of Emoticons and why you should analyze them
As far as I remember, all the sentiment analysis codes I came across dealt with emoticons by simply getting rid of them. Now, it might be ok if you’re interested in analyzing someone’s vocabulary or do some fancy wordclouds. But if you want to perform sentiment analysis, than these emoticons are probably the most meaningful part of your data! Sentiments, emotions, emoticons, you know, there’s a link Not only are they full of meaning by themselves, they also have the virtue to change the meaning of the sentences they are appended to. Think of a tweet like “I’m going to bed”. It has a dramatically different meaning, depending on whether a happy smiley or a sad smiley is associated to it. Long story short: If you’re interested in sentiment, you MUST capture emoticons!
Emoticons and R
As mentioned earlier, R seems to be totally capable of properly displaying some emoticons, while it fails displayng others. Try inputing xE2x9DxA4 (heavy black heart) to the console and this is how the output will look like:
## [1] "❤"
Just as expected. So far, so good.
Now try inputing this xF0x9Fx98x8A (smiling face with smiling eyes) and R won’t display it as an emoticon:
## [1] "U0001f60a"
The display problem seems related to the length of the code. Apparently, if an emoticon’s UTF-8 code is longer then 12 characters, it won’t be displayed properly. Find a list of emoticons with their respective encodings here.
Now, the real problem occurs when you retrieve data from social media. This is how tweets look like when retrieved with the userTimeline() function and parsed to a data frame with the twListToDF() function from the twitteR package:
For demonstration purposes, I needed to find a twitter user who integrates a lot of different emoticons in his or her tweets. And who is better suited than emoticons queen Paris Hilton herself? Exactly, nobody As you can see, many emoticons aren’t displayed correctly but appear as strange question marks symbols instead, while others are displayed as emoticons.
Printed to the console, the strange question marks symbols aka emoticons look like unicode except that it is not unicode and doesn’t match the actual, right UTF-8 encoding I expected. Take this tweet, for example:
## [1] "At sound check last night for the ports1961womenswear After Party.
xedxa0xbcxedxbexb6xedxa0xbdxedxb1xb8xedxa0xbcxedxbfxbc
xedxa0xbcxedxbexb6 @ Shanghai, China https://t.co/Y9lE9mXjIL"
The first emoticon consists of multiple musical notes, it’s corresponding UTF-8 encoding is xF0x9Fx8ExB6 but it is being rendered as xedxa0xbcxedxbexb6.
Also, try to pass the tweet to the str() function and you will get an error message:
## Error in strtrim(e.x, nchar.max): invalid multibyte string at
'<a0><bc>x<65>d<b7><a8>xed<a0><bc>xed<b7><b3> https://t.co/0p90p8Rrhu"'
If you don’t need the emoticons, the most reasonable thing to do is to get rid of them. But given that I wanted to include them in my analysis, I was facing two problems: First, it was difficult to handle the tweets because their format messed up with some functions and lead to error messages. I needed to change their encoding. Second, I couldn’t identify the emoticons as the way R encodes them doesn’t correspond to their actual UTF-8 encoding. It’s a real pain in the neck, especially (but not only) if you’re analyzing Paris Hilton’s tweets, because she just LOVES her musical notes emoticons The solution that came to my mind was to decode them.
First things first, I used the iconv() function with the following parameters to be able to further handle the tweets. As you can see, the multiple musical notes emoticon xF0x9Fx8ExB6 now appears as <ed><a0><bc><ed><be><b6>. This is an encoding I could work with and this was going to be the basis of the decoder.
## [1] "At sound check last night for the ports1961womenswear After Party.
<ed><a0><bc><ed><be><b6><ed><a0><bd><ed><b1><b8><ed><a0><bc><ed><bf><bc>
<ed><a0><bc><ed><be><b6> @ Shanghai, China https://t.co/Y9lE9mXjIL"
Building an emoticons decoder for R
This is how I did it.
1. Scrape the code
The first step consisted of scraping the emoticons, their UTF-8 code and the description from this website. I believe this is a more or less full list of them. Here is the R code I used to do so. No problem thanks to the rvest package
2. Post code
Now that I had a list of the emoticons and their codes, I could post them on twitter. I first created a twitter account for this purpose and then posted the emoticons in batches until I reached the rate limit. I continued the procedure the next days until I had posted them all. I used this code to post to twitter.
3. Retrieve tweets
The third step consisted of retrieving the tweets to get this strange encoding of R for every emoticon on the list.
4. Build the decoder
The last step consisted of matching the emoticons list from step 1 with the retrieved emoticons from step 3 based on their description. And ready was my (R)emoticons decoder! It includes the image, UTF-8 code, description and the encoding R gives to every emoticon. You will find the decoder as a csv file here.
With this decoder, I’m finally able to identify and match emoticons retrieved with R from social media. Possible use cases could be to give them a score for sentiment analysis (e.g. +1 for positive emoticons, -1 for negative emoticons or even weighted scores according to their level of positivity and negativity) or to put them into categories for semantic analysis (animals, activities, emotions, etc.). There are no boundaries to your imagination.
Let me know if you find an easier way to decode emotions in R or to solve this problem!
A few weeks ago I gave a talk at BARUG (and wrote a post) about blogging with the excellent knitr-jekyll repo. Yihui’s system is fantastic, but it does have one drawback: None of those fancy new htmlwidgets packages seem to work…
A fewpeople have run into this. I recently figured out how to fix it for this blog (which required a bit of time reading through the rmarkdown source), so I thought I’d write it up in case it helps anyone else, or my future-self.
TL;DR
You can add a line to build.R which calls a small wrapper function I cobbled together (brocks::htmlwidgts_deps), add a snippet of liquid syntax to ./_layouts/post.html, and you’re away.
What’s going on?
Often, when you ‘knit’ an .Rmd file to html, (perhaps without knowing it) you’re doing it via the rmarkdown package, which adds its own invisible magic to the process. Behind the scenes, rmarkdown uses knitr to convert the file to markdown format, and then uses pandoc to convert the markdown to HTML.
While knitr executes R code and embeds results, htmlwidgets packages (such as leaflet, DiagrammR, threejs, and metricsgraphics) also have js and css dependencies. These are handled by rmarkdown’s second step, and so don’t get included when using knitr alone.
The rmarkdown invisible magic works as follows:
It parses the .Rmd for special dependencies objects, linking to the js/css source (by calling knitr::knit_meta)
It then (by default) combines their source-code into a huge data:uri blob, which it writes to a temp-file
This is injected into the the final HTML file, by passing it to pandoc’s --include-in-header argument
A fix: htmlwdigets_deps
Happily, including bits of HTML in other bits of HTML is one of Jekyll’s strengths, and it’s possible to high-jack the internals of rmarkdown to do something appropriate. I did this with a little function htmlwdigets_deps, which:
Copies the js/css dependencies from the R packages, into a dedicated assets folder within in your blog
Writes a little HTML file, containing the links to the source code above
With a small tweak to the post.html file, Jekyll’s liquid templating system can be used to pull in that little HTML file, if htmlwidgets are detected in your post.
If you’re using knitr-jekyll, all that’s needed to make everything work as you’d expect, is to call the function from your build.R file, like so:
local({
# Your existing configurations...
# See https://github.com/yihui/knitr-jekyll/blob/gh-pages/build.R
brocks::htmlwidgets_deps(a)
})
(The parameter a refers to the input file — if you’re using a build file anything like Yihui’s example, this will work fine.)
If you’d like to have a look at the internals of htmlwidgets_deps yourself, it’s in my personal package up on GitHub. Long story short, it hi-jacks rmarkdown:::html_dependencies_as_string. The rest of this post walks through what it actually does.
1. Copying dependencies to your site
To keep things transparent, the dependency source files are kept in their own folder (./htmlwidgets_deps). If it doesn’t exist, it’ll be created. This behaviour is different to the rmarkdown default of compressing everything into huge in-line data:uri blobs. While that works great for keeping everything in one big self-contained file (e.g. to email to someone), it makes for a very slow web page. For a blog, having separate files is preferable, as it allows the browser to load files asynchronously, reducing the load time.
After compiling your sites, if you’ve used htmlwidgets you’ll have an extra directory within your blog, containing the source for all the dependencies, a bit like this:
Once all the dependencies are ready to be served from your site, you still need to add HTML pointers to your blog post, so that it knows where to find them. htmlwidgets_deps automates this, by adding a file for each htmlwidgets post to the ./_includes directory (which is where Jekyll goes to look for HTML files to include). For each post which requires it, the extra HTML file will be generated in the htmlwidgets sub-directory, like this:
The HTML comes pre-wrapped in the usual liquid syntax.
3. Including the extra HTML
Now you have a little file to include, you just need to get it into the HTML of the blog post. Jekyll’s templating system liquid is all about doing this.
Because htmlwdigets_deps gives the dependency file the same name as your .Rmd input (and thus the post), it’s quite easy to write a short {% include %} statement, based on the name of the page itself. However, things get tricky if the file doesn’t exist. By default, htmlwdigets_deps only produces files when necessary (e.g. when you are actually using htmlwidgets). To handle this, I used a plugin, providing the file_exists function.
Adding the following the bottom of ./_layouts/default.html did the trick. You could also use ./_layouts/post.html if you wanted to. It’s a good idea to put it towards to the bottom, otherwise the page won’t load until all the htmlwdigets dependencies are loaded, which could make things feel rather slow.
The solution above proves a little tricky if you’re using GitHub pages, as this doesn’t allow plugins. While I’m sure an expert with the liquid templating engine could come up with a brilliant solution to this, in lieu, I present a filthy untested hack.
By setting the htmlwdigets_deps parameter always = TRUE, a dependencies file will always be produced, even if there’s no htmlwidgets detected (the file will be empty). This means that you can do-away with the logic part (and the plugin), and simply add the lines:
The disadvantage is that you’ll end up with some empty HTML files in ./_includes/htmlwidgets/, which may or may not bother you. If you’re only going to be using htmlwidgets for blog posts (and not the rest of your site) I’d recommend doing this for the ./_layouts/post.html file, (as opposed to default.html) so that other pages don’t have trouble finding dependencies they don’t need.
If you give this a crack, let me know!
How to do the same
In summary:
Add the snippet of liquid syntax to one of your layout files
Add the following line to your build.R file, just below the call to knitr::knit
brocks::htmlwidgets_deps(a)
And you should be done!
Showing Off
After all that, it would be a shame not to show off some interactive visualisations. Here are some of the htmlwidgets packages I’ve had the chance to muck about with so far.
MetricsGraphics
MetricsGraphics.js is a JavaScript API, built on top of d3.js, which allows you to produce a lot of common plots very quickly (without having to start from scratch each time). There’s a few libs like this, but MetricsGraphics is especially pleasing. Huge thanks to Ali Almossawi and Mozilla, and also to Bob Rudis for the R interface.
Here’s the Pride of Spitalfields, which I occasionally pine for, from beneath the palm trees of sunny California.
library(leaflet)
m <- leaflet() %>%
addTiles() %>% # Add default OpenStreetMap map tiles
addMarkers(lng = -0.07125, lat = 51.51895,
popup = "Reasonably Priced Stella Artois")
m
threejs
three.js is a gobsmackingly brilliant library for creating animated, interactive 3D graphics from within a Web browser. Here’s an interactive 3D globe with the world’s populations mapped as, erm, light-sabers. Probably not as informative as a base graphics plot, but it is much more Bond villianish. Drag it around and have a zoom!
library("threejs")
library("maps")
##
## # ATTENTION: maps v3.0 has an updated 'world' map. #
## # Many country borders and names have changed since 1990. #
## # Type '?world' or 'news(package="maps")'. See README_v3. #
data(world.cities, package = "maps")
cities <- world.cities[order(world.cities$pop,decreasing = TRUE)[1:1000],]
value <- 100 * cities$pop / max(cities$pop)
# Set up a data color map and plot
col <- rainbow(10, start = 2.8 / 6, end = 3.4 / 6)
col <- col[floor(length(col) * (100 - value) / 100) + 1]
globejs(lat = cities$lat, long = cities$long, value = value, color = col,
atmosphere = TRUE)
Last week witnessed a number of exciting announcements from the big data and machine learning space. What it shows is that there are still lots of problems to solve in 1) working with/deriving insights from big data, 2) integrating insights into business processes.
TensorFlow
Probably the biggest (data) headline was that Google open sourced TensorFlow, their graph-based computing framework. Many stories refer to TensorFlow as Google’s AI engine, but it is actually a lot more. Indeed, like Spark and Hadoop, it encompasses a computing paradigm based on a directed, acyclic graph (DAG). DAGs have been around in the world of mathematics since the days of Euler, and have been used in computer science for decades. The past 10-15 years have seen DAGs become popular as a way to model systems, with noteworthy examples being SecDB/Slang from Goldman Sachs and its derivatives (Athena, Quartz, etc.).
What differentiates TensorFlow is that it transparently scales across various hardware platforms, from smartphones to GPUs to clusters. For anyone who’s tried to do parallel computing in R, knows how significant this seamless scaling can be. Second, TensorFlow has built in primitives for modeling recurrent neural networks, which are used for Deep Learning. After Spark, TensorFlow delivers the final nail in the coffin for Hadoop. I wouldn’t be surprised if in a few years the only thing remaining in the Hadoop ecosystem is HDFS.
A good place to get started with TensorFlow is their basic MNIST handwriting tutorial. Note that TensorFlow has bindings for Java, Python, C/C++. One of their goals of open sourcing TensorFlow is to see more language bindings. One example is this simple R binding via RPython, although integrating with Rcpp is probably preferred. If anyone is interested in collaborating on proper R bindings, do reach out via the comments.
Tensors
What’s in a name exactly? Tensors are a mathematical object that is commonly said to generalize vectors. For the most part the TensorFlow documentation refers to tensors as multidimensional arrays. Of course, there’s more to the story, and for the mathematically inclined, you’ll see that tensors are referred to as functions, just like matrix operators. The mechanics of tensors are nicely described in Kolecki’s An Introduction To Tensors For Students Of Physics And Engineering published by NASA and this (slightly terse) chapter on tensors from U Miami.
Ufora
Another notable computing platform is Ufora, founded by Braxton McKee. Braxton’s platform differs from TensorFlow and the others I mentioned in that it doesn’t impose a computing paradigm on you. All the magic is behind the scenes, where the platform acts as a dynamic code optimizer, figuring out how to parallelize operations as they happen.
What made the headlines is that Ufora decided to open source their kit as well. This is really great for everyone, as their technology will likely find its way into all sorts of places. A good place to start is the codebase on github. Do note that you’ll need to roll up your sleeves for this one.
PCA and K-Means
Last week in my class, we discussed ways of visualizing multidimensional data. Part of the assignment was clustering data via k-means. One student suggested using PCA to reduce the dimensions into 3-space so it could be visualized. In Kuhn & Johnson, PCA is cited as a useful data preparation step to remove noise in extra dimensions. This suggests pre-processing with PCA and then applying k-means. Which is right?
It turns out that PCA and k-means are intimately connected. In K-means Clustering via Principal Component Analysis, Ding and He prove that PCA is actually the continuous solution of the cluster membership indicators of k-means. Whoa, that was a mouthful. To add some color, clustering algorithms are typically discrete: an element is either in one cluster or another, but not both. In this paper, the authors show that if cluster membership is considered continuous (akin to probabilities), then the k-means solution is the same as applying PCA!
Back to the original question, in practice both approaches are valid and it really boils down to what you want to accomplish. If your goal is to remove noise, pre-processing with PCA is appropriate. If the dataset becomes easily visualized, that’s a nice side effect. On the other hand, if the original space is already optimal, then there’s no harm in clustering and reducing dimensions via PCA afterward for visualization purposes. If you take this approach, I think it’s wise to communicate to your audience that the visualization is an approximation of the relationships.
What are your thoughts on using PCA for visualization? Add tips and ideas in the comments.
Something Wow
It’s true, the jet pack is finally a reality. Forty years after their first iteration of the RocketBelt, the inventors have succeeded in improving the flight time (from 30 seconds 40 years ago) to 10 minutes and a top speed of around 100 kmh.
To leave a comment for the author, please follow the link and comment on their blog: Cartesian Faith » R.
The second, annual H2O World conference finished up yesterday. More than 700 people from all over the US attended the three-day event that was held at the Computer History Museum in Mountain View, California; a venue that pretty much sits well within the blast radius of ground zero for Data Science in the Silicon Valley. This was definitely a conference for practitioners and I recognized quite a few accomplished data scientists in the crowd. Unlike many other single-vendor productions, this was a genuine Data Science event and not merely a vendor showcase. H2O is a relatively small company, but they took a big league approach to the conference with an emphasis on cultivating the community of data scientists and delivering presentations and panel discussions that focused on programming, algorithms and good Data Science practice.
The R based sessions I attended on the tutorial day were all very well done. Each was designed around a carefully crafted R script performing a non-trivial model building exercise and showcasing one or more of the various algorithms in the H2O repertoire including GLMs, Gradient Boosting Machines, Random Forests and Deep Learning Neural Nets. The presentations were targeted to a sophisticated audience with considerable discussion of pros and cons. Deep Learning is probably H2O's signature algorithm, but despite its extremely impressive performance in many applications nobody here was selling it as the answer to everything.
The following code fragment from a script (Download Deeplearning)that uses deep learning to identify a spiral pattern in a data set illustrates the current look and feel of H2O's R interface. Any function that begins with h2o. runs in the JVM not in the R environment. (Also note that if you want to run the code you must first install Java on your machine, the Java Runtime Environment will do. Then, download the H2O R package Version 3.6.0.3 from the company's website. The scripts will not run with the older version of the package on CRAN.)
### Cover Type Dataset#We important the full cover type dataset (581k rows, 13 columns, 10 numerical, 3 categorical).#We also split the data 3 ways: 60% for training, 20% for validation (hyper parameter tuning) and 20% for final testing.#df <- h2o.importFile(path = normalizePath("../data/covtype.full.csv"))dim(df)df
splits <- h2o.splitFrame(df,c(0.6,0.2), seed=1234)
train <- h2o.assign(splits[[1]],"train.hex")# 60%
valid <- h2o.assign(splits[[2]],"valid.hex")# 20%
test <- h2o.assign(splits[[3]],"test.hex")# 20%##Here's a scalable way to do scatter plots via binning (works for categorical and numeric columns) to get more familiar with the dataset.##dev.new(noRStudioGD=FALSE) #direct plotting output to a new windowpar(mfrow=c(1,1))# reset canvasplot(h2o.tabulate(df,"Elevation","Cover_Type"))plot(h2o.tabulate(df,"Horizontal_Distance_To_Roadways","Cover_Type"))plot(h2o.tabulate(df,"Soil_Type","Cover_Type"))plot(h2o.tabulate(df,"Horizontal_Distance_To_Roadways","Elevation"))##### First Run of H2O Deep Learning#Let's run our first Deep Learning model on the covtype dataset. #We want to predict the `Cover_Type` column, a categorical feature with 7 levels, and the Deep Learning model will be tasked to perform (multi-class) classification. It uses the other 12 predictors of the dataset, of which 10 are numerical, and 2 are categorical with a total of 44 levels. We can expect the Deep Learning model to have 56 input neurons (after automatic one-hot encoding).#
response <- "Cover_Type"
predictors <- setdiff(names(df), response)
predictors
##To keep it fast, we only run for one epoch (one pass over the training data).#
m1 <- h2o.deeplearning(
model_id="dl_model_first",
training_frame=train,
validation_frame=valid,## validation dataset: used for scoring and early stopping
x=predictors,
y=response,#activation="Rectifier", ## default#hidden=c(200,200), ## default: 2 hidden layers with 200 neurons each
epochs=1,
variable_importances=T ## not enabled by default)summary(m1)##Inspect the model in [Flow](http://localhost:54321/) for more information about model building etc. by issuing a cell with the content `getModel "dl_model_first"`, and pressing Ctrl-Enter.##### Variable Importances#Variable importances for Neural Network models are notoriously difficult to compute, and there are many [pitfalls](ftp://ftp.sas.com/pub/neural/importance.html). H2O Deep Learning has implemented the method of [Gedeon](http://cs.anu.edu.au/~./Tom.Gedeon/pdfs/ContribDataMinv2.pdf), and returns relative variable importances in descending order of importance.#head(as.data.frame(h2o.varimp(m1)))##### Early Stopping#Now we run another, smaller network, and we let it stop automatically once the misclassification rate converges (specifically, if the moving average of length 2 does not improve by at least 1% for 2 consecutive scoring events). We also sample the validation set to 10,000 rows for faster scoring.#
m2 <- h2o.deeplearning(
model_id="dl_model_faster",
training_frame=train,
validation_frame=valid,
x=predictors,
y=response,
hidden=c(32,32,32),## small network, runs faster
epochs=1000000,## hopefully converges earlier...
score_validation_samples=10000,## sample the validation dataset (faster)
stopping_rounds=2,
stopping_metric="misclassification",## could be "MSE","logloss","r2"
stopping_tolerance=0.01)summary(m2)plot(m2)
First notice that it all looks pretty much like R code. The script mixes standard R functions and H2O functions in a natural way. For example, h20.tabulate() produces an object of class "list" and h20.deeplearning() yields a model object that plot can deal with. This is just really baseline stuff that has to happen to provide make H2O coding feel like R. But note that the H2O code goes beyond this baseline requirement. The functions h2o.splitFrame() and h2o.assign() manipulate data residing in the JVM in a way that will probably seem natural to most R users, and the function signatures also seem to be close enough "R like" to go unnoticed. All of this reflects the conscious intent of the H2O designers not only to provide tools to facilitate the manipulation of H2O data from the R environment, but also to try and replicate the R experience.
An innovative new feature of the h20.deeplearning() function itself is the ability to specify a stopping metric. The parameter setting: (stopping_metric="misclassification",## could be "MSE","logloss","r2" ) in the specification of model m2 means that the neural net will continue to learn until the specified performance threshold is achieved. In most cases, this will produce a useful model in much less time than it would take to have the learner run to completion. The following plot, generated in the script referenced above, shows the kind of problem for which the Deep Learning algorithm excels.
Highlights of the conference for me included the presentations listed below. The videos and slides (when available) from all of these presentations will be posted on the H2O conference website. Some have been posted already and the rest should follow soon. (I have listed the dates and presentation times to help you locate the slides when they become available)
Madeleine Udell (11-11: 10:30AM) presented the mathematics underlying the new algorithm, Generalized Low Rank Models (GLRM), she developed as part of her PhD work under Stephen Boyd, professor at Stanford University and adviser to H2O. This algorithm which generalizes PCA to deal with heterogeneous data types shows great promise for a variety of data science applications. Among other things, it offers a scalable way to impute missing data. This was possibly the best presentation of the conference. Madeleine is an astonishingly good speaker; she makes the math exciting.
Anqi Fu (11-9: 3PM) presented her H2O implementation of the GLRM. Anqi not only does a great job of presenting the algorithm, she also offers some real insight into the challenges of turning the mathematics into production level code. You can download one of Anqi's demo R scripts here: Download Glrm.census.labor.violations. To my knowledge, Anqi's code is the only scalable implementation of the GLRM. (Madeleine wrote the prototype code in Julia.)
Matt Dowle (11-10), of data.table fame, demonstrated his port of data.table's lightning fast radix sorting algorithm to H2O. Matt showed a 1B row X 1B row table join that runs in about 1.45 minutes on a 4 node 128 core H2O cluster. This is very impressive result, but Matt says he can already do 10B x 10B row joins, and is shooting for 100B x 100B rows.
Professor Rob Tibshirani (11-11: 11AM) presented work he is doing that may lead to lasso based models capable of detecting the presence of cancer in tissue extracted from patients while they are on the operating table! He described "Customized Learning", a method of building individual models for each patient. The basic technique is to pool the data from all of the patients and run a clustering algorithm. Then, for each patient fit a model using only the data in the patient's cluster. This is exciting work with the real potential to save lives.
Professor Stephen Boyd (11-10: 11AM) delivered a tutorial on optimization starting with basic convex optimization problems and then went on to describe Consensus Optimization, an algorithm for building machine learning models from data stored at different locations without sharing the data among the locations. Professor Boyd is a lucid and entertaining speaker, the kind of professor you will wish you had had.
Arno Candel (11-9: 1:30PM) presented the Deep Learning model which he developed at H2O. Arno is an accomplished speaker who presents the details with great clarity and balance. Be sure to have a look at his slide showing the strengths and weaknesses of Deep Learning.
Erin LeDell (11-9: 3PM) de-mystified ensembles and described how to build an ensemble learner from scratch. Anyone who wants to compete in a Kaggle competition should find this talk to be of value.
Szilard Pafka (11-11:3PM), in a devastatingly effective, low key presentation, described his efforts to benchmark the open source, machine learning platforms R, Python scikit, Vopal Wabbit, H2O, xgboost and Spark MLlib. Szilard downplayed his results, pointing out that they are in no way meant to be either complete nor conclusive. Nevertheless, Szilard put some considerable effort into the benchmarks. (He worked directly with all of the development teams for the various platforms.) Szilard did not offer any conclusions, but things are not looking all that good for Spark. The following slide plots AUC vs file size up to 10M rows.
Szilard's presentation should be available on the H2O site soon, but it is also available here.
I also found the Wednesday morning panel discussion on the "Culture of Data Driven Decision Making" and the Wednesday afternoon panel on "Algorithms -Design and Application" to be informative and well worth watching. Both panels included a great group of articulate and knowledgeable people.
If you have not checked in with H2O since the post I wrote last year, here' on one slide, is some of what they have been up to since then.
Congratulations to H2O for putting on a top notch event!
To leave a comment for the author, please follow the link and comment on their blog: Revolutions.
I was looking to show a more substantive piece of analysis using the World Development Indicators data, and at the same time show how to get started on fitting a mixed effects model with grouped time series data. The relationship between exports’ importance in an economy and economic growth forms a good start as it is of considerable theoretical and practical policy interest and has fairly reliable time series data for many countries. At the most basic level, there is a well known positive relationship between these two variables:
The relationship is made particularly strong by the small, rich countries in the top right corner. Larger countries, with their own large domestic markets, can flourish economically while being less exports-dependent than smaller countries – the USA being the example par excellence. If the regression is weighted by population, the relationship is much weaker than shown in the above diagram.
However, today, I’m looking at a different aspect of the relationship – changes over time. Partly this is because I’m genuinely interested, but mostly because I needed an example demonstrating fitting a mixed effects model to longitudinal data with a time series component.
The data come from the World Bank’s World Development Indicators (WDI), which I explored recently in my last post. I’m comparing “Exports of goods and services (% of GDP)” with “GDP per capita (constant 2000 US$)”. The WDI have at least some data on these variables for 186 countries, but different starting years for each (earliest being 1962). The data look like this, in a connected scatterplot showing the relationship between the two variables for 12 randomly chosen countries:
… and this, in a more straightforward time series line plot:
Close watchers will see that there are some country groupings in the dataset (eg “Pacific island small states”) that we don’t want, in addition to the individual countries we do. So our first job is to get rid of these. Here’s the R code that pulls in the data, draws the plots so far, and removes those country groupings.
library(WDI)library(ggplot2)library(scales)library(dplyr)library(tidyr)library(showtext)# for fontslibrary(nlme)# import fontsfont.add.google("Poppins","myfont")showtext.auto()theme_set(theme_light(10, base_family ="myfont"))#-------------------data imports, first explore, and clean-----------------if(!exists("exports")){# Exports of goods and services (% of GDP) exports <- WDI(indicator ="NE.EXP.GNFS.ZS", end =2015, start =1950)# GDP per capita (constant 2000 US$) gdp <- WDI(indicator ="NY.GDP.PCAP.KD", end =2015, start =1950)}both <-merge(exports, gdp)%>% rename(exports = NE.EXP.GNFS.ZS, gdp = NY.GDP.PCAP.KD)%>%# removing any year-country combos missing eith gdp or exports: filter(!(is.na(exports)|is.na(gdp)))%>% arrange(country, year)# let's look at 12 countries at a timeall_countries <-unique(both$country)sampled <- both %>% filter(country %in%sample(all_countries,12))# connected scatter plot:p1 <- sampled %>% ggplot(aes(y = gdp, x = exports, colour = year))+ facet_wrap(~country, scales ="free")+ geom_path()+ labs(x ="Exports as a percentage of GDP")+ scale_y_continuous("GDP per capita, constant 2000 US dollars", label = dollar)+ ggtitle("Exports and GDP over time, selected countries")# univariate time series plotsp2 <- sampled %>% gather(variable, value,-(iso2c:year))%>% ggplot(aes(x = year, y = value))+ geom_line()+ facet_wrap(~ country + variable, scales ="free_y")# how to get rid of the country groups?unique(both[,c("iso2c","country")])%>% arrange(iso2c)# they have a number as first or second digit, or X or Z as first digit (but ZW ZA ZM legit)both2 <- both %>% filter(!grepl("X.", iso2c)&!grepl("[0-9]", iso2c))%>% filter(!grepl("Z.", iso2c)| iso2c %in%c("ZW","ZA","ZM"))%>% filter(!iso2c %in%c("OE","EU"))#------------------------simple cross section used for first plot-------------------country_sum <- both2 %>% group_by(country)%>% filter(year ==max(year))p3 <- country_sum %>% ggplot(aes(x = exports /100, y = gdp, label = country))+ geom_text(size =3)+ geom_smooth(method ="lm")+ scale_y_log10(label = dollar)+ scale_x_log10(label = percent)+ labs(x ="Exports as a percentage of GDP", y ="GDP per capita, constant 2000 US dollars", title ="Cross section of exports and GDP, latest years when data are present")
The plan is to fit an appropriate time series model with GDP as a response variable and exports as a percentage of GDP as an explanatory variable, allowing country to be a group random effect, and the country-year individual randomness to be related year to year as a time series. This means a mixed effects model. There’s a bit of transforming I need to do:
To avoid spurious correlation I’ll need to transform the data until each time series is at least approximately stationary. I’ll do this by taking differences of the logarithm of the series, which has the same impact as looking at year on year growth rates.
I’ll also need some kind of control for the nuisance factor that each country has a different starting year for its data.
I want to control for the possibility that the absolute value of exports-orientation at the start of recorded data has a persistent impact, so I need to have that absolute level (or the logarithm of it, which has nicer properties) as a country-level variable.
I’m interested in whether a change in exports has a persistent impact, so I’ll need a lagged value of my transformed exports variable. Otherwise we might just be picking up the inherent impact on GDP of a change in exports or their prices, with domestic contribution to GDP kept constant. If there’s a persistent effect on future years from a change exports-orientation in one year, it’s much more likely to reflect something significant economically (still not necessarily causal though).
Even after taking first differences of the logarithms of my two main variables (GDP, and Exports as a percentage of GDP), I’m still a little worried that in particular countries there will be non-stationary time series that result in spurious associations. For example, if the growth rates of both variables are steadily growing or declining. To reduce this risk I’m going to include year in my model, so at least any linear trend of that sort is removed before looking at whether the two variables move together.
Here’s the distribution of the first recorded value of the logarithm of exports as a percentage of gdp:
Here’s the connected scatter plot of the differenced logarithms (effectively, growth rates):
Here’s the traditional time series plots:
And here’s the code that does the transformations and plots:
# first we want to know the value of exports at the beginning of each seriesfirst_values <- both2 %>% group_by(country)%>% filter(year ==min(year))%>% ungroup()%>% mutate(exports_starter =log(exports), first_year = year)%>% select(iso2c, exports_starter, first_year)p4 <- ggplot(both3, aes(x = exports_starter))+ geom_density()# all those individual time series have problem of non-stationarity and of course autocorrelation# so when we merge with the starting values of exports we also calculate first# differences of logarithmsboth3 <- both2 %>% left_join(first_values, by ="iso2c")%>% arrange(country, year)%>% group_by(country)%>% mutate(exports_g =c(NA,diff(log(exports))), gdp_g =c(NA,diff(log(gdp))), exports_lag = lag(exports_g))%>% ungroup()%>% filter(!is.na(gdp_g))%>% filter(!is.na(exports_lag))all_countries2 <-unique(both3$country)set.seed(123)sampled <- both3 %>% filter(country %in%sample(all_countries2,12))p5 <- sampled %>% select(country, year, exports_g, gdp_g)%>% gather(variable, value,-(country:year))%>% ggplot(aes(x = year, y = value))+ geom_line()+ facet_wrap(~ country + variable, scales ="free_y")p6 <- sampled %>% select(country, year, exports_g, gdp_g)%>% ggplot(aes(x = exports_g, y = gdp_g, colour = year))+ geom_smooth(colour ="red", method ="lm")+ geom_path()+ geom_point()+ facet_wrap(~ country, scales ="free")+ labs(x ="change in logarithm of exports share of GDP", y ="change in logarithm of GDP per capita, constant 2000 USD", title ="Changing importance of exports and GDP growth, selected countries")
So we’re ready for some modelling. After all the data transformations, this part is relatively easy. I use a mixed effects model that lets the key effects vary in each country, around an average level of the effect.
I tried a few simpler models than below and checked the auto-correlation functions to be sure the time series part was needed (it is; ACF() returns an autocorrelation at lag 1 of around 0.23, and the decay of autocorrelation in subsequent lags characteristic of a fairly simple AR process). So here we go:
model4 <- lme(fixed = gdp_g ~ordered(first_year)+ exports_g + exports_lag + exports_starter + year, random =~ exports_g + exports_lag + year | country, correlation = corAR1(form =~ year | country), data = both3)
Just to visualise what this model is doing, imagine the connected scatter plot of the differenced logarithms of each variable, as shown earlier. We’re drawing a diagonal line of best fit on each facet, with the slope and intercept of each line allowed to different for each country. The average slope is the average impact of changes in exports as a percent of GDP on GDP growth. Now add to this the complication of the other variables – the first year data starts (basically just a nuisance variable we want to control for); each countrys’ absolute level of exports as a percentage of GDP when its data started; the lagged value of exports (which is actually the number of most interest); and year, for which we’re trying to control for any linear trend.
The “correlation = corAR1(…)” part is important, as it means that when we come to conduct inference (t statistics, p values and confidence intervals) the model takes into account that each observation in a particular country is related to the previous year’s – they aren’t as valuable as independent and identically distributed data from a simple random sample. Failing to include this factor would mean that inference was biased in the direction of concluding results are significant that are in fact due to chance.
And here are the results. Bear in mind that:
the response variable is change in the logarithm of GDP;
exports_g is change in the logarithm of exports as a percentage of GDP;
exports_lag is the lagged value of exports_g;
exports_starter is the logarithm of the starting value of exports as a percentage of GDP;
year is year, and we are testing for a linear trend in growth of GDP
Not wanting to give too precise an interpretation of this without a bit more theory and thinking, it suggests for countries on average:
evidence of impact (on GDP growth) of a change in exports as a percentage of GDP, and the impact is in the same direction;
the impact persists beyond the immediate year into subsequent years;
no evidence of a systematic change in growth rates of GDP over time simply related to time and nothing else in the model;
no evidence that the starting absolute value of exports as a percentage of GDP impacts on subsequent GDP growth rates
As exports_g, exports_lag and year were all allowed to be random effects ie take different values by country, their differing values for each country are of interest. The figures above reflect an overall effect of these variables; the actual value in any country is a (approximately) normally distributed random variable. Here’s how their’ densities look:
Each observation (the little rug marks on the horizontal axis) is the size of the effect for a particular country. So we can see that while overall value of exports_lag – the persistent impact of a change in exports as a percentage of GDP on GDP growth – was 0.021 as per the table above, any particular country has a value that is lower or hight than that, with a reasonable number of countries seeing a negative impact.
The value for the exports_g parameter plus the exports_lag parameter gives a crude sense of the overall impact, for a particular country, of changing importance of exports on GDP growth. As can be seen from the top right panel of the above set of plots, this combined value is generally but not always positive.
All up, while we might conclude that for countries “on average” there is a positive relationship between growth in the importance of exports and GDP growth, for any particular country the relationship will vary and may in fact be negative. Further investigation would need to place this in a more theoretical context, and control for other variables that might be confounding the results; but the above is probably enough for an exploratory blog post.
Here’s how to extract those country-level effects from our mixed effects model object:
cf <- coef(model4)# let's look at just the last four coefficients ie exclude the nuisance of first_yearcf_df <- cf[,(ncol(cf)-3):ncol(cf)]%>%as.data.frame()%>% mutate(country =rownames(cf), combined = exports_lag + exports_g)%>% arrange(combined)p7 <- cf_df %>% gather(variable, value,-country,-exports_starter)%>% mutate(variable =gsub("combined","exports_g + exports_lag", variable, fixed =TRUE))%>% ggplot(aes(x = value))+ facet_wrap(~variable, scales ="free")+ geom_density()+ geom_rug()
To leave a comment for the author, please follow the link and comment on their blog: Peter's stats stuff - R.
These 6 visualizations were created in Plotly between 2014 and 2016 and are in some way related to machine learning. They were created using Plotly’s free and open-source graphing libraries for Python and R.
In a previous post, we had ‘mapped’ the culinary diversity in India through a visualization of food consumption patterns. Since then, one of the topics in my to-do list was a visualization of world cuisines. The primary question was similar to that asked of the Indian cuisine: Are cuisines of geographically and culturally closer regions also similar? I recently came across an article on the analysis of recipe ingredients that distinguish the cuisines of the world. The analysis was conducted on a publicly available dataset consisting of ingredients for more than 13,000 recipes from the recipe website Epicurious. Each recipe was also tagged with the cuisine it belonged to, and there were a total of 26 different cuisines. This dataset was initially reported in an analysis of flavor network and principles of food pairing.
In this post, we (re)look the Epicurious recipe dataset and perform an exploratory analysis and visualization of ingredient frequencies among cuisines. Ingredients that are frequently found in a region’s recipes would also have high consumption in that region, and so an analysis of the ‘ingredient frequency’ of a cuisine should give us similar info as an analysis of ‘ingredient consumption’.
Outline of Analysis Method
Here is a part of the first few lines of data from the Epicurious dataset:
Vietnamese
vinegar
cilantro
mint
olive_oil
cayenne
fish
lime_juice
Vietnamese
onion
cayenne
fish
black_pepper
seed
garlic
Vietnamese
garlic
soy_sauce
lime_juice
thai_pepper
Vietnamese
cilantro
shallot
lime_juice
fish
cayenne
ginger
pea
Vietnamese
coriander
vinegar
lemon
lime_juice
fish
cayenne
scallion
Vietnamese
coriander
lemongrass
sesame_oil
beef
root
fish
…
Each row of the dataset lists the ingredients for one recipe and the first column gives the cuisine the recipe belongs to. As the first step in our analysis, we collect ALL the ingredients for each cuisine (over all the recipes for that cuisine). Then we calculate the frequency of occurrence of each ingredient in each cuisine and normalize the frequencies for each cuisine with the number of recipes available for that cuisine. This matrix of normalized ingredient frequencies is used for further analysis.
We use two approaches for the exploratory analysis of the normalized ingredient frequencies: (1) heatmap and (2) principal component analysis (pca), followed by display using biplots. The complete R code for the analysis is given at the end of this post.
Results
There are a total of 350 ingredients occurring in the dataset (among all cuisines). Some of the ingredients occur in just one cuisine, which, though interesting, will not be of much use for the current analysis. For better visual display, we restrict attention to ingredients showing most variation in normalized frequency across cuisines. The results are shown below:
Heatmap:
Biplot:
The figures look self-explanatory and does show the clustering together of geographically nearby regions on the basis of commonly used ingredients. Moreover, we also notice the grouping together of regions with historical travel patterns (North Europe and American, Spanish_Portuguese and SouthAmerican/Mexican) or historical trading patterns (Indian and Middle East).
We need to further test the stability of the grouping obtained here by including data from the Allrecipes dataset. Also, probably taking the third principal component might dissipate some of the crowd along the PC2 axis. These would be some of the tasks for the next post…
Here is the complete R code used for the analysis:
workdir <- "C:\Path\To\Dataset\Directory"
datafile <- file.path(workdir,"epic_recipes.txt")
data <- read.table(datafile, fill=TRUE, col.names=1:max(count.fields(datafile)),
na.strings=c("", "NA"), stringsAsFactors = FALSE)
a <- aggregate(data[,-1], by=list(data[,1]), paste, collapse=",")
a$combined <- apply(a[,2:ncol(a)], 1, paste, collapse=",")
a$combined <- gsub(",NA","",a$combined) ## this column contains the totality of all ingredients for a cuisine
cuisines <- as.data.frame(table(data[,1])) ## Number of recipes for each cuisine
freq <- lapply(lapply(strsplit(a$combined,","), table), as.data.frame) ## Frequency of ingredients
names(freq) <- a[,1]
prop <- lapply(seq_along(freq), function(i) {
colnames(freq[[i]])[2] <- names(freq)[i]
freq[[i]][,2] <- freq[[i]][,2]/cuisines[i,2] ## proportion (normalized frequency)
freq[[i]]}
)
names(prop) <- a[,1] ## this is a list of 26 elements, one for each cuisine
final <- Reduce(function(...) merge(..., all=TRUE, by="Var1"), prop)
row.names(final) <- final[,1]
final <- final[,-1]
final[is.na(final)] <- 0 ## If ingredient missing in all recipes, proportion set to zero
final <- t(final) ## proportion matrix
s <- sort(apply(final, 2, sd), decreasing=TRUE)
## Selecting ingredients with maximum variation in frequency among cuisines and
## Using standardized proportions for final analysis
final_imp <- scale(subset(final, select=names(which(s > 0.1))))
## heatmap
library(gplots)
heatmap.2(final_imp, trace="none", margins = c(6,11), col=topo.colors(7),
key=TRUE, key.title=NA, keysize=1.2, density.info="none")
## PCA and biplot
p <- princomp(final_imp)
biplot(p,pc.biplot=TRUE, col=c("black","red"), cex=c(0.9,0.8),
xlim=c(-2.5,2.5), xlab="PC1, 39.7% explained variance", ylab="PC2, 24.5% explained variance")
To leave a comment for the author, please follow the link and comment on their blog: Design Data Decisions » R.
It’s the time of the year again where one eats too much, and gets in a reflective mood! 2015 is nearly over, and us bloggers here at opiateforthemass.es thought it would be nice to argue endlessly which R package was the best/neatest/most fun/most useful/most whatever in this year!
Since we are in a festive mood, we decided we would not fight it out but rather present our top five of new R packages, a purely subjective list of packages we (and Chuck Norris) approves of.
But do not despair, dear reader! We have also pulled hard data on R package popularity from CRAN, and will present this first.
Top Popular CRAN packages
Let’s start with some factual data before we go into our personal favourites of 2015. We’ll pull the titles of the new 2015 R packages from cranberries, and parse the CRAN downloads per day using cranlogs package.
Using downloads per day as a ranking metric could have the problem that earlier package releases have had more time to create a buzz and shift up the average downloads per day, skewing the data in favour of older releases. Or, it could have the complication that younger package releases are still on the early “hump” part of the downloads (let’s assume they’ll follow a log-normal (exponential decay) distribution, which most of these things do), thus skewing the data in favour of younger releases. I don’t know, and this is an interesting question I think we’ll tackle in a later blog post…
For now, let’s just assume that average downloads per day is a relatively stable metric to gauge package success with. We’ll grab the packages released using rvest:
berries <- read_html("http://dirk.eddelbuettel.com/cranberries/2015/")
titles <- berries %>% html_nodes("b")%>% html_text
new <- titles[grepl("^New package", titles)]%>%gsub("^New package (.*) with initial .*","\1",.)%>%unique
and then lapply() these titles into the CRAN and parse their respective average downloads per day:
logs <- pblapply(new,function(x){
down <- cran_downloads(x, from ="2015-01-01")$count
if(sum(down)>0){
public <- down[which(down >0)[1]:length(down)]}else{
public <-0}return(data.frame(package = x, sum =sum(down), avg =mean(public)))})
logs <-do.call(rbind, logs)
With some quick dplyr and ggplot magic, these are the top 20 new CRAN packages from 2015, by average number of daily downloads:
As we can see, the main bias does not come from our choice of ranking metric, but by the fact that some packages are more “under the hood” and are pulled by many packages as dependencies, thus inflating the download statistics.
The top four packages (rversions, xml2, git2r, praise) are all technical packages. Although I have to say I did not know of praise so far, and it looks like it’s a very fun package, indeed: you can automatically add randomly generated praises to your output! Fun times ahead, I’d say.
Excluding these, the clear winner of “frontline” packages are readxl and readr, both packages by Hadly Wickham dealing with importing data into R. Well-deserved, in our opinion. These are packages nearly everybody working with data will need on a daily basis. Although, one hopes that contact with Excel sheets is kept to a minimum to ensure one’s sanity, and thus readxl is needed less often in daily life!
The next two are packages (DiagrammeR and visNetwork) relate to network diagrams, something that seems to be en vogue currently. R is getting some much-needed features on these topics here it seems.
plotly is the R package to the recently open-sourced popular plot.ly javascript libraries for interactive charts. A well-deserved top ranking entry! We also see packages that build and improve the ever-popular shiny packages (DT and shinydashboard), leaflet dealing with interactive mapping issues, and packages on stan, the Baysian statistical interference language (rstan, StanHeaders).
But now, this blog’s authors’ personal top five of new R packages for 2015:
readr is our package pick that also made it into the top downloads metric, above. Small wonder, as it’s written by Hadley and aims to make importing data easier, and especially more consistent. It is thus immediately useful for most, if not all, R users out there, and also received a tremendous “fame kickstart” from Hadley’s reputation within the R community. For extremely large datasets I still like to use data.table’s fread() function, but for anything else the new read_* functions make your life considerably easier. They’re faster compared to base R, and just the no more worries of stringsAsFactors alone is a godsend.
Since the package is written by Hadley, it is not only great but also comes with a fantastic documentation. If you’re not using readr currently, you should head over the the package readme and check it out.
R already has many template engines but this one is simple yet quite useful if you work on data exploration, visualization, statistics in R and deploy your findings in Python while using the same SQL queries and as similar syntax as possible.
Code transition from R to Python is quick and easy with infuser like this now;
# Rlibrary(infuser)
template <-"SELECT {{var}} FROM {{table}} WHERE month = {{month}}"
query <- infuse(template,var="apple",table="fruits",month=12)cat(query)# SELECT apple FROM fruits WHERE month = 12
# Pythontemplate="SELECT {var} FROM {table} WHERE month = {month}"query=template.format(var="apple",table="fruits",month=12)print(query)# SELECT apple FROM fruits WHERE month = 12
googlesheets by Jennifer Bryan finally allows me to directly output to Google Sheets, instead of output it to xlsx format and then push it (mostly manually) to Google Drive. At our company we use Google Drive as a data communication and storage tool for the management, so outputing Data Science results to Google Sheets is important. We even have some small reports stored in Google Sheets. The package allows for easy creating, finding, filling, and reading of Google Sheets with an incredible simplicity of use.
AnomalyDetection
(Kirill’s second pick. He gets to pick two since he is so indecisive)
AnomalyDetection was developed by Twitter’s data scientists and introduced to the open source community in the first week of the year. A very handy, beautiful, well-developed tool to find anomalies in the data. This is very important for a data scientist to be able to find anomalies in the data fast and reliably, before real damage occurs. The package allows you to get a good first impression of the things going on in your KPIs (Key Performance Indicators) and react quickly. Building alerts with it is a no-brainer if you want to monitor your data and assure data quality.
emoGG is definitely falling in the category “most whatever” R package of the year. What this package does is fairly simple: it allows you to display emojis in your ggplot2 plots, either as plotting symbols or as a background. Under the hood, it adds a geom_emoji layer to your ggplot2 plots, in which you have to specify one or more emoji codes corresponding to the emojis you wish to plot. emoGG can be used to make visualisations more compelling and make plots transport more meaning, no doubt. But before anything else, it’s fun and a must have for an avid emoji fan like me.
(This article was first published on SAS and R, and kindly contributed to R-bloggers)
I’ve been working on a Shiny app and wanted to display some math equations. It’s possible to use LaTeX to show math using MathJax, as shown in this example from the makers of Shiny. However, by default, MathJax does not allow in-line equations, because the dollar sign is used so frequently. But I needed to use in-line math in my application. Fortunately, the folks who make MathJax show how to enable the in-line equation mode, and the Shiny documentation shows how to write raw HTML. Here’s how to do it.
R
Here I replicated the code from the official Shiny example linked above. The magic code is inserted into ui.R, just below withMathJax(). ## ui.R
library(shiny)
shinyUI(fluidPage( title = 'MathJax Examples with in-line equations', withMathJax(), # section below allows in-line LaTeX via $ in mathjax. tags$div(HTML(" ")), helpText('An irrational number $\sqrt{2}$ and a fraction $1-\frac{1}{2}$'), helpText('and a fact about $\pi$:$\frac2\pi = \frac{\sqrt2}2 \cdot \frac{\sqrt{2+\sqrt2}}2 \cdot \frac{\sqrt{2+\sqrt{2+\sqrt2}}}2 \cdots$'), uiOutput('ex1'), uiOutput('ex2'), uiOutput('ex3'), uiOutput('ex4'), checkboxInput('ex5_visible', 'Show Example 5', FALSE), uiOutput('ex5') ))
## server.R library(shiny)
shinyServer(function(input, output, session) { output$ex1 <- renderUI({ withMathJax(helpText('Dynamic output 1: $\alpha^2$')) }) output$ex2 <- renderUI({ withMathJax( helpText('and output 2 $3^2+4^2=5^2$'), helpText('and output 3 $\sin^2(\theta)+\cos^2(\theta)=1$') ) }) output$ex3 <- renderUI({ withMathJax( helpText('The busy Cauchy distribution $\frac{1}{\pi\gamma\,\left[1 + \left(\frac{x-x_0}{\gamma}\right)^2\right]}\!$')) }) output$ex4 <- renderUI({ invalidateLater(5000, session) x <- round(rcauchy(1), 3) withMathJax(sprintf("If $X$ is a Cauchy random variable, then $P(X \leq %.03f ) = %.03f$", x, pcauchy(x))) }) output$ex5 <- renderUI({ if (!input$ex5_visible) return() withMathJax( helpText('You do not see me initially: $e^{i \pi} + 1 = 0$') ) }) })
Give it a try (or check out the Shiny app at https://r.amherst.edu/apps/nhorton/mathjax/)! One caveat is that the other means of in-line display, as shown in the official example, doesn’t work when the MathJax HTML is inserted as above.
An unrelated note about aggregators:We love aggregators! Aggregators collect blogs that have similar coverage for the convenience of readers, and for blog authors they offer a way to reach new audiences. SAS and R is aggregated by R-bloggers, PROC-X, and statsblogs with our permission, and by at least 2 other aggregating services which have never contacted us. If you read this on an aggregator that does not credit the blogs it incorporates, please come visit us at SAS and R. We answer comments there and offer direct subscriptions if you like our content. In addition, no one is allowed to profit by this work under our license; if you see advertisements on this page, the aggregator is violating the terms by which we publish our work.
To leave a comment for the author, please follow the link and comment on their blog: SAS and R.
In this post, we’ll look at a simple method to identify segments of an image based on RGB color values. The segmentation technique we’ll consider is called color quantization. Not surprisingly, this topic lends itself naturally to visualization and R makes it easy to render some really cool graphics for the color quantization problem.
The code presented in detail below is packaged concisely in this github gist:
By sourcing this script in R, all the required images will be fetched and some demo visualizations will be rendered.
Color Quantization
Digital color images can be represented using the RGB color model. In a digital RGB image, each pixel is associated with a triple of 3 channel values red, green, and blue. For a given pixel in the image, each channel has an intensity value (e.g. an integer in the range from 0 to 255 for an 8-bit color representation or a floating point number in the range from 0 to 1). To render a pixel in a particular image, the intensity values of three RGB channels are combined to yield a specific color value. This RGB illumination image from Wikipedia give some idea of how the three RGB channels can combine to form new colors:
The goal of image segmentation, is to take a digital image and partition it into simpler regions. By breaking an image into simpler regions, it often becomes easier to identify interesting superstructure in an image such as edges of objects. For example, here’s a possible segmentation of the Wikipedia RGB illumination image into 8 segments:
This segmentation picks out all of the solid color regions in the original image (excluding the white center) and discards much of the finer details of the image.
There are many approaches to segmenting an image but here we’ll just consider a fairly simple one using K-means. The k-means algorithm attempts to partition a data set into k clusters. Our data set will be the RBG channel values for each pixel in a given image and we’ll choose k to coincide with the number of partitions we’d like to extract from the region. By clustering over the RGB channel values, we’ll tend to get clusters whose RGB channel values are relatively “close” in terms of Euclidean distance. If the choice of k is a good one, the color values of the pixels within a cluster will be very close to each other and the color values of pixels within two different clusters will be fairly distinct.
Implementing Color Segmentation in R
This beautiful image of a mandrill is famous in image processing (it’s also in the public domain like all images in this post).
To load this PNG image into R, we’ll use the PNG package:
# load the PNG into an RGB image object mandrill = readPNG("mandrill.png")
# This mandrill is 512 x 512 x 3 array dim(mandrill) ## [1] 512 512 3
In R, an RGB image is represented as an n by m by 3 array. The last dimension of this array is the channel (1 for red, 2 for green, 3 for blue). Here’s what the three RGB channels of the image look like:
Here are some ways to view image data directly from within R:
library("grid") library("gridExtra")
### EX 1: show the full RGB image grid.raster(mandrill)
### EX 2: show the B channel in gray scale representing pixel intensity grid.raster(mandrill[,,3])
### EX 3: show the 3 channels in separate images # copy the image three times mandrill.R = mandrill mandrill.G = mandrill mandrill.B = mandrill
# zero out the non-contributing channels for each image copy mandrill.R[,,2:3] = 0 mandrill.G[,,1]=0 mandrill.G[,,3]=0 mandrill.B[,,1:2]=0
Now let’s segment this image. First, we need to reshape the array into a data frame with one row for each pixel and three columns for the RGB channels:
# reshape image into a data frame df = data.frame( red = matrix(mandrill[,,1], ncol=1), green = matrix(mandrill[,,2], ncol=1), blue = matrix(mandrill[,,3], ncol=1) )
Now, we apply k-means to our data frame. We’ll choose k=4 to break the image into 4 color regions.
### compute the k-means clustering K = kmeans(df,4) df$label = K$cluster
### Replace the color of each pixel in the image with the mean ### R,G, and B values of the cluster in which the pixel resides:
# get the coloring colors = data.frame( label = 1:nrow(K$centers), R = K$centers[,"red"], G = K$centers[,"green"], B = K$centers[,"blue"] )
# merge color codes on to df # IMPORTANT: we must maintain the original order of the df after the merge! df$order = 1:nrow(df) df = merge(df, colors) df = df[order(df$order),] df$order = NULL
Finally, we have to reshape our data frame back into an image:
# get mean color channel values for each row of the df. R = matrix(df$R, nrow=dim(mandrill)[1]) G = matrix(df$G, nrow=dim(mandrill)[1]) B = matrix(df$B, nrow=dim(mandrill)[1])
# reconstitute the segmented image in the same shape as the input image mandrill.segmented = array(dim=dim(mandrill)) mandrill.segmented[,,1] = R mandrill.segmented[,,2] = G mandrill.segmented[,,3] = B
# View the result grid.raster(mandrill.segmented)
Here is our segmented image:
Color Space Plots in Two and Three Dimensions
Color space is the three dimensional space formed by the three RGB channels. We can get a better understanding of color quantization by visualizing our images in color space. Here are animated 3d plots of the color space for the mandrill and the segmented mandrill:
These animations were generated with the help of the rgl package:
library("rgl") # color space plot of mandrill open3d() plot3d(df$red, df$green, df$blue, col=rgb(df$red, df$green, df$blue), xlab="R", ylab="G", zlab="B", size=3, box=FALSE, axes=TRUE) play3d( spin3d(axis=c(1,1,1), rpm=3), duration = 10 )
# color space plot of segmented mandrill open3d() plot3d(df$red, df$green, df$blue, col=rgb(df$R, df$G, df$B), xlab="R", ylab="G", zlab="B", size=3, box=FALSE) play3d( spin3d(axis=c(1,1,1), rpm=3), duration = 10 )
# Use # movie3d( spin3d(axis=c(1,1,1), rpm=3), duration = 10 ) # instead of play3d to generate GIFs (requires imagemagick).
To visualize color space in two dimensions, we can use principle components analysis. Principle components transforms the original RGB coordinate system into a new coordinate system UVW. In this system, the U coordinate captures as much of the variance in the original data as possible and the V coordinate captures as much of the variance as possible after factoring out U. So after performing PCA, most of the variation in the data should be visible by plotting in the UV plane. Here is the color space projection for the mandrill:
and for the segmented mandrill:
Here is the code to generate these projections:
require("ggplot2")
# perform PCA on the mandril data and add the uv coordinates to the dataframe PCA = prcomp(df[,c("red","green","blue")], center=TRUE, scale=TRUE) df$u = PCA$x[,1] df$v = PCA$x[,2]
# Inspect the PCA # most of the cumulative proportion of variance in PC2 should be close to 1. summary(PCA)
Guest post by Gergely Daróczi. If you like this content, you can buy the full 396 paged e-book for 5 USD until January 8, 2016 as part of Packt’s “$5 Skill Up Campaign” at http://bit.ly/mastering-R
Feature extraction tends to be one of the most important steps in machine learning and data science projects, so I decided to republish a related short section from my intermediate book on how to analyze data with R. The 9th chapter is dedicated to traditional dimension reduction methods, such as Principal Component Analysis, Factor Analysis and Multidimensional Scaling — from which the below introductory examples will focus on that latter.
Multidimensional Scaling (MDS) is a multivariate statistical technique first used in geography. The main goal of MDS it is to plot multivariate data points in two dimensions, thus revealing the structure of the dataset by visualizing the relative distance of the observations. Multidimensional scaling is used in diverse fields such as attitude study in psychology, sociology or market research.
Although the MASS package provides non-metric methods via the isoMDS function, we will now concentrate on the classical, metric MDS, which is available by calling the cmdscale function bundled with the stats package. Both types of MDS take a distance matrix as the main argument, which can be created from any numeric tabular data by the dist function.
But before such more complex examples, let’s see what MDS can offer for us while working with an already existing distance matrix, like the built-in eurodist dataset:
> as.matrix(eurodist)[1:5, 1:5]
Athens Barcelona Brussels Calais Cherbourg
Athens 0 3313 2963 3175 3339
Barcelona 3313 0 1318 1326 1294
Brussels 2963 1318 0 204 583
Calais 3175 1326 204 0 460
Cherbourg 3339 1294 583 460 0
The above subset (first 5-5 values) of the distance matrix represents the travel distance between 21 European cities in kilometers. Running classical MDS on this example returns:
> (mds <- cmdscale(eurodist))
[,1] [,2]
Athens 2290.2747 1798.803
Barcelona -825.3828 546.811
Brussels 59.1833 -367.081
Calais -82.8460 -429.915
Cherbourg -352.4994 -290.908
Cologne 293.6896 -405.312
Copenhagen 681.9315 -1108.645
Geneva -9.4234 240.406
Gibraltar -2048.4491 642.459
Hamburg 561.1090 -773.369
Hook of Holland 164.9218 -549.367
Lisbon -1935.0408 49.125
Lyons -226.4232 187.088
Madrid -1423.3537 305.875
Marseilles -299.4987 388.807
Milan 260.8780 416.674
Munich 587.6757 81.182
Paris -156.8363 -211.139
Rome 709.4133 1109.367
Stockholm 839.4459 -1836.791
Vienna 911.2305 205.930
These scores are very similar to two principal components (discussed in the previous, Principal Component Analysis section), such as running prcomp(eurodist)$x[, 1:2]. As a matter of fact, PCA can be considered as the most basic MDS solution.
Anyway, we have just transformed (reduced) the 21-dimensional space into 2 dimensions, which can be plotted very easily — unlike the original distance matrix with 21 rows and 21 columns:
> plot(mds)
Does it ring a bell? If not yet, the below image might be more helpful, where the following two lines of code also renders the city names instead of showing anonymous points:
Although the y axis seems to be flipped (which you can fix by multiplying the second argument of text by -1), but we have just rendered a map of some European cities from the distance matrix — without any further geographical data. I hope you find this rather impressive!
Please find more data visualization tricks and methods in the 13th, Data Around Us chapter, from which you can learn for example how to plot the above results over a satellite map downloaded from online service providers. For now, I will only focus on how to render this plot with the new version of ggplot2 to avoid overlaps in the city names, and suppressing the not that useful x andy axis labels and ticks:
But let’s get back to the original topic and see how to apply MDS on non-geographic data, which was not prepared to be a distance matrix. We will use the mtcars dataset in the following example resulting in a plot with no axis elements:
The above plot shows the 32 cars of the original dataset scattered in a two dimensional space. The distance between the elements was computed by MDS, which took into account all the 11 original numeric variables, and it makes vert easy to identify the similar and very different car types. We will cover these topics in more details in the next chapter, which is dedicated toClassification and Clustering.
This article first appeared in the “Mastering Data Analysis with R” book, and is now published with the permission of Packt Publishing.
To leave a comment for the author, please follow the link and comment on their blog: R – R-statistics blog.
A while back I showed you how to make volcano plots in base R for visualizing gene expression results. This is just one of many genome-scale plots where you might want to show all individual results but highlight or call out important results by labeling them, for example, with a gene name.
But if you want to annotate lots of points, the annotations usually get so crowded that they overlap one another and become illegible. There are ways around this – reducing the font size, or adjusting the position or angle of the text, but these usually don’t completely solve the problem, and can even make the visualization worse. Here’s the plot again, reading the results directly from GitHub, and drawing the plot with ggplot2 and geom_text out of the box.
What a mess. It’s difficult to see what any of those downregulated genes are on the left. Enter the ggrepel package, a new extension of ggplot2 that repels text labels away from one another. Just sub in geom_text_repel() in place of geom_text() and the extension is smart enough to try to figure out how to label the points such that the labels don’t interfere with each other. Here it is in action.
Since last year Revolution Analytics has been publishing beta versions of Revolution R Open and finally in April this year they released RRO 8.0.3. The current release is RRO 3.2.2 (naming was adapted to fit the R version it is built upon). This post will give you an introduction on my favorite new features, how to install RRO and on performance benchmarks.
RRO 3.2.2 is based on the base R that we all know and which is published by the R Foundation. My favorite features of this new RRO version are:
100% compatible with R and therefore your existing R code and all packages
Reproducible R toolkit – every version of RRO comes with a snapshot of CRAN, which means that installing packages in RRO always gives you the same version of a package as long as you have the same RRO version.
Daily CRAN Snapshots if you need a different version of a package: checkpoint(“YYYY-MM-DD”)
No more problems when moving code to another machine as long as you are using the same version of RRO
One feature that sounded very interesting but I could not get to work as expected is the Intel Math Kernel Libraries that should improve performance. Intel MKL contains highly vectorized and threaded functions like fast fourier transformation, linear algebra and methods for statistics and vector math. You just have to install it and RRO automatically detects it. As I saw when preparing this post, the library works on the basic matrix computations but when tested on real data with real packages that use matrix functions in their core, the code performance did not improve.
The installation of Revolution R Open is very easy:
Download MKL from the above link if you want to make use of the multithreaded performance (not required on Mac OS X)
If you’ve never used R before and don’t have an IDE, I recommend using RStudio
The information after starting RStudio looks a bit different for me (different version information and so on), especially this is interesting if you’ve installed MKL as well (the default will be using all cores your machine has):
Multithreaded BLAS/LAPACK libraries detected. Using 4 cores for math algorithms.
RStudio startup information after installing RRO and MKL.
As many benchmarks already report, RRO is really an improvement over Base R with regards to performance and using 4 cores decreases run time a little bit more. These improvements only appear for certain operations though. In general performance gains are reported for matrix calculations and matrix functions, Cholesky factorization, singular value decomposition, principal component analysis and linear discriminant analysis. These are some benchmarks I found:
I tried all these tasks as well on both my Surface 3 running Windows 10 (4 GB RAM, 4 cores) and on my Mac Pro running Windows 7 (16 GB RAM, 4 cores) and observed similar improvements for the above mentioned tasks.
Motivated to try these packages on real data and with methods one would expect to perform matrix computations, I downloaded the packages fabia (Factor Analysis for Bicluster Acquisition) and irlba (Fast Truncated SVD, PCA and Symmetric Eigendecomposition for Large Dense and Sparse Matrices). To test fabia I used the datasets that comes with the package. To give irlba a real task, I used the Netflix Prize dataset which contains movie ratings of 480,000 users for 18,000 movies. The results were disappointing, the performance of the algorithms was the same, no matter whether I used base R, RRO with 1 core or RRO with 4 cores.
I looked into the code of the fabia package and it does use matrix calculations and also the method svd (singular value decomposition) which should both be faster using RRO (as shown above). But there is so much code and computations around, that this is probably not the slowest part of the method anyway and performance improvements there do not stand out at all. In the irlba package, they use a loop and also some matrix calculations but also lots of other code. Additionally here we did not use a real matrix but a sparse matrix (with package Matrix), therefore I believe the Math Kernel Library can not really improve the computations. A friend suggested the package pracma. With its method itersolve you can solve systems of linear equations. Unfortunately this method also performs iterations and the run times are the same in all three settings.
My conclusion about the Revolution R Open performance improvements is that it only helps in certain cases (mainly linear algebra) but you will not always get a gain out of it. Since it does not slow other computations down or have any other disadvantages, it will not hurt you using it and sometimes you will benefit from it. Furthermore the functionality of the reproducible R kit (daily snapshots of CRAN) will make your workflow a lot comprehensible and less prone to error.
Simple image recognition app using TensorFlow and Shiny
About
My weekend was full of deep learning and AI programming so as a milestone I made a simple image recognition app that:
Takes an image input uploaded to Shiny UI
Performs image recognition using TensorFlow
Plots detected objects and scores in wordcloud
App
This app is to demonstrate powerful image recognition functionality using TensorFlow following the first half of this tutorial.
In the backend a pretrained classify_image.py is running, with the model being pretrained by tensorflow.org.
This Python file takes a jpg/jpeg file as an input and performs image classifications.
I will then use R to handle the classification results and produce wordcloud based on detected objects and their scores.
Requirements
The app is based on R (shiny and wordcloud packages), Python 2.7 (tensorflow, six and numpy packages) and TensorFlow (Tensorflow itself and this python file).
Please make sure that you have all the above packages installed. For help installing TensorFlow this link should be helpful.
Structure
Just like a usual Shiny app, you only need two components; server.R and ui.R in it.
This is optional but you can change number of objects in the image recognition output by changing the line 63 of classify_image.py
tf.app.flags.DEFINE_integer('num_top_predictions',5#I changed this to 10,"""Display this many predictions.""")
server.R
I put comments on almost every line in server.R so you can follow the logic more easily.
library(wordcloud)
shinyServer(function(input, output){
PYTHONPATH <-"path/to/your/python"#should look like /Users/yourname/anaconda/bin if you use anaconda python distribution in OS X
CLASSIFYIMAGEPATH <-"path/to/your/classify_image.py"#should look like ~/anaconda/lib/python2.7/site-packages/tensorflow/models/image/imagenet
outputtext <- reactive({###This is to compose image recognition template###
inFile <- input$file1 #This creates input button that enables image upload
template <-paste0(PYTHONPATH,"/python ",CLASSIFYIMAGEPATH,"/classify_image.py")#Template to run image recognition using Pythonif(is.null(inFile)){res <-system(paste0(template," --image_file /tmp/imagenet/cropped_panda.jpg"),intern=T)}else{#Initially the app classifies cropped_panda.jpg, if you download the model data to a different directory, you should change /tmp/imagenet to the location you use.
res <-system(paste0(template," --image_file ",inFile$datapath),intern=T)#Uploaded image will be used for classification}})
output$plot <- renderPlot({###This is to create wordcloud based on image recognition results###
df <-data.frame(gsub(" *\(.*?\) *","", outputtext()),gsub("[^0-9.]","", outputtext()))#Make a dataframe using detected objects and scoresnames(df)<-c("Object","Score")#Set column names
df$Object <-as.character(df$Object)#Convert df$Object to character
df$Score <-as.numeric(as.character(df$Score))#Convert df$Score to numeric
s <-strsplit(as.character(df$Object),',')#Split rows by comma to separate rows
df <-data.frame(Object=unlist(s), Score=rep(df$Score,sapply(s, FUN=length)))#Allocate scores to split words# By separating long categories into shorter terms, we can avoid "could not be fit on page. It will not be plotted" warning as much as possible
wordcloud(df$Object, df$Score, scale=c(4,2),
colors=brewer.pal(6,"RdBu"),random.order=F)#Make wordcloud})
output$outputImage <- renderImage({###This is to plot uploaded image###if(is.null(input$file1)){
outfile <-"/tmp/imagenet/cropped_panda.jpg"
contentType <-"image/jpg"#Panda image is the default}else{
outfile <- input$file1$datapath
contentType <- input$file1$type
#Uploaded file otherwise}list(src = outfile,
contentType=contentType,
width=300)}, deleteFile =TRUE)})
ui.R
The ui.R file is rather simple:
shinyUI(
fluidPage(titlePanel("Simple Image Recognition App using TensorFlow and Shiny"),
tags$hr(),
fluidRow(
column(width=4,
fileInput('file1','',accept =c('.jpg','.jpeg')),
imageOutput('outputImage')),
column(width=8,
plotOutput("plot")))))
Shiny App
That’s it!
Here is a checklist to run the app without an error.
Make sure you have all the requirements installed
You have server.R and ui.R in the same folder
You corrently set PYTHONPATH and CLASSIFYIMAGEPATH
Optionally, change num_top_predictions in classify_image.py
Upload images should be in jpg/jpeg format
I was personally impressed with what machine finds in abstract paintings or modern art
A common theme over the last few decades was that we could afford to simply sit back and let computer (hardware) engineers take care of increases in computing speed thanks to Moore’s law. That same line of thought now frequently points out that we are getting closer and closer to the physical limits of what Moore’s law can do for us.
So the new best hope is (and has been) parallel processing. Even our smartphones have multiple cores, and most if not all retail PCs now possess two, four or more cores. Real computers, aka somewhat decent servers, can be had with 24, 32 or more cores as well, and all that is before we even consider GPU coprocessors or other upcoming changes.
Sometimes our tasks are embarrassingly simple as is the case with many data-parallel jobs: we can use higher-level operations such as those offered by the base R package parallel to spawn multiple processing tasks and gather the results. Dirk covered all this in some detail in previous talks on High Performance Computing with R (and you can also consult the CRAN Task View on High Performance Computing with R).
But sometimes we cannot use data-parallel approaches. Hence we have to redo our algorithms. Which is really hard. R itself has been relying on the (fairly mature) OpenMP standard for some of its operations. Luke Tierney’s keynote at the 2014 R/Finance conference mentioned some of the issues related to OpenMP, which works really well on Linux but currently not so well on other platforms. R is expected to make wider use of it in future versions once compiler support for OpenMP on Windows and OS X improves.
In the meantime, the RcppParallel package provides a complete toolkit for creating portable, high-performance parallel algorithms without requiring direct manipulation of operating system threads. RcppParallel includes:
Intel Thread Building Blocks (v4.3), a C++ library for task parallelism with a wide variety of parallel algorithms and data structures (Windows, OS X, Linux, and Solaris x86 only).
TinyThread, a C++ library for portable use of operating system threads.
RVector and RMatrix wrapper classes for safe and convenient access to R data structures in a multi-threaded environment.
High level parallel functions (parallelFor and parallelReduce) that use Intel TBB as a back-end on systems that support it and TinyThread on other platforms.
All four are interesting and demonstrate different aspects of parallel computing via RcppParallel. But the last article is key—it shows how a particular matrix distance metric (which is missing from R) can be implemented in a serial manner in both R, and also via Rcpp. The fastest implementation, however, uses both Rcpp and RcppParallel and thereby achieves a truly impressive speed gain as the gains from using compiled code (via Rcpp) and from using a parallel algorithm (via RcppParallel) are multiplicative. On a couple of four-core machines the RcppParallel version was between 200 and 300 times faster than the R version.
Exciting times for parallel programming in R! To learn more head over to the RcppParallel package and start playing.
To leave a comment for the author, please follow the link and comment on their blog: RStudio Blog.
Data analyses are the product of many different tasks, and statistical methods are one key aspect of any data analysis. There is a common workflow in the related areas of informatics, data mining, data science, machine learning, and statistics. The workflow tasks include data preparation, the development of predictive mathematical models, and the interpretation and preparation of analysis results (including the development of visualizations to communicate findings).
The presentation provides information on the last two steps of this workflow and reproducible code examples and presents a walk-through of many common statistical methods (including regression, clustering (e.g. K-means and hiearchical), and dimensionality reduction (e.g. prinical component analysis (PCA)) used to explore data with examples in R.
Novice users are shown how to navigate the resuting R object to extract specific elements of interest, such as correlation p-values, regression coefficients, etc. The presentation additionally tries to tackle of some of the key concerns about these introductory methods by providing guidance on the interpretation of analyses results, such as understanding the approximately 10 values returned in a simple linear regression; the importance of and how to deal with missing values through imputation in real world problems; determining the quality of clustering results; and understanding the data transformations that take place in dimension reduction methods. Also provided is information about more sophisticated methodologies, such as regularized regression methods: LASSO, Ridge, and Elastic Net regression, and packages to make use of these more advanced methods in R, such as glmnet for regularized regression.