Quantcast
Channel: Search Results for “pca”– R-bloggers
Viewing all articles
Browse latest Browse all 209

Clustering French Cities (based on Temperatures)

$
0
0

(This article was first published on R-english – Freakonometrics, and kindly contributed to R-bloggers)

In order to illustrate hierarchical clustering techniques and k-means, I did borrow François Husson‘s dataset, with monthly average temperature in several French cities.

> temp=read.table(
+ "http://freakonometrics.free.fr/FR_temp.txt",
+ header=TRUE,dec=",")

We have 15 cities, with monthly observations

> X=temp[,1:12]
> boxplot(X)

Since the variance seems to be rather stable, we will not ‘normalize’ the variables here,

> apply(X,2,sd)
    Janv     Fevr     Mars     Avri 
2.007296 1.868409 1.529083 1.414820 
     Mai     Juin     juil     Aout 
1.504596 1.793507 2.128939 2.011988 
    Sept     Octo     Nove     Dece 
1.848114 1.829988 1.803753 1.958449

In order to get a hierarchical cluster analysis, use for instance

> h <- hclust(dist(X), method = "ward")
> plot(h, labels = rownames(X), sub = "")

An alternative is to use

> library(FactoMineR)
> h2=HCPC(X)
> plot(h2)

Here, we visualise observations with a principal components analysis. We have here also an automatic selection of the number of classes, here 3. We can get the description of the groups using

> h2$desc.ind

or directly

> cah=hclust(dist(X))
> groups.3 <- cutree(cah,3)

We can also visualise those classes by ourselves,

> acp=PCA(X,scale.unit=FALSE)
> plot(acp$ind$coord[,1:2],col="white")
> text(acp$ind$coord[,1],acp$ind$coord[,2],
+ rownames(acp$ind$coord),col=groups.3)

It is possible to plot the centroïds of those clusters

> PT=aggregate(acp$ind$coord,list(groups.3),mean)
> points(PT$Dim.1,PT$Dim.2,pch=19)

If we add Voroid sets around those centroïds, here we do not see them (actually, we see the point – in the middle – that is exactly at the intersection of the three regions),

> library(tripack)
> V <- voronoi.mosaic(PT$Dim.1,PT$Dim.2)
> plot(V,add=TRUE)

To visualize those regions, use

> p=function(x,y){
+   which.min((PT$Dim.1-x)^2+(PT$Dim.2-y)^2)
+ }
> vx=seq(-10,12,length=251)
> vy=seq(-6,8,length=251)
> z=outer(vx,vy,Vectorize(p))
> image(vx,vy,z,col=c(rgb(1,0,0,.2),
+ rgb(0,1,0,.2),rgb(0,0,1,.2)))
> CL=c("red","black","blue")
> text(acp$ind$coord[,1],acp$ind$coord[,2],
+ rownames(acp$ind$coord),col=CL[groups.3])

Actually, those three groups (and those three regions) are also the ones we obtain using a k-mean algorithm,

> km=kmeans(acp$ind$coord[,1:2],3)
> km
K-means clustering 
with 3 clusters of sizes 3, 7, 5

(etc). But actually, since again we have some spatial data, it is possible to visualize them on a map

> library(maps)
> map("france")
> points(temp$Long,temp$Lati,col=groups.3,pch=19)

or, to visualize the regions, use e.g.

> library(car)
> for(i in 1:3) 
+ dataEllipse(temp$Long[groups.3==i],
+ temp$Lati[groups.3==i], levels=.7,add=TRUE,
+ col=i+1,fill=TRUE)

Those three regions actually make sense, geographically speaking.

To leave a comment for the author, please follow the link and comment on their blog: R-english – Freakonometrics.

R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more...

Viewing all articles
Browse latest Browse all 209

Trending Articles