Fast functions for correlation and hierarchical clustering
R code examples

Peter Langfelder1 and Steve Horvath1,2



1 Dept. of Human Genetics, UC Los Angeles, 2 Dept. of Biostatistics, UC Los Angeles

Peter (dot) Langfelder (at) gmail (dot) com, SHorvath (at) mednet (dot) ucla (dot) edu

We provide several R scripts comparing the performance of the correlation calculations and hierarchical clustering to the standard R functions. To run these examples, packages flashClust (version 1.20 or higher) and WGCNA (version 1.13 or higher) must be installed. The R code was last updated July 1, 2015, with small updates to both code and text.

1. Example of module stability analysis using resampling of microarray samples

We provide an example of a study of module stability analysis using resampling of microarray samples in expression data from livers of female mice of an F2 cross (Ghazalpour et al, 2006). We provide two version of the example. The "large" version uses a full data set of over 23000 probe sets. This version requires a computer with at least 16 GB (32 GB preferred) of RAM to run. For the benefit of users who do not have access to computers with that much memory, we also provide a smaller version of the same analysis that only uses 5000 probes and will run on a standard modern desktop or laptop with at least 2GB of memory.

Download data and custom function for the analysis. The following two files are necessary for either version of the analysis.

R code that performs the large analysis: Please choose your preferred format of the actual R code:

R code that performs the small analysis: Please choose your preferred format of the actual R code:



2. Timing comparisons of correlation calculations

We provide several R scripts that compare correlation calculations implemented in the WGCNA package to standard R function cor.



3. Timing comparisons of hierarchical clustering

We provide an R script that compares the performance of the hierarchical clustering implemented in package flashClust to that of standard R function hclust. As written, the script will run only on a large computer (see above), but can easily be modified to make it manageable also on standard desktop computers.

Update (October 2014): R core team recently modified the code in the standard function hclust implemented in package stats. The new "standard" hclust is now as fast or faster than the flashClust presented here. The R timing code below will work but flashClust will no longer be much (if at all) faster than the "standard" hclust.