Learning with Random
Forest Predictors

Tao Shi, Steve Horvath


Department of Human Genetics and Department of Biostatistics
University of California
Los Angeles

CA 90095


A random forest (RF) predictor (Breiman 2001) is an ensemble of individual tree predictors. As part of their construction, RF predictors naturally lead to a dissimilarity measure between the observations.  One can also define an RF dissimilarity measure between unlabelled data: the idea is to Construct an RF predictor that distinguishes the `observed' data from suitably generated synthetic data (Breiman 2003). The observed data are the original unlabelled data while the synthetic data are drawn from a reference distribution. Recently, RF dissimilarities have been used successfully in several unsupervised learning tasks involving genomic data. Unlike standard dissimilarities, the relationship between the RF dissimilarity and the variables can be difficult to disentangle. Here we describe the properties of the RF dissimilarity and make recommendations on how to use it in practice.  An RF dissimilarity can be attractive because it handles mixed variable types well, is invariant to monotonic transformations of the input variables, is robust to outlying observations, and accommodates several strategies for dealing with missing data. The RF dissimilarity easily deals with large number of variables due to its intrinsic variable selection, e.g. the Addcl1 RF dissimilarity weighs the contribution of each variable on the dissimilarity according to how dependent it is on other variables.  We find that the RF dissimilarity is useful for detecting tumor sample clusters on the basis of tumor marker expressions. In this application, biologically meaningful clusters can often be described with simple thresholding rules. 

KEY WORDS: random forest clustering,
biomarkers, ensemble predictors, random forest distance, random forest
dissimilarity, tree predictor clustering


A technical report for random forest clustering can be found here
To cite the technical report, please use:
Tao Shi and Steve Horvath (2006) Unsupervised Learning with Random Forest
Predictors. Journal of Computational and Graphical Statistics. Volume 15,
Number 1, March 2006, pp. 118-138(21)

For the journal article, click here


Word version

PDF version

TXT file version

R-functions used in the tutorial

Test data (comma delimited text file or Excel File)

Student Presentation

A student presentation for random forest clustering can be found here

Other Materials

The randomGLM predictor is an attractive alternative to the random forest. It often is more acccurate and involves fewer covariates as described here


The random forest predictors can also be used for gene
screening as described here.

Read article 1 and article 2


Please send your suggestions and
comments to: shorvath@mednet.ucla.edu