When is hub gene selection better than standard meta-analysis?
Peter Langfelder1, Paul S. Mischel2
and Steve Horvath1,3
PLoS ONE 8(4): e61505. doi:10.1371/journal.pone.0061505
1Department of Human Genetics
2Department of Pathology and Laboratory Medicine
3Deptartment of Biostatistics, University of California, Los Angeles
Peter (dot) Langfelder (at) gmail (dot) com,
SHorvath (at) mednet (dot) ucla (dot) edu
Abstract
Since hub nodes have been found to play important roles in many networks, highly connected hub genes are expected to play an important role in biology as well. However, the empirical evidence remains ambiguous. An open question is whether (or when) hub gene selection leads to more meaningful gene lists than a standard statistical analysis based on significance testing when analyzing genomic data sets (e.g. gene expression or DNA methylation data). Here we address this question for the special case when multiple genomic data sets are available. This is of great practical importance since for many research questions multiple data sets are publicly available. In this case, the data analyst can decide between a standard statistical approach (e.g., based on meta-analysis) and a co-expression network analysis approach that selects intramodular hubs in consensus modules. We assess the performance of these
two types of approaches according to two criteria. The first criterion evaluates the biological insights
gained and is relevant in basic research. The second criterion evaluates the validation success (reproducibility) in independent data sets and often applies in clinical diagnostic or prognostic applications.
We compare meta-analysis with consensus network analysis based on weighted correlation network analysis (WGCNA) in three comprehensive and unbiased empiricalstudies:
(1) Finding genes predictive of lung cancer survival
(2) Finding methylation markers related to age
(3) Finding mouse genes related to total cholesterol.
The results demonstrate that intramodular hub gene status with respect to consensus modules is more useful than a meta-analysis p-value when identifying biologically meaningful gene lists (reflecting criterion 1). However, standard meta-analysis methods perform as good as (if not better than) a
consensus network approach in terms of validation success (criterion 2). The article also reports a comparison of meta-analysis techniques applied to gene expression data and presents novel R functions for carrying out consensus network analysis, network based screening, and meta analysis. All data and analysis R code can be found at this web site.
Data and R code/tutorials
We provide data and code necessary to reproduce our analysis. The code is presented in annotated PDF
documents that contain code together with explanations and notes.
The code documents also serve as tutorials
on the use of consensus module methods, marginal meta-analysis, and meta-analysis of module membership.
The code and data together can be
downloaded a single zip bundle:
- Data and code zip bundle (Caution: very large file, over 500MB.
May take up
to an hour to download.)
Please save the the zip bundle on your hard drive and unpack it. The unpacked files should be stored in a
folder (directory) named Project-MetaAnalysis. This folder contains the following main sub-folders:
- LungCancer: Data and code for the adenocarcinoma (lung cancer)
application.- Data-Expression: Original data downloaded from GEO and/or
provided by the original authors, as well as our pre-processed data suitable for our analysis. - RCode
- CommonFunctions: Files networkFunctions-extras-05.R and outlierRemovalFunctions.R contain custom R functions necessary for the
analysis, over and above those already present in the WGCNA package. These files are sourced
by the analysis code. - 002-Preprocessing: The document LungCancer-outlierRemovalAndCollapsing.pdf contains code to
pre-process the original datas – removal
of potential outliers and restriction to a common set of genes. - 005-ConsensusModules: The document LungCancer-consensusModules.pdf contains code to
identify consensus modules in the 8 adenocarcinoma data sets. - 010-CompareEnrichment: The document LungCancer-compareEnrichment.pdf contains code to
evaluate hub gene selection and marginal
meta-analysis methods by GO enrichment in biologically relevant categories. - 020-CompareValidation: The document LungCancer-compareValidation.pdf contains code to evaluate hub
gene selection and
marginal meta-analysis methods by how well selected genes validate in an independent data set.
- CommonFunctions: Files networkFunctions-extras-05.R and outlierRemovalFunctions.R contain custom R functions necessary for the
- Data-Expression: Original data downloaded from GEO and/or
- Aging-Methylation: Data and code for the analysis of association of
methylation profiles with age.- Data-Methylation: Methylation data in a pre-processed form, saved
as na R object. - RCode:
- 005-ConsensusModules: The document Aging-consensusModules.pdf contains code to
identify consensus modules in the 7 methylation data sets. - 010-CompareEnrichment: The document Aging-compareEnrichment.pdf contains code to
evaluate hub gene selection and
marginal meta-analysis methods by enrichment in Polycomb Group target genes. - 020-CompareValidation: The document Aging-compareValidation.pdf contains code to evaluate hub
gene selection and
marginal meta-analysis methods by how well selected genes validate in an independent data set.
- 005-ConsensusModules: The document Aging-consensusModules.pdf contains code to
- Data-Methylation: Methylation data in a pre-processed form, saved
- Mouse: Data and code for the analysis of association of liver gene
expression profiles with total cholesterol and other clinical traits.- Data-Expression: Expression data in a pre-processed form, saved
as na R object. - RCode:
- 005-ConsensusModules: The document Mouse-consensusModules.pdf contains code to
identify consensus modules
across 9 mouse liver expression data sets. - 010-CompareEnrichment: The document Mouse-compareEnrichment.pdf contains code to
evaluate hub gene selection and
marginal meta-analysis methods by enrichment in a relevant GO category. - 020-CompareValidation: This folder will contain
code to evaluate hub gene selection and
marginal meta-analysis methods by how well selected genes validate in an independent data set.
- 005-ConsensusModules: The document Mouse-consensusModules.pdf contains code to
- Data-Expression: Expression data in a pre-processed form, saved
The zip bundle contains additional data files and directories that are not listed above; please do not
remove the additional files (in particular, the data files) since they may be needed for the analysis.