Clustering of epigenomes reveals common lineages, common properties

DATA SOURCE

Correlation matrices (Data format: RData)

Download URL:
https://egg2.wustl.edu/roadmap/data/byDataType/celltype_clustering/correlation_matrices/

Newick formatted optimally ordered hierarchical trees, annotated with bootstrap scores

Download URL:
https://egg2.wustl.edu/roadmap/data/byDataType/celltype_clustering/bootstrap_results/

For each analyzed mark, we calculated Pearson correlation values between all pairwise combinations of reference epigenomes using the marks signal confidence scores (-log10(Poisson p-value)) within 200bp of the genomic regions deemed relevant for that mark. Relevance of regions is determined by whether a region was called in a particular (mark-matched) chromatin state with posterior probability of > 0.95 in any of the reference epigenomes. For H3K4me1, H3K27ac and H3K9ac we used state Enh, for H3K4me3 state TssA, for H3K27me3 state ReprPC, for H3K36me3 state Tx and for H3K9me3 state Het, unless otherwise noted (all based on the 15-state core model).

The resulting correlation matrices were used as the basis for a distance matrix for complete-linkage hierarchical clustering, followed by optimal leaf ordering (Bar-Joseph et al. (2001)). Bootstrap support values are derived from 1,000 random samplings with replacement from all regions considered for a particular mark and a bootstrap tree was estimated for each resampling. The bootstrap support for a branch corresponds to the fraction of bootstrapped trees that support the bipartition induced by the branch.

In parallel to this, all correlation matrices mentioned above were used to perform Multi-Dimensional Scaling (MDS) analyses using R. Some concept code that can be used with the provided correlation matrices to perform hierchical clustering and MDS analyses:

mark <- "H3K4me1"; state <- 7; # example mark/state combination: H3K4me1 in the Enh state
load(paste("cor_", mark, "_", state, ".RData", sep=""));
d <- as.dist(1-markcor);
hclust_res <- hclust(d);
MDS_res <- cmdscale(d, eig=TRUE, k=nrow(markcor)-1);