Delineation of DNaseI-accessible regulatory regions

DATA SOURCE

BED files with coordinates for regions of each region type for each epigenome separately are available here for:

Promoter: 81,232 promoter regions (1.43958% of genome)
Download URL: https://egg2.wustl.edu/roadmap/data/byDataType/dnase/BED_files_prom/

Enhancer: 2,328,936 putative enhancer regions (12.6385% of genome)
Download URL: https://egg2.wustl.edu/roadmap/data/byDataType/dnase/BED_files_enh/

Dyadic: 129,960 dyadic promoter/enhancer regions (0.985296% of genome)
Download URL: https://egg2.wustl.edu/roadmap/data/byDataType/dnase/BED_files_dyadic/

RData files containing matrices of chromatin state calls for the three region types:

Promoter: RData
Enhancer: RData
Dyadic: RData

Note that only states of interest are listed for each region type, others are listed as "0". For each of these files, the object 'max_states' contains the chromatin state calls, and the object 'max_probs' contains the ChromHMM posterior probabilities for each (non-"0") state call. Please note that the files provided here are identical to the ones listed here. They are based on the 111 Roadmap reference epigenomes only. Newer versions of these files using all 127 analyzed epigenomes (including ENCODE 2012 epigenomes), as well as files based on imputed chromatin states, are available here.

Chromatin state calls along with positions of DNaseI putative regulatory regions

For each of the 39 Roadmap reference epigenomes with DNase data, peak positions are combined across reference epigenomes by defining peak island areas, defined by stacking all DNase peak positions across epigenomes, and considering the Full Width at Half Maximum (FWHM). Note that for this we are only considering peak locations, not intensities. The goal of this is to obtain an estimate of the area of open chromatin, not to quantify the level of 'opennes', as these data are not available for all reference epigenomes. In cases when peak islands overlap, they are merged because it means that the original DNase peak area populations overlap at least for half of the epigenomes with DNase peaks in that area (given the FWHM approach). Peak island summits are defined as the median peak summit of all peak island member DNase peaks. This results in a total of 3,516,964 DNase enriched regions across epigenomes.

We then annotate each of the ~3.5M DNase peaks with the chromatin states they overlap with in each of the 111 Roadmap reference epigenomes, using the core 15-state chromatin state model, and focusing on states TssA, TssAFlnk, and TssBiv for promoters, and EnhG, Enh, and EnhBiv for enhancers, and state BivFlnk (flanking bivalent Enh/Tss) for ambiguous regions. Out of these, ~2.5M (2,540,128) regions are called as either enhancer or promoter across any of the 111 Roadmap reference epigenomes. Note that because DNase data is not available for all Roadmap epigenomes, the set of regulatory regions defined may exclude DNase regions active in cell types for which DNase was not profiled (Fig. 2g). Although most regions are undisputedly called exclusively promoter or enhancer, there are ~530k regions that needed further study to decide whether they should be called promoters, enhancers, or both (dyadic). We arbitrate on these regions by first clustering them (using the methods in the following section) with an expected cluster size of 10,000 regions, and then for each cluster calculating (a) the mean posterior probabilities for promoter and enhancer calls separately, and (b) the mean number of reference epigenomes in which regions were called promoter or enhancer. Clusters of regions for which the differences in mean posterior probabilities (a) is smaller than 0.5 (note that the paper erroneously says '0.05' here), or for which the absolute log2-ratio of the number of epigenomes called as promoter or enhancer (b) is smaller than 0.05 are called true dyadic regions, along with a small number of ambiguous regions in state BivFlnk. Note that this particular clustering is only to arbitrate on these regions using group statistics instead of one-by-one; the final clusterings are described next. Overall, we define ~2.3M putative enhancer regions (12.63% of genome), ~80k promoter regions (1.44% of genome) and ~130k dyadic regions (0.99% of genome), showing either promoter or enhancer signatures across epigenomes.

Clustering of DNaseI-accessible regulatory regions to identify modules of coordinated activity

DATA SOURCE

BED files with coordinates for regions in each module:

Promoter: 81,232 promoter regions (1.43958% of genome)
Download URL: https://egg2.wustl.edu/roadmap/data/byDataType/regulatory_modules/modules_prom/BED_files/

Open in a new page (deactivate pop-up blockers)
Enhancer: 2,328,936 putative enhancer regions (12.6385% of genome)
Download URL: https://egg2.wustl.edu/roadmap/data/byDataType/regulatory_modules/modules_enh/BED_files/

Open in a new page (deactivate pop-up blockers)
Dyadic: 129,960 dyadic promoter/enhancer regions (0.985296% of genome)
Download URL: https://egg2.wustl.edu/roadmap/data/byDataType/regulatory_modules/modules_dyadic/BED_files/

Open in a new page (deactivate pop-up blockers)

High resolution figures of module heatmaps for each module:

Promoter: PDF, PNG
Enhancer: PDF, PNG
Dyadic: PDF, PNG

The order in which modules are plotted in the heatmaps:

Promoter: TXT
Enhancer: TXT
Dyadic: TXT

Please note that the files provided here are identical to the ones listed here. They are based on the 111 Roadmap reference epigenomes only. Newer versions of these files using all 127 analyzed epigenomes (including ENCODE 2012 epigenomes), as well as files based on imputed chromatin states, are available here.

In order to cluster regulatory (i.e., enhancer, promoter or dyadic) regions based on their activity patterns across all reference epigenomes, we expressed each region in terms of a binary vector of length nxs, where n is the number of reference epigenomes (111) and s is the number of chromatin states considered. For enhancers and promoters, s=3, as both of these types of regions are made up of 3 chromatin states in the 15-state ChromHMM model (enhancers: EnhG, Enh & EnhBiv, promoters: TssA, TssAFlnk & TssBiv).

The thus obtained binary matrices are subsequently clustered using a variation of a k-centroid clustering algorithm (Leisch et al. (2006)). Instead of Euclidean distance we use a Jaccard-index based distance. This is done to be able to correctly cluster highly cell type restricted regions. From a computational point of view, we optimized the method to both deal with the size of the used data matrices and leverage their sparsity, in order to efficiently compute and update distances for matrices with sizes on the order of 106x103. The algorithm has been further modified to converge when less than 0.01% of cluster assignments change between iterations.

We selected the number of clusters k by tuning the expected number of regions within each cluster to be approximately 1000 for promoter and dyadic regions, and approximately 10,000 for enhancer regions, given their much larger count (81k, 129k, and 2.3M for promoter, dyadic, and enhancer respectively). This results in a value of k=233 for enhancer clusters (for ~10k elements per cluster), and the algorithm converged on k=226 non-empty clusters, which are used for subsequent analyses.

Clusters are visualized (Fig. 7a) by diagonalizing when possible. First, 'ubiquitous' clusters (defined as having at least 50% of epigenomes with an enhancer/promoter density of > 25%) are shown. Then, the remaining clusters are ordered according to which epigenome has the maximum enhancer density.

Enrichment analyses of proximity to gene members of a catalogue of gene sets (Gene Ontology (GO), Human Phenotype Ontology (HPO)) have been performed using the GREAT tool (McLean et al. (2010)). In particular, the GREAT web API was used to automatically submit region descriptions and retrieve results for subsequent parsing. We restricted ourselves to interpretation of results with an enrichment ratio of at least 2, and multiple hypothesis testing corrected, p-values < 0.01 for both the binomial and the hypergeometric distribution based tests.

For visualization of a representative subset of enriched terms in Fig. 7b and Fig. 7c, we select representative terms for display (after diagonalizing the enrichment matrix by re-ordering the rows). We do this using a weighted bag-of-words approach to select highly-enriched terms that contain many words that are overrepresented in gene-set labels showing similar enrichment patterns. Briefly, sliding along the row names (gene-set terms) of the, diagonalized, enrichment matrices, we collect word counts and multiply these by integer-rounded -log10(q-values) obtained from GREAT. We do this in sliding windows of size 33 for Fig. 7b (resulting in 35 terms) and size 16 for Fig. 7c (resulting in 15 terms). For each word in a window, these values are expressed relative to the same words across all row names, registering to what extend they are over-represented. Each gene-set term in the window is then assigned a score based on the mean over-representation of all words it consists of. Lastly, gene-sets are co-ranked based on this mean over-representation and their GREAT significance. The best-ranked gene set label is selected as the representative label for that window. All terms are shown in Fig. S11d and available for download on the supplementary website.