Predicting motifs and active regulators in each cell-type/tissue/lineage
DATA SOURCE
- Primary motif resource:
http://compbio.mit.edu/encode-motifs/ - Data underlying Figures Extended Data 8a, S13a (clustered), S13b,c (unclustered), S13d,e (unique):
https://egg2.wustl.edu/roadmap/data/byDataType/motifanalysis/pouyak/ - Motif enrichments (enhancer cluster centric):
https://egg2.wustl.edu/roadmap/data/byDataType/motifanalysis/pouyak/viewByCluster/bycluster.html
We collect 1,772 known TF recognition motifs (position weight matrices) from primarily large-scale databases (Matys et al. (2003), Sandelin et al. (2004), Berger et al. (2006), Berger et al. (2008), Jolma et al. (2013), Kheradpour et al. (2014), Badis et al. (2009)) and measure their enrichment in the enhancers for each enhancer module compared to the union of the 226 enhancer modules (as described in Kheradpour et al. (2014) and Ernst et al. (2011)) using a 0.3 conservation-based confidence cutoff (Lindblad-Toh et al. (2011), Kheradpour et al. (2007)). We cluster motifs using a 0.75 correlation cutoff resulting in 300 motif clusters (Kheradpour et al. (2014)) and select for each motif cluster the motif with the highest enrichment in any enhancer module for further analysis.
We compute an expression score for each enhancer module and transcription factor as the Pearson correlation between the TF expression across cell types with expression data (quantile-normalized log(RPKM) with zeros replaced by log(0.0005)) and the center of a module. For each enhancer module, its center is defined as a vector of length 111, containing the fraction of regions in that module called as (any type of) enhancer in each of the 111 epigenomes analyzed. This expression score is meant to act as the "expression" of a transcription factor within a module of cell types. We then compute an expression-enrichment value for each transcription factor as the correlation of this expression score and the enrichment of the corresponding motif across enhancer modules. The top 40 motifs in terms of their absolute expression-enrichment correlation and the clusters with log2 enrichment or depletion of at least log2=1.5 for at least one motif are shown in Fig. 8 and Extended Data 8a (only one motif is shown in Fig. 8 for each factor).
We show all 84 motifs that were significantly enriched (log2>=1.5) in any enhancer modules, across the full set of 226 enhancer modules (Fig. S13a) and in the 101 modules in which they were significantly enriched (Extended Data 8a). Similarly, we show all 10 enriched motifs across the full set of 111 individual reference epigenomes (Fig. S13b) and specifically in the 15 enriched epigenomes (Fig. S13c). Lastly, we show all 19 enriched motifs across the full set of 17 tissue groups (Fig. S13d), and specifically within the 10 groups that showed significant enrichments (Fig. S13e).
For visualization of regulator-cell type links (Fig. 8), we compute edge weights between each cell type and motif using these motif-module enrichments. For each motif and cell type, we compute the sum across all modules of the product of the log2 motif enrichment and the value of the cell type within the module center (only consider the highly associated cell types by replacing values <0.7 with 0). We show all resulting edge weights of at least 1.5 and visualize the network using Cytoscape (Shannon et al. (2003)).
Based on the same motif enrichment method mentioned above, we computed the motif enrichment in the tissue-specific Digital Genomic Footprinting (DGF) regions in each library. The tissue-specific DGF regions were identified by selecting the DGF region occurring in no more than 20 DGF libraries among 42 DGF libraries. To generate Extended Data 9b, we standardized the motif enrichment in each library into z-scores for each motif (row) and color each DGF library (column) based on their tissue type.
DNA Motif Positional Bias in Digital Genomic Footprinting Sites
DATA SOURCE
- Primary motif resource:
http://compbio.mit.edu/encode-motifs/ - DNase/DGF Footprint calls:
https://egg2.wustl.edu/roadmap/data/byDataType/dgfootprints/
Format: 5 column BED files. 4th column is footprint ID. 5th column is FOS score. See below. - Data underlying Figures 8, Extended Data 9b, 9c:
https://egg2.wustl.edu/roadmap/data/byDataType/motifanalysis/zhizhou/
To quantify the occupancy at transcription factor recognition sequences within DNase-hypersensitive sites genome-wide, we computed for each instance a footprint occupancy score (FOS) relating the density of DNase I cleavages within the core recognition motif to cleavages in the immediately flanking regions. FOS = (C + 1)/ L + ( C + 1)/ R where C represents the average number of tags in the central component, L is the average number of tags in the left flanking component, R is the average number of tags in the right flanking component, and a smaller FOS value indicates greater average contrast levels between the central component and its flanking regions. The FOS can be used to rank motif instances by the depth of the footprint at that position, and is expected to provide a quantitative measure of factor occupancy. Detailed methods are available here
We compute the positional enrichment of each driver motif (Extended Data 9c, Extended Data 10) related to the Digital Genomic Footprinting(DGF) sites in each cell type (Table S5b). For each driver TF motif, we generated two views corresponding to the motif position (the center of the motif instance) relative to the center of closest DGF site (center view) and the motif position relative to the boundary of closest DGF site (boundary view). We only consider the motif instances with closest DGF site within 100bp. For center view, we plotted the motif occurrence density respect to the distance to DGF center for different cell types. For the boundary view, we considered the shortest distance of the center of a motif instance to the both sides of DGF boundary, and gave a negative distance value if the motif instance is inside the DGF, otherwise the distance value is positive. Similar to center view, we plot the motif density with respect to the derived distance value in the boundary view for each cell type.
To access the significance of the motif concentration within DGF in each cell type, we compute the DGF enrichment ratio as the ratio between the number of motif instances with distance less than 20bp to the DGF center and that number in the immediate flanking window, that is, the number of motif instances with distance to the DGF center larger than 20bp and smaller than 40bp. As control, we randomly sampled the same number of motif instances from the shuffled versions of the given motif, and obtained the DGF enrichment ratio for the shuffled motif instances. The DGF enrichment ratio of the true motif is further converted to z-score by mean and standard deviation from the DGF enrichment ratios of shuffled motif from 1000 times random sampling. Then the adjusted p-value is further computed from z-score and bonferroni correction for number of cell types.
The motifs that were predictive of epigenomic modifications (Whitaker et al. (2014)) were compared to Digital Genomic Footprinting data (DGF) in Table S5a. This was done in three cell types where both DGF and predictive motifs were available: H1 BMP4 Derived Mesendoderm Cultured Cells (E004), H1 BMP4 Derived Trophoblast Cultured Cells (E005), and H1 Derived Mesenchymal Stem Cells (E006). The motifs that were predictive of the following seven inputs were considered: H3K27me3, H3K27ac, H3K9me3, H3K36me3, H3K4me1, H3K4me3 and DNA methylation valleys (DMV) (Xie et al. (2013)). To identify overlaps the predictive motifs were scanned against the modification peaks of the corresponding modification and the location of the best match between motif and sequence was recorded. Then we counted the number of times the locations of the best motif matches overlapped a DGF by at least one bp. These counts were compared to the number of overlaps identified randomly, which was calculated by comparing DGF to random locations within the modifications peaks. The reported random frequency was the average of 100 repeats. To calculate the fold enrichment we divided the observed frequency by the random frequency.