Chromatin state learning
In order to capture the significant combinatorial interactions between different chromatin marks in their spatial context (chromatin states) across 127 epigenomes, we used ChromHMM v1.10 (Ernst et al., 2012), which is based on a multivariate Hidden Markov Model.
Core 15-state model (5 marks, 127 epigenomes)
DATA SOURCE
- Download URL:
https://egg2.wustl.edu/roadmap/data/byFileType/chromhmmSegmentations/ChmmModels/coreMarks/jointModel/final - Summarized visualization of all 127 epigenomes using epilogos
- Emission, transition probabilities and enrichment of states relative to various genomic and functional annotations
- MNEMONICS BED FILES ( [Epigenome_id]_15_coreMarks_mnemonics.bed.gz files )
- Tab delimited 4 columns
- chromosome, start (0-based), stop (1-based), state_label_mnemonic for that region
- ARCHIVE of all mnemonics.bed files
- BROWSER FRIENDLY FILES ([Epigenome_id]_15_coreMarks_dense.bb)
- The dense BIGBED files will allow you to view each epigenome as a single track with regions labeled with state mnemonics and representative colors. You can stream these to UCSC Genome Browser or IGV
- ARCHIVE of all the dense BIGBED files
- [Epigenome_id]_15_coreMarks_dense.bed.gz (Same as above except in text format)
- ARCHIVE of all dense BED files
- [Epigenome_id]_15_coreMarks_expanded.bed.gz files: The expanded files will allow you to view each epigenome with each state as a separate track labeled with state mnemonics and representative colors
- ARCHIVE of expanded BED files
- WE ALSO PROVIDE LIFT-OVER GRCh38 FILES ([Epigenome_id]_15_coreMarks_hg38lift_*)
- STATES FOR EACH 200bp BIN:
https://egg2.wustl.edu/roadmap/data/byFileType/chromhmmSegmentations/ChmmModels/coreMarks/jointModel/final/STATEBYLINE/ - Max. posterior state label for each 200 bp bin in each chromosome for all epigenomes. The difference from the Mnemonic BED files is that in the Mnemonic files contiguous bins with the same state label are merged and a label is assigned to the entire merged regions whereas these files are at a fixed 200 bp resolution.
- ARCHIVE of state-by-line files
- POSTERIOR PROBABILITY FOR EACH 200bp BIN:
https://egg2.wustl.edu/roadmap/data/byFileType/chromhmmSegmentations/ChmmModels/coreMarks/jointModel/final/POSTERIOR/ - Posterior probabilities of each state in each 200 bp bin for all chromosomes in all epigenomes
Open in a new page (deactivate pop-up blockers)
GRCh38 lift-over (refresh this page with CTRL+F5 if you still see hg19 tracks)
The states are as follows
STATE NO. | MNEMONIC | DESCRIPTION | COLOR NAME | COLOR CODE |
---|---|---|---|---|
1 | TssA | Active TSS | Red | 255,0,0 |
2 | TssAFlnk | Flanking Active TSS | Orange Red | 255,69,0 |
3 | TxFlnk | Transcr. at gene 5' and 3' | LimeGreen | 50,205,50 |
4 | Tx | Strong transcription | Green | 0,128,0 |
5 | TxWk | Weak transcription | DarkGreen | 0,100,0 |
6 | EnhG | Genic enhancers | GreenYellow | 194,225,5 |
7 | Enh | Enhancers | Yellow | 255,255,0 |
8 | ZNF/Rpts | ZNF genes & repeats | Medium Aquamarine | 102,205,170 |
9 | Het | Heterochromatin | PaleTurquoise | 138,145,208 |
10 | TssBiv | Bivalent/Poised TSS | IndianRed | 205,92,92 |
11 | BivFlnk | Flanking Bivalent TSS/Enh | DarkSalmon | 233,150,122 |
12 | EnhBiv | Bivalent Enhancer | DarkKhaki | 189,183,107 |
13 | ReprPC | Repressed PolyComb | Silver | 128,128,128 |
14 | ReprPCWk | Weak Repressed PolyComb | Gainsboro | 192,192,192 |
15 | Quies | Quiescent/Low | White | 255,255,255 |
A ChromHMM model applicable to all 127 epigenomes was learned by virtually concatenating consolidated data corresponding to the core set of 5 chromatin marks assayed in all epigenomes (H3K4me3, H3K4me1, H3K36me3, H3K27me3, H3K9me3). The model was trained on 60 epigenomes with highest-quality data (Fig. 2k), which provided sufficient coverage of the different lineages and tissue types (Table S1 - Sheet QCSummary). The ChromHMM parameters used were as follows: Reads were shifted in the 5 to 3 direction by 100 bp. For each consolidated ChIP-seq dataset, read counts were computed in non-overlapping 200 bp bins across the entire genome. Each bin was discretized into two levels, 1 indicating enrichment and 0 indicating no enrichment. The binarization was performed by comparing ChIP-seq read counts to corresponding whole-cell extract control read counts within each bin and using a Poisson p-value threshold of 1e-4 (the default discretization threshold in ChromHMM). We trained several models in parallel mode with the number of states ranging from 10 states to 25 states. We decided to use a 15-state model (Fig. 4a-f, Extended Data 2b) for all further analyses since it captured all the key interactions between the chromatin marks, and because larger numbers of states did not capture sufficiently distinct interactions. The trained model was then used to compute the posterior probability of each state for each genomic bin in each reference epigenome. The regions were labeled using the state with the maximum posterior probability.
Expanded 18-state model (6 marks, 98 epigenomes)
DATA SOURCE
- Download URL:
https://egg2.wustl.edu/roadmap/data/byFileType/chromhmmSegmentations/ChmmModels/core_K27ac/jointModel/final/ - Summarized visualization of 98 epigenomes using epilogos
- Emission, transition probabilities and enrichment of states relative to various genomic and functional annotations
- MNEMONICS BED FILES ( [Epigenome_id]_18_core_K27ac_mnemonics.bed.gz files )
- Tab delimited 4 columns
- chromosome, start (0-based), stop (1-based), state_label_mnemonic for that region
- ARCHIVE of all mnemonics.bed files
- BROWSER FRIENDLY FILES ([Epigenome_id]_18_core_K27ac_dense.bb)
- The dense BIGBED files will allow you to view each epigenome as a single track with regions labeled with state mnemonics and representative colors. You can stream these to UCSC Genome Browser or IGV
- ARCHIVE of all the dense BIGBED files
- [Epigenome_id]_18_core_K27ac_dense.bed.gz (Same as above except in text format)
- ARCHIVE of all dense BED files
- [Epigenome_id]_18_core_K27ac_expanded.bed.gz files: The expanded files will allow you to view each epigenome with each state as a separate track labeled with state mnemonics and representative colors
- ARCHIVE of expanded BED files
- WE ALSO PROVIDE LIFT-OVER GRCh38 FILES ([Epigenome_id]_18_core_K27ac_hg38lift_*)
- STATES FOR EACH 200bp BIN:
https://egg2.wustl.edu/roadmap/data/byFileType/chromhmmSegmentations/ChmmModels/core_K27ac/jointModel/final/STATEBYLINE/ - Max. posterior state label for each 200 bp bin in each chromosome for all epigenomes. The difference from the Mnemonic BED files is that in the Mnemonic files contiguous bins with the same state label are merged and a label is assigned to the entire merged regions whereas these files are at a fixed 200 bp resolution.
- ARCHIVE of state-by-line files
- POSTERIOR PROBABILITY FOR EACH 200bp BIN:
https://egg2.wustl.edu/roadmap/data/byFileType/chromhmmSegmentations/ChmmModels/core_K27ac/jointModel/final/POSTERIOR/ - Posterior probabilities of each state in each 200 bp bin for all chromosomes in all epigenomes
Open in a new page (deactivate pop-up blockers)
GRCh38 lift-over (refresh this page with CTRL+F5 if you still see hg19 tracks)
The states are as follows
STATE NO. | MNEMONIC | DESCRIPTION | COLOR NAME | COLOR CODE |
---|---|---|---|---|
1 | TssA | Active TSS | Red | 255,0,0 |
2 | TssFlnk | Flanking TSS | Orange Red | 255,69,0 |
3 | TssFlnkU | Flanking TSS Upstream | Orange Red | 255,69,0 |
4 | TssFlnkD | Flanking TSS Downstream | Orange Red | 255,69,0 |
5 | Tx | Strong transcription | Green | 0,128,0 |
6 | TxWk | Weak transcription | DarkGreen | 0,100,0 |
7 | EnhG1 | Genic enhancer1 | GreenYellow | 194,225,5 |
8 | EnhG2 | Genic enhancer2 | GreenYellow | 194,225,5 |
9 | EnhA1 | Active Enhancer 1 | Orange | 255,195,77 |
10 | EnhA2 | Active Enhancer 2 | Orange | 255,195,77 |
11 | EnhWk | Weak Enhancer | Yellow | 255,255,0 |
12 | ZNF/Rpts | ZNF genes & repeats | Medium Aquamarine | 102,205,170 |
13 | Het | Heterochromatin | PaleTurquoise | 138,145,208 |
14 | TssBiv | Bivalent/Poised TSS | IndianRed | 205,92,92 |
15 | EnhBiv | Bivalent Enhancer | DarkKhaki | 189,183,107 |
16 | ReprPC | Repressed PolyComb | Silver | 128,128,128 |
17 | ReprPCWk | Weak Repressed PolyComb | Gainsboro | 192,192,192 |
18 | Quies | Quiescent/Low | White | 255,255,255 |
A second "expanded" model applicable to 98 epigenomes that also have an H3K27ac ChIP-seq dataset, was learned by virtually concatenating consolidated data corresponding to the core set of 5 chromatin marks and H3K27ac. The model was trained on 40 high quality epigenomes using the same parameters as those used for the primary model (Table S1 - Sheet QCSummary). We trained several models with the number of states ranging from 15 states to 25 states. An 18 state model was used for further analyses (Extended Data 2c) based on similar considerations.
State labels, interpretation and mnemonics
In order to assign biologically meaningful mnemonics to the states, we used the ChromHMM package to compute the overlap and neighborhood enrichments of each state relative to various types of functional annotations (Fig. 4b-c,f, Extended Data 2b,c, Fig. S2).
For any set of genomic coordinates representing a genomic feature and a given state, the fold enrichment of overlap is calculated as the ratio of the joint probability of a region belonging to the state and the feature” vs. “the product of independent marginal probability of observing the state in the genome” times “the probability of of observing the feature”, namely the ratio between the (#bases in state AND overlap feature)/(#bases in genome) and the [(#bases overlap feature)/(#bases in genome) X (#bases in state)/(#bases in genome)]. The neighborhood enrichment is computed for genomic bins around a set of single base pair anchor locations in the genome e.g. transcription start sites.
For the overlap enrichment plots in the figures, the enrichments for each genomic feature (column) across all states is normalized by subtracting the minimum value from the column and then dividing by the max of the column. So the values always range from 0 (white) to 1 (dark blue) i.e. its a column wise relative scale. For the neighborhood positional enrichment plots, the normalization is done across all columns i.e. the minimum value over the entire matrix is subtracted from each value and divided by the maximum over the entire matrix.
The functional annotations used were as follows (All coordinates were relative to the hg19 version of the human genome):
Annotation files for overlap enrichments: https://egg2.wustl.edu/roadmap/src/chromHMM/bin/COORDS/hg19/
Annotation files for neighborhood enrichments: https://egg2.wustl.edu/roadmap/src/chromHMM/bin/ANCHORFILES/hg19/
- (1) CpG islands obtained from the UCSC table browser
- (2) Exons, genes, introns, transcription start sites (TSSs) and transcription end sites (TESs), 2Kb windows around TSSs and 2Kb windows around TESs based on the GENCODEv10 annotation (http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeGencodeV10/) restricted to GENCODE biotypes annotating long transcripts.
- (3) Expressed and non-expressed genes, their TSSs and TESs. Genes were classified into the expressed or non-expressed class based on their RNA-seq expression levels in the H1-ESC (Fig. 4c) and GM12878 (Extended Data 2b) cell-lines. A gaussian mixture model with 2 components was fit on expression levels of all genes to obtain thresholds for the two classes.
- (4) Zinc finger genes (obtained by searching the ENSEMBL annotation for genes with gene names starting with ZNF).
- (5) Transcription factor binding sites (TFBS) based on ENCODE ChIP-seq data in the H1-ESC cell-line. The uniformly processed TF ChIP-seq peak locations were downloaded from the ENCODE repository: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeAwgTfbsUniform/. We also computed % TF binding site coverage for states calls in the GM12878 and K562 cell-lines using corresponding TF ChIP-seq data data from ENCODE which matched and supported the mnemonics and state interpretations obtained from the H1 cell-line (Fig. S2).
- (6) Conserved GERP elements based on 34 way placental mammalian alignments http://mendel.stanford.edu/SidowLab/downloads/gerp/ (Fig. S3).
- (7) Enrichment for conserved GERP elements, subtracting parts of the above mentioned GERP elements that overlap exons.
Comparison to chromatin states learned on individual epigenomes
DATA SOURCE
- Download URL:
https://egg2.wustl.edu/roadmap/data/byFileType/chromhmmSegmentations/ChmmModels/coreMarks/indivModels/default_init/ - One directory for each epigenome ID
We also learned independent 15-state models individually on each of the 127 epigenomes using the core set of 5 marks and the same parameter settings as for the primary model. In order to compare the individual models to the joint 15 state primary model, we stacked the emission vectors for all states from all the models and hierarchically clustered them using Euclidean distance and Ward linkage (Extended Data 2a). The individual epigenome models consistently and repeatedly identified states that were also recovered by the joint model (Extended Data 2a). Two additional clusters which included states recovered by the independent models learned in individual cell types, but not recovered in the joint model, were HetWk, characterized by weak presence of H3K9me3, and Rpts, characterized by presence of H3K9me3 along with a diversity of other marks, which was enriched in a large number of repeat elements.
Expanded 50 chromatin state models using large numbers of histone marks for Class 1 epigenomes
DATA SOURCE
- Download URL:
https://egg2.wustl.edu/roadmap/data/byFileType/chromhmmSegmentations/ChmmModels/class1Models_50states/ - One directory for each epigenome ID
For each of the seven deeply-profiled reference epigenomes (Fig. 2j) we independently learned chromatin states on observed data for all available histone modifications or variants, and DNase in the reference epigenome. The same binarization and model learning procedure was followed as for the core set of 5 marks. We chose to consistently focus on a larger set of 50-states to capture the additional state distinctions afforded by using additional marks (Fig. S4). Enrichments for annotations, including some of those described above for the 15-state model, were computed using ChromHMM. The HiC domains were obtained from Dixon et al. (2012), the lamina associated domains are described below, conserved element sets were the hg19 lift-over from Lindblad-Toh et al. (2011), repetitive elements are from RepeatMasker.