Chromatin state learning

In order to capture the significant combinatorial interactions between different chromatin marks in their spatial context (chromatin states) across 127 epigenomes, we used ChromHMM v1.10 (Ernst et al., 2012), which is based on a multivariate Hidden Markov Model.

Core 15-state model (5 marks, 127 epigenomes)

DATA SOURCE

Download URL:
https://egg2.wustl.edu/roadmap/data/byFileType/chromhmmSegmentations/ChmmModels/coreMarks/jointModel/final

Summarized visualization of all 127 epigenomes using epilogos
Emission, transition probabilities and enrichment of states relative to various genomic and functional annotations
MNEMONICS BED FILES ( [Epigenome_id]_15_coreMarks_mnemonics.bed.gz files )

Tab delimited 4 columns
chromosome, start (0-based), stop (1-based), state_label_mnemonic for that region
ARCHIVE of all mnemonics.bed files

BROWSER FRIENDLY FILES ([Epigenome_id]_15_coreMarks_dense.bb)

The dense BIGBED files will allow you to view each epigenome as a single track with regions labeled with state mnemonics and representative colors. You can stream these to UCSC Genome Browser or IGV
ARCHIVE of all the dense BIGBED files
[Epigenome_id]_15_coreMarks_dense.bed.gz (Same as above except in text format)
ARCHIVE of all dense BED files
[Epigenome_id]_15_coreMarks_expanded.bed.gz files: The expanded files will allow you to view each epigenome with each state as a separate track labeled with state mnemonics and representative colors
ARCHIVE of expanded BED files

WE ALSO PROVIDE LIFT-OVER GRCh38 FILES ([Epigenome_id]_15_coreMarks_hg38lift_*)
STATES FOR EACH 200bp BIN:
https://egg2.wustl.edu/roadmap/data/byFileType/chromhmmSegmentations/ChmmModels/coreMarks/jointModel/final/STATEBYLINE/

Max. posterior state label for each 200 bp bin in each chromosome for all epigenomes. The difference from the Mnemonic BED files is that in the Mnemonic files contiguous bins with the same state label are merged and a label is assigned to the entire merged regions whereas these files are at a fixed 200 bp resolution.
ARCHIVE of state-by-line files

POSTERIOR PROBABILITY FOR EACH 200bp BIN:
https://egg2.wustl.edu/roadmap/data/byFileType/chromhmmSegmentations/ChmmModels/coreMarks/jointModel/final/POSTERIOR/

Posterior probabilities of each state in each 200 bp bin for all chromosomes in all epigenomes

The states are as follows

STATE NO.	MNEMONIC	DESCRIPTION	COLOR NAME	COLOR CODE
1	TssA	Active TSS	Red	255,0,0
2	TssAFlnk	Flanking Active TSS	Orange Red	255,69,0
3	TxFlnk	Transcr. at gene 5' and 3'	LimeGreen	50,205,50
4	Tx	Strong transcription	Green	0,128,0
5	TxWk	Weak transcription	DarkGreen	0,100,0
6	EnhG	Genic enhancers	GreenYellow	194,225,5
7	Enh	Enhancers	Yellow	255,255,0
8	ZNF/Rpts	ZNF genes & repeats	Medium Aquamarine	102,205,170
9	Het	Heterochromatin	PaleTurquoise	138,145,208
10	TssBiv	Bivalent/Poised TSS	IndianRed	205,92,92
11	BivFlnk	Flanking Bivalent TSS/Enh	DarkSalmon	233,150,122
12	EnhBiv	Bivalent Enhancer	DarkKhaki	189,183,107
13	ReprPC	Repressed PolyComb	Silver	128,128,128
14	ReprPCWk	Weak Repressed PolyComb	Gainsboro	192,192,192
15	Quies	Quiescent/Low	White	255,255,255

A ChromHMM model applicable to all 127 epigenomes was learned by virtually concatenating consolidated data corresponding to the core set of 5 chromatin marks assayed in all epigenomes (H3K4me3, H3K4me1, H3K36me3, H3K27me3, H3K9me3). The model was trained on 60 epigenomes with highest-quality data (Fig. 2k), which provided sufficient coverage of the different lineages and tissue types (Table S1 - Sheet QCSummary). The ChromHMM parameters used were as follows: Reads were shifted in the 5 to 3 direction by 100 bp. For each consolidated ChIP-seq dataset, read counts were computed in non-overlapping 200 bp bins across the entire genome. Each bin was discretized into two levels, 1 indicating enrichment and 0 indicating no enrichment. The binarization was performed by comparing ChIP-seq read counts to corresponding whole-cell extract control read counts within each bin and using a Poisson p-value threshold of 1e-4 (the default discretization threshold in ChromHMM). We trained several models in parallel mode with the number of states ranging from 10 states to 25 states. We decided to use a 15-state model (Fig. 4a-f, Extended Data 2b) for all further analyses since it captured all the key interactions between the chromatin marks, and because larger numbers of states did not capture sufficiently distinct interactions. The trained model was then used to compute the posterior probability of each state for each genomic bin in each reference epigenome. The regions were labeled using the state with the maximum posterior probability.

Expanded 18-state model (6 marks, 98 epigenomes)

DATA SOURCE

Download URL:
https://egg2.wustl.edu/roadmap/data/byFileType/chromhmmSegmentations/ChmmModels/core_K27ac/jointModel/final/

Summarized visualization of 98 epigenomes using epilogos
Emission, transition probabilities and enrichment of states relative to various genomic and functional annotations
MNEMONICS BED FILES ( [Epigenome_id]_18_core_K27ac_mnemonics.bed.gz files )

Tab delimited 4 columns
chromosome, start (0-based), stop (1-based), state_label_mnemonic for that region
ARCHIVE of all mnemonics.bed files

BROWSER FRIENDLY FILES ([Epigenome_id]_18_core_K27ac_dense.bb)

The dense BIGBED files will allow you to view each epigenome as a single track with regions labeled with state mnemonics and representative colors. You can stream these to UCSC Genome Browser or IGV
ARCHIVE of all the dense BIGBED files
[Epigenome_id]_18_core_K27ac_dense.bed.gz (Same as above except in text format)
ARCHIVE of all dense BED files
[Epigenome_id]_18_core_K27ac_expanded.bed.gz files: The expanded files will allow you to view each epigenome with each state as a separate track labeled with state mnemonics and representative colors
ARCHIVE of expanded BED files

WE ALSO PROVIDE LIFT-OVER GRCh38 FILES ([Epigenome_id]_18_core_K27ac_hg38lift_*)
STATES FOR EACH 200bp BIN:
https://egg2.wustl.edu/roadmap/data/byFileType/chromhmmSegmentations/ChmmModels/core_K27ac/jointModel/final/STATEBYLINE/

Max. posterior state label for each 200 bp bin in each chromosome for all epigenomes. The difference from the Mnemonic BED files is that in the Mnemonic files contiguous bins with the same state label are merged and a label is assigned to the entire merged regions whereas these files are at a fixed 200 bp resolution.
ARCHIVE of state-by-line files

POSTERIOR PROBABILITY FOR EACH 200bp BIN:
https://egg2.wustl.edu/roadmap/data/byFileType/chromhmmSegmentations/ChmmModels/core_K27ac/jointModel/final/POSTERIOR/

Posterior probabilities of each state in each 200 bp bin for all chromosomes in all epigenomes

The states are as follows

STATE NO.	MNEMONIC	DESCRIPTION	COLOR NAME	COLOR CODE
1	TssA	Active TSS	Red	255,0,0
2	TssFlnk	Flanking TSS	Orange Red	255,69,0
3	TssFlnkU	Flanking TSS Upstream	Orange Red	255,69,0
4	TssFlnkD	Flanking TSS Downstream	Orange Red	255,69,0
5	Tx	Strong transcription	Green	0,128,0
6	TxWk	Weak transcription	DarkGreen	0,100,0
7	EnhG1	Genic enhancer1	GreenYellow	194,225,5
8	EnhG2	Genic enhancer2	GreenYellow	194,225,5
9	EnhA1	Active Enhancer 1	Orange	255,195,77
10	EnhA2	Active Enhancer 2	Orange	255,195,77
11	EnhWk	Weak Enhancer	Yellow	255,255,0
12	ZNF/Rpts	ZNF genes & repeats	Medium Aquamarine	102,205,170
13	Het	Heterochromatin	PaleTurquoise	138,145,208
14	TssBiv	Bivalent/Poised TSS	IndianRed	205,92,92
15	EnhBiv	Bivalent Enhancer	DarkKhaki	189,183,107
16	ReprPC	Repressed PolyComb	Silver	128,128,128
17	ReprPCWk	Weak Repressed PolyComb	Gainsboro	192,192,192
18	Quies	Quiescent/Low	White	255,255,255

A second "expanded" model applicable to 98 epigenomes that also have an H3K27ac ChIP-seq dataset, was learned by virtually concatenating consolidated data corresponding to the core set of 5 chromatin marks and H3K27ac. The model was trained on 40 high quality epigenomes using the same parameters as those used for the primary model (Table S1 - Sheet QCSummary). We trained several models with the number of states ranging from 15 states to 25 states. An 18 state model was used for further analyses (Extended Data 2c) based on similar considerations.

State labels, interpretation and mnemonics

In order to assign biologically meaningful mnemonics to the states, we used the ChromHMM package to compute the overlap and neighborhood enrichments of each state relative to various types of functional annotations (Fig. 4b-c,f, Extended Data 2b,c, Fig. S2).

For any set of genomic coordinates representing a genomic feature and a given state, the fold enrichment of overlap is calculated as the ratio of the joint probability of a region belonging to the state and the feature” vs. “the product of independent marginal probability of observing the state in the genome” times “the probability of of observing the feature”, namely the ratio between the (#bases in state AND overlap feature)/(#bases in genome) and the [(#bases overlap feature)/(#bases in genome) X (#bases in state)/(#bases in genome)]. The neighborhood enrichment is computed for genomic bins around a set of single base pair anchor locations in the genome e.g. transcription start sites.

For the overlap enrichment plots in the figures, the enrichments for each genomic feature (column) across all states is normalized by subtracting the minimum value from the column and then dividing by the max of the column. So the values always range from 0 (white) to 1 (dark blue) i.e. its a column wise relative scale. For the neighborhood positional enrichment plots, the normalization is done across all columns i.e. the minimum value over the entire matrix is subtracted from each value and divided by the maximum over the entire matrix.

The functional annotations used were as follows (All coordinates were relative to the hg19 version of the human genome):
Annotation files for overlap enrichments: https://egg2.wustl.edu/roadmap/src/chromHMM/bin/COORDS/hg19/
Annotation files for neighborhood enrichments: https://egg2.wustl.edu/roadmap/src/chromHMM/bin/ANCHORFILES/hg19/

(1) CpG islands obtained from the UCSC table browser
(2) Exons, genes, introns, transcription start sites (TSSs) and transcription end sites (TESs), 2Kb windows around TSSs and 2Kb windows around TESs based on the GENCODEv10 annotation (http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeGencodeV10/) restricted to GENCODE biotypes annotating long transcripts.
(3) Expressed and non-expressed genes, their TSSs and TESs. Genes were classified into the expressed or non-expressed class based on their RNA-seq expression levels in the H1-ESC (Fig. 4c) and GM12878 (Extended Data 2b) cell-lines. A gaussian mixture model with 2 components was fit on expression levels of all genes to obtain thresholds for the two classes.
(4) Zinc finger genes (obtained by searching the ENSEMBL annotation for genes with gene names starting with ZNF).
(5) Transcription factor binding sites (TFBS) based on ENCODE ChIP-seq data in the H1-ESC cell-line. The uniformly processed TF ChIP-seq peak locations were downloaded from the ENCODE repository: http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeAwgTfbsUniform/. We also computed % TF binding site coverage for states calls in the GM12878 and K562 cell-lines using corresponding TF ChIP-seq data data from ENCODE which matched and supported the mnemonics and state interpretations obtained from the H1 cell-line (Fig. S2).
(6) Conserved GERP elements based on 34 way placental mammalian alignments http://mendel.stanford.edu/SidowLab/downloads/gerp/ (Fig. S3).
(7) Enrichment for conserved GERP elements, subtracting parts of the above mentioned GERP elements that overlap exons.

Comparison to chromatin states learned on individual epigenomes

DATA SOURCE

Download URL:
https://egg2.wustl.edu/roadmap/data/byFileType/chromhmmSegmentations/ChmmModels/coreMarks/indivModels/default_init/

One directory for each epigenome ID

We also learned independent 15-state models individually on each of the 127 epigenomes using the core set of 5 marks and the same parameter settings as for the primary model. In order to compare the individual models to the joint 15 state primary model, we stacked the emission vectors for all states from all the models and hierarchically clustered them using Euclidean distance and Ward linkage (Extended Data 2a). The individual epigenome models consistently and repeatedly identified states that were also recovered by the joint model (Extended Data 2a). Two additional clusters which included states recovered by the independent models learned in individual cell types, but not recovered in the joint model, were HetWk, characterized by weak presence of H3K9me3, and Rpts, characterized by presence of H3K9me3 along with a diversity of other marks, which was enriched in a large number of repeat elements.

Expanded 50 chromatin state models using large numbers of histone marks for Class 1 epigenomes

DATA SOURCE

Download URL:
https://egg2.wustl.edu/roadmap/data/byFileType/chromhmmSegmentations/ChmmModels/class1Models_50states/

One directory for each epigenome ID

For each of the seven deeply-profiled reference epigenomes (Fig. 2j) we independently learned chromatin states on observed data for all available histone modifications or variants, and DNase in the reference epigenome. The same binarization and model learning procedure was followed as for the core set of 5 marks. We chose to consistently focus on a larger set of 50-states to capture the additional state distinctions afforded by using additional marks (Fig. S4). Enrichments for annotations, including some of those described above for the 15-state model, were computed using ChromHMM. The HiC domains were obtained from Dixon et al. (2012), the lamina associated domains are described below, conserved element sets were the hg19 lift-over from Lindblad-Toh et al. (2011), repetitive elements are from RepeatMasker.

Chromatin state learning

Core 15-state model (5 marks, 127 epigenomes)

DATA SOURCE

Expanded 18-state model (6 marks, 98 epigenomes)

DATA SOURCE

State labels, interpretation and mnemonics

Comparison to chromatin states learned on individual epigenomes

DATA SOURCE

Expanded 50 chromatin state models using large numbers of histone marks for Class 1 epigenomes

DATA SOURCE

Chromatin state model based on imputed data (25 state, 12 marks, 127 epigenomes)