Primary data processing and quality control
All genome-wide maps of histone modifications, DNA accessibility, DNA methylation and RNA expression are freely available online. Raw sequencing data deposited at the Short Read Archive or dbGAP is linked from http://www.ncbi.nlm.nih.gov/geo/roadmap/epigenomics. All primary processed data (including mapped reads) for profiling experiments are contained within Release 9 of the Human Epigenome Atlas (http://www.epigenomeatlas.org). Complete metadata associated with each dataset in this collection is archived at GEO and describes samples, assays, data processing details and quality metrics collected for each profiling experiment.
Release 9 of the compendium contains uniformly pre-processed and mapped data from multiple profiling experiments (technical and biological replicates from multiple individuals and/or datasets from multiple centers). In order to reduce redundancy, improve data quality and achieve uniformity required for our integrative analyses, experiments were subjected to additional processing to obtain comprehensive data for 111 consolidated epigenomes (See below for additional details). Numeric epigenome identifiers EIDs (e.g. E001) and mnemonics for epigenome names were assigned for each of the consolidated epigenomes. The metadata section summarizes the mapping of the individual Release 9 samples to the consolidated epigenome IDs. Key metadata such as age, sex, anatomy, epigenome class, ethnicity and solid/liquid status were summarized for the consolidated epigenomes. Datasets corresponding to 16 cell-lines from the ENCODE project (with epigenome IDs ranging from E114-E129) were also used in the integrative analyses (ENCODE Project Consortium (2012)). All datasets from the 127 consolidated epigenomes were subjected to processing filters to ensure uniformity in terms of read length based mappability and sequencing depth as described below.
Each of the 127 epigenomes included consolidated ChIP-seq datasets for a core set of histone modifications - H3K4me1, H3K4me3, H3K27me3, H3K36me3, H3K9me3 as well as a corresponding whole-cell extract sequenced control. 98 epigenomes and 62 epigenomes had consolidated H3K27ac and H3K9ac histone ChIP-seq datasets respectively. A smaller subset of epigenomes had ChIP-seq datasets for additional histone marks, giving a total of 1319 consolidated datasets (Table S1, QCSummary sheet). 53 epigenomes had DNA accessibility (DNase-seq) datasets. 56 epigenomes had mRNA-seq gene expression data. For the 127 consolidated epigenomes, a total of 104 DNA methylation datasets across 95 epigenomes involved either bisulfite treatment (WGBS or RRBS assays) or a combination of MeDIP-seq and MRE-seq assays. In addition to the 1936 datasets analyzed here across 111 reference epigenomes, the NIH Roadmap Epigenomics Project has generated an additional 869 genome-wide datasets, linked from GEO, the Human Epigenome Atlas, and NCBI, and also publicly and freely available.
ChIP-seq and DNase-seq uniform reprocessing for consolidated epigenomes
a. Read mapping
DATA SOURCE
Data format: BED/TagAlign
- Consolidated Epigenomes:36 bp mappability filtered, pooled and subsampled read alignment files:
https://egg2.wustl.edu/roadmap/data/byFileType/alignments/consolidated/
or
- Unconsolidated Epigenomes (Uniform mappability): 36 bp mappability filtered primary alignment files:
https://egg2.wustl.edu/roadmap/data/byFileType/alignments/unconsolidated/
or
- Unconsolidated Epigenomes (Non-uniform mappability): Unfiltered raw primary alignment files: (Not recommended)
http://genboree.org/EdaccData/Release-9/
Open in a new page (deactivate pop-up blockers)
Open in a new page (deactivate pop-up blockers)
Sequenced datasets from the Release 9 of the Epigenome Atlas involved mapping a total of 150.21 billion sequencing reads onto hg19 assembly of the human genome using Pash 3.0 read mapper. These read mappings were used (except for RNA-seq data sets which were mapped as described above) for constructing the 111 consolidated epigenomes. Only unique mapping reads were retained and duplicates were filtered out. BED files containing the mapped reads were obtained from http://genboree.org/EdaccData/Release-9/ . Alignment parameters for each assay type and experiment are specified in the associated publicly accessible Release 9 metadata archived at GEO. For the ENCODE datasets, BAM files containing mapped reads were downloaded from http://hgdownload.cse.ucsc.edu/goldenPath/hg19/encodeDCC/. Only uniquely-mapping reads were retained and multiply-mapping reads were discarded.
b. Mappability filtering, pooling and subsampling
The raw Release 9 read alignment files contain reads that are pre-extended to 200 bp. However, there were significant differences in the original read lengths across the Release 9 raw datasets reflecting differences between centers and changes of sequencing technology during the course of the project (36 bp, 50 bp, 76 bp and 100 bp). To avoid artificial differences due to mappability, for each consolidated dataset, the raw mapped reads were uniformly truncated to 36 bp and then refiltered using a 36 bp custom mappability track to only retain reads that map to positions (taking strand into account) at which the corresponding 36-mers starting at those positions are unique (no mismatches) in the genome. Unconsolidated filtered alignment files are available at https://egg2.wustl.edu/roadmap/data/byFileType/alignments/unconsolidated/. These unconsolidated filtered datasets were then used for basic processing steps i.e. peak calling and signal tracks.
Filtered datasets were then merged appropriately (technical/biological replicates) to obtain a single consolidated sample for every histone mark or DNase-seq in each standardized epigenome. The metadata spreadsheet (Pool FileName columns in Consolidated_EpigenomeIDs_summary_Table sheet) summarizes the mapping of the individual Release 9 primary data sample files to the consolidated data files corresponding to the 127 consolidated epigenomes.
To avoid artificial differences in signal strength due to differences in sequencing depth, all consolidated histone mark datasets (except the additional histone marks the 7 deeply profiled epigenomes) were uniformly subsampled to a maximum depth of 30 million reads (the median read depth over all consolidated samples). For the 7 deeply-profiled reference epigenomes, histone mark datasets were subsampled to a maximum of 45 million reads (median depth). The consolidated DNase-seq datasets were subsampled to a maximum depth of 50 million reads (median depth). These uniformly subsampled datasets were then used for all further processing steps (peak calling, signal coverage tracks, chromatin states).
Consolidated filtered, pooled and subsampled alignment files are available at https://egg2.wustl.edu/roadmap/data/byFileType/alignments/consolidated/. These consolidated datasets were then used for further processing steps (peak calling, signal tracks, chromatin states).
c. Peak Calling
DATA SOURCE
For consolidated epigenomes (EIDs)
- Narrow contiguous regions of enrichment (peaks) for histone ChIP-seq and DNase-seq
- Data format: NarrowPeak
- https://egg2.wustl.edu/roadmap/data/byFileType/peaks/consolidated/narrowPeak/ or
- Broad domains on enrichment for histone ChIP-seq and DNase-seq)
- Data format: BroadPeak https://egg2.wustl.edu/roadmap/data/byFileType/peaks/consolidated/broadPeak/
- Data format: GappedPeak (subset of domains containing at least one narrow peaks) https://egg2.wustl.edu/roadmap/data/byFileType/peaks/consolidated/gappedPeak/
- We recommend using gappedPeaks rather than broadPeaks
Open in a new page (deactivate pop-up blockers)
or
Open in a new page (deactivate pop-up blockers)
For unconsolidated epigenomes
- Narrow contiguous regions of enrichment (peaks) for histone ChIP-seq and DNase-seq
- Data format: NarrowPeak https://egg2.wustl.edu/roadmap/data/byFileType/peaks/unconsolidated/narrowPeak/
- Broad domains on enrichment for histone ChIP-seq
- Data format: BroadPeak https://egg2.wustl.edu/roadmap/data/byFileType/peaks/unconsolidated/broadPeak/
- Data format: GappedPeak (subset of domains containing at least one narrow peaks) https://egg2.wustl.edu/roadmap/data/byFileType/peaks/unconsolidated/gappedPeak/
- We recommend using gappedPeaks rather than broadPeaks
or
Open in a new page (deactivate pop-up blockers)
or
Open in a new page (deactivate pop-up blockers)
For the histone ChIP-seq data, the MACSv2.0.10 peak caller was used to compare ChIP-seq signal to a corresponding whole cell extract (WCE) sequenced control to identify narrow regions of enrichment that pass a Poisson p-value threshold 0.01 and broad domains that pass an additional broad-peak Poisson p-value of 0.1 (https://github.com/taoliu/MACS/). Fragment lengths for each dataset were pre-estimated using strand cross-correlation analysis and the SPP peak caller package (https://code.google.com/p/phantompeakqualtools/ ) and these fragment length estimates were explicitly used as parameters in the MACS2 program (--shift-size=fragment_length/2).
For DNase-seq data, we used two methods to identify DNaseI-accessible sites. First, the Hotspot algorithm was used to identify fixed-size (150bp) DNase hypersensitive sites, and more general-sized regions of DNA accessibility (hotspots) using an FDR of 0.01 (http://www.uwencode.org/proj/hotspot) (John et al. (2011)). MACSv2.0.10 was also used to call narrow peaks using the same settings specified above for the histone mark narrow peak calling.
Narrow peaks and broad domains were also generated for the unconsolidated, 36 bp mappability filtered histone mark ChIP-seq and DNase-seq Release 9 datasets using MACSv2.0.10 with the same settings as specified above.
d. Genome-wide signal coverage tracks
DATA SOURCE
For consolidated epigenomes
Data format: BIGWIG
- -log10(p-value) signal tracks:
https://egg2.wustl.edu/roadmap/data/byFileType/signal/consolidated/macs2signal/pval/
or
- Fold-enrichment signal tracks:
https://egg2.wustl.edu/roadmap/data/byFileType/signal/consolidated/macs2signal/foldChange/
or
Open in a new page (deactivate pop-up blockers)
Open in a new page (deactivate pop-up blockers)
For unconsolidated epigenomes
Data format: BIGWIG
- -log10(p-value) signal tracks:
https://egg2.wustl.edu/roadmap/data/byFileType/signal/unconsolidated/pval/
or
- Fold-enrichment signal tracks:
https://egg2.wustl.edu/roadmap/data/byFileType/signal/unconsolidated/foldChange/
or
Open in a new page (deactivate pop-up blockers)
Open in a new page (deactivate pop-up blockers)
We used the signal processing engine of the MACSv2.0.10 peak caller to generate genome-wide signal coverage tracks (https://github.com/taoliu/MACS/). Whole cell extract was used as a control for signal normalization for the histone ChIP-seq coverage. Each DNase-seq dataset was normalized using simulated background datasets generated by uniformly distributing equivalent number of reads across the mappable genome. We generated 2 types of tracks that use different statistics based on a Poisson background model to represent per-base signal scores. Briefly, reads are extended in the 5’ to 3’ direction by the estimated fragment length. At each base, the observed counts of ChIP-seq/DNaseI-seq extended reads overlapping the base are compared to corresponding dynamic expected background counts (local) estimated from the control dataset. local is defined as max(BG , 1K , 5K , 10K) where BG is the expected counts per base assuming a uniform distribution of control reads across all mappable bases in the genome and 1K , 5K , 10K are expected counts estimated from the 1 kb, 5 kb and 10 kb window centered at the base. local is adjusted for the ratio of the sequencing depth of ChIP-seq/DNase-seq dataset relative to the control dataset. The two types of signal score statistics computed per base are as follows.
- (1) Fold-enrichment ratio of ChIP-seq or DNase counts relative to expected background counts local. These scores provide a direct measure of the effect size of enrichment at any base in the genome.
- (2) negative log10 of the Poisson p-value of ChIP-seq or DNase counts relative to expected background counts local. These signal confidence scores provides a measure of statistical significance of the observed enrichment.
NOTE: The -log10(p-value) scores provide a convenient way to threshold signal (e.g. 2 corresponds to a p-value threshold of 1e-2), similar to what is used in identifying enriched regions (peak calling). We recommend using the signal confidence score tracks for visualization. A universal threshold of 2 provides good separation between signal and noise. Both types of signal tracks were also generated for the unconsolidated datasets using the same parameter settings described above.
e. Quality Control
For the primary Release 9 datasets, data quality enrichment scores were computed as the fraction of the uniquely mapped reads overlapping with areas of enrichment. Several methods were employed to select signal enrichment regions. The SPOT quality score was computed based on regions identified with the HotSpot peak caller (John et al. (2011)); the FindPeaks quality score was inferred based on peak calls made using the FindPeaks (Fejes et al. (2008)) software; finally, a Poisson metric was derived by modeling the read distribution in genome-tiling 1000 basepair windows with a Poisson distribution and selecting as enriched regions windows with p < 0.05. All the quality scores in Release 9 are in agreement, with strong pairwise correlation (Pearson correlation > 0.9).
Concordance between centers was confirmed and data analysis pipeline was validated at the outset of the project using datasets for the H1 cell line. The same pipeline was subsequently used to produce Release 9 data. ChIP-seq data for 6 histone modifications (H3K4me3, H3K27me3, H3K9ac, H3K9me3, H3K36me3, and H3K4me1) were independently generated for the H1 cell line by three REMCs (Broad, UCSD, UCSF-UBC). To quantify concordance, the reads from each experiment were mapped (Level 1 data), read density tracks (Level 2 data) were generated using the EDACC’s primary data processing pipeline, and finally Pearson correlation coefficients were computed between each pair of experiments, as well as between experiments and H1 input acting as a control for background correlation between signals (Table S2). The methylome processing pipeline was characterized experimentally on four independent samples (Kunde-Ramamoorthy et al. (2014), Harris et al. (2010)).
For the uniformly reprocessed and consolidated ChIP-seq and DNase-seq datasets, strand cross-correlation measures were used to estimate signal-to-noise ratios (https://code.google.com/p/phantompeakqualtools/) (Landt et al. (2012)). Datasets for each mark were rank ordered based on the normalized strand cross-correlation coefficient (NSC) and flagged if the scores were significantly below the median value or in the range of NSC values for WCE extract controls. Consolidated datasets with extremely low sequencing depth (< 10M reads) were also flagged. Each standardized epigenome was then manually assigned a subjective quality flag of 1 (high), 0 (medium) or -1 (low), based on the number of flagged datasets it contained. The SPOT, FindPeaks and Poisson quality scores were also recomputed for the consolidated datasets. We observed high correlations of the NSC scores with the SPOT (Pearson correlation of 0.7) and FindPeaks scores (Pearson correlation of 0.65). All QC measures are provided in Table S1 (Sheets QCSummary and AdditionalQCScores).
To identify potential antibody cross-reactivity or mislabeling issues, a pairwise correlation heatmap (Extended Data 1e) was computed across all consolidated datasets for H3K4me1, H3K4me3, H3K36me3, H3K27me3, H3K9me3, H3K27ac, H3K9ac, and DNase. We computed the Pearson correlation between all pairs of the signal tracks based on signal in chr 1-22 and chrX. We used the signal confidence score tracks (-log10(Poisson p-value)) where we first computed the average signal scores within each consecutive 25-bp interval. To order the experiments in the heatmap we defined the distance between two pairs of experiments as 1-correlation value and used a traveling salesman problem formulation (Ernst et al. (2013)).
RNA-seq uniform processing and quantification for consolidated epigenomes
DATA SOURCE
- Download URL:
https://egg2.wustl.edu/roadmap/data/byDataType/rna/ - README for file names and formats:
https://egg2.wustl.edu/roadmap/data/byDataType/rna/README
Expression quantification
Data format: Tab-delimited matrix
- Download URL:
https://egg2.wustl.edu/roadmap/data/byDataType/rna/expression/ - EG.name.txt:is the header file containing order and names of epigenomes (columns) in the expression matrices listed below. The EG names and the RNA seq quantification include an E000 sample representing a Universal Human Reference RNA sample (HUR). Agilent's Universal Human Reference RNA is composed of total RNA from 10 human cell lines. The reference RNA is designed to be used as a reference for expression profiling experiments. Since RNA species differ in abundance between cell lines, an ideal reference sample should represent these different RNAs. Equal quantities of DNase-treated total RNA from each cell line were pooled to make the Universal Human Reference RNA. Stratagene also supplies a QPCR Human Reference Total RNA, suitable for QRT-PCR, which has undergone further DNase treatment. Further details are available at
http://www.genomics.agilent.com/article.jsp?pageId=1452&_requestid=2183245 and
http://www.chem.agilent.com/library/usermanuals/public/740000.pdf - Ensembl_v65.Gencode_v10.ENSG.gene_info: details and annotations for the ENSEMBL ids in the expression matrices
- 57epigenomes.RPKM.pc: RPKM expression matrix for protein coding genes
- 57epigenomes.N.pc: RNA-seq read counts matrix for protein coding genes
- 57epigenomes.RPKM.nc: RPKM expression matrix for non-coding RNAs
- 57epigenomes.N.nc: RNA-seq read counts matrix for non-coding RNAs
- 57epigenomes.exon.RPKM.pc: RPKM expression matrix for protein coding exons
- 57epigenomes.exon.N.pc: RNA-seq read counts matrix for protein coding exons
- 57epigenomes.RPKM.intronic.pc: RNA-seq read count matrix for intronic protein-coding RNA elements
- 57epigenomes.N.intronic.pc: RNA-seq read count matrix for intronic protein-coding RNA elements
- 57epigenomes.N.rb: RNA-seq read counts matrix for ribosomal genes
- 57epigenomes.RPKM.rb: RPKM expression matrix for ribosomal RNAs
- 57epigenomes.exn.RPKM.rb: RPKM expression matrix for ribosomal gene exons
- 57epigenomes.exn.N.rb: RNA-seq read counts matrix for ribosomal gene exons
RNA-seq signal tracks
Data format: BIGWIG
- Normalized coverage (stranded libraries have two tracks per library for + and - strand with - strand track having negative values):
https://egg2.wustl.edu/roadmap/data/byDataType/rna/signal/normalized_bigwig/stranded/
Open in a new page (deactivate pop-up blockers)
RNA-seq intergenic contigs
- Intergenic contigs:
https://egg2.wustl.edu/roadmap/data/byDataType/rna/intergenic_contigs/ - Summary statistics for intergenic contigs:
https://egg2.wustl.edu/roadmap/data/byDataType/rna/intergenic_contigs/RNAseq_intergenic_summary.xls
We uniformly reprocessed mRNA-seq datasets from 56 reference epigenomes that had RNA-seq data. For RNA-seq analysis, after library construction (Gascard et al. (2015)), we aligned 75bp or 100bp long reads using the BWA aligner, and generated read coverage profiles separately for positive and negative strand strand-specific libraries. We used several QC metrics for the RNA-seq library, including intron-exon ratio, intergenic reads fraction, strand specificity (for stranded RNA-seq protocols), 3-5 bias, GC bias, and RPKM discovery the metadata spreadsheet). We quantified exon and gene expression using a modified RPKM measure (Mortazavi et al. (2008)), whereby we used the total number of reads aligned into coding exons for the normalization factor in RPKM calculations, and excluded reads from the mitochondrial genome, reads falling into genes coding for ribosomal proteins, and reads falling into top 0.5% expressed exons. RPKM for a gene was calculated using the total number of reads aligned into all merged exons for a gene normalized by total exonic length. The resulting files contain RPKM values for all annotated exons and coding and non-coding genes (excluding ribosomal genes), as well as introns (Gencode V10 annotations were used). We also report the coordinates of all significant intergenic RNA-seq contigs not overlapping the annotated genes.
Methylation data cross-assay standardization and uniform processing for consolidated epigenomes
DATA SOURCE
- Download URL: https://egg2.wustl.edu/roadmap/data/byDataType/dnamethylation/
- README for file names and formats:
https://egg2.wustl.edu/roadmap/data/byDataType/dnamethylation/README - Whole genome bisulphite methylation calls:
https://egg2.wustl.edu/roadmap/data/byDataType/dnamethylation/WGBS/ - Fractional methylation (BIGWIG format):
https://egg2.wustl.edu/roadmap/data/byDataType/dnamethylation/WGBS/FractionalMethylation_bigwig/
Open in a new page (deactivate pop-up blockers)
- Read coverage (BIGWIG format):
https://egg2.wustl.edu/roadmap/data/byDataType/dnamethylation/WGBS/ReadCoverage_bigwig/
Open in a new page (deactivate pop-up blockers)
- RRBS methylation calls:
https://egg2.wustl.edu/roadmap/data/byDataType/dnamethylation/RRBS/ - Fractional methylation (BIGWIG format):
https://egg2.wustl.edu/roadmap/data/byDataType/dnamethylation/RRBS/FractionalMethylation_bigwig/
Open in a new page (deactivate pop-up blockers)
- Read coverage (BIGWIG format):
https://egg2.wustl.edu/roadmap/data/byDataType/dnamethylation/RRBS/ReadCoverage_bigwig/
Open in a new page (deactivate pop-up blockers)
- MeDIP/MRE (mCRF) methylation calls:
https://egg2.wustl.edu/roadmap/data/byDataType/dnamethylation/mCRF/ - Fractional methylation (BIGWIG format):
https://egg2.wustl.edu/roadmap/data/byDataType/dnamethylation/mCRF/FractionalMethylation_bigwig/
Open in a new page (deactivate pop-up blockers)
We used PASH (Kunde-Ramamoorthy et al. (2014)) alignments for the WGBS and RRBS read alignments. From the number of converted and unconverted reads at each individual CpGs the total coverage and fractional methylation were reported. The data were uniformly post-processed and formatted into two matrices for each chromosome. One matrix contained read coverage information for each base (C and G) in every CpG (row) and for each reference epigenome (column). Another matrix similarly contained fractional methylation ranging from 0 to 1. For the locations where coverage was <=3 we considered data as missing. For MeDIP/MRE methylation data we used the output of the mCRF tool (Stevens et al. (2013)) that reports fractional methylation in the range from 0 to 1 and uses an internal BWA mapping. The mCRF results were combined in a single matrix per chromosome for all reference epigenomes where available.
Differentially Methylated Regions (DMRs) and DNA methylation variation
DMR calls across reference epigenomes
As a general resource for epigenomic comparisons across all epigenomes, we defined DMRs using the Lister et al method (Lister et al. (2013)), combining all differentially methylated sites (DMSs) within 250bp of one another into a single DMR and excluded any DMR with less than 3 DMSs. For each DMR in each sample, we computed its average methylation level, weighted by the number of reads overlapping it (Schultz et al. (2012)). This resulted in a methylation level matrix with rows of DMRs and columns of samples.
DATA SOURCE
- Whole genome bisulphite sequencing (WGBS) DMRs:
- Format: Matrix
https://egg2.wustl.edu/roadmap/data/byDataType/dnamethylation/DMRs/WGBS_DMRs_v2.tsv.gz - Format: BIGWIG
https://egg2.wustl.edu/roadmap/data/byDataType/dnamethylation/DMRs/WGBS_bigwig/
Open in a new page (deactivate pop-up blockers)
- Format: Matrix
- Reduced representation bisulfite sequencing (RRBS) DMRs:
- Format: Matrix
https://egg2.wustl.edu/roadmap/data/byDataType/dnamethylation/DMRs/RRBS_DMRs_v2.tsv.gz - Format: BIGWIG
https://egg2.wustl.edu/roadmap/data/byDataType/dnamethylation/DMRs/RRBS_bigwig/
Open in a new page (deactivate pop-up blockers)
- Format: Matrix
DMRs in hESC differentiation (Fig. 4h)
For analyzing differentiation of hESCs in Fig. 4h, we used a second set of DMRs. We used a pairwise comparison strategy between ESCs and three in vitro derived cell types representative of the three germ layers (mesoderm, endoderm, ectoderm) and performed DMR calling as previously described (Ziller et al. (2014)). Only DMRs losing more than 30% methylation compared to the ESC state at a significance level of p ≤ 0.01 were retained. Subsequently, we computed weighted methylation levels for all three DMR sets across HUES64, mesoderm, endoderm and ectoderm as well as three csecutive stages of in vitro derived neural progenitors (please see companion (Ziller et al. (2014)) paper for details on the cell types). Finally, we plotted the corresponding distribution using the R function vioplot in the vioplot package. In order to identify potential regulators associated with the loss of DNA methylation at these regions, we determined binding sites of a compendium of transcription factors profiled in distinct cell lines and types (Ziller et al. (2013)) that overlapped with each set of hypomethylated DMRs. Next, we determined a potential enrichment over a random genomic background by randomly sampling 100 equally sized sets of genomic regions, respecting the chromosomal and size distribution of the different DMR sets and determined their overlap with the same transcription factor binding site compendium to estimate a null distribution. Only transcription factors that showed fewer binding sites across the control regions in 99 of the cases were considered for further analysis. Next, we computed the average enrichment over background for each TF with respect to the 100 sets of random control regions for each germ layer DMR and report this enrichment level in Fig. 4h right, where we capped the relative enrichment at 12.
Please not that these files previously contained the wrong column header indication. This has been corrected now and new files were uploaded on 2015-09-18
DATA SOURCE
- DMRs defined across human ESCs and three derived cell types:
https://egg2.wustl.edu/roadmap/data/byDataType/dnamethylation/DMRs/REMC_DMRs_corrected.xlsx - README file:
https://egg2.wustl.edu/roadmap/data/byDataType/dnamethylation/DMRs/REMC_DMRs_README.txt
Additional DMR calls
For studying breast epithelia differentiation, DMRs were called from WGBS, requiring at least 5 aligned reads to call differentially-methylated CpG, and at least 3 differentially-methylated CpGs within a distance of 200 bp of each other (Gascard et al. (2015)). For studying tissue environment vs. developmental origin, DMRs were called from MeDIP and MRE data using the M&M algorithm (Lowdon et al. (2014)).
DATA SOURCE
- DMRs defined for studying breast epithelia differentiation:
https://egg2.wustl.edu/roadmap/data/byDataType/dnamethylation/DMRs/mbilenky_DMRs.xlsx - README file:
https://egg2.wustl.edu/roadmap/data/byDataType/dnamethylation/DMRs/mbilenky_DMRs_README.txt