README for uniformly processed RNA-seq data These data were processed by Misha Bilenky (mbilenky@gmail.com) from UBC. ========================= ANNOTATIONS ========================= annotations/ directory contains GencodeV10 gene/transcript coordinates and annotations corresponding to hg19 version of the human genome gen10.gtf.gz: The original GENCODE v10 (ftp://ftp.sanger.ac.uk/pub/gencode/release_10/gencode.v10.annotation.gtf.gz) does not contain tRNAs. I added the entries from a separate gencode v10 file (ftp://ftp.sanger.ac.uk/pub/gencode/release_10/gencode.v10.tRNAs.gtf.gz). For the tRNAs I modified "gene_type" and "transcript_type" attributes and added element "gene" and "transcript". gen10.long.gtf: is a subset and contains only entries of long transcripts (protein coding, processed transcript, pseudogenes, ...) NOTE: the long.gtf file was used for quantification gen10.long.partition.unstr.gtf.gz: Partitioning the genome into elements using the hierarchy gap-exon-intron-intergenic for ~3 billion bg (unstranded). The partition also contain an element called "genic". this partition is loger than exonic+intronic, since transcripts assigned to a gene locus do not necessarily overlap. However, those cases are rather the exception than the rule ========================= EXPRESSION QUANTIFICATION ========================= expression/ - This directory contains expression quantification information These were downloaded from ftp://ftp.bcgsc.ca/public/mbilenky/112epigenomes/RNAseq_Removed_E060_E064/ FILES: ------------- HEADER FILES ------------- EG.name.txt: order and names of epigenomes (columns) in the expression matrices The EG names and the RNA seq quantification include an E000 sample representing a Universal Human Reference RNA sample (HUR). Universal Human Reference RNA: Agilent's Universal Human Reference RNA is composed of total RNA from 10 human cell lines. The reference RNA is designed to be used as a reference for microarray gene-profiling experiments. Since RNA species differ in abundance between cell lines, an ideal reference sample should represent these different RNAs. Equal quantities of DNase-treated total RNA from each cell line were pooled to make the Universal Human Reference RNA. This Universal Reference RNA is suitable for microarray experiments. Stratagene also supplies a QPCR Human Reference Total RNA, suitable for QRT-PCR, which has undergone further DNase treatment. http://www.genomics.agilent.com/article.jsp?pageId=1452&_requestid=2183245 And http://www.chem.agilent.com/library/usermanuals/public/740000.pdf gives you cell lines that are actually in. Ensembl_v65.Gencode_v10.ENSG.gene_info : details and annotations for the ENSEMBL ids that are the rows in the expression matrices ------------------ EXPRESSION MATRICES -------------------- 57epigenomes.RPKM.pc: RPKM expression matrix for protein coding genes 57epigenomes.N.pc: RNA-seq read counts matrix for protein coding genes 57epigenomes.RPKM.nc: RPKM expression matrix for non-coding RNAs 57epigenomes.N.nc: RNA-seq read counts matrix for non-coding RNAs 57epigenomes.exon.RPKM.pc: RPKM expression matrix for protein coding exons 57epigenomes.exon.N.pc: RNA-seq read counts matrix for protein coding exons 57epigenomes.RPKM.intronic.pc: RNA-seq read count matrix for intronic protein-coding RNA elements 57epigenomes.N.intronic.pc: RNA-seq read count matrix for intronic protein-coding RNA elements 57epigenomes.N.rb: RNA-seq read counts matrix for ribosomal genes 57epigenomes.RPKM.rb: RPKM expression matrix for ribosomal RNAs 57epigenomes.exn.RPKM.rb: RPKM expression matrix for ribosomal gene exons 57epigenomes.exn.N.rb: RNA-seq read counts matrix for ribosomal gene exons Same normalization for RPKM for both protein coding and non-coding RNAs For non stranded libraries, nc gives rather messy results due to lots of overlapping between strands. ============= SIGNAL TRACKS ============= signal/normalized_bigwig/stranded - Contains normalized bigwig files with RNA-seq signal coverage For stranded libraries, there are two files corresponding to + and - strand. - strand files have signal values expressed as negative numbers. signal/unnormalized_wig - Contains unnormalized wiggle files with RNA-seq signal coverage Downloaded from ftp://ftp.bcgsc.ca/public/mbilenky/112epigenomes/RNAseq_wigs/ All wigs are processed as a read coverage, multimapped reads are kept, duplicated reads are kept. Coverages are NOT normalized, but normalization coefficients are in the file all.EGID.N.readlength It contains normalization information and the read length per library signal/unnormalized_wig/stranded/: contains wigs for stranded libraries signal/unnormalized_wig/strandagnostic: contains wigs that merge reads from both strands Replicates are provided separately for H1 and H1/derived replicates ========================== INTERGENIC CONTIGS ========================== RNAseq_intergenic.tar.gz: BED files containing intergenic contigs RNAseq_intergenic_summary.xls: Summary statistics of intergenic contigs Clusters of intergenic mRNA expression were identified as follows: 1. The from RNA-seq alignments bam files reads aligned into multiple locations were excluded and duplicated/multiplicated reads were accounted only once. 2. Read coverage was calculated in 200bp bins genome wide. After that read coverage was converted into RPKM values, using the same normalization procedure as for annotated genes. 3. All bins that overlap any of the gene body as annotated by Ensembl v65 (GenCode v10) we excluded 4. All bins with RPKM<0.5 were excluded 5. Resulting clusters of indetrgenic expression were summarized in the UCSC bedGraph format (with 200bp resolution). 6. For the data from the strand-specific experiments analysis was performed in a strand specific manner. At the final step, after all filtering, for the overlapping fragments of the clusters on positive and negative strands RPKM values were averaged.