README for uniformly processed RNA-seq data
These data were processed by Misha Bilenky (mbilenky@gmail.com) from UBC. 

=========================
ANNOTATIONS
=========================
annotations/ directory contains GencodeV10 gene/transcript coordinates and annotations corresponding to hg19 version of the human genome

gen10.gtf.gz: The original GENCODE v10 (ftp://ftp.sanger.ac.uk/pub/gencode/release_10/gencode.v10.annotation.gtf.gz) does not contain tRNAs. I added the entries from a separate gencode v10 file (ftp://ftp.sanger.ac.uk/pub/gencode/release_10/gencode.v10.tRNAs.gtf.gz). For the tRNAs I modified "gene_type" and "transcript_type" attributes and added element "gene" and "transcript".

gen10.long.gtf: is a subset and contains only entries of long transcripts (protein coding, processed transcript, pseudogenes, ...)
NOTE: the long.gtf file was used for quantification

gen10.long.partition.unstr.gtf.gz: Partitioning the genome into elements using the hierarchy gap-exon-intron-intergenic for ~3 billion bg (unstranded). The partition also contain an element called "genic". this partition is loger than exonic+intronic, since transcripts assigned to a gene locus do not necessarily overlap. However, those cases are rather the exception than the rule

=========================
EXPRESSION QUANTIFICATION
=========================
expression/ - This directory contains expression quantification information

These were downloaded from ftp://ftp.bcgsc.ca/public/mbilenky/112epigenomes/RNAseq_Removed_E060_E064/

FILES:

-------------
HEADER FILES
-------------
EG.name.txt: order and names of epigenomes (columns) in the expression matrices

The EG names and the RNA seq quantification include an E000 sample representing a Universal Human Reference RNA sample (HUR).
Universal Human Reference RNA: Agilent's Universal Human Reference RNA is composed of total RNA from 10 human cell lines. The reference RNA is designed to be used as a reference for microarray gene-profiling experiments. Since RNA species differ in abundance between cell lines, an ideal reference sample should represent these different RNAs. Equal quantities of DNase-treated total RNA from each cell line were pooled to make the Universal Human Reference RNA. This Universal Reference RNA is suitable for microarray experiments. Stratagene also supplies a QPCR Human Reference Total RNA, suitable for QRT-PCR, which has undergone further DNase treatment. http://www.genomics.agilent.com/article.jsp?pageId=1452&_requestid=2183245
And http://www.chem.agilent.com/library/usermanuals/public/740000.pdf gives you cell lines that are actually in.

Ensembl_v65.Gencode_v10.ENSG.gene_info : details and annotations for the ENSEMBL ids that are the rows in the expression matrices

------------------
EXPRESSION MATRICES
--------------------
57epigenomes.RPKM.pc: RPKM expression matrix for protein coding genes
57epigenomes.N.pc: RNA-seq read counts matrix for protein coding genes
57epigenomes.RPKM.nc: RPKM expression matrix for non-coding RNAs
57epigenomes.N.nc: RNA-seq read counts matrix for non-coding RNAs
57epigenomes.exon.RPKM.pc: RPKM expression matrix for protein coding exons
57epigenomes.exon.N.pc: RNA-seq read counts matrix for protein coding exons
57epigenomes.RPKM.intronic.pc: RNA-seq read count matrix for intronic protein-coding RNA elements
57epigenomes.N.intronic.pc: RNA-seq read count matrix for intronic protein-coding RNA elements
57epigenomes.N.rb: RNA-seq read counts matrix for ribosomal genes
57epigenomes.RPKM.rb: RPKM expression matrix for ribosomal RNAs
57epigenomes.exn.RPKM.rb: RPKM expression matrix for ribosomal gene exons
57epigenomes.exn.N.rb: RNA-seq read counts matrix for ribosomal gene exons

Same normalization for RPKM for both protein coding and non-coding RNAs
For non stranded libraries, nc gives rather messy results due to lots of overlapping between strands.

=============
SIGNAL TRACKS
=============
signal/normalized_bigwig/stranded - Contains normalized bigwig files with RNA-seq signal coverage
For stranded libraries, there are two files corresponding to + and - strand. - strand files have signal values expressed as negative numbers.

signal/unnormalized_wig - Contains unnormalized wiggle files with RNA-seq signal coverage
Downloaded from ftp://ftp.bcgsc.ca/public/mbilenky/112epigenomes/RNAseq_wigs/
All wigs are processed as a read coverage, multimapped reads are kept, duplicated reads are kept. Coverages are NOT normalized, but normalization coefficients are in the file all.EGID.N.readlength
It contains normalization information and the read length per library
signal/unnormalized_wig/stranded/: contains wigs for stranded libraries 
signal/unnormalized_wig/strandagnostic: contains wigs that merge reads from both strands
Replicates are provided separately for H1 and H1/derived replicates

==========================
INTERGENIC CONTIGS
==========================
RNAseq_intergenic.tar.gz: BED files containing intergenic contigs
RNAseq_intergenic_summary.xls: Summary statistics of intergenic contigs

Clusters of intergenic mRNA expression were identified as follows:

1. The from RNA-seq alignments bam files reads aligned into multiple locations were excluded and duplicated/multiplicated reads were accounted only once.
2. Read coverage was calculated in 200bp bins genome wide. After that read coverage was converted into RPKM values, using the same normalization procedure as for annotated genes.
3. All bins that overlap any of the gene body as annotated by Ensembl v65 (GenCode v10) we excluded
4. All bins with RPKM<0.5 were excluded
5. Resulting clusters of indetrgenic expression were summarized in the UCSC bedGraph format (with 200bp resolution).
6. For the data from the strand-specific experiments analysis was performed in a strand specific manner. At the final step, after all filtering, for the overlapping fragments of the clusters on positive and negative strands RPKM values were averaged.