0.4 RNA-seq Differential Analysis
Assess different expression levels of genes between our samples at different timepoints, conditions, etc.
Why?
Let's suppose we have a strain of E. coli that we want to optimize to produce ethanol.But we don't know what genes are involved. What do we do?
We perform differential expression analysis on our RNA-seq data. The expressionlevels have already been calculated using cufflinks
.
How? Cuffdiff!
Cuffdiff under the hood
A simple methodology for assessing differential expression levels is differences in transcript count between sample conditions, but this can be erroneous due to alternative splicing and biases in sequencing.
A more robust method is to fit a Poisson Distribution given the expectation that the odds of seeing a change in expression level are small.
But Poisson doesn't account for count uncertainty and count dispersion!
Count Uncertainty: The fact that reads can be shared across multiple genes because of shared genetic data.
Count Dispersion: The fact that the number of reads produced is highly variable between replicates.
From the Cuffdiff Paper:
Our method addresses both of these issues by modeling how variability in measurements of a transcript's fragment count depends on both its expression and its splicing structure.
The way in which this happens is:
The algorithm captures uncertainty in a transcript's fragment count as a beta distribution and the over dispersion in this count with a negative binomial, and mixes the distributions together. The resulting mixture is a beta negative binomial distribution that reflects both sources of variability in an isoform's measured expression level.
Summarized in this picture:
Pretty clever!
Using Cuffdiff
Thankfully using cuffdiff
is super easy.
An example implementation is shown below:
First thing for ease of use is to generate a file called diff.sh
. This is a simple bash file and all of this could be accomplished just using the command line but this is a little cleaner and allows for easier code repeatability.
Now we simply run: bash diff.sh
Where the files noted in C*_R*
are the expression files generated fromcufflinks
.
In bash scripting calling ${*}
denotes a variable defined earlier in the script.
Cuffdiff Output
It will generate a CSV-like file in the diff_out
directory called gene_exp.diff
. It looks like:
Accessing the data
Conveniently CSV files can be accessed with R
, Python
, or even Excel
Here's an example code snippet of accessing only statistically significant rows in Bash
:
Where this only selects rows where the last columns is yes
. However if the substring yes
is in a locus
, gene_id
, or any other field then this will return false positives.
By default this threshold is a value of 0.05 for the q-value field.What is a q-value?
In Python
:
Further analysis would be to get the count of significant differences by gene locus. Continuing:
Example Figure
This is an example out put of how you could plot the results. Here the x-axis is the Log<sub>2</sub>(fold-change) and the y-axis is the-log<sub>10</sub>(q-value). Both of these values are accessible from the the output of cuffdiff
above.
Global analysis of differential gene expression related to long-term sperm storage in oviduct of Chinese Soft-Shelled Turtle Pelodiscus sinensis Liu, Tengfei, et al.
Supplementary
What's a q-value?
When doing lots of statistical comparisons, the likelihood of getting a statistically significant result (p-value) increases as more comparisons are performed. This results in an increased False Discovery Rate (FDR).
Bonferroni Correction
To combat this issue of an increased FDR from multiple comparisons, we can adjust our threshold for statistical significance with the Bonferroni Correction.
Simply, if we have m
comparisons being performed and some initial p-value, a
,then our new threshold, p
is: p = a / m
Last updated