0.4 RNA-seq Differential Analysis
Assess different expression levels of genes between our samples at different timepoints, conditions, etc.
Why?
Let's suppose we have a strain of E. coli that we want to optimize to produce ethanol.But we don't know what genes are involved. What do we do?
We perform differential expression analysis on our RNA-seq data. The expressionlevels have already been calculated using cufflinks
.
How? Cuffdiff!
Cuffdiff under the hood
A simple methodology for assessing differential expression levels is differences in transcript count between sample conditions, but this can be erroneous due to alternative splicing and biases in sequencing.
A more robust method is to fit a Poisson Distribution given the expectation that the odds of seeing a change in expression level are small.
But Poisson doesn't account for count uncertainty and count dispersion!
Count Uncertainty: The fact that reads can be shared across multiple genes because of shared genetic data.
Count Dispersion: The fact that the number of reads produced is highly variable between replicates.
From the Cuffdiff Paper:
Our method addresses both of these issues by modeling how variability in measurements of a transcript's fragment count depends on both its expression and its splicing structure.
The way in which this happens is:
The algorithm captures uncertainty in a transcript's fragment count as a beta distribution and the over dispersion in this count with a negative binomial, and mixes the distributions together. The resulting mixture is a beta negative binomial distribution that reflects both sources of variability in an isoform's measured expression level.
Summarized in this picture:

Pretty clever!
Using Cuffdiff
Thankfully using cuffdiff
is super easy.
An example implementation is shown below:
First thing for ease of use is to generate a file called diff.sh
. This is a simple bash file and all of this could be accomplished just using the command line but this is a little cleaner and allows for easier code repeatability.
$ vi diff.sh
genes=/home/linux/ieng6/be183f/public/bengTutorial/index_gtf/genes4.gtf
C1_R1=../alignments/C1_R1/accepted_hits.bam
C1_R2=../alignments/C1_R2/accepted_hits.bam
C2_R1=../alignments/C2_R1/accepted_hits.bam
C2_R2=../alignments/C2_R2/accepted_hits.bam
cuffdiff -o diff_out ${genes} ${C1_R1},${C1_R2} ${C2_R2},${C2_R2}
Now we simply run: bash diff.sh
Where the files noted in C*_R*
are the expression files generated fromcufflinks
.
In bash scripting calling ${*}
denotes a variable defined earlier in the script.
Cuffdiff Output
It will generate a CSV-like file in the diff_out
directory called gene_exp.diff
. It looks like:
test_id gene_id gene locus sample_1 sample_2 status value_1 value_2 log2(fold_change) test_stat p_value q_value significant
FBgn0002521 FBgn0002521 pho 4:1193093-1202271 q1 q2 OK 1885.82 1597.12 -0.239721 -1.27565 0.07125 0.389135 no
FBgn0004607 FBgn0004607 zfh2 4:524476-560418 q1 q2 NOTEST 6.72414 6.02587 -0.158181 0 1 1 no
FBgn0004624 FBgn0004624 CaMKII 4:1056642-1074329 q1 q2 OK 6854.63 6821.34 -0.00702337 -0.0397619 0.9532 0.97767 no
FBgn0004859 FBgn0004859 ci 4:68333-77667 q1 q2 NOTEST 33.6854 68.8131 1.03056 0 1 1 no
FBgn0005558 FBgn0005558 ey 4:718314-741787 q1 q2 OK 339.474 313.086 -0.116739 -0.511244 0.46085 0.925195 no
FBgn0005561 FBgn0005561 sv 4:1109443-1133943 q1 q2 OK 215.591 182.876 -0.237431 -0.810935 0.2223 0.819162 no
FBgn0005666 FBgn0005666 bt 4:745029-796707 q1 q2 OK 337.666 352.068 0.0602593 0.312879 0.65755 0.925195 no
FBgn0010217 FBgn0010217 ATPsyn-beta 4:1052439-1055175 q1 q2 OK 45751.2 45196.6 -0.0175982 -0.10918 0.8776 0.929994 no
FBgn0011642 FBgn0011642 Zyx102EF 4:1077990-1081542 q1 q2 OK 4612.4 4319.09 -0.0947887 -0.486887 0.48765 0.925195 no
Accessing the data
Conveniently CSV files can be accessed with R
, Python
, or even Excel
Here's an example code snippet of accessing only statistically significant rows in Bash
:
$ grep yes gene_exp.diff
Where this only selects rows where the last columns is yes
. However if the substring yes
is in a locus
, gene_id
, or any other field then this will return false positives.
By default this threshold is a value of 0.05 for the q-value field.What is a q-value?
In Python
:
import pandas as pd
# Tab delimited with a header at the zeroth row.
df = pd.read_csv('gene_exp.diff', delimiter='\t', header=0)
just_significant = df.loc[df.significant == 'yes']
Further analysis would be to get the count of significant differences by gene locus. Continuing:
counts_by_locus = just_significant.groupby(['locus']).count()
Example Figure
This is an example out put of how you could plot the results. Here the x-axis is the Log<sub>2</sub>(fold-change) and the y-axis is the-log<sub>10</sub>(q-value). Both of these values are accessible from the the output of cuffdiff
above.

Global analysis of differential gene expression related to long-term sperm storage in oviduct of Chinese Soft-Shelled Turtle Pelodiscus sinensis Liu, Tengfei, et al.
Supplementary
What's a q-value?
When doing lots of statistical comparisons, the likelihood of getting a statistically significant result (p-value) increases as more comparisons are performed. This results in an increased False Discovery Rate (FDR).
Bonferroni Correction
To combat this issue of an increased FDR from multiple comparisons, we can adjust our threshold for statistical significance with the Bonferroni Correction.
Simply, if we have m
comparisons being performed and some initial p-value, a
,then our new threshold, p
is: p = a / m
Last updated