0.3 RNA-seq Data Mapping & Gene Quantification

Connect to linux server

Open a terminal and type ssh username@ieng6-###.ucsd.edu

First let's create some target directories with the following commands

mkdir geneExpression
cd geneExpression
mkdir alignments 
mkdir fpkm       
mkdir diff

Then we can use TOPHAT to align the reads to the genome with the following template command tophat -p 1 -G /path/to/genes.gtf -o out/dir path/to/genome/index path/to/reads_R1.fastq path/to/reads_R1.fastq The genome index used with TOPHAT should be bowtie2 index files. To align all the fastq files from the example data at the same time, we can create a shell script

cd alignments
vi alignment.sh
reads=/home/linux/ieng6/be183f/public/bengTutorial/fastq    # address where the fastq files are stored
genes=/home/linux/ieng6/be183f/public/bengTutorial/index_gtf/genes4.gtf   # this contain information about the positions of genes on the genome
# Tophat need to know where is the reference genome, we'll create soft-links in of the reference genome in the current folder
for file in /home/linux/ieng6/be183f/public/bengTutorial/index_gtf/4*; do
       ln -s $file .
done

tophat -p 1 -G $genes -o C1_R1 4 ${reads}/GSM794483_C1_R1_1.ss.fq ${reads}/GSM794483_C1_R1_2.ss.fq &
tophat -p 1 -G $genes -o C1_R2 4 ${reads}/GSM794484_C1_R2_1.ss.fq ${reads}/GSM794484_C1_R2_2.ss.fq &
tophat -p 1 -G $genes -o C2_R1 4 ${reads}/GSM794486_C2_R1_1.ss.fq ${reads}/GSM794486_C2_R1_2.ss.fq &
tophat -p 1 -G $genes -o C2_R2 4 ${reads}/GSM794487_C2_R2_1.ss.fq ${reads}/GSM794487_C2_R2_2.ss.fq &

Next we quit and save the script by typing: :wq Then we run the script by typing: bash alignment.sh After the alignment step is finished, we use Cufflink to quantify the gene expressions A template Cufflink command is like the following cufflink -p 1 -G path/to/genes.gtf -o path/to/outdir path/to/accepted_hits.bam We can also write a shell script to execute the files all at once

$ cd fpkm  # get into the fpkm folder
$ vi fpkm.sh
genes=/home/linux/ieng6/be183f/public/bengTutorial/index_gtf/genes4.gtf
alignments=../alignments

for condition in C1 C2; do
for replicate in R1 R2; do
   echo ${condition}_${replicate}
   cufflinks -p 1 -G $genes -o ${condition}_${replicate} ${alignments}/${condition}_${replicate}/accepted_hits.bam
done; done

We quit and save the script by typing: :wq Then we run the script by typing: bash fpkm.sh Here, genes.fpkm_tracking and isoforms.fpkm_tracking contains gene expression values (measured as FPKM) at the gene and transcript levels.

STAR-Kallisto Pipeline

We can also use STAR to align the reads to the genome. We need to first build index files that is compatible with STAR prior to the alignment step. To build the index, we can run the following template command STAR --runMode genomeGenerate --genomeDir path/to/starIndex --genomeFastaFiles path/to/genome.fa Then, we'll be ablt to execute the alignment step with the following template command STAR --genomeDir path/to/starIndex/ --readFilesIn path/to/read1 path/to/read2 --outFileNamePrefix output/ After the mapping is finished, the mapping statistics can be viewed as Log.final.out and the detailed mapping results can be viewed at Align.out.sam Besides Cufflinks, we can also use Kallisto to quantify the gene expressions directly from the raw fastq files. To do this, we need to build the index for Kallisto first with the following template command kallisto index -i path/to/output.index path/to/transcriptome.fa With index file built, we are able to quantify the gene expressions with the following template command kallisto quant -i path/to/output.index -o path/to/outDic path/to/read1.fastq path/to/read2.fastq The results can be viewed at abundance.tsv where gene expressions are quantified in terms of TPM values.

Last updated