# 0.3 RNA-seq Data Mapping & Gene Quantification

### Connect to linux server

Open a terminal and type \
&#x20;`ssh username@ieng6-###.ucsd.edu`

### TOPHAT-CUFFLINK Pipeline

First let's create some target directories with the following commands

```
mkdir geneExpression
cd geneExpression
mkdir alignments 
mkdir fpkm       
mkdir diff
```

Then we can use TOPHAT to align the reads to the genome with the following template command \
&#x20;`tophat -p 1 -G /path/to/genes.gtf -o out/dir path/to/genome/index path/to/reads_R1.fastq path/to/reads_R1.fastq` \
&#x20;The genome index used with TOPHAT should be bowtie2 index files. To align all the fastq files from the example data at the same time, we can create a shell script <br>

```
cd alignments
vi alignment.sh
reads=/home/linux/ieng6/be183f/public/bengTutorial/fastq    # address where the fastq files are stored
genes=/home/linux/ieng6/be183f/public/bengTutorial/index_gtf/genes4.gtf   # this contain information about the positions of genes on the genome
# Tophat need to know where is the reference genome, we'll create soft-links in of the reference genome in the current folder
for file in /home/linux/ieng6/be183f/public/bengTutorial/index_gtf/4*; do
       ln -s $file .
done

tophat -p 1 -G $genes -o C1_R1 4 ${reads}/GSM794483_C1_R1_1.ss.fq ${reads}/GSM794483_C1_R1_2.ss.fq &
tophat -p 1 -G $genes -o C1_R2 4 ${reads}/GSM794484_C1_R2_1.ss.fq ${reads}/GSM794484_C1_R2_2.ss.fq &
tophat -p 1 -G $genes -o C2_R1 4 ${reads}/GSM794486_C2_R1_1.ss.fq ${reads}/GSM794486_C2_R1_2.ss.fq &
tophat -p 1 -G $genes -o C2_R2 4 ${reads}/GSM794487_C2_R2_1.ss.fq ${reads}/GSM794487_C2_R2_2.ss.fq &
```

Next we quit and save the script by typing: `:wq` Then we run the script by typing: `bash alignment.sh` \
&#x20;After the alignment step is finished, we use Cufflink to quantify the gene expressions \
&#x20;A template Cufflink command is like the following \
&#x20;`cufflink -p 1 -G path/to/genes.gtf -o path/to/outdir path/to/accepted_hits.bam` We can also write a shell script to execute the files all at once <br>

```
$ cd fpkm  # get into the fpkm folder
$ vi fpkm.sh
genes=/home/linux/ieng6/be183f/public/bengTutorial/index_gtf/genes4.gtf
alignments=../alignments

for condition in C1 C2; do
for replicate in R1 R2; do
   echo ${condition}_${replicate}
   cufflinks -p 1 -G $genes -o ${condition}_${replicate} ${alignments}/${condition}_${replicate}/accepted_hits.bam
done; done
```

We quit and save the script by typing: `:wq` Then we run the script by typing: `bash fpkm.sh` \
&#x20;Here, genes.fpkm\_tracking and isoforms.fpkm\_tracking contains gene expression values (measured as FPKM) at the gene and transcript levels.

### STAR-Kallisto Pipeline

We can also use STAR to align the reads to the genome. We need to first build index files that is compatible with STAR prior to the alignment step. To build the index, we can run the following template command \
&#x20;`STAR --runMode genomeGenerate --genomeDir path/to/starIndex --genomeFastaFiles path/to/genome.fa` \
&#x20;Then, we'll be ablt to execute the alignment step with the following template command \
&#x20;`STAR --genomeDir path/to/starIndex/ --readFilesIn path/to/read1 path/to/read2 --outFileNamePrefix output/` \
&#x20;After the mapping is finished, the mapping statistics can be viewed as `Log.final.out` and the detailed mapping results can be viewed at `Align.out.sam` \
&#x20;Besides Cufflinks, we can also use Kallisto to quantify the gene expressions directly from the raw fastq files. To do this, we need to build the index for Kallisto first with the following template command \
&#x20;`kallisto index -i path/to/output.index path/to/transcriptome.fa` \
&#x20;With index file built, we are able to quantify the gene expressions with the following template command \
&#x20;`kallisto quant -i path/to/output.index -o path/to/outDic path/to/read1.fastq path/to/read2.fastq` \
&#x20;The results can be viewed at `abundance.tsv` where gene expressions are quantified in terms of TPM values.
