3.1 Primary order analysis

Assessment of Primary-order Chromatin

Features of data
Analytical pipeline
2.1. Step1 Raw reads
2.2. Step2 Pre-processing & Alignment
2.3. Step3 Peak calling
2.4. Step4 Chromatin Accessibility Analysis
2.5. Step5 Analysis and Interpretation
Introduction of MACS

Genomic regions with dense nucleosomes are more tightly packed (i.e., “closed”), whereas nucleosome- depleted regions are more accessible (i.e., “open”) for interactions with regulators and are therefore regarded as the primary locations of regulatory elements.

3.2.1 Features of data

Figure1. Fragmentation methods and read out features comparison between four methods. MNase is specially sensitive to open chromotin areas and it is exonucleas which maps regions that are protected by nucleosomes (large open regions are digested). DNase-seq and ATAC-seq are used to sequence and map exposed regions of DNA. © The Scientist.

3.2.2 Analytical pipeline:

Step1 Raw reads (Quality control)

Composite plots to check experiment sucess: eg, TSSs shown t be open.---- ArchTEX[1], CEAS[2].
ATAC-seq can be further: estimating the percentage of sequence reads that map to the mitochondrial genome(lower is better).
Use genome browser to see raw tag density ---- UCSC[3], IGV[4], GIVE.

Step2 Pre-processing & Alignment

Reads are filtered to remove redundancy and adaptors, also size selection (selective for some methods).

Size selection:

MNase: 25-50bp fragments represent TF-binding sites.
ATAC:
- reads length below 38bp are removed,as 38 bp is the minimal distance between neighboring transposition sites generated by the Tn5 transposase.
- mitochondrial genome reads removed.

Alignment

Full-read approach
- Bowtie2
- Burrows-Wheeler Aligner (BWA)
Chimeric alignment (unmapped from previous step, which may contains ligation junction)
- BWA
- STAR
- ChimeraScan

Minimal reads number

MNase: 150-200 Million reads.
DNase / FAIRE: 25-50 M
ATAC: 50-160 M

Step3 Peak calling

Peak calling basically means we identify possible open regions or nucleosome location through the tiling reads depth. This step is the most critical for chromatin accessibility profiling, revealing nucleosome-dense, closed regions (MNase-seq) or open chromatin regions (DNase-seq, FAIRE-seq, and ATAC-seq) [5].

For MNase-seq

Midpoint of nucleosome:

single-end mode: shifting the ends 73 bp from 5' end, or extending the ends from 120 to 147 bp in the 3′ direction.
paired end mode: mid of forward and reverse reads.

Peak calling tools:

Tools

Description

Reference

GeneTrack

Gaussian smoothing generate prob-based map based on predefined exclusion distance

[6]

iNPS

wave-like structure of data first derivative of the Gaussian smoothed profile

[7]

DANPOS

identification of dynamic nucleosomes positions

[8]

For DNase-seq and FAIRE-seq

Specific tools for DNase-seq

Tools

Description

Reference

F-seq

smooth Gaussian kernel density estimation higher accuracy and sensitivity

[9]

Hotspot

reports statistical significance for identified DHSs (DNase I hypersensitive site)

[10]

Specific tools for FAIRE-seq

Tools

Description

Reference

MACS2

more parameter settingssettings can be converted to P-values empirically

[11]

General peak calling tools

Tools

Description

Reference

ZINBA

regression model to identify enriched regions without a prior regions within a defined distance are combined to form a broad region shape-detection function for sharp signal scan combine prior like GC content when noise is high

[12]

MACS

use Poisson distribution as a background model

[11]

For ATAC-seq

Paired-end sequencing is performed for ATAC-seq. The read start sites require adjustment because the Tn5 transposase binds as a dimer and inserts adaptors separated by 9bp. Generally, reads aligned to the + strand are offset by +4 bp, and reads aligning to the - strand are offset by −5 bp [5]. This method can detect both broad regions (few kb) or small regions (50-500bp).

Common peak calling tools:

MACS2, ZINBA, F-seq, HOMER. "atac-seq"-R package.

Step4 Chromatin Accessibility Analysis

Accessible regions are determined based on peak-calling results. With peak files we go downstream analysis with different purpose:

MNase-seq

DNase-seq

FAIRE-seq

ATAC-seq

Identification Objectives

Nucleosome positioning

TF footprint

CORE locations

Nucleosome pos and TF footprint

Tools

Nucleosome positioning algorithms [13,14,15,16]
Nucleosome occupancy algorithms [17,18]
V-plots for TF occupancy [19]

Digital genomic footprinting algorithms
[20, 21]
Nucleosome and TF occupancy algorithms [22]
CENTIPEDE [23]

Not available

Digital genomic footprinting algorithms
CENTIPEDE

Step5 Analysis and Interpretation

Data annotation and integration represents the final and most informative stage of analysis. After we get the nucleosome positions and TF binding sites, it is desirable to further interpret them in the light of relevant information.

In combination with (promoters, introns, intergenic regions, TSSs, TTSs) information ---- BedTools.
Discovery TF binding events based on known knowledge or de novo ways.

3.2.3 Analytical Tools for peak calling:

Brief view of different tools

A comprehensive list for peak calling softwares can be found at here. Few widely-used tools are listed below.

Tools

Description

MACS2

(MACS1.4)Most widely used peak caller. Can detect narrow and broad peaks.

Epic (SICER)

specialized for broad peaks

BayesPeak R/Bioconductor

Jmosaics Detects enriched regions jointly from replicates

EDD

Detects megabase domain enrichment

GEM

Peak calling and motif discovery for ChIP-seq and ChIP-exo

SPP

Fragment length computation and saturation analysis to determine if read depth is

Introduction of MACS

MACS is on of the most popular peak calling tools and is developed in X. Shirley Liu’s lab at Harvard University. We'll use MACS as an example to show you the basic idea behind peak calling and practical implementation. The original paper introduce MACS can be found here.

Core idea of the algorithm

The combination of TF and genome is a relatively random process, that's to say that every position on the genome have the chance to be seen by the TF (however, with different probability). Peak calling is aimed to find those hot spot that is easily seen.

How can we identify whether a spot is "hot"? Suppose we are sequencing a group of cells, then a hot spot is presumably be covered more frequently than other spots. This process can be viewed as a binomial distribution and when the number of reads n is large and the prob p of each location to be covered by a read is small, then the binomial distribution became Poisson distribution.

An event can occur 0, 1, 2, … times in an interval. The average number of events in an interval is designated $\lambda$ (lambda). Lambda is the event rate, also called the rate parameter. The probability of observing $k$ events in an interval is given by the equation: $P(k \text{ events in interval}) = e^{-\lambda}\frac{\lambda^k}{k!}$ .

Here the $\lambda=n*p, p=\frac{l}{s}$ .

n: the reads number from sequencing.
l: length of a single read.
s: length of the whole genome.

With this basic model then we can compute a threshold depth of a spot by setting a confidence probability. Howver, this is only a baby model which may fail in real circumstances due to lots of bias of the experiment.

Bias in ChIP-seq experiment

In doing ChIP-seq analysis, it's important to think about biases that can affect your results. There are a number of issues, such as issues with chromatin accessibility that are going to affect how your DNA gets fragmented, issues with amplification, repetitive regions, which are going to be difficult to map back to. And so it's very important in a ChIP-seq experiment to use some kind of control, and a common control is to use input DNA control, where we have data that's, that is fragmented, but the immunoprecipitation, or the, the antibody pull down has not been performed. (A comprehensive review for bias can be found here).

Using control data

Determining the background expectation of the number of peaks you would see and use that in making your peak calls. It also allows for MACS to compute a false discovery rate. Without a control, MACS will not compute a false discovery rate. But using a control, it's able to to model that background.

Generate profile from combined tag

Another problem is that, the reads we got do not indicate the real potion of a TF, reads are pulled down together with a TF. In that case we usually get a double peak result which desired further analysis. MACS solved this problem by shifting tags of d/2, where d is the discrepancy between forward and backward peaks around one TF. An illustration can be seen here.

Implementation Pipeline

Install MACS (version 1.4 or 2.0).

tar xvzf MACS-1.4.2-1.tar.gz
cd MACS-1.4.2
python setup.py install –prefix /your_directory/

Change the environment path

export PATH = /your_directory/bin:$PATH
export PYTHONPATH = /your_directory/lib/python2.X/site-packages/:$PYTHONPATH

Align the reads to the genome:

bowtie –m 1 -S -q /path_to/mm10 CTCF.fastq CTCF.sam

Run MACS:

macs14 -t CTCF.sam -c Control.sam -n CTCF –g mm

The help docs of comprehensive parameters here. After alignment we'll get four files: CTCF_model.r，CTCF_peaks.bed，CTCF_peaks.xls，CTCF_summits.bed. The double peaks figure is:

 Rscript CTCF_model.r

The CTCF_peak.bed records the location of each peak and can be further processed with the tools like Bedtools.

PreviousChap3 Computational analysis Next3.2 Higer order data analysis

Last updated 6 years ago