3.1 Primary order analysis
Last updated
Last updated
Features of data
Analytical pipeline
2.1. Step1 Raw reads
2.2. Step2 Pre-processing & Alignment
2.3. Step3 Peak calling
2.4. Step4 Chromatin Accessibility Analysis
2.5. Step5 Analysis and Interpretation
Introduction of MACS
Genomic regions with dense nucleosomes are more tightly packed (i.e., “closed”), whereas nucleosome- depleted regions are more accessible (i.e., “open”) for interactions with regulators and are therefore regarded as the primary locations of regulatory elements.
Figure1. Fragmentation methods and read out features comparison between four methods. MNase is specially sensitive to open chromotin areas and it is exonucleas which maps regions that are protected by nucleosomes (large open regions are digested). DNase-seq and ATAC-seq are used to sequence and map exposed regions of DNA. © The Scientist.
ATAC-seq can be further: estimating the percentage of sequence reads that map to the mitochondrial genome(lower is better).
Reads are filtered to remove redundancy and adaptors, also size selection (selective for some methods).
MNase: 25-50bp fragments represent TF-binding sites.
ATAC:
reads length below 38bp are removed,as 38 bp is the minimal distance between neighboring transposition sites generated by the Tn5 transposase.
mitochondrial genome reads removed.
Full-read approach
Bowtie2
Burrows-Wheeler Aligner (BWA)
Chimeric alignment (unmapped from previous step, which may contains ligation junction)
BWA
STAR
ChimeraScan
MNase: 150-200 Million reads.
DNase / FAIRE: 25-50 M
ATAC: 50-160 M
Peak calling basically means we identify possible open regions or nucleosome location through the tiling reads depth. This step is the most critical for chromatin accessibility profiling, revealing nucleosome-dense, closed regions (MNase-seq) or open chromatin regions (DNase-seq, FAIRE-seq, and ATAC-seq) [5].
Midpoint of nucleosome:
single-end mode: shifting the ends 73 bp from 5' end, or extending the ends from 120 to 147 bp in the 3′ direction.
paired end mode: mid of forward and reverse reads.
Peak calling tools:
Tools | Description | Reference |
GeneTrack | Gaussian smoothing generate prob-based map based on predefined exclusion distance | |
iNPS | wave-like structure of data first derivative of the Gaussian smoothed profile | |
DANPOS | identification of dynamic nucleosomes positions |
Specific tools for DNase-seq
Tools | Description | Reference |
F-seq | smooth Gaussian kernel density estimation higher accuracy and sensitivity | |
Hotspot | reports statistical significance for identified DHSs (DNase I hypersensitive site) |
Specific tools for FAIRE-seq
Tools | Description | Reference |
MACS2 | more parameter settingssettings can be converted to P-values empirically |
Tools | Description | Reference |
ZINBA | regression model to identify enriched regions without a prior regions within a defined distance are combined to form a broad region shape-detection function for sharp signal scan combine prior like GC content when noise is high | |
MACS | use Poisson distribution as a background model |
Paired-end sequencing is performed for ATAC-seq. The read start sites require adjustment because the Tn5 transposase binds as a dimer and inserts adaptors separated by 9bp. Generally, reads aligned to the + strand are offset by +4 bp, and reads aligning to the - strand are offset by −5 bp [5]. This method can detect both broad regions (few kb) or small regions (50-500bp).
Common peak calling tools:
MACS2, ZINBA, F-seq, HOMER. "atac-seq"-R package.
Accessible regions are determined based on peak-calling results. With peak files we go downstream analysis with different purpose:
MNase-seq | DNase-seq | FAIRE-seq | ATAC-seq | |
Identification Objectives | Nucleosome positioning | TF footprint | CORE locations | Nucleosome pos and TF footprint |
Tools | Not available |
|
Data annotation and integration represents the final and most informative stage of analysis. After we get the nucleosome positions and TF binding sites, it is desirable to further interpret them in the light of relevant information.
In combination with (promoters, introns, intergenic regions, TSSs, TTSs) information ---- BedTools.
Discovery TF binding events based on known knowledge or de novo ways.
A comprehensive list for peak calling softwares can be found at here. Few widely-used tools are listed below.
Tools | Description |
MACS2 | (MACS1.4)Most widely used peak caller. Can detect narrow and broad peaks. |
Epic (SICER) | specialized for broad peaks |
BayesPeak R/Bioconductor | Jmosaics Detects enriched regions jointly from replicates |
EDD | Detects megabase domain enrichment |
GEM | Peak calling and motif discovery for ChIP-seq and ChIP-exo |
SPP | Fragment length computation and saturation analysis to determine if read depth is |
MACS is on of the most popular peak calling tools and is developed in X. Shirley Liu’s lab at Harvard University. We'll use MACS as an example to show you the basic idea behind peak calling and practical implementation. The original paper introduce MACS can be found here.
Core idea of the algorithm
The combination of TF and genome is a relatively random process, that's to say that every position on the genome have the chance to be seen by the TF (however, with different probability). Peak calling is aimed to find those hot spot that is easily seen.
How can we identify whether a spot is "hot"? Suppose we are sequencing a group of cells, then a hot spot is presumably be covered more frequently than other spots. This process can be viewed as a binomial distribution and when the number of reads n is large and the prob p of each location to be covered by a read is small, then the binomial distribution became Poisson distribution.
An event can occur 0, 1, 2, … times in an interval. The average number of events in an interval is designated (lambda). Lambda is the event rate, also called the rate parameter. The probability of observing events in an interval is given by the equation: .
Here the .
n: the reads number from sequencing.
l: length of a single read.
s: length of the whole genome.
With this basic model then we can compute a threshold depth of a spot by setting a confidence probability. Howver, this is only a baby model which may fail in real circumstances due to lots of bias of the experiment.
Bias in ChIP-seq experiment
In doing ChIP-seq analysis, it's important to think about biases that can affect your results. There are a number of issues, such as issues with chromatin accessibility that are going to affect how your DNA gets fragmented, issues with amplification, repetitive regions, which are going to be difficult to map back to. And so it's very important in a ChIP-seq experiment to use some kind of control, and a common control is to use input DNA control, where we have data that's, that is fragmented, but the immunoprecipitation, or the, the antibody pull down has not been performed. (A comprehensive review for bias can be found here).
Using control data
Determining the background expectation of the number of peaks you would see and use that in making your peak calls. It also allows for MACS to compute a false discovery rate. Without a control, MACS will not compute a false discovery rate. But using a control, it's able to to model that background.
Generate profile from combined tag
Another problem is that, the reads we got do not indicate the real potion of a TF, reads are pulled down together with a TF. In that case we usually get a double peak result which desired further analysis. MACS solved this problem by shifting tags of d/2, where d is the discrepancy between forward and backward peaks around one TF. An illustration can be seen here.
Implementation Pipeline
Install MACS (version 1.4 or 2.0).
Change the environment path
Align the reads to the genome:
Run MACS:
The help docs of comprehensive parameters here. After alignment we'll get four files: CTCF_model.r,CTCF_peaks.bed,CTCF_peaks.xls,CTCF_summits.bed. The double peaks figure is:
The CTCF_peak.bed records the location of each peak and can be further processed with the tools like Bedtools.