3.2 Higer order data analysis
Computational Approaches for Assessing Higher-order Chromatin
Analytical pipeline
1.1. Alignment
1.2. Binning and Generating Contact Matrices
1.3. Normalization (Balancing)
1.4. Identification of interactions
1.5. Visualization
Analytical tools
TAD calling tools and algorithms
3.2.1 Analytical pipeline:
Raw reads
Pre-processing
Alignment (Full read or chimeric alignment)
Binning & contact matrixing
Normalization
Detection (Intra/Inter-chr interactions)
Visualization
Analysis and Interpretation
3.2.1.1 Alignment
Hi-C data produced by deep sequencing is no different than other genome-wide deep sequencing datasets. The data starts out as genomic reads in the traditional FASTQ file format.
Reads are filtered to remove duplicates, PCR artifact. Sequencing adaptors can also be removed prior to alignment.
The goal is to simply find a unique alignment for each read. The insert size of the Hi-C ligation product can vary between 1bp to hundreds of megabases (in terms of linear genome distance), it is difficult to use most paired-end alignment modes as is. One straightforward solution is to map each side of the paired end read separately/independently using a standard alignment procedure.
Full-read alignment first ---- Bowtie2, BWA
Average sufficient reads depth, sufficient mappable reads:4C (1– 2 million), 5C (25 million) and Hi-C (8.4 to 100 million) [3]
A detailed description of mapping analysis is covered in Read mapping consideration chapter.
3.2.1.2 Binning and Generating Contact Matrices
what is bin?
A bin is a fixed, non-overlapping geno-mic span into which reads are grouped to increase the signal of the interaction frequency. The interactions between bins are simply summed up to aggregate the signals.
why we use bin?
To overcome the limitations that the signal-to-noise ratio decreases with increased distance between two target loci.
when we use bin?
If the goal is to measure large scale structures, such as genomic compartments, then a lower resolution will often suffice (1MB-10MB), then we'll choose a proper bin size. However if the goal is to measure specific interactions of a small region, e.g. promoter-enhancer looping, then one should choose to use a restriction enzyme that cuts more frequently (e.g. 4bp) and a method that does not measure the entire genome, but instead focuses on exploring only a subset of the genome (i.e. 3C/4C/5C).
how to choose bin size
Smaller bins usually are used for more frequent intra-chromosomal interactions, and larger bins are for less frequent inter-chromosomal interactions. selected bin size should be inversely proportional to the expected number of interactions in a region.
what decide the Hi-C resolution
Sequencing coverage, more reads will cover more of the interaction space and thus improve the resolution.
Library complexity -- the total number of unique chimeric molecules that exist in a Hi-C library, a library with a low complexity level will saturate quickly with increasing sequencing depth
bin-level filtering
Prior to matrix balancing, it is advised to remove any bins (rows/columns) from the dataset that have either very noisy or too low of a signal. These bins are normally found in genomic regions with low mappability or high repeat content, such as around telomeres and centromeres.
3.2.1.3 Normalization (Balancing)
The goal of normalization is to reduce biases during the experiment as well as a better comparison between different experiment results (reduce batch effect).
Two types of normalization
Explicit normalization:
known bias factors [4]
Distance between restriction enzyme cut sites (eg, for hi-c)
GC content of trimmed ligation junction
uniqueness of sequence reads
correction: integrate prior probabilistic model.
Implicit normalization:
Iterative correction [5] based on the assumption that all loci should have equal visibility since we are detecting the entire genome in an unbiased manner (By equalizing the sum of every row/column in the matrix). Faster and preferred.
3.2.1.4 Identification of interactions
TAD calling (Details see 3.2.3)
Tools for calling TADs
algorithms and principle
GPU accelerating
Separating active/repressive compartments A/B
Identifying chromatin loops
3.2.1.5 Visualization
See Chapter4
3.2.2 Analytical tools:
Techniques
Tools
Description
4C
Uses R to detect specific interactions between DNA elements and identify differential
5C
Python package for normalization and analysis of chromatin structural data produced using either
Hi-C
Assigns statistical confidence to mid-range cis-chromosomal contacts
Designed for high-resolution Hi-C data
Identifies chromatin interactions in a genome
Detect sub-TAD chromatin interactions (cis)
Aligns, filters and normalizes, identifies and compares TADs, loops and compartments and display using Juicebox
Aligns, quality control, inter-intra contact maps, fast iterative correction, allele specific contact maps
Comprehensive tools list for hi-c data analysis can be found here. Next we'll use Hi-C pro as a showcase to see hi-c data analysis workflow (See Hi-C Pro Pipeline chapter).
3.2.3 TAD calling tools and algorithms Brief view of different tools.
Tools
Description
Language
TADbit includes quality control module, and aligns reads to the reference
Python
Identifies hierarchical topological domains
Python
Uses dynamic programming to call TADs in different resolutions
C++
Arrowhead is an algorithm for finding contact domains
Java
Performances comparison for different tools are surveyed here.
Reference:
[1] Computational Methods for Assessing Chromatin Hierarchy [2] The Hitchhiker's Guide to Hi-C Analysis: Practical guidelines [3] Comparison of computational methods for Hi-C data analysis
Last updated