3.2 Higer order data analysis

Computational Approaches for Assessing Higher-order Chromatin

  1. Analytical pipeline

    1.1. Alignment

    1.2. Binning and Generating Contact Matrices

    1.3. Normalization (Balancing)

    1.4. Identification of interactions

    1.5. Visualization

  2. Analytical tools

  3. TAD calling tools and algorithms

3.2.1 Analytical pipeline:

  • Raw reads

  • Pre-processing

  • Alignment (Full read or chimeric alignment)

  • Binning & contact matrixing

  • Normalization

  • Detection (Intra/Inter-chr interactions)

  • Visualization

  • Analysis and Interpretation

3.2.1.1 Alignment

Hi-C data produced by deep sequencing is no different than other genome-wide deep sequencing datasets. The data starts out as genomic reads in the traditional FASTQ file format.

Reads are filtered to remove duplicates, PCR artifact. Sequencing adaptors can also be removed prior to alignment.

The goal is to simply find a unique alignment for each read. The insert size of the Hi-C ligation product can vary between 1bp to hundreds of megabases (in terms of linear genome distance), it is difficult to use most paired-end alignment modes as is. One straightforward solution is to map each side of the paired end read separately/independently using a standard alignment procedure.

  • Full-read alignment first ---- Bowtie2, BWA

  • Unmapped reads: chimeric alignment ---- read splitting [1], iterative mapping [2]

  • Average sufficient reads depth, sufficient mappable reads:4C (1– 2 million), 5C (25 million) and Hi-C (8.4 to 100 million) [3]

A detailed description of mapping analysis is covered in Read mapping consideration chapter.

3.2.1.2 Binning and Generating Contact Matrices

what is bin?

A bin is a fixed, non-overlapping geno-mic span into which reads are grouped to increase the signal of the interaction frequency. The interactions between bins are simply summed up to aggregate the signals.

why we use bin?

  • To overcome the limitations that the signal-to-noise ratio decreases with increased distance between two target loci.

when we use bin?

If the goal is to measure large scale structures, such as genomic compartments, then a lower resolution will often suffice (1MB-10MB), then we'll choose a proper bin size. However if the goal is to measure specific interactions of a small region, e.g. promoter-enhancer looping, then one should choose to use a restriction enzyme that cuts more frequently (e.g. 4bp) and a method that does not measure the entire genome, but instead focuses on exploring only a subset of the genome (i.e. 3C/4C/5C).

how to choose bin size

Smaller bins usually are used for more frequent intra-chromosomal interactions, and larger bins are for less frequent inter-chromosomal interactions. selected bin size should be inversely proportional to the expected number of interactions in a region.

what decide the Hi-C resolution

  • Sequencing coverage, more reads will cover more of the interaction space and thus improve the resolution.

  • Library complexity -- the total number of unique chimeric molecules that exist in a Hi-C library, a library with a low complexity level will saturate quickly with increasing sequencing depth

bin-level filtering

Prior to matrix balancing, it is advised to remove any bins (rows/columns) from the dataset that have either very noisy or too low of a signal. These bins are normally found in genomic regions with low mappability or high repeat content, such as around telomeres and centromeres.

3.2.1.3 Normalization (Balancing)

The goal of normalization is to reduce biases during the experiment as well as a better comparison between different experiment results (reduce batch effect).

Two types of normalization

  • Explicit normalization:

    • known bias factors [4]

      • Distance between restriction enzyme cut sites (eg, for hi-c)

      • GC content of trimmed ligation junction

      • uniqueness of sequence reads

    • correction: integrate prior probabilistic model.

  • Implicit normalization:

    • Iterative correction [5] based on the assumption that all loci should have equal visibility since we are detecting the entire genome in an unbiased manner (By equalizing the sum of every row/column in the matrix). Faster and preferred.

3.2.1.4 Identification of interactions

  • TAD calling (Details see 3.2.3)

    • Tools for calling TADs

    • algorithms and principle

    • GPU accelerating

  • Separating active/repressive compartments A/B

  • Identifying chromatin loops

3.2.1.5 Visualization

  • See Chapter4

3.2.2 Analytical tools:

Techniques

Tools

Description

4C

Uses R to detect specific interactions between DNA elements and identify differential

5C

Python package for normalization and analysis of chromatin structural data produced using either

Hi-C

Assigns statistical confidence to mid-range cis-chromosomal contacts

Designed for high-resolution Hi-C data

Identifies chromatin interactions in a genome

Detect sub-TAD chromatin interactions (cis)

Aligns, filters and normalizes, identifies and compares TADs, loops and compartments and display using Juicebox

Aligns, quality control, inter-intra contact maps, fast iterative correction, allele specific contact maps

Comprehensive tools list for hi-c data analysis can be found here. Next we'll use Hi-C pro as a showcase to see hi-c data analysis workflow (See Hi-C Pro Pipeline chapter).

3.2.3 TAD calling tools and algorithms Brief view of different tools.

Tools

Description

Language

TADbit includes quality control module, and aligns reads to the reference

Python

Identifies hierarchical topological domains

Python

Uses dynamic programming to call TADs in different resolutions

C++

Arrowhead is an algorithm for finding contact domains

Java

Performances comparison for different tools are surveyed here.

Reference:

[1] Computational Methods for Assessing Chromatin Hierarchy [2] The Hitchhiker's Guide to Hi-C Analysis: Practical guidelines [3] Comparison of computational methods for Hi-C data analysis

Last updated