3.2 Higer order data analysis

  1. 1.
    Analytical pipeline
    1.1. Alignment
    1.2. Binning and Generating Contact Matrices
    1.3. Normalization (Balancing)
    1.4. Identification of interactions
    1.5. Visualization
  2. 2.
    Analytical tools
  3. 3.
    TAD calling tools and algorithms

  • Raw reads
  • Pre-processing
  • Alignment (Full read or chimeric alignment)
  • Binning & contact matrixing
  • Normalization
  • Detection (Intra/Inter-chr interactions)
  • Visualization
  • Analysis and Interpretation

Hi-C data produced by deep sequencing is no different than other genome-wide deep sequencing datasets. The data starts out as genomic reads in the traditional FASTQ file format.
Reads are filtered to remove duplicates, PCR artifact. Sequencing adaptors can also be removed prior to alignment.
The goal is to simply find a unique alignment for each read. The insert size of the Hi-C ligation product can vary between 1bp to hundreds of megabases (in terms of linear genome distance), it is difficult to use most paired-end alignment modes as is. One straightforward solution is to map each side of the paired end read separately/independently using a standard alignment procedure.
  • Full-read alignment first ---- Bowtie2, BWA
  • Unmapped reads: chimeric alignment ---- read splitting [1], iterative mapping [2]
  • Average sufficient reads depth, sufficient mappable reads:4C (1– 2 million), 5C (25 million) and Hi-C (8.4 to 100 million) [3]
A detailed description of mapping analysis is covered in Read mapping consideration chapter.

what is bin?
A bin is a fixed, non-overlapping geno-mic span into which reads are grouped to increase the signal of the interaction frequency. The interactions between bins are simply summed up to aggregate the signals.
why we use bin?
  • Using a 6-bp cutting restriction enzyme, there are almost
    restriction fragments, leading to an interaction space on the order of
    possible pairwise interactions. Thus, achieving sufficient coverage to support maximal resolution is a significant challenge. It's critical to set goals (what resolution you desire) to choose the proper bin size.
  • To overcome the limitations that the signal-to-noise ratio decreases with increased distance between two target loci.
when we use bin?
If the goal is to measure large scale structures, such as genomic compartments, then a lower resolution will often suffice (1MB-10MB), then we'll choose a proper bin size. However if the goal is to measure specific interactions of a small region, e.g. promoter-enhancer looping, then one should choose to use a restriction enzyme that cuts more frequently (e.g. 4bp) and a method that does not measure the entire genome, but instead focuses on exploring only a subset of the genome (i.e. 3C/4C/5C).
how to choose bin size
Smaller bins usually are used for more frequent intra-chromosomal interactions, and larger bins are for less frequent inter-chromosomal interactions. selected bin size should be inversely proportional to the expected number of interactions in a region.
what decide the Hi-C resolution
  • Sequencing coverage, more reads will cover more of the interaction space and thus improve the resolution.
  • Library complexity -- the total number of unique chimeric molecules that exist in a Hi-C library, a library with a low complexity level will saturate quickly with increasing sequencing depth
bin-level filtering
Prior to matrix balancing, it is advised to remove any bins (rows/columns) from the dataset that have either very noisy or too low of a signal. These bins are normally found in genomic regions with low mappability or high repeat content, such as around telomeres and centromeres.

The goal of normalization is to reduce biases during the experiment as well as a better comparison between different experiment results (reduce batch effect).
Two types of normalization
  • Explicit normalization:
    • known bias factors [4]
      • Distance between restriction enzyme cut sites (eg, for hi-c)
      • GC content of trimmed ligation junction
      • uniqueness of sequence reads
    • correction: integrate prior probabilistic model.
  • Implicit normalization:
    • Iterative correction [5] based on the assumption that all loci should have equal visibility since we are detecting the entire genome in an unbiased manner (By equalizing the sum of every row/column in the matrix). Faster and preferred.

  • TAD calling (Details see 3.2.3)
    • Tools for calling TADs
    • algorithms and principle
    • GPU accelerating
  • Separating active/repressive compartments A/B
  • Identifying chromatin loops

  • See Chapter4

Uses R to detect specific interactions between DNA elements and identify differential
Python package for normalization and analysis of chromatin structural data produced using either
Assigns statistical confidence to mid-range cis-chromosomal contacts
Designed for high-resolution Hi-C data
Identifies chromatin interactions in a genome
Detect sub-TAD chromatin interactions (cis)
Aligns, filters and normalizes, identifies and compares TADs, loops and compartments and display using Juicebox
Aligns, quality control, inter-intra contact maps, fast iterative correction, allele specific contact maps
Comprehensive tools list for hi-c data analysis can be found here. Next we'll use Hi-C pro as a showcase to see hi-c data analysis workflow (See Hi-C Pro Pipeline chapter).

TADbit includes quality control module, and aligns reads to the reference
Identifies hierarchical topological domains
Uses dynamic programming to call TADs in different resolutions
Arrowhead is an algorithm for finding contact domains
Performances comparison for different tools are surveyed here.

[1] Computational Methods for Assessing Chromatin Hierarchy [2] The Hitchhiker's Guide to Hi-C Analysis: Practical guidelines [3] Comparison of computational methods for Hi-C data analysis
Copy link
On this page
Computational Approaches for Assessing Higher-order Chromatin
3.2.1 Analytical pipeline:
3.2.2 Analytical tools:
3.2.3 TAD calling tools and algorithms Brief view of different tools.