HiC-Pro Pipeline

Hi-C pro pipeline

Comparison with other tools
Overview of the workflow
Step by Step
run HIC-PRO in sequential mode
allele specific analysis
compatibility with other software

Thorough documentation can be found here.

HiC-Pro was designed to process Hi-C data, from raw fastq files (paired-end Illumina data) to the normalized contact maps. It supports the main Hi-C protocols, including digestion protocols as well as protocols that do not require restriction enzyme such as DNase Hi-C. In practice, HiC-Pro can be used to process dilution Hi-C, in situ Hi-C, DNase Hi-C, Micro-C, capture-C, capture Hi-C or HiChip data. Each step of the workflow can be run independantly. HiC-Pro includes a fast implementatation of the iterative correction method (see the iced python library for more information). In addition, HiC-Pro can use phasing data to build allele specific contact maps.

Comparison with other tools

Table1: X stands for has this feature, $X^a$ indicates HiC-inpector, HiCdat and HiC-Box do not allow chimeric reads to be rescued during the mapping.

Overview of the workflow

Figure 2. HiC-Pro workflow. Figure by Servant, N, et al. Genome Biology 16.1(2015):259.

Step by Step

1) Installation

To install this tool you should first check all the dependencies it relies on:

The bowtie2 mapper
Python (>2.7) with pysam (>=0.8.3), bx(>=0.5.0), numpy(>=1.8.2), and scipy(>=0.15.1) libraries
R with the RColorBrewer and ggplot2 packages
g++ compiler
Samtools (>0.1.19)
Unix sort (which support -V option) is required ! For Mac OS user, please install the GNU core utilities
After set up the system configuration.

Install HiC-Pro (>=2.7.8), be sure to have the appropriate rights and run :

tar -zxvf HiC-Pro-master.tar.gz
cd HiC-Pro-master
## Edit config-install.txt file if necessary
make configure
make install

If you encounter any error you may luckily find some solution here and here.

2) Reads mapping

Pair-end sequencing is independantly aligned on the reference genome. The mapping is performed in two steps, more notes here.

First, the reads are aligned using an end-to-end aligner.
Second, reads spanning the ligation junction are trimmmed from their 3’ end, and aligned back on the genome.
Input file
```
.fastq(.gz) files
```
Output file
```
.bam files
```

Parameters for specific alignment is the same usage with bowtie2, like the min quality, index location, sequencing qualities encoding and so on.

3) Fragment assignment and filtering

Each aligned reads can be assigned to one restriction fragment according to the reference genome and the restriction enzyme.

The next step is to separate the invalid ligation products from the valid pairs. Dangling end and self circles pairs are therefore excluded. See previous chapter Read mapping considerations.

In case of Hi-C protocols that do not require a restriction enzyme such as DNase Hi-C or micro Hi-C, the assignment to a restriction is not possible. If no GENOME_FRAGMENT file are specified, this step is ignored. Short range interactions can however still be discarded using the MIN_CIS_DIST parameter.

4) Quality Controls

There are multiple qualitity controls for each step. Mapping:

Aligned reads in the first (end-to-end) step
Alignment after trimming (in pratice, we ususally observed around 10-20% of trimmed reads. An abnormal level of trimmed reads can reflect a ligation issue).
The fraction of valid pairs for each type of ligation products.
Invalid pairs: dangling and or self-circle, singleton, multiple hits or duplicates.
Calculate distribution of fragment size.
Fraction about intra/inter- chromosomal contacts.
Fraction about short range (20kb) contacts.

5) Map builder

Intra et inter-chromosomal contact maps are build for all specified resolutions. The genome is splitted into bins of equal size. Each valid interaction is associated with the genomic bins to generate the raw maps.

6) ICE normalization

Hi-C data can contain several sources of biases which has to be corrected. HiC-Pro proposes a fast implementation of the original ICE normalization algorithm Imakaev et al. 2012, making the assumption of equal visibility of each fragment. The ICE normalization can be used as a standalone python package through the iced python package.

Run HIC-PRO in sequential mode

HiC-Pro can be run in a step-by-step mode, users just have to set the -s parameter to specify one step. If you want to only want to only align the sequencing reads and run a quality control, use :

MY_INSTALL_PATH/bin/HiC-Pro -i FULL_PATH_TO_RAW_DATA -o FULL_PATH_TO_OUTPUTS -c MY_LOCAL_CONFIG_FILE -s mapping -s quality_checks

HiC-Pro --help

HiC-Pro --help
usage : HiC-Pro -i INPUT -o OUTPUT -c CONFIG [-s ANALYSIS_STEP] [-p] [-h] [-v]
Use option -h|--help for more information

HiC-Pro 2.7.0
---------------
OPTIONS

 -i|--input INPUT : input data folder; Must contains a folder per sample with input files
 -o|--output OUTPUT : output folder
 -c|--conf CONFIG : configuration file for Hi-C processing
 [-p|--parallel] : if specified run HiC-Pro on a cluster
 [-s|--step ANALYSIS_STEP] : run only a subset of the HiC-Pro workflow; if not specified the complete workflow is run
    mapping: perform reads alignment
    proc_hic: perform Hi-C filtering
    quality_checks: run Hi-C quality control plots
    build_contact_maps: build raw inter/intrachromosomal contact maps
    ice_norm: run ICE normalization on contact maps

Allele specific analysis

From the discussion in Chap1.2 we know that there are differences in paternal and maternal X chromosome organization, with the presence of mega-domains on the inactive X chromosome, which are not seen in the active X chromosome. Like as we expected, the inactive X chromosome map is partitioned into two mega-domains. The boundary between the two mega-domains lies near the DXZ4 micro-satellite.

HiC-Pro is able to incorporate phased haplotype information in the Hi-C data processing in order to generate allele-specific contact maps.

First: HiC-Pro will mask the reference genome by replacing the SNP position by an ‘N’ using the BEDTools utilities.
Then: Once aligned, HiC-Pro browses all reads spanning a polymorphic site, locates the nucleotide at the appropriate position, and assigns the read to either the maternal or paternal allele.
Next: classify as allele-specific all pairs for which both reads are assigned to the same parental allele or for which one read is assigned to one parental allele and the other is unassigned.
Finally: These allele-specific read pairs are then used to generate a genome-wide contact map for each parental genome and two allele-specific genome-wide contact maps are independently normalized using the iterative correction algorithm.

Compatibility with other software

Reference here.

Visualization: JuiceBox and HiCPlotter.
TADcalling: use DIRECTIONALITY INDEX first proposed by Dixon et al, or FIT-HI-C.
R environment.

PreviousGITAR Pipeline Next3.2.3 TAD calling algorithms

Last updated 6 years ago