HiC-Pro Pipeline
Last updated
Last updated
Comparison with other tools
Overview of the workflow
Step by Step
run HIC-PRO in sequential mode
allele specific analysis
compatibility with other software
Thorough documentation can be found here.
HiC-Pro was designed to process Hi-C data, from raw fastq files (paired-end Illumina data) to the normalized contact maps. It supports the main Hi-C protocols, including digestion protocols as well as protocols that do not require restriction enzyme such as DNase Hi-C. In practice, HiC-Pro can be used to process dilution Hi-C, in situ Hi-C, DNase Hi-C, Micro-C, capture-C, capture Hi-C or HiChip data. Each step of the workflow can be run independantly. HiC-Pro includes a fast implementatation of the iterative correction method (see the iced python library for more information). In addition, HiC-Pro can use phasing data to build allele specific contact maps.
Table1: X stands for has this feature, indicates HiC-inpector, HiCdat and HiC-Box do not allow chimeric reads to be rescued during the mapping.
Figure 2. HiC-Pro workflow. Figure by Servant, N, et al. Genome Biology 16.1(2015):259.
To install this tool you should first check all the dependencies it relies on:
The bowtie2 mapper
Python (>2.7) with pysam (>=0.8.3), bx(>=0.5.0), numpy(>=1.8.2), and scipy(>=0.15.1) libraries
R with the RColorBrewer and ggplot2 packages
g++ compiler
Samtools (>0.1.19)
Unix sort (which support -V option) is required ! For Mac OS user, please install the GNU core utilities
After set up the system configuration.
Install HiC-Pro (>=2.7.8), be sure to have the appropriate rights and run :
If you encounter any error you may luckily find some solution here and here.
Pair-end sequencing is independantly aligned on the reference genome. The mapping is performed in two steps, more notes here.
First, the reads are aligned using an end-to-end aligner.
Second, reads spanning the ligation junction are trimmmed from their 3’ end, and aligned back on the genome.
Input file
Output file
Parameters for specific alignment is the same usage with bowtie2, like the min quality, index location, sequencing qualities encoding and so on.
Each aligned reads can be assigned to one restriction fragment according to the reference genome and the restriction enzyme.
The next step is to separate the invalid ligation products from the valid pairs. Dangling end and self circles pairs are therefore excluded. See previous chapter Read mapping considerations.
In case of Hi-C protocols that do not require a restriction enzyme such as DNase Hi-C or micro Hi-C, the assignment to a restriction is not possible. If no GENOME_FRAGMENT file are specified, this step is ignored. Short range interactions can however still be discarded using the MIN_CIS_DIST parameter.
There are multiple qualitity controls for each step. Mapping:
Aligned reads in the first (end-to-end) step
Alignment after trimming (in pratice, we ususally observed around 10-20% of trimmed reads. An abnormal level of trimmed reads can reflect a ligation issue).
The fraction of valid pairs for each type of ligation products.
Invalid pairs: dangling and or self-circle, singleton, multiple hits or duplicates.
Calculate distribution of fragment size.
Fraction about intra/inter- chromosomal contacts.
Fraction about short range (20kb) contacts.
Intra et inter-chromosomal contact maps are build for all specified resolutions. The genome is splitted into bins of equal size. Each valid interaction is associated with the genomic bins to generate the raw maps.
Hi-C data can contain several sources of biases which has to be corrected. HiC-Pro proposes a fast implementation of the original ICE normalization algorithm Imakaev et al. 2012, making the assumption of equal visibility of each fragment. The ICE normalization can be used as a standalone python package through the iced python package.
HiC-Pro can be run in a step-by-step mode, users just have to set the -s
parameter to specify one step. If you want to only want to only align the sequencing reads and run a quality control, use :
HiC-Pro --help
From the discussion in Chap1.2 we know that there are differences in paternal and maternal X chromosome organization, with the presence of mega-domains on the inactive X chromosome, which are not seen in the active X chromosome. Like as we expected, the inactive X chromosome map is partitioned into two mega-domains. The boundary between the two mega-domains lies near the DXZ4 micro-satellite.
HiC-Pro is able to incorporate phased haplotype information in the Hi-C data processing in order to generate allele-specific contact maps.
First: HiC-Pro will mask the reference genome by replacing the SNP position by an ‘N’ using the BEDTools utilities.
Then: Once aligned, HiC-Pro browses all reads spanning a polymorphic site, locates the nucleotide at the appropriate position, and assigns the read to either the maternal or paternal allele.
Next: classify as allele-specific all pairs for which both reads are assigned to the same parental allele or for which one read is assigned to one parental allele and the other is unassigned.
Finally: These allele-specific read pairs are then used to generate a genome-wide contact map for each parental genome and two allele-specific genome-wide contact maps are independently normalized using the iterative correction algorithm.
Reference here.
Visualization: JuiceBox and HiCPlotter.
TADcalling: use DIRECTIONALITY INDEX first proposed by Dixon et al, or FIT-HI-C.
R environment.