GITAR Pipeline
GITAR website: https://www.genomegitar.org
Last updated
GITAR website: https://www.genomegitar.org
Last updated
GITAR (Genome Interaction Tools and Resources), is a software to perform a comprehensive Hi-C data analysis, including data preprocessing, normalization, visualization and topologically associated domains (TADs) analysis. GITAR is composed of two main modules: 1) HiCtool, a Python library to process and visualize Hi-C data, including TADs analysis and 2) Processed data library, a large collection of human and mouse datasets processed using HiCtool.
HiCtool is like a one-stop analysis tool which integrates all parts of hic analysis. It's aim is to enable users without any programming or bioinformatic expertise to work with Hi-C data and compare different datasets in a consistent way.
HiCtool is a pipeline divided into three main sections:
Data preprocessing
Data analysis and visualization
Topological domain analysis
HiCtool is in a pipeline format to allow extreme flexibility and easy usage. You do not need to install anything besides the following Python libraries, packages and software. Everything is open source.
1. Downloading the source data from GEO.
2. Pre-truncation of the reads that contain potential ligation junctions.
3. Mapping read pairs to the reference genome.
4. Filtering reads and selecting reads that are paired.
5. Creating the fragment-end (FEND) bed file.
Full list of instructions here, key steps are captured below for your better understanding.
The source data in sra format are downloaded via GEO accession number using the command fastq-dump of SRA Toolkit.
Split paired-reads SRA data into SRRXXXXXXX_1.fastq and SRRXXXXXXX_2.fastq, and SRRXXXXXXX.fastq (if present) contains reads with no mates.
1. Before normalization: HiFIVE:
Creating the Fend object.
Creating the HiCData object.
Creating the HiC project object.
Filtering HiC fends.
Estimating the HiC distance function.
Learning the correction model.
2. Normalizing the data.
3. Visualizing the data.
Before we do normalization, there are many unwanted effects people would like to exclude: Such as PCR duplicates and non-informative reads, produced by the possibility of incomplete restriction enzyme digestion and fragment circularization.
Below we have a functional summary for each step, details should refer to here.
Steps
Functions
1.Creating the Fend object
Fend object Transform bed file to FEND object
(which containing RE information like coordinates, GC content and mappability score).
2.Creating the HiCData object
Removing unwanted paired-reads (like total distance to their respective restriction sites exceeds threshold, PCR duplicates, incomplete restriction enzyme digestion and fragment circularization)
3.Creating the HiC project object
The HiC project object
(hdf5 format) links the HiCData object
with information about which fends to include in the analysis
4.Filtering HiC fends
Filter out fragments that do not have at least one interaction before learning correction parameters.
5.Estimating the HiC distance function
Estimation of the distance-dependence relationship from the data prior to normalization. Due to unevenly distributed restriction sites, fragments surrounded by shorter ones will show higher nearby interactions than those with longer adjacent fragments
6.Learning the correction model
The script HiCtool_hifive.py
can be used to run all the HiFive steps (1-6), whose outputs are hdf5 files. For more information about these functions, please see HiFive’s API documentation. To run together steps 1-6, open the script HiCtool_hifive.py
, update the parameters on the top and save. Then just execute the script:
A description can be found here, in this book we will talk about it in 3.2.3 TAD calling algorithms (Combined DI).
The processed data library is a collection of standardized processed datasets using HiCtool. There are 19 datasets of human (hg38), taken from the library on the 4D Nucleome (4DN) Web Portal, and 2 datasets of mouse (mm10). Specifically, for GITAR we referred only to Hi-C derived datasets.
The results stored into zip files include:
Intrachromosomal contact matrices (40 kb)
Directionality Index
HMM states
Topological domain coordinates
HDF5 files
The downstream analysis can be carried out directly from these intermediate step like here.
The cell line used here is B-lymphoblastoids GM12878 of human (hg38) with GEO accession number GSM1551550. To run the examples in this short tutorial you do not need the entire list of software/libraries but only the following:
To check if a module is installed, open your python console and try to import each module by typing:
You can simply type python
on your unix shell or you could use any other environment such as for example Spyder or Jupyter QtConsole in Anaconda (suggested).
If a module is not installed, go back to your unix console and type the following to install it:
If you are into an iPython console (Spyder or Jupyter QtConsole for example) type the following:
All the modules should be already included into Python except for hmmlearn
. To install hmmlearn
download the package from GitHub (link above) then go inside the package folder and run the following on the unix shell:
Otherwise in iPython type the following:
HiCtool provides a complete pipeline from the downloading of the raw data (SRA format) to the final BAM files that are used for the following analysis steps. In addition, instructions on how to generate a fragment end BED file to correct biases are provided.
Preprocessing steps:
Downloading the source data from GEO.
Pre-truncation of the reads that contain potential ligation junctions.
Mapping read pairs to the reference genome.
Filtering reads and selecting reads that are paired.
Creating the fragment-end (FEND) bed file.
In this short tutorial we do not run any preprocessing step. The following are the two BAM files and the FEND bed file produced as output of the preprocessing, to be used only if you want to proceed with the entire data normalization pipeline of the full documentation:
Download the data using wget
followed by the link address of each file.
The data analysis and visualization section provides the pipeline to normalize the data and plot the heatmaps. The normalization is done using the Python package HiFive while for plotting Matplotlib is used, with the possibility also to add a histogram of the distribution of the data. Observed, expected and normalized (FEND) contact counts can be plotted. FEND stands for "fragment end" and it refers to data corrected of technical and biological biases. In addition, we provide the possibility of plotting “observed over expected” (enrichment) contact heatmaps, where the expected counts are calculated considering both the learned correction parameters for biases and the distance between read pairs, given the property that the average intrachromosomal contact probability for pairs of loci decreases monotonically with increasing of their linear genomic distance.
Data analysis and visualization steps:
Creating the Fend object.
Creating the HiCData object.
Creating the HiC project object.
Filtering HiC fends.
Estimating the HiC distance function.
Learning the correction model.
Normalizing the data.
Visualizing the data (reported here).
Here for simplicity and processing time, we show examples only on visualization of the data (step 8). After the data are normalized (step 7), if both FEND and enrichment data were calculated, these files will be outputed (here only for chromosome 6, at 1 mb and 40 kb resolution):
To download the data use the command wget
on the unix shell and copy the url link from each file above, here an example for the normalized fend data of chr 6 (40 kb):
Then download the Python script HiCtool_normalization_visualization.py:
If you do not have wget
installed, you may want to check these links for installation instructions:
Otherwise go to the Google Drive folder with all the files and download the files.
Go to the directory where you downloaded the files and open a Python console on the unix shell (by simply typing python
) and then execute the script on the Python console:
Plot the normalized fend data for chromosome 6 at 40 kb resolution:
Here we plot normalized fend data (data_type
) of chromosome 6 (a_chr
), from 50 Mb (start_coord
) to 54 Mb (end_coord
) at a bin size of 40 kb (bin_size
), for species hg38 (species
). We use a colormap (my_colormap
) which goes from white (no contacts) to red (maximum contact) and we use a upper cut-off at the 95th percentile of the non-zero data (cutoff_type
and cutoff
) to enhance higher order chromatin structure such as topological domains on the heatmap. We assign to the bins over the cut-off a specific color (max_color
) and also we choose to plot the distribution of the contact data as well on a separate file (plot_histogram
).
The same can be done for the "observed over expected" (enrichment) data:
Red pixels are loci where there are more contacts than expected, blue pixels less contacts than expected. Note that the scale is log2. Gray pixels are those where the observed contact counts are 0, therefore the log2 of the ratio "observed/expected" would be minus infinite.
Plot of the normalized fend data at 1 mb resolution:
In this case we plot the entire contact matrix (full_matrix=True
) and we changed the maximum color of the heatmap to blue (my_colormap
).
The topological domain analysis section provides the code to calculate both the observed DI (Directionality Index) and the “true DI” using a Hidden Markov Model. Also the code to calculate topological domain coordinates is provided, therefore the user can infer systematically about the location of topological domain and boundaries over the genome.
To calculate the DI, normalized fend data at 40 kb resolution are used (see the full documentation for more details).
First, download the script from the unix shell:
Then, execute the script in the Python console:
To calculate the DI values and save them to file run:
The DI values are used as emissions in a Hidden Markov Model (HMM) to calculate the true DI values as HMM biased states:
Now we can plot the DI and true DI values:
The true DI values allow to infer the locations of the topological domains in the genome. A domain is initiated at the beginning of a single downstream biased HMM state (red color in the above figure). The domain is continuous throughout any consecutive downstream biased state. The domain will then end when the last in a series of upstream biased states (green color in the above figure) is reached, with the domain ending at the end of the last HMM upstream biased state.
To calculate the topological domain coordinates run:
Start and end coordinates will be saved in a tab separated format file where each line corresponds to a topological domain. Domain coordinates for the window (50-54 Mb) of the plot above are:
R Calandrelli et, al. GITAR: An open source tool for analysis and visualization of Hi-C data
Take into account of fragments length, inter-fragment distance, GC content and mappability score biases to learn the correction model for Hi-C data. (). In addition, biological biases are considered at this step (TSSs and CTCF bound sites).