3dgenome
  • Initial page
  • Cover
  • Preface
  • Figurelist
  • Chap0 Preparation
    • 0.1 Molecular biology
    • 0.2 Sequencing technologies
    • 0.3 RNA-seq Data Mapping & Gene Quantification
    • 0.4 RNA-seq Differential Analysis
  • Chap1 Why we care about 3D genome
    • 1.1 From 2D to 3D nuclear structure
    • 1.2 From static to dynamic
    • 1.3 From intra to inter chromosomes "talk"
    • 1.4 From aggregation to division - phase separation
  • Chap2 experiment tools for exploring genome interaction
    • 2.1 Image based
    • 2.2 Primary order
    • 2.3 Higher order C-techs
  • Chap3 Computational analysis
    • 3.1 Primary order analysis
    • 3.2 Higer order data analysis
      • 3.2.1 Read mapping consideration
      • 3.2.2 Analytical Pipelines
        • GITAR Pipeline
        • HiC-Pro Pipeline
      • 3.2.3 TAD calling algorithms
    • 3.3 3D structure
  • Chap4 RNA-genome interaction
    • 4.1 Experimental Methods
    • 4.2 Computational Analysis
  • Chap5 Integrative Data Visualization Tools
    • 5.1 GIVE
    • 5.2 HiGlass
  • Chap6 4DN Project
  • Appendix
    • Homework
    • Student's presentation
      • A Brief Introduction to Machine Learning
      • Precision medicine
      • CHIP-Seq
Powered by GitBook
On this page
  • Introduction
  • HiCtool
  • Installation
  • Data preprocessing
  • Data analysis and visualization
  • Topological domain analysis
  • Processed Data
  • Hands on example:
  • Data preprocessing
  • Data analysis and visualization
  • Topological domain analysis
  • Reference:
  1. Chap3 Computational analysis
  2. 3.2 Higer order data analysis
  3. 3.2.2 Analytical Pipelines

GITAR Pipeline

GITAR website: https://www.genomegitar.org

Previous3.2.2 Analytical PipelinesNextHiC-Pro Pipeline

Last updated 6 years ago

Introduction

GITAR (Genome Interaction Tools and Resources), is a software to perform a comprehensive Hi-C data analysis, including data preprocessing, normalization, visualization and topologically associated domains (TADs) analysis. GITAR is composed of two main modules: 1) HiCtool, a Python library to process and visualize Hi-C data, including TADs analysis and 2) Processed data library, a large collection of human and mouse datasets processed using HiCtool.

HiCtool

HiCtool is like a one-stop analysis tool which integrates all parts of hic analysis. It's aim is to enable users without any programming or bioinformatic expertise to work with Hi-C data and compare different datasets in a consistent way.

HiCtool is a pipeline divided into three main sections:

  • Data preprocessing

  • Data analysis and visualization

  • Topological domain analysis

Installation

Data preprocessing

  • 1. Downloading the source data from GEO.

  • 2. Pre-truncation of the reads that contain potential ligation junctions.

  • 3. Mapping read pairs to the reference genome.

  • 4. Filtering reads and selecting reads that are paired.

  • 5. Creating the fragment-end (FEND) bed file.

Downloading the source data from GEO

  • Split paired-reads SRA data into SRRXXXXXXX_1.fastq and SRRXXXXXXX_2.fastq, and SRRXXXXXXX.fastq (if present) contains reads with no mates.

Pre-truncation of the reads that contain potential ligation junctions

  • Output with .trunc.fastq file as well as a log file contains the truncation information.

Mapping read pairs to the reference genome

  • HiCfile1_log.txt and HiCfile2_log.txt are log files containing the statistics of the alignment.

Filtering reads and selecting reads that are paired

  • Output paired SAM files (HiCfile1_hq.sam and HiCfile2_hq.sam) in BAM format.

Creating the fragment-end (FEND) bed file

The fragment-end (FEND) bed file is used to normalize the data and it contains restriction site coordinates and additional information related to fragment properties (GC content and mappability score). These properties will later be used in the normalization pipeline to remove biases.

  • GC content of 200 bp upstream and downstream to the restriction site is computed.

  • The common RE sites were already precomputed:

Data analysis and visualization

  • 1. Before normalization: HiFIVE:

    • Creating the Fend object.

    • Creating the HiCData object.

    • Creating the HiC project object.

    • Filtering HiC fends.

    • Estimating the HiC distance function.

    • Learning the correction model.

  • 2. Normalizing the data.

  • 3. Visualizing the data.

Before we do normalization, there are many unwanted effects people would like to exclude: Such as PCR duplicates and non-informative reads, produced by the possibility of incomplete restriction enzyme digestion and fragment circularization.

Steps

Functions

1.Creating the Fend object

Fend object Transform bed file to FEND object (which containing RE information like coordinates, GC content and mappability score).

2.Creating the HiCData object

Removing unwanted paired-reads (like total distance to their respective restriction sites exceeds threshold, PCR duplicates, incomplete restriction enzyme digestion and fragment circularization)

3.Creating the HiC project object

The HiC project object (hdf5 format) links the HiCData object with information about which fends to include in the analysis

4.Filtering HiC fends

Filter out fragments that do not have at least one interaction before learning correction parameters.

5.Estimating the HiC distance function

Estimation of the distance-dependence relationship from the data prior to normalization. Due to unevenly distributed restriction sites, fragments surrounded by shorter ones will show higher nearby interactions than those with longer adjacent fragments

6.Learning the correction model

  • The observed data contain the observed reads count for each bin.

  • The fend expected data contain the learned correction value to remove biases related to fends for each bin.

  • The enrichment expected data contain the expected reads count for each bin, considering the linear distance between read pairs and the learned correction parameters.

  • The normalized fend data contain the corrected reads count for each bin.

  • The normalized enrichment data (“observed over expected” matrix) contain the enrichment value (O/E) for each bin.

This part is to plot the heatmap and histogram for the normalized contact data and enrichment normalized contact matrix:

Topological domain analysis

Processed Data

The results stored into zip files include:

  • Intrachromosomal contact matrices (40 kb)

  • Directionality Index

  • HMM states

  • Topological domain coordinates

  • HDF5 files

Hands on example:

To check if a module is installed, open your python console and try to import each module by typing:

  import my_module

If a module is not installed, go back to your unix console and type the following to install it:

  pip install my_module

If you are into an iPython console (Spyder or Jupyter QtConsole for example) type the following:

  !pip install my_module

All the modules should be already included into Python except for hmmlearn. To install hmmlearn download the package from GitHub (link above) then go inside the package folder and run the following on the unix shell:

  python setup.py install

Otherwise in iPython type the following:

  !pip install hmmlearn

Data preprocessing

HiCtool provides a complete pipeline from the downloading of the raw data (SRA format) to the final BAM files that are used for the following analysis steps. In addition, instructions on how to generate a fragment end BED file to correct biases are provided.

Preprocessing steps:

  1. Downloading the source data from GEO.

  2. Pre-truncation of the reads that contain potential ligation junctions.

  3. Mapping read pairs to the reference genome.

  4. Filtering reads and selecting reads that are paired.

  5. Creating the fragment-end (FEND) bed file.

Download the data using wget followed by the link address of each file.

Data analysis and visualization

Data analysis and visualization steps:

  1. Creating the Fend object.

  2. Creating the HiCData object.

  3. Creating the HiC project object.

  4. Filtering HiC fends.

  5. Estimating the HiC distance function.

  6. Learning the correction model.

  7. Normalizing the data.

  8. Visualizing the data (reported here).

Here for simplicity and processing time, we show examples only on visualization of the data (step 8). After the data are normalized (step 7), if both FEND and enrichment data were calculated, these files will be outputed (here only for chromosome 6, at 1 mb and 40 kb resolution):

To download the data use the command wget on the unix shell and copy the url link from each file above, here an example for the normalized fend data of chr 6 (40 kb):

  wget https://sysbio.ucsd.edu/public/rcalandrelli/hictool_example/HiCtool_chr6_40kb_normalized_fend.txt
 wget https://sysbio.ucsd.edu/public/rcalandrelli/hictool_example/HiCtool_normalization_visualization.py

If you do not have wget installed, you may want to check these links for installation instructions:

Go to the directory where you downloaded the files and open a Python console on the unix shell (by simply typing python) and then execute the script on the Python console:

execfile("HiCtool_normalization_visualization.py")

Plot the normalized fend data for chromosome 6 at 40 kb resolution:

plot_chromosome_data('HiCtool_chr6_40kb_normalized_fend.txt', a_chr='6', bin_size=40000, full_matrix=False, start_coord=50000000, end_coord=54000000, species='hg38', data_type="normalized_fend", my_colormap=['white', 'red'], cutoff_type='percentile', cutoff=95, max_color='#460000', plot_histogram=True)

Here we plot normalized fend data (data_type) of chromosome 6 (a_chr), from 50 Mb (start_coord) to 54 Mb (end_coord) at a bin size of 40 kb (bin_size), for species hg38 (species). We use a colormap (my_colormap) which goes from white (no contacts) to red (maximum contact) and we use a upper cut-off at the 95th percentile of the non-zero data (cutoff_type and cutoff) to enhance higher order chromatin structure such as topological domains on the heatmap. We assign to the bins over the cut-off a specific color (max_color) and also we choose to plot the distribution of the contact data as well on a separate file (plot_histogram).

The same can be done for the "observed over expected" (enrichment) data:

plot_chromosome_enrich_data('HiCtool_chr6_40kb_normalized_enrich.txt', a_chr='6', bin_size=40000, full_matrix=False, start_coord=50000000, end_coord=54000000, species='hg38', plot_histogram=True)

Red pixels are loci where there are more contacts than expected, blue pixels less contacts than expected. Note that the scale is log2. Gray pixels are those where the observed contact counts are 0, therefore the log2 of the ratio "observed/expected" would be minus infinite.

Plot of the normalized fend data at 1 mb resolution:

plot_chromosome_data('HiCtool_chr6_1mb_normalized_fend.txt', a_chr='6', bin_size=1000000, full_matrix=True, species='hg38', data_type="normalized_fend", my_colormap=['white', 'blue'], cutoff_type='percentile', cutoff=95, max_color='#460000', plot_histogram=True)

In this case we plot the entire contact matrix (full_matrix=True) and we changed the maximum color of the heatmap to blue (my_colormap).

Topological domain analysis

The topological domain analysis section provides the code to calculate both the observed DI (Directionality Index) and the “true DI” using a Hidden Markov Model. Also the code to calculate topological domain coordinates is provided, therefore the user can infer systematically about the location of topological domain and boundaries over the genome.

First, download the script from the unix shell:

wget https://sysbio.ucsd.edu/public/rcalandrelli/hictool_example/HiCtool_DI.py

Then, execute the script in the Python console:

execfile("HiCtool_DI.py")

To calculate the DI values and save them to file run:

DI = calculate_chromosome_DI(input_contact_matrix='HiCtool_chr6_40kb_normalized_fend.txt', a_chr='6')

The DI values are used as emissions in a Hidden Markov Model (HMM) to calculate the true DI values as HMM biased states:

true_DI = calculate_chromosome_true_DI(input_file_DI='HiCtool_chr6_DI.txt', a_chr='6')

Now we can plot the DI and true DI values:

plot_chromosome_DI(input_file_DI='HiCtool_chr6_DI.txt', a_chr='6', start_pos=50000000, end_pos=54000000, input_file_hmm='HiCtool_chr6_hmm_states.txt', species='hg38', plot_legend=True, plot_grid=True)

The true DI values allow to infer the locations of the topological domains in the genome. A domain is initiated at the beginning of a single downstream biased HMM state (red color in the above figure). The domain is continuous throughout any consecutive downstream biased state. The domain will then end when the last in a series of upstream biased states (green color in the above figure) is reached, with the domain ending at the end of the last HMM upstream biased state.

To calculate the topological domain coordinates run:

topological_domains = calculate_chromosome_topological_domains(input_file_hmm='HiCtool_chr6_hmm_states.txt', a_chr='6')
  50080000    50600000
  50640000    51760000
  51840000    52000000
  52080000    52680000
  52800000    53040000
  53120000    53760000

Reference:

HiCtool is in a pipeline format to allow extreme flexibility and easy usage. You do not need to install anything besides the following , packages and software. Everything is open source.

Full list of instructions , key steps are captured below for your better understanding.

The source data in sra format are downloaded via GEO accession number using the command fastq-dump of .

This step mainly pre-truncating ithe reads that contain potential ligation junctions to keep the longest piece without a junction sequence (). To do so download the code in (see ) to carry on.

Full list of instructions , key steps are captured below for your better understanding.

pre_truncate a .fastq file with different restriction enzyme (HindIII, MboI, NcoI and DpnII sites were built in, while you can still add your own restriction sites file look up .

Full list of instructions , key steps are captured below for your better understanding.

Mapping software: —— reads are mapped independently. index is built with reference hg38.fa.

Full list of instructions , key steps are captured below for your better understanding.

is used to extract the headers and perform filtering on the mapped reads (quality).

Below we have a functional summary for each step, details should refer to .

Take into account of fragments length, inter-fragment distance, GC content and mappability score biases to learn the correction model for Hi-C data. (). In addition, biological biases are considered at this step (TSSs and CTCF bound sites).

The script can be used to run all the HiFive steps (1-6), whose outputs are hdf5 files. For more information about these functions, please see . To run together steps 1-6, open the script , update the parameters on the top and save. Then just execute the script:

Biases are to remove at this step, the observed contact matrix and the fend expected contact matrix are calculated. For each chromosome you can but not restricted to get these 6 files ():

A description can be found , in this book we will talk about it in 3.2.3 TAD calling algorithms (Combined DI).

The processed data library is a collection of standardized using HiCtool. There are 19 datasets of human (hg38), taken from the library on the , and 2 datasets of mouse (mm10). Specifically, for GITAR we referred only to Hi-C derived datasets.

The downstream analysis can be carried out directly from these intermediate step like .

The cell line used here is B-lymphoblastoids GM12878 of human (hg38) with GEO accession number . To run the examples in this short tutorial you do not need the but only the following:

You can simply type python on your unix shell or you could use any other environment such as for example Spyder or Jupyter QtConsole in (suggested).

In this short tutorial we do not run any preprocessing step. The following are the two BAM files and the FEND bed file produced as output of the preprocessing, to be used only if you want to proceed with the entire data normalization pipeline of the :

The data analysis and visualization section provides the pipeline to normalize the data and plot the heatmaps. The normalization is done using the Python package while for plotting Matplotlib is used, with the possibility also to add a histogram of the distribution of the data. Observed, expected and normalized (FEND) contact counts can be plotted. FEND stands for "fragment end" and it refers to data corrected of technical and biological biases. In addition, we provide the possibility of plotting “observed over expected” (enrichment) contact heatmaps, where the expected counts are calculated considering both the learned correction parameters for biases and the distance between read pairs, given the property that the average intrachromosomal contact probability for pairs of loci decreases monotonically with increasing of their linear genomic distance.

Then download the Python script :

Otherwise go to the and download the files.

To calculate the DI, normalized fend data at 40 kb resolution are used (see the for more details).

Start and end coordinates will be saved in a where each line corresponds to a topological domain. Domain coordinates for the window (50-54 Mb) of the plot above are:

R Calandrelli et, al.

Python libraries
here
SRA Toolkit
Ay et al., 2015
pre_truncation.py
API Documentation
here
here
here
Bowtie 2
here
SAMtools
HindIII-hg38
MboI-hg38
NcoI-hg38
HindIII-mm10
MboI-mm10
here
HiCtool_hifive.py
HiFive’s API documentation
HiCtool_hifive.py
docs
here
processed datasets
4D Nucleome (4DN) Web Portal
here
GSM1551550
entire list of software/libraries
numpy
scipy
matplotlib
math
hmmlearn
Anaconda
full documentation
BAM file pair 1
BAM file pair 2
FEND bed file
HiFive
Observed data chr 6 (1 mb)
Normalized fend data chr 6 (1 mb)
Observed data chr 6 (40 kb)
Normalized fend data chr 6 (40 kb)
Normalized enrichment data chr 6 (40 kb)
HiCtool_normalization_visualization.py
Windows
MacOS
Google Drive folder with all the files
full documentation
tab separated format file
GITAR: An open source tool for analysis and visualization of Hi-C data
Yaffe E. and Tanay A., 2011
Figure1 HiCtool workflow schema HiCtool figure by Calandrelli R et al
Normalized contact matrix
enrichment normalized contact matrix