3.3 3D structure
Last updated
Last updated
After the experimental protocols, data analysis pipline, contact matrix normalization.The next holy grail is to infer, from the resulting loci contact maps, to accurate 3D models of how chromosomes folded and fitted into the nucleus. Most of the methods are inferring the 3D struture based on some "wish distance" map, so that we can calculate the 3D embedding (locations in 3D coordinates) with based method (a classical method to calculate embedding of every sample point from pairwise distance matrix). But one of the prominent problems is how to build the bridge from a contact matrix to an Euclidean space distance.
Another way to find a 3D coordinates of the chromosomes is based on probabilistic model like Poisson model. These models cast the problem of structure inference as a maximum likelihood problem. For that purpose, people need to define a probabilistic model of contact counts parametrized by the 3D structure that we want to infer from contact count observations.
We'll unfold these two kinds of methods over the rest of the discussion.
The first step of this distance based model is to compute the "wish distance" of each pair of loci in Euclidean space. The basic assumption behind this idea is that the contact frequency is positively related with the spatial distance, which to say, if two loci has a high count value , then there is a great likelihood that they are approximate with each other.
Default contact-to-distance:
DNA, the contact count is inversely proportional to the genomic distance , whereas the volume scales linearly with the subchain length , from which we deduce a relationship between d and c of the form . Sometimes we could add some scaling parameters before the distance: .
Caveats in transfering
As we could imgaine, this kind of transformation mapping the correlation between loci from non-euclidean space to euclidean space. There are lots of remaining caveats that people should bare in mind:
The contacts are captured physically by proteins which to say the approximity adjacencies would be captured as long as two loci are within some spatial threshold.
Not all the contacts are detected. The absence of reads for a pair of sites does not assess, and should not be handled as, an absence of contact.
Contact network originates from an average contact map, derived from a huge number of individual conformations.
Only reflects a spatial proximity at the time of the experiment. It may result from random thermal motion of DNA, and does not necessarily imply a specific biochemical or physical interaction between the genomic sites.
A very classical way of inferring the coordinates of points is MDS - multidimentional scaling (Kruskal et.al,1964), which will optimize on the embedded pairwise distance between loci so that be coordinated with the original pairwise distance. There are many variations of classical MDS which also be utilized in chromosome structure reconstruction.
Multidimensional scaling
An MDS algorithm aims to place each object in a lower dimensional space such that the between-object distances are preserved as well as possible. This can be achieved by minimizing the stress function ( stress = real_pairwise_distance - pairwise_distance_from_inferred_points ). However, with different types of stress, MDS can be classified with few categories:
Metric MDS : the stress function is
There are also other ways to make up for the large dominant value like logarithm.
where b_{ij} are the terms of the embedding. The is the original distance. This method can be computed efficiently with implemented packages.
Weighted MDS: there are drawbacks of the raw stress function mentioned above, which is dominated by large values. Just as we discussed above, the small contact value may lead to large distance, which is not reliable. proposed a way to rectify the distance:
Non-metric MDS: Instead of using the value, we could also try to keep the rank or the relative order of the data. Non-metric MDS relies on the sole hypothesis that if two loci i and j are observed to be in contact more often than loci k and $l$,then i and j should be closer in 3D space than and :
Given a set of similarities , find ,such that:
Softwares based on these methods: , .
propose to cast the problem of structure inference as a maximum likelihood problem. They basically modeled the contact frequencies as independent Poisson random variables, where the Poisson parameter of is a decreasing function of (embedded pairwise distance), for some parameters the likelihood of an embedding could be expressed as:
The best can be maximized by interative algorithms.
is a novel Bayesian probabilistic approach for analyzing Hi-C data. BACH takes the Hi-C contact matrix and local genomic features (restriction enzyme cutting frequencies, GC content and sequence uniqueness) as input and produces, via MCMC computation, the posterior distribution of three-dimensional (3D) chromosomal structure.
inferences the spatial organization of chromosomes using a semi-definte embedding approach and Hi-C data
is a Python package for normalization and analysis of chromatin structural data produced using either the 5C of HiC assay. This library contains tools for handling all steps after mapping of reads.
proposes a novel approach to infer a consensus three- dimensional structure of a genome from Hi-C data. The method incorporates a statistical model of the contact counts, assuming that the counts between two loci follow a Poisson distribution whose intensity decreases with the physical distances between the loci. The method can automatically adjust the transfer function relating the spatial distance to the Poisson intensity and infer a genome structure that best explains the observed data.
is a complete Python library to deal with all steps to analyze, model and explore 3C-based data. With TADbit the user can map FASTQ files to obtain raw interaction binned matrices (Hi-C like matrices), normalize and correct interaction matrices, identify adn compare the so-called Topologically Associating Domains (TADs), build 3D models from the interaction matrices, and finally, extract structural properties from the models. TADbit is complemented by TADkit for visualizing 3D models.
is an application designed to determine three-dimensional genome and chromosome models.
is a model-view framework for displaying genomic and epigenomic data within a three-dimensional physical model of the human genome. In this framework, the model of the physical genome implicitly contains all levels of structure and hierarchy, and provides an underlying platform for integrating multi-scale structural and genomic information within three dimensions. The viewer is a Microsoft Windows based software designed to display data from multiple scales and uses a hierarchical model of the relative positions of all nucleotide atoms in the cell nucleus, i.e., the physical genome. With Genome3D, multi-resolution data can be interactively explored to uncover complex and new structural relationships. Genome3D is designed with a hierarchical architecture that establishes methods for incorporating new data of varying resolutions, and allows for user-defined annotations. The viewer can also be used to display other 3D genome models as long as they are specified in Genome3D format. A platform independent, fast response, cloud ready viewer prototype has also been developed to employ game engine to visualize 3D genome structure.
, a new browser-based interactive tool designed to provide an intuitive environment for investigators to facilitate the 3D exploratory analysis of Hi-C data along with many useful annotation functionalities. Among the key features of HiC-3DViewer relevant to chromatin conformation studies, the most important one is the 1D-to-2D-to-3D mapping, to highlight genomic regions of interest interactively. This feature enables investigators to explore their data at different levels/angels. Additionally, investigators can superpose different genomic signals (such as ChIP-Seq, SNP) on the top of the 3D structure.
creates interactive 3D representations of chromatin conformations modeled from 3C-based interaction matrices. The user can overlay 1D and 2D tracks of data to these 3D views to evaluate the relationship between the 3D structure of the genome and its biological function.
is an application designed to visualize genome structure in 3D. It allows users to view the genome structure at multiple scales, including: global, chromosome, loci, fiber, nucleosome, and nucleotide. This software was built upon the pre-existing Jmol package by Prof. Cheng's group.