3.3 3D structure

From Hi-C matrix to 3D model

After the experimental protocols, data analysis pipline, contact matrix normalization.The next holy grail is to infer, from the resulting loci contact maps, to accurate 3D models of how chromosomes folded and fitted into the nucleus. Most of the methods are inferring the 3D struture based on some "wish distance" map, so that we can calculate the 3D embedding (locations in 3D coordinates) with MDS based method (a classical method to calculate embedding of every sample point from pairwise distance matrix). But one of the prominent problems is how to build the bridge from a contact matrix to an Euclidean space distance.

Another way to find a 3D coordinates of the chromosomes is based on probabilistic model like Poisson model. These models cast the problem of structure inference as a maximum likelihood problem. For that purpose, people need to define a probabilistic model of contact counts parametrized by the 3D structure that we want to infer from contact count observations.

We'll unfold these two kinds of methods over the rest of the discussion.

Method1: Distance based model

1.From contact matrix to spatial distance

The first step of this distance based model is to compute the "wish distance" of each pair of loci in Euclidean space. The basic assumption behind this idea is that the contact frequency is positively related with the spatial distance, which to say, if two loci has a high count value , then there is a great likelihood that they are approximate with each other.

Default contact-to-distance:

DNA, the contact count is inversely proportional to the genomic distance (cs1)(c\sim s^{-1}), whereas the volume scales linearly with the subchain length (d3s)(d^3 \sim s), from which we deduce a relationship between d and c of the form dc1/3d \sim c^{-{1/3}}. Sometimes we could add some scaling parameters before the distance: δij=γcij1/3\delta_{ij}=\gamma c^{-1/3}_{ij}.

Caveats in transfering

As we could imgaine, this kind of transformation mapping the correlation between loci from non-euclidean space to euclidean space. There are lots of remaining caveats that people should bare in mind:

  • The contacts are captured physically by proteins which to say the approximity adjacencies would be captured as long as two loci are within some spatial threshold.

  • Not all the contacts are detected. The absence of reads for a pair of sites does not assess, and should not be handled as, an absence of contact.

  • Contact network originates from an average contact map, derived from a huge number of individual conformations.

  • Only reflects a spatial proximity at the time of the experiment. It may result from random thermal motion of DNA, and does not necessarily imply a specific biochemical or physical interaction between the genomic sites.

2.From distance to 3D structure

A very classical way of inferring the coordinates of points is MDS - multidimentional scaling (Kruskal et.al,1964), which will optimize on the embedded pairwise distance between loci so that be coordinated with the original pairwise distance. There are many variations of classical MDS which also be utilized in chromosome structure reconstruction.

Multidimensional scaling

An MDS algorithm aims to place each object in a lower dimensional space such that the between-object distances are preserved as well as possible. This can be achieved by minimizing the stress function ( stress = real_pairwise_distance - pairwise_distance_from_inferred_points ). However, with different types of stress, MDS can be classified with few categories:

  • Metric MDS : the stress function is

    StressD(x1,x2,...,xN)=(i,j(dijxixj)2i,jdij2)1/2Stress_D(x_1,x_2,...,x_N)=\Biggl(\frac{\sum_{i,j}\bigl(d_{ij}-\|x_i-x_j\|\bigr)^2}{\sum_{i,j}d_{ij}^2}\Biggr)^{1/2}

    where b_{ij} are the terms of the embedding. The dijd_{ij} is the original distance. This method can be computed efficiently with implemented packages.

  • Weighted MDS: there are drawbacks of the raw stress function mentioned above, which is dominated by large values. Just as we discussed above, the small contact value may lead to large distance, which is not reliable. Varoquaux, Nelle, et al. proposed a way to rectify the distance:

    minx(i,j)Dδij2(dij(X)δij)2\min_x \sum_{(i,j)\in D}\delta_{ij}^{-2}(d_ij(X)-\delta_{ij})^2

    There are also other ways to make up for the large dominant value like logarithm.

  • Non-metric MDS: Instead of using the value, we could also try to keep the rank or the relative order of the data. Non-metric MDS relies on the sole hypothesis that if two loci i and j are observed to be in contact more often than loci k and $l$,then i and j should be closer in 3D space than kk and ll:

    Given a set of similarities cijc_{ij}, find XR3×nX \in \mathbb{R}^{3 \times n},such that:

    cijcklxixj2xkxl2c_{ij} \ge c_{kl} \Leftrightarrow ||x_i - x_j||_2 \le ||x_k - x_l||_2

Softwares based on these methods: ChromSDE, Pastis.

Method2: Statistical model

Varoquaux, Nelle, et al. propose to cast the problem of structure inference as a maximum likelihood problem. They basically modeled the contact frequencies cijc_ij as independent Poisson random variables, where the Poisson parameter of cijc_{ij} is a decreasing function of dijd_{ij} (embedded pairwise distance), for some parameters the likelihood of an embedding could be expressed as:

l(X,α,β)=ij(βdijα)cijcij!exp(βdijα)l(X,\alpha,\beta)=\prod_{ij} \frac{(\beta d_{ij}^\alpha)^{c_{ij}}}{c_{ij}!}exp(-\beta d_{ij}^{\alpha})

The best XX can be maximized by interative algorithms.

Softwares:

Modeling 3C-based data:

BACH is a novel Bayesian probabilistic approach for analyzing Hi-C data. BACH takes the Hi-C contact matrix and local genomic features (restriction enzyme cutting frequencies, GC content and sequence uniqueness) as input and produces, via MCMC computation, the posterior distribution of three-dimensional (3D) chromosomal structure.

Visualizing 3D-model:

Genome3D is a model-view framework for displaying genomic and epigenomic data within a three-dimensional physical model of the human genome. In this framework, the model of the physical genome implicitly contains all levels of structure and hierarchy, and provides an underlying platform for integrating multi-scale structural and genomic information within three dimensions. The viewer is a Microsoft Windows based software designed to display data from multiple scales and uses a hierarchical model of the relative positions of all nucleotide atoms in the cell nucleus, i.e., the physical genome. With Genome3D, multi-resolution data can be interactively explored to uncover complex and new structural relationships. Genome3D is designed with a hierarchical architecture that establishes methods for incorporating new data of varying resolutions, and allows for user-defined annotations. The viewer can also be used to display other 3D genome models as long as they are specified in Genome3D format. A platform independent, fast response, cloud ready viewer prototype has also been developed to employ game engine to visualize 3D genome structure.

Last updated