
Antonio Mucherino's Research
Discretizable Distance Geometry
The Distance Geometry Problem (DGP) asks whether a simple weighted undirected graph G=(V,E,d) can be
realized in a Kdimensional space so that the distance between each pair of realized vertices u and v
is the same as the weight d(u,v) assigned to the edge {u,v}, when available. The distance function d
generally associates an interval [lb(u,v),ub(u,v)] to every edge {u,v} of E. The range
of the interval corresponds to the uncertainty on the actual value of the distance: when the interval is degenerate,
i.e. lb(u,v) = ub(u,v), then there can be two possibilities: either the distance is exact (provided
with high precision), or the level of uncertainty on the distance is not a priori known, so that only an approximated
value is provided. Notice that the embedding space is Euclidean in most reallife applications. The DGP is NPhard,
and in the last years, together with several other colleagues, we have been working on the different facets of this problem.
Various applications can lead to the definition of a DGP: the DGP can nowadays be seen as a classical problem
in operational research. One classical DGP application arises in biology: experiments of Nuclear Magnetic Resonance
(NMR) are able to find estimates of some distances between atom pairs in molecules, so that the solution to the
corresponding DGP allows for finding molecular structures. Other applications include clock synchronization,
sensor network localization, graph drawing, multidimensional scaling. In recent research, we have started to work on
motion adaptations by a novel distancebased approach. Since motions have a temporal component, the DGP that is formulated
in this context is based on a graph G whose vertex set is the product of a set V (of objects) by the set
T (representing time as a sequence of discrete steps).
A protein model and an animation in 2D, representing two of the aforementioned DGP applications, are shown below:




More animations, related to experiments presented in our publications, are available on
this special page.
The discretization of the DGP can be performed when the graph G satisfies some particular assumptions.
These assumptions ensure that every vertex v of the graph has at least K adjacent predecessors, which can serve
as a reference for computing candidate positions for the vertex. They allow for defining the set of possible positions for a
given vertex v as the intersection of K Euclidean objects (circles in dimension 2, spheres in dimension 3).
Moreover, in order to fix the coordinate space and avoid that the obtained conformations can "float" in the space, the
first K vertices are supposed to form a clique. An example is given below for K = 2. The first two vertices
(forming the clique) can be fixed in two unique positions. The search starts from the vertex 3, and goes up to the vertex 5
in this example. For every vertex from 3 to 5, the circle defined by the reference vertex with lowest rank is explicitely
drawn, while only short marks, indicating the intersection points, are given for the second (having higher rank) reference
vertex.

The solutions to this small instance are given by the set of paths from the vertex 1 to all the possible positions in the
2dimensional space for the vertex 5. The drawing shows that the solution set has the shape of a tree, which grows
exponentially with the number of involved vertices. This is consequence of the fact that the DGP is an NPhard problem.
Moreover, the search tree is symmetric: in this example, the line going through every pair of reference vertices defines a
symmetric axis, such that all distances (concerning the reference vertices and all "subsequent" vertices) that are satisfied
(or not) by the solutions on the lefthand side of the symmetry axis are also satisfied (or not) by the solutions on the
righthand side of the axis.
In this example, the solution set corresponds to the entire search tree. When additional distances are available, they can be
used for selecting a subset of solutions from the search tree. In practice, the idea is to construct the search tree step
by step while performing a depthfirst tree search where the additional distance information (used to perfom the selections)
are used as soon as the corresponding vertex positions are computed. In this way, some branches of the search tree are not
generated at all, because it is known in advance that they contain infeasible solutions. It is said in this case that the
branches of the tree are "pruned", and the distances used to perform this operation are called "pruning distances".
The general algorithmic framework implementing this idea is therefore named the branchandprune (BP) algorithm.
In the example above, the graph G and the vertex ordering specified by the numerical labels actually satisfy an additional
assumption: for every vertex v, the reference vertices are its two immediately preceding vertices (for example, the reference
vertices for 5 are 3 and 4). This assumption is known under the name of "consecutivity assumption". However, this assumption is not
strictly necessary to perform the discretization. The example below shows in fact another example where the consecutivity assumption
is not satisfied: both vertices 4 and 5 have as reference vertices the vertex 2 and the vertex 3. In the figure below, the circles
and the marks are drawn by following the same criteria indicated above for the previous figure. This example basically shows that,
if the consecutivity assumption is not satisfied, then the search domain is not necessarily a tree (even if it can be approximated
by a tree, as it was previously done in several publications), and the reference vertices for different vertices can form a common
symmetry axis. In this example, it is also supposed, for simplicity, that the distance between the vertex 2 and 4 coincide with the
distance between 2 and 5 (the positions for both 4 and 5 lie on a common circle).

Remark that, in order to deal with reallife data, where distances are generally imprecise and often represented
by intervals having relatively large ranges, it is required that at least K1 reference distances can be considered as
"exact" (meaning by "exact" that lb(u,v) = ub(u,v) and the approximated distance value is not affected by an error that
is higher than the predefined tolerance error). As a consequence, the Euclidean objects involved in the intersections (performed
to determine the feasible points for a given vertex v) are not always (hyper)spheres, but (hyper)spherical shells may
actually be involved in the intersection in presence of an interval distance. In these hypotheses, the intersection does not provide
a finite number of feasible points, but rather a continuous object in dimension 1 (a set of arcs delimited on the circle defined by
the K 1 spheres). One of the initial employed strategies to deal with interval distances consisted in discretizing
these arcs, by simply extracting equidistant sample points from each arc obtained with the intersections. However, it is now known
that this strategy can introduce an important error capable to propagate along the search domain, and to spoil the computations.
Therefore, more recently, we have proposed a coarsegrained representation for the solutions of the discrete search
domain of discretizable DGPs. This coarsegrained representation is actually able to handle in a more efficient way DGP instances
containing interval distances. Notice that the discretization does not change the problem complexity, which remains
NPhard.
MDjeep is an implementation of the BP algorithm for discretizable DGPs, which is available on
GitHub
since the release 0.3.0, where the software tool was extended for solving instances containing interval distances.
Differently from other software tools for the DGP, MDjeep is able to enumerate the entire solution set
for a given instance. This is trivially done with instances consisting of only exact distances by performing the entire
exploration of the search tree (which is in practice possible when a sufficient number of pruning distances is
available). In order to do the same for instances containing interval distances, the resolution parameter was
recently introduced in MDjeep. This parameter allows us to select solutions from a (local) continuous set
of feasible solutions.
Metaheuristics
Metaheuristics are widely employed in optimization. My first experience with metaheuristics dates back to my PhD
in Italy, where I was working on a geometric model for the simulation of protein conformations. This model leads to the definition
of a global optimization problem, whose solutions represent geometrically "wellshaped" conformations for proteins, that we were
used to obtain by executing a simple Simulated Annealing (SA). The permutations implemented in our SA implementations were
particularly inspired by the geometric nature of the considered problem.
A few years later, I took part of the development of a novel metaheuristic framework which is inspired by the behavior
of a monkey climbing trees of solutions with the aim of finding goodquality solution approximations.
We named this method the Monkey Search.
A monkey can in fact survive in a forest because it is able to focus its food researches where it has already found food.
The trees the monkey explores contain solutions to the optimization problem to be solved, and branches of the tree represent
perturbations able to transform one solution into another. At each step, the monkey chooses the branches to climb by applying
a random mechanism, which is based on the objective function values of the new corresponding solutions. Notice that, rather than
a completely new metaheuristic approach for global optimization, the Monkey Search can be considered as a general
framework describing the monkey behavior, where other ideas and strategies for global optimization are also employed.
In the context of data mining (see below), we have been working on techniques for supervised biclustering.
We proposed a bilevel reformulation of the associated optimization problem, where the inner problem is linear, and its solutions
are also solutions to the original problem. In order to solve the bilevel problem, we have developed an algorithm that performs
a Variable Neighborhood Search (VNS) in the space defined by the outer problem, while the inner problem is solved exactly by
linear programming at each iteration of the VNS.
More recently, we proposed the environment as a new general concept in metaheuristics. As of today, we have tested
this novel idea in conjunction with only one metaheuristic: the Ant Colony Optimization (ACO). In ACO, artificial ants move in a
mathematical space by following the pheromone trails left by ants having previously explored the same region of the search space.
Our environment perturbs the perception of the pheromone, for ants to avoid to get trapped at local optima of the considered problem.
In a recent work, we used our ACO with environmental changes for intercriteria analysis. Work is in progress for a theoretical study
for this new metaheuristic concept, as well as for testing this novel idea with other metaheuristics.
Data mining
Data mining consists in extracting previously unknown, potentially useful and reliable patterns from a given set of data.
In collaboration with Panos Pardalos and Petraq Papajorgji, we wrote a
textbook
describing the most common data mining techniques, by paying a particular attention to those used in the agricultural field.
One interesting example is the problem of predicting, by using the data obtained from the very first stages of the fermentation,
the wine fermentations that are likely to be problematic, in order for the enologist to take actions in advance to avoid
spoiling part of the production.
After the publication of the book, we worked on supervised biclustering techniques for classification.
Given a set of data, biclustering aims at finding simultaneous partitions in biclusters of its samples and of the features which are
used for representing the samples. Consistent biclusterings can be exploited for performing supervised classifications, as well as for
solving feature selection problems. The problem of identifying a consistent biclustering can be formulated as a 01 linear fractional
optimization problem, which is NPhard. We worked on a bilevel reformulation of this optimization problem, and on a heuristic
(see above) which is based on the bilevel reformulation.
Proteins
In the different problems introduced above, the main considered application (at least in my publications) is the one
of analyzing or identifying the threedimensional conformation of protein conformations. Why all this interest?
C.M. Dobson, A. Sali and M. Karplus wrote:
An understand of folding is important for the analysis of many events involved in cellular regulation,
the design of proteins with novel functions, the utilization of sequence information from the various genome
projects, and the development of novel therapeutic strategies for treating or preventing debilitating human diseases
that are associated with the failure of proteins to fold correctly.
Here's below a brief description of protein conformations that I wrote at the very beginning of my research on this topic
(in other words, this is a description from a nonspecialist, but that the mathematician or computer scientist may easily follow).
Proteins are heteropolymers. They are formed by smaller molecules called amino acids, that are bonded together to form a sort of "chain".
Living organisms contain plenty of proteins, which perform several functions of vital importance. Each amino acid along the protein chain
contains a Carbon atom called C_{α}, which is chemically bonded to three different aggregates of atoms. Two of these
aggregates are common to each amino acid: the amino group (NH_{2}) and the carboxyl group (COOH). Only one aggregate of atoms
makes each amino acid different from one another: the socalled side chain R. Only 20 amino acids are involved into the protein synthesis.
These molecules are under genetic control: the amino acid sequence of a protein is contained into the corresponding gene of the DNA molecule.
The threedimensional conformation of a protein molecule is of great interest, because it defines its dynamics in the cell, and therefore
its function.
Different properties of the side chains allow for classifying the amino acids on the basis of their features. A very important feature
of the amino acids is the polarity of their side chains R. A side chain is polar if it has a global or local electric charge.
There is an attraction between positive and negative charged molecules and there is repulsion between molecules with the same charge.
Since the protein solvent, i.e. water, is polar and globally uncharged, there is attraction between the water molecules and the
polar R groups. It follows that the amino acids having a polar side chain are hydrophilic, because they tend to interact
with the solvent. Other amino acids are buried inside the protein because of the hydrophilic amino acid behavior. Therefore, they are
called hydrophobic. This feature of the amino acids plays a fundamental role in the stabilization of protein molecules. The
attractive and repulsive van der Waals forces are important as well. They can be modeled, for example, by the Lennard Jones potential
(see below).
This page
provides additional information about amino acids.
Each amino acid of the protein chain is bonded to the previous and the successive amino acid through a peptide bond. More precisely,
the carboxyl group of the i^{th} amino acid is bonded to the amino group of the (i+1)^{th} amino acid,
i.e. a covalent bond keeps bonded the C atom of the first group and the N atom of the second one. The peptide bond is strong enough
to force every atom between two consecutive C_{α} atoms on the same plane. Therefore, the movement possibilities of
the amino acids on the chain are strongly constrained.
Lennard Jones clusters
Finding the ground state conformations of clusters of molecules is one of the most difficult global optimization problems.
Such clusters are regulated by potential energy functions, such as the Lennard Jones and the Morse potential energy (also involved in the
identification of protein conformations, see above). The Lennard Jones potential is very simple but sufficiently accurate for noble gas,
and it also provides approximations for structures of nickel and gold. The Morse potential has a parameter which determines the width of
the potential well; Morse clusters show a wide variety of structural behaviors in function of this parameter. Both potential functions
have several local minima in which methods for optimization may get stuck at. For instance, a Lennard Jones cluster formed by 13 atoms
has about 1000 local minima, and the number of local minima increases exponentially with the size of the cluster. The Lennard Jones and
Morse optimal clusters usually have particular geometric motifs, which can be exploited during the optimization process. The
Cambridge Cluster Database
contains experimentally obtained clusters. Some of them may just be local minima:
the database is open to newly generated clusters having a lower energy function value.
Drawings
The drawings above that I made with pensil and compass are not meant to be precise but they intend only to visually show the main ideas
behind the discretization of DGPs. The idea to make such drawings came into my mind only recently, one day while I was helping my kid
with his elementary school homeworks...
Back Home

