
Antonio Mucherino's Research
This page contains a brief summary of my research activities. Additional content about topics related to my research,
which was published on this website during the past years, is appended at the end of this page.
Discretizable Distance Geometry.
The Distance Geometry Problem (DGP) asks whether a simple weighted undirected graph G=(V,E,d) can be
realized in a Kdimensional space so that the distance between each pair of realized vertices u and v
is the same as the weight d(u,v) assigned to the edge {u,v}, when available. The embedding space is Euclidean
in most applications. The distance information can be given either through one real value, representing an exact distance
or rather an approximation, or by realvalued intervals. The DGP is NPhard, and in recent years I have been working
on the different facets of this problem with several other colleagues.
Different applications can lead to the definition of a DGP: the DGP can nowadays be seen as a classical problem
in operational research. One classical DGP application arises in biology: experiments of Nuclear Magnetic Resonance
(NMR) are able to find estimates of some distances between atom pairs in molecules, which can then be exploited for
discovering molecular conformations. We also began to investigate the possibility to formulate as a DGP a particular
class of the Multidimensional Scaling (MDS). More recently, we have also started to work on human motion adaptation by
a novel distancebased approach, which brings to the definition of a DGP. Since motions have a temporal component,
it was necessary to preliminary extend the DGP to dynamical problems.
The discretization of the DGP can be performed when some particular assumptions are satisfied. These ensure
that every vertex v of the graph has at least K adjacent predecessors, that can serve as a reference for
computing candidate positions for the vertex. Moreover, in order to deal with reallife data, where distances are
generally imprecise and often represented by intervals, it is required that at least K1 reference distances
can be considered as exact. These assumptions allow for defining the set of possible positions for a given vertex v
as the intersection of K Euclidean objects: hyperspheres, that are related to exact distances, and hyperspherical
shells, that are related to nonexact distances. These intersections give either discrete subsets of positions, or
continuous objects in dimension 1. A lot of research has been performed to verify what is the best strategy to adopt
in the case where the intersection of the hyperspheres does not provide a discrete set of possible solutions.
One of the first considered strategies consisted in discretizing those continuous objects by extracting
equidistant sample points. However, this strategy introduced an error which can propagate during the rest of the
computations. Therefore, more recently, we have proposed a coarsegrained representation which is able to handle
in a more efficient way DGP instances containing nonexact distances. The discretization allows to define a tree of
possible vertex positions, which corresponds to the search domain of the DGP after the discretization. The
discretization does not change the problem complexity, which remains NPhard.
Algorithms based on a branchandprune (BP) framework can be implemented for exploring the search
domain of discretizable DGPs. We developed numerically stable methods for the computation of candidate vertex positions
at every layer of the search tree. Once computed, the feasibility of candidate positions is verified by employing the
socalled pruning devices. Pruning allows to focus the search on the feasible regions of the search tree. Basic
pruning is based on the verification of the distances that are not used during the discretization process. These
distances actually allow to define additional Euclidean objects to be intersected with the K objects involved in
the discretization. Additional pruning devices can be conceived and tailored to some particular class of DGP instances,
such as the DGP class related to biological molecules. We also integrated the algorithm with some energybased pruning
devices, where van der Vaals and Lennard Jones potentials are taken into consideration. The enumeration of the entire
solution set of DGP instances can be performed by our approach. MDjeep is an implementation of a BP algorithm
for discretizable DGPs with exact distances, that is now available on
GitHub.
In presence of a large number of wrong measurements, it may be necessary to relax the algorithm pruning phase, which
implies an increase of the number of nodes included in the search tree: heuristics can in this case be conceived for
speeding up the search. The BP framework was recently enhanced for problems in dimension 1, where an additional pruning
phase, named backtracking pruning, allows to deal with interval distances without introducing any approximations
on the distances. Current research is aimed at extending such a result to dimensions K greater than 1.
At the beginning of our work on the DGP, we have focused our attention on instances where all available
distances are exact. This is evidently an unrealistic assumption for many reallife applications, but it allowed us to study
some interesting properties of discretizable DGPs. We discovered in fact that the discrete search domain of some DGPs contains
several symmetries, which can be exploited for computing its solution set. Moreover, when DGP instances related to molecular
conformations are considered, the search domain is a tree with bounded width, so that it can never experience a combinatorial
explosion. The particular structure of this tree also allowed us to develop two parallelization strategies for the BP
framework, which were tested on the French nationwide grid infrastructure
Grid5000.
A discretization order is a vertex order for the graph $G$ such that the discretization assumptions are
satisfied. For an automatic detection of discretization orders, we proposed a greedy algorithm that was initially only able
to work with instances containing exact distances; it was then subsequently extended for dealing with interval distances. In
both cases, it was proved that the greedy algorithm is able to identify discretization orders, when they exist, and to deliver
a certificate of nonexistence otherwise. The complexity of this ordering problem can become NPhard when additional
requirements are integrated with the discretization assumptions. For example, the "consecutivity assumption" requires that all
reference vertices for a given vertex are consecutive in the order, and that they immediately precede this vertex. When this
additional assumption is satisfied, all reference distances are edges of cliques belonging to the original graph G, and the
discretization order can be defined as a sequence of overlapping cliques. A pseudo de Bruijn graph was proposed for aiding the task
of generating discretization orders satisfying the consecutivity assumption, and an ordering for the protein backbones that is
optimal in terms of length was hence discovered. Other orders satisfying the consecutivity assumption were previously handcrafted
for the protein backbone and the side chains belonging to the amino acids involved in the protein synthesis.
We also worked on suitable methods for finding discretization orders satisfying a given set of objectives, together
with the necessary discretization assumptions. We wrote a constraint program to this purpose, and we found an optimal (partial)
discretization order for the protein backbones by solving this program by Answer Set Programming (ASP). More recently, we proposed
an algorithm (having polynomial complexity) for the generation of optimal discretization orders, where the objectives to be optimized
are given by a set of simple functions. This algorithm extends the previously proposed greedy algorithm for discretization orders.
Data mining.
Data mining consists in extracting previously unknown, potentially useful and reliable patterns from a given set of data.
During my postdoc at University of Florida, in collaboration with Panos Pardalos and Petraq Papajorgji,
I wrote a
textbook
describing the most common data mining techniques.
The book also contains many applications of data mining to the agricultural field, as well as to some related fields,
such as biology and chemistry.
Some years after my postdoc experiments at University of Florida, I have been working with other colleagues on supervised
biclustering techniques for classification. Given a set of data, biclustering aims at finding simultaneous partitions in
biclusters of its samples and of the features which are used for representing the samples. Consistent biclusterings can be
exploited for performing supervised classifications, as well as for solving feature selection problems. The problem of
identifying a consistent biclustering can be formulated as a 01 linear fractional optimization problem, which is NPhard.
We are working on a bilevel reformulation of this optimization problem, and on a heuristic which is based on the bilevel reformulation.
This technique was extended to sets of data containing negative entries.
As pointed out in our book, biclustering had never been applied to agricultural problems before the book publication, while the
application seemed to be in fact promising. Subsequently, we applied biclustering to the analysis of wine fermentations. The
problem consists in predicting problematic fermentations at the early stages of the process: we discovered some compounds of wine
that may be the cause of problematic fermentations.
Metaheuristics.
Metaheuristics are widely employed in optimization. My first experience with metaheuristics dates back to my PhD
in Italy, where I was working on a geometric model for the simulation of protein conformations. This model leads to the definition
of a global optimization problem, whose solutions represent geometrically wellshaped conformations for proteins, that we were used
to obtain by executing a simple Simulated Annealing (SA). The permutations implemented in our SA implementations were particularly
inspired by the geometric nature of the considered problem.
A few years later, I took part of the development of a novel metaheuristic framework which is inspired by the behavior
of a monkey climbing trees of solutions with the aim of finding goodquality solution approximations (more details about
this metaheuristic can be found below). We named this method Monkey Search (MS). Rather than a completely new metaheuristic
approach, MS can be seen as a general framework (describing the monkey behavior) where other ideas and strategies for global optimization
can be implemented.
In the context of data mining, we have been working on techniques for supervised biclustering. We proposed a bilevel
reformulation of the associated optimization problem, where the inner problem is linear, and its solutions are also solutions to the
original problem. In order to solve the bilevel problem, we have developed an algorithm that performs a Variable Neighborhood Search
(VNS) in the space defined by the outer problem, while the inner problem is solved exactly by CPLEX at each VNS iteration.
More recently, we proposed the environment as a new general concept in metaheuristics. As of today, we have tested this novel
idea in conjunction with only one metaheuristic: the Ant Colony Optimization (ACO). In ACO, artificial ants move in a mathematical space
by following the pheromone trails left by ants having previously explored the same region of the search space. Our environment perturbs
the perception of the pheromone, for ants to avoid to get trapped at local optima of the considered problem. In a recent work, we used
our ACO with environmental changes for intercriteria analysis. Work is in progress for a theoretical study for this new metaheuristic
concept, as well as for testing this novel idea with other metaheuristic.
Proteins described by a mathematician.
Proteins are heteropolymers. They are formed by smaller molecules called amino acids, that are bonded
together to form a sort of "chain".
Living organisms contain plenty of proteins, which perform several functions of vital importance.
Each amino acid along the protein chain contains a Carbon atom called C_{α},
which is chemically bonded to three different aggregates of atoms.
Two of these aggregates are common to each amino acid: the amino group (NH_{2}) and the carboxyl group (COOH).
Only one aggregate of atoms makes each amino acid different from one another: the socalled side chain R.
Only 20 amino acids are involved into the protein synthesis.
These molecules are under genetic control: the amino acid sequence of a protein is contained into the
corresponding gene of the DNA molecule. The threedimensional conformation of a protein molecule is of great interest,
because it defines its dynamics in the cell, and therefore its function.
Different properties of the side chains allow for classifying the amino acids on the basis of their features.
A very important feature of the amino acids is the polarity of their side chains R. A side chain is polar if
it has a global or local electric charge. There is an attraction between positive and negative charged molecules and
there is repulsion between molecules with the same charge. Since the protein solvent, i.e. water, is polar and globally uncharged,
there is attraction between the water molecules and the polar R groups. It follows that the amino acids having a polar side
chain are hydrophilic, because they tend to interact with the solvent. Other amino acids are buried inside the protein because
of the hydrophilic amino acid behavior. Therefore, they are called hydrophobic. This feature of the amino acids plays a
fundamental role in the stabilization of protein molecules. The attractive and repulsive van der Waals forces are important as well.
They can be modeled, for instance, by the Lennard Jones potential (see below).
Click here
in order to get more information on the amino acids and on their properties.
Each amino acid of the protein chain is bonded to the previous and the successive amino acid through a peptide bond. More precisely,
the carboxyl group of the i^{th} amino acid is bonded to the amino group of the (i+1)^{th} amino acid,
i.e. a covalent bond keeps bonded the C atom of the first group and the N atom of the second one.
The peptide bond is strong enough to force every atom between two consecutive C_{α} atoms on the same plane.
Therefore, the movement possibilities of the amino acids on the chain are constrained.
More information about proteins and the computational methods for finding their stable conformations can be found on
this web page
on the
Arnold Neumaier web site.
Experimentally determined protein conformations can be downloaded from the
Protein Data Bank
(PDB), that currently contains about 152000
entries.
Why are we so interested in proteins?
C. M. Dobson, A. Sali and M. Karplus wrote:
"An understand of folding is important for the analysis of many events involved in cellular regulation,
the design of proteins with novel functions, the utilization of sequence information from the various genome
projects, and the development of novel therapeutic strategies for treating or preventing debilitating human diseases
that are associated with the failure of proteins to fold correctly".
Analyzing and simulating the conformation of protein molecules.
At the beginning of my career, I performed many analyses on protein conformations, where particular attention
was given to their geometric features. I also worked on a model for protein simulations which is mainly based
on geometric properties of protein conformations. Differently from other models for protein prediction, the considered
model does not rely on chemical and physical forces involved in the folding process, but it rather attempts the simulation
of protein conformations by exploiting their geometric properties. Computational experiments proved that the geometric properties
of protein conformations play an important role in the folding process. Indeed, among the conformations that were simulated,
there are some which are "very similar" to the ones protein molecules can actually have in nature.
Exploring the optimal conformations of clusters of molecules.
Finding the ground state conformations of clusters of molecules is one of the most difficult global optimization problems.
Such clusters are regulated by potential energy functions, such as the Lennard Jones and the Morse potential energy.
The Lennard Jones potential is very simple but sufficiently accurate for noble gas, and it also provides approximations for structures of nickel and gold.
The Morse potential has a parameter which determines the width of the potential well; Morse clusters show a wide variety of structural behaviors
in function of this parameter. Both potential functions have several local minima in which methods for optimization may get stuck at.
For instance, a Lennard Jones cluster formed by 13 atoms has about 1000 local minima, and the number of local minima increases exponentially
with the size of the cluster. The Lennard Jones and Morse optimal clusters usually have particular geometric motifs,
which can be exploited during the optimization process. The
Cambridge Cluster Database
contains experimentally obtained clusters. Some of them may just be local minima:
the database is open to newly generated clusters having a lower energy function value.
Climbing trees like a monkey.
A monkey can survive in a forest of trees because it is able to focus its food researches where it already found good food to eat.
During my period of work in the United States,
Onur Seref
and I proposed a novel method for global optimization which is inspired by the behavior of a monkey.
We named this method Monkey Search. The trees the monkey explores contain solutions to the optimization problem to be solved,
and branches of the tree represent perturbations able to transform one solution into another. At each step,
the monkey chooses the branches to climb by applying a random mechanism, which is based on the objective function values of
the new corresponding solutions. The monkey prefers better solutions, but it is not forced to choose them.
Rather than a completely new metaheuristic approach for global optimization, the Monkey Search can be considered as a general framework
(describing the monkey behavior) where other ideas and strategies for global optimization are also employed. Perturbations on candidate
solutions can be taken from other methods for optimization (such as Genetic Algorithms, Harmony Search, Ant Colony Optimization, etc)
and others can be inspired by the optimization problem itself. This leads to a wide range of variants for this metaheuristic search.
We implemented the Monkey Search in C programming language for solving particular classes of optimization problems,
such as the problem of finding Lennard Jones and Morse clusters, as well as for solving the distance geometry problem (see above).
Alla Kammerdiner
implemented the Monkey Search for solving the Multidimensional Assignment Problem (MAP).
Back Home

