Abstract—This paper reviews the different gradient-based
schemes and the sources of gradient, their availability, precision
and computational complexity, and explores the benefits of
using gradient information within a memetic framework in the
context of continuous parameter optimization, which is labeled
here as Memetic Gradient Search. In particular, we considered
a quasi-Newton method with analytical gradient and finite
differencing, as well as simultaneous perturbation stochastic
approximation, used as the local searches. Empirical study on
the impact of using gradient information showed that Memetic
Gradient Search outperformed the traditional GA and
analytical, precise gradient brings considerable benefit to
gradient-based local search (LS) schemes. Though
gradient-based searches can sometimes get trapped in local
optima, memetic gradient searches were still able to converge
faster than the conventional GA.
I. INTRODUCTION
PTIMIZATION problems often involve finding one or
more optima of some objective functions:
,
1. In real-world engineering problems, the
objective function of interest is usually multimodal with high
dimensionality. The cost of one measurement of the function
value is not limited to computational effort, but may also
include human labor, time, fuel and many others. Any
attempts to search exhaustively will be prohibitively
expensive due to the huge domain space of these functions,
yet the multimodality implies that a search too narrow in
scope risks mistaking a local optimum for the global optimum.
The design of such optimization algorithms has consequently
become another optimization problem, minimizing the cost,
which is usually bounded by the number of objective function
evaluations, while maximizing chances of finding the global
optimum. Throughout the paper, we only consider
minimization problems as maximization can be trivially
converted to minimization.
Over the years, researchers aiming to tackle these problems
have developed various algorithms, of which the origin may
be traced back to the time of Isaac Newton. Generally,
optimization algorithms may be classified into deterministic
methods, or conventional numerical methods, and stochastic
methods, depending on whether or not the result is totally
determined at the inception of the optimization. Depending
on the scope of the search, they may also be roughly classified
into global searches and local searches.
Manuscript received December 15, 2007.
Boyang Li, Yew-Soon Ong, and Minh Nghia Le are with the School of
Computer Engineering, Nanyang Technological University, Singapore,
639798. (e-mail: libo0001@ntu.edu.sg, asysong@ntu.edu.sg,
lemi0005@ntu.edu.sg)
Chi Keong Goh is with the Data Storage Institute, Agency for Science,
Technology and Research, Singapore (e-mail: ckgoh@nus.edu.sg)
A. Deterministic Search
Deterministic methods can be further classified into direct
searches, or gradient-free algorithms, and gradient-based
methods. In many methods of both classes, a line search is
employed although the heuristics that determine the direction
and step size vary. The strengths of the conventional
numerical methods lie in fast convergence. However, their
scope of exploration is usually limited compared to the whole
objective space. Many methods can be “trapped” in the first
local optimum they find and will not be able to improve
further. Since these methods are completely determined, they
are destined to produce poor results if the starting position is
unfortunate. Thus they are commonly classified as local
searches.
Direct search methods do not depend on other information
other than the measurement of the objective function. No
assumptions of differentiability are made. The very first
direct search is now known as the coordinate strategy, or the
Gauss-Seidel strategy among many other names [1-3]. This
particular strategy singles out one dimension at a time, and
performs a line search on that dimension. An important
variation of this strategy, the Davies-Swann-Campey method,
initially proposed by Swann[4], combined the idea of
coordinate strategy with Rosenbrock’s idea of rotating
coordinates[5] so the line search is not limited to go parallel to
the coordinate axes.
There are quite a number of gradient-based local searches,
each having numerous variants. The most widely used
include the method of steepest descent[6], conjugate
gradients[7], Newton’s method and a class of quasi-Newton
methods. While steepest descent and conjugate gradients
make use of only the gradient, Newton’s method requires the
costly computation of the inverse of the Hessian
matrix
. Quasi-Newton methods were introduced to
estimate the Hessian matrix based on first-order data and
simplify computation. The assumptions of these methods are,
of course, that the objective function is twice or once
differentiable, for Newton’s method and other method
respectively. The interested reader is referred to the excellent
books written by Schwefel[8] and Ruszczyński[9] for
detailed discussions.
B. Evolutionary Computing
Evolutionary computing, including genetic algorithms[10],
evolutionary strategies[11, 12], and genetic programming
[13], was inspired by the biological process that genes of
creatures are optimized by the force of natural selection. In
contrast to conventional numerical methods, genetic
algorithm is able to explore the objective space more
extensively, but it usually takes more computational effort
and reaches the optima with less precision. Hybrid methods
Memetic Gradient Search
Boyang Li, Yew-Soon Ong, Minh Nghia Le and Chi Keong Goh
O
that combine genetic algorithms with local searches have
been proposed so as to perform efficient global searches and
take the merits from the two distinct approaches. These
methods are usually named memetic algorithms (MAs)[14]
due to the resemblance of local searches to the evolution of
culture in human societies, or the evolution of memes[15].
Memetic algorithm has been successfully applied in many
disparate engineering fields, ranging from microarray gene
selection[16], aerodynamic design[17], drug Therapies
design[18] to assignment problems[19].
In addition to gradient-based LSes in MAs, other means to
utilize gradient in traditional genetic algorithms or
evolutionary strategies have also been proposed. Salomon[20]
pointed out the similarities between finite differencing (FD)
and Evolutionary Strategies and consequently introduced a
new mutation operator based on approximated gradients in
his Evolutionary Gradient Search.
In this paper, we studied the impact of gradients on
memetic search performance. In Section II we review various
gradient-based local search methods, from steepest descent to
quasi-Newton methods. Section III discusses sources of
gradients as well as their possible influence in a memetic
algorithm. Finally, in Section IV, we present an empirical
study comparing four algorithms, including a simple genetic
algorithm, a MA utilizing gradients computed with FD, a MA
utilizing analytical gradients and a MA with Simultaneous
Perturbation Stochastic Approximation (SPSA) as the local
search.
II. G
RADIENT-BASED LOCAL SEARCHES
A large number of widely-used deterministic algorithms
depend on gradient information of the objective function to
perform optimization. There are, for example, steepest
descent, Newton’s method, and a variety of quasi-Newton
methods. Different from direct searches, in a gradient-based
method a relationship between the input and output of the
function must be established or approximated. While some
methods, like steepest descent, rely on only first-order
derivatives, the Newton’s method requires second-order
information. Thus it must be assumed that the objective
function can be differentiated at least twice for application of
the Newton’s method and once for other methods.
One-dimensional line searches are usually employed
iteratively in deterministic methods. Given a pre-defined
search direction

, the search seeks the optimal step
length
:

min


Common strategies include Fibonacci Division [21],
Golden Section[21, 22], and so forth. Although the method of
steepest descent and conjugate gradient requires precise
one-dimensional searches, Newton’s method and
quasi-Newton method only needs imprecise searches that
guarantees sufficient decrease[23].
A. Steepest Descent
One of the simplest gradient-based LS is the method of
steepest descent introduced by Curry[6], which follows from
the fact that the gradient vector points to the direction of
fastest increase in the function value. The search points are
iteratively generated by stepping in the opposite direction of
the gradient. Formally, letting
denote the current search
point, we have





Although steepest descent is easy to implement, it is
commonly considered slow as the locally optimal direction
often does not coincide with the optimal direction in the
bigger picture. A zigzag path can often be observed. The
convergence rate is no better than first order except for some
special cases like concentric contours[24].
Some variants[25, 26] of the basic steepest descent involve
deriving the current direction from historical search points in
order to reduce the zigzag behavior and accelerate
convergence.
B. Conjugate gradient
In the method of conjugate gradient[7], the search
directions are a set of -conjugate vectors. If a symmetric
matrixis positive definite, non-zero vectors
,
,…
are
said to be -conjugate if

0,,1,
It can be shown all these vectors are linearly independent.
With first-order derivatives, we may compute the direction in
the th iteration
with the following formula:



where








with the initialization

. It is shown to be
beneficial to reinitialize the direction vector periodically so as
to avoid possible linear dependence in the generated direction
vectors [27].
The power of conjugate gradient lies in the fact that
will probably become separable in the new coordinate system,
so each dimension can be minimized individually. For a
quadratic function of D dimensions, conjugate gradient needs
a maximum of D steps to converge.
C. Newton’s method
Newton’s method, or the Newton-Raphson method, was
first introduced as a method to solve equations iteratively. For
a simple equation
0 and the current position
in the
proximity of the solution, the next search point is taken as the
intercept of the tangent line at
and the axis. More
formally,




As an optimization method, the equation to be solved is
simply changed to 
0. Substituting corresponding
terms, we get:





For a quadratic objective function with a positive-definite
Hessian, a local optimum can be reached with Newton’s
method in one step. Nonetheless the fast convergence is
achieved at a high cost, mainly the effort needed to compute
the Hessian and the search direction can be distorted by errors
introduced by finite differencing. This inconvenience
inspired researchers to look for alternative ways to
approximate the Hessian or its inverse and led to the
quasi-Newton methods.
D. Quasi-Newton methods
Davidon[28] originated the first method in the
quasi-Newton family. It was later refined by Fletcher and
Powell[29] and is now called the Davidon-Fletcher-Powell
method (DFP). The search direction in DFP is given by:




where
 and



̂





̂
̂
̂
̂
̂
The Broyden-Fletcher-Goldfarb-Shanno or BFGS [30-33]
method is slightly different. However, both of the methods
belong to the quasi-Newton class characterized by the
formula


̂
̂
̂
̂
̂

̂
̂
̂
̂
̂
̂
̂
̂
̂
Obviously, the DFP formula is the special case where
ω
0 while in BFGS ω
1. It can be shown that

is
positive definite as long as
is positive definite for
all ω
0,1
, if all computation and line searches are exact.
In real-world applications, it is beneficial to periodically
reset
to , so as to prevent loss of the positive-definiteness
[34].
III. M
EMETIC ALGORITHM
Having seen how creatures incredibly adapt to their
environment, one may reasonably conclude that nature is
probably the most powerful optimizer. The astonishing
elegance of the ways natural creatures solve their daily
problems inspired researchers to mimic evolution, nature’s
way of optimization, in their algorithms. These algorithms
simulate the natural selection process and blind mutation and
crossover of genes in the hope that the genes will be
optimized, in a sense, “automatically”.
In 1960s, Holland invented genetic algorithm(GA)[10] and
Rechenberg and Schwefel invented evolutionary
strategies[11, 12]. In 1985 Cramer[13] conceived genetic
programming(GP), in which individuals are represented by
tree structures, and Koza[35] refined it later. The three
methods differ mostly in the representation of individuals and
use of operators such as crossover and mutation. Nevertheless
they share more or less a common algorithmic sequence. The
three methods are now sometimes grouped together under the
collective term evolutionary computing. More recent
developments includes Ferreira’s Gene Expression
Programming[36], which is alleged to be a closer metaphor to
real biological processes.
Evolutionary computing, same as the natural processes it
mimics, is stochastic. It is able to reach the global optimum
with high probability and at lower cost than a multi-start
deterministic search. However, while evolutionary
computing is generally good at identifying promising regions
that are worth exploring further, it may take time to converge
within those regions and may not converge to precisely the
optimum. On the other hand, deterministic methods converge
rapidly near optima. Memetic algorithms that hybridize
evolutionary computing and deterministic LSes are thus
proposed to perform efficient search with good precision.
Based on the way results of local searches are used,
memetic algorithms can be categorized into Baldwinian and
Lamarckian. In a Baldwinian MA, the results are merely used
in the selection of individuals. In a Lamarckian MA, the
results are also translated back to the genotypes. The
Lamarckian approach is used in most recent works.
A. Sources of Gradients
While the exact search strategy may vary, the gradient
information many method depend on does not differ, except
for truncation and rounding errors. In this section we look at
different sources of gradient, their availability and cost.
The expense we pay for the gradients is generally limited
by our knowledge about the objective function. The most
general and widely used method is finite differencing, which
requires no other information than measurements of the
objective functions. However, FD can be expensive. If more
information of the objective function is available, for example,
the algorithm to derive the function value, it can be exploited
so that the gradient information can be obtained at a much
lower cost. In some special cases the gradient may even be
readily available. Techniques to compute gradients ranges
from Automatic Differentiation and adjoint Euler solver to
symbolic differentiation. Even when little information is
known, we might still reduce the number of function
measurements at the expense of accuracy of the gradient,
thanks to stochastic gradient methods such as SPSA.
In the most special case, analytical gradient is readily
available. A good example is the Lennard-Jones cluster
problem, which aims to find the optimal positions of a cluster
of atoms so that the potential energy of the atoms is
minimized. The problem has been so well studied that one can
easily find analytical gradient in relevant literature. For some
other problems, the objective function may be completely
specified in formula form while the derivative is not; this
knowledge enables us to use computer-assisted symbolic
differentiation or even analytical derivation by hand.
In cases where the function is so complex that manual
derivation or symbolic differentiation is unbearably slow, or
the function is not known in exact formula form but only
algorithmically, i.e. as a series of steps, for instance, in
complex simulations, automatic differentiation may be used.
Automatic differentiation (AD) is superior to finite
differencing because it generates numerical results as precise
as one might obtain with analytical gradients. Besides, the
results may be generated with reasonable cost as the
computational complexity in the reverse mode is independent
of the dimensionality of the inputs.
Automatic differentiation is a transformation of the
algorithm deriving the function value to the algorithm
deriving its derivative. See for [37] a history of automatic
differentiation. AD include two strategies, or modes: the
forward mode and the reverse mode. For our function of
interest :

, the reverse mode requires less
computational cost than the forward mode and is preferred.
However, the reverse mode requires the storage of
intermediate results and may pose a challenge for memory
capacity when used with very complex algorithms.
There are also cases where gradient is not available in
analytical form, but still can be obtained at a much lower cost
compared to finite differencing. A class of problems directly
taken from engineering practice is the optimization of airfoil
design. The optimization of airfoil shapes can be viewed as a
constrained optimization problem, where the constraints
appear as Euler or Navier-Stokes flow equations. Here, an
adjoint Euler solver, first used in [38], may be utilized to
compute the gradients at only 1.5 times or 2 times the cost of
evaluating the objective function[39], which is, compared to
finite differencing, a cheap method.
Nevertheless there are cases that the function value is
generated as processes or experiments external to our
computer, such as outputs from external circuitry. Little is
known about the input-output relationship. In this case the
function becomes a black box. Generally, only finite
differencing is applicable in these situations.
Finite differencing is derived directly from the definition of
partial derivatives. There are several ways to approximate a
local derivative. For the objective function:

, let
u
denote the unit vector of the th dimension, and
·
denote the corresponding approximated partial derivative.
For a small positive perturbation
, the one-sided estimate
(forward difference) at the point
can be express as
1


We might also use the central difference, or the two-sided
estimate:
1
2



If it can be assumed the function is twice differentiable,
from the expansion of Taylor series we may obtain:

1
2
2
4

3
Although the two-point estimate and the result yielded
from Taylor series produces generally more accurate results,
the one-point estimate is usually preferred on the grounds of
economics. However, if the objective function is rough or
noise is present, using the central difference may reduce the
noise level and achieve faster convergence[40].
Despite being universally applicable, finite differencing
has its own drawbacks. Choosing the value of
can be tricky
because too great a
will, by definition, yields an inaccurate
gradient, and too small a
may be obscured by the noise in
the external system and susceptible to rounding errors. In
addition, computing FD is expensive because it requires D+1
measurement for a D-dimensional objective function.
When gradients are not cheaply available, it may be
advantageous to use stochastic gradient methods that
stochastically approximate the gradient of the function rather
than the combination of a deterministic method with gradient
obtained from finite differencing[41]. Although the gradient
information acquired by stochastic approximation can be
tightly coupled with the strategy of its utilization, we still
classify these methods mainly as a source of gradient instead
of gradient-based methods. In the next a few paragraphs we
review the simultaneous perturbation stochastic
approximation method (SPSA).
SPSA is due to Spall [42, 43]. The approximated gradient
can be expressed as follows:


2






where the components of the perturbation vector
is
drawn from the Bernoulli ±1 distribution. The independent
variable is updated as



The choice of the decreasing gain sequence a
and
are
respectively determined by
/
and
/
.
For effective values of the constants A, a, γ, α, and c, see [44].
For each search point, the algorithm only requires two
evaluations of ·, this is only 1/D of the computational
effort required by central finite differencing for a
D-dimensional problem.
IV. E
MPIRICAL STUDIES
In this section we studied the characteristics of algorithms
based on the pseudo-code shown in Fig.1, including three
memetic algorithm and one traditional genetic algorithm, in
order to illustrate the impact of gradient in MAs. In this study
our focus is on the steps marked (6) and (7) in Fig. 1. The first
memetic algorithm employed analytical gradient that is
assumed to be readily available (MA-AG). The cost of
evaluating the gradient is deemed as negligible. It is worth
noting that gradients obtained by automatic differentiation are
as precise as analytical gradient, only at a slightly higher cost.
Consequently, characteristics exhibited by this particular
algorithm can easily be extended to cases where AD is used
by only multiplying the cost by a small constant.
The second algorithm employed gradient obtained from
one-sided finite differencing (MA-FD). The first and second