The task

Ignoring exogenous data (assume that it is supplied when needed), our problem is to find

\[argmin{_x}{f(x)}\]

where \(x\) is our set of \(n\) parameters and \(f()\) is a scalar real function of those parameters. The gradient of \(f(x)\) is the vector valued function \[g(x) = \partial{f(x)}/\partial{x}.\] The Hessian of \(f(x)\) is the matrix of second partial derivatives \[H(x) = \partial^2{f(x)}/\partial{x^2}.\]

We will be considering methods for this problem where a gradient can be supplied, though many of the R tools will use a numerical approximation if needed.

General approaches

The so-called Newton method tries to find \(x\) that satisfies the first order conditions for an extremum, namely,

\[H(x) \delta = - g(x)\]

and then updates \(x\) to \((x + \delta)\).

The main objections to this approach are

  • the Hessian is generally expensive to compute, though because an accurate Hessian is so useful, I strongly recommend that workers examine whether it can be computed in a reasonable way;
  • it is necessary to apply safeguards on the computations and the size of \(\delta\) to avoid cases where the Hessian is near singular, poorly computed, or the assumptions of the method are violated.

There are four main approaches to simplifying the Newton method:

  • The variable metric or quasi-Newton methods use clever ways to improve an approximate Hessian or Hessian inverse while finding lower points \(x\) on the function surface, using only function and gradient values computed in the process.
  • Truncated Newton methods use a linear conjugate gradient method to inexactly solve the Newton equations, thereby reducing memory requirements.
  • Conjugate gradient methods aim to search "downhill" for better points on the functional surface using only recent gradient information. The traditional first search direction is that of steepest descents, namely, \(- g\), the negative gradient. After a line search selects a suitable step size, we have a new point, a new function value that is lower than the initial point, and a new gradient. Using this information and possibly the last search direction, we compute a new search direction that is somehow "conjugate" to the previous one. For a problem with \(n\) parameters, of course, there are only \(n\) independent search directions, so we need to restart the procedure on or before that step. Indeed there are a number of strategies for deciding when to restart the conjugacy cycle. And, of course, there are a number of choices for updating the search direction and for performing the line search.
  • Limited Memory BFGS methods use reduced memory approaches to variable metric methods, but can also be thought of as extending conjugate gradients methods by using several rather than the two most recent gradients.

The histoRicalg project

The R Consortium awarded some modest funding for histoRicalg, a project to document and transfer knowledge of some older algorithms used by R and by other computational systems. These older codes are mainly in Fortran, but some are in C, with the original implementations possibly in other programming languages. This effort was prompted by finding some apparent bugs in codes, which could be either from the original programs or else the implementations. Two examples in particular – in nlm() and in optim–L-BFGS-B – gave motivation for the project. We welcome interest and participation, and have a project repository https://gitlab.com/nashjc/histoRicalg, where several aspects of this work are in various stages of completion.

The function optim() in base-R has three of the optimizers from the 1990 edition of Compact Numerical Methods (J. C. Nash 1979), abbreviated CNM henceforth. Given the dates of their conception, these are an obvious focus for histoRicalg. The direct-search Nelder-Mead method (Nelder and Mead 1965) is in fact the default solver i.e., it uses method="Nelder-Mead" in the optim() call and does not require a gradient routine to be supplied. In fact, Nelder-Mead will be used even if a gradient routine is included unless some other method is suggested. The choice if method="BFGS" is the variable metric method of Fletcher (1970). The third choice from CNM is method="CG" for conjugate gradients, which is a combination of several similar approaches.

In this article on the provenance of the methods, the details will be discussed only in general terms. However, it is critical to get those details right.

Provenance of the R optim–BFGS solver

If the source code for base R is in a directory that is named R-X.Y.Z (in my case 3.5.1 when this vignette was started) then the calling routine for optim() is src/library/stats/R/optim.R, but this uses the .External2 method to call src/library/stats/src/optim.c, where the function vmmin is called. However, the source for vmmin is in src/appl/optim.c. I will venture that having two files of the same name in different directories is tempting an error.

I will use optim–BFGS when using the call method="BFGS" in optim(), and similar abbreviations for the other methods. To use optim–BFGS, the relevant code is a C routine vmmin. In src/appl/optim.c, above this routine is the comment

{,comment}
/*  BFGS variable-metric method, based on Pascal code
in J.C. Nash, `Compact Numerical Methods for Computers', 2nd edition,
converted by p2c then re-crafted by B.D. Ripley */

As author of this work, I can say that the code used is Algorithm 21 of (J. C. Nash 1979). In the First Edition, this was a step-and-description code which was replaced by Turbo Pascal in the Second Edition. The methods were worked out mainly on a Data General Nova which had partitions of between 3.5 K and 8 K bytes accessible via a 10 character per second teletype. The floating point available had a 24 bit mantissa, in a single level of precision with (as I recall) no guard digit. Programming was in a fairly early and simple form of BASIC.

This machine was at Agriculture Canada. In the late 1970s, it was replaced with a Data General Eclipse, but largely the facilities were the same, except the floating point went to 6 hexadecimal digits with no guard digit. I also implemented some codes on HP 9830 and Tektronix 4051 "programmable calculators". A Fortran translation was made as NASHLIB, which ran mainly on IBM System 360 class computers, but also was tested at least partially on Univac 1108, ICL 1906A and Xerox-Honeywell CP-6 machines. The codes were distributed as Fortran source code. See https://gams.nist.gov/cgi-bin/serve.cgi/Package/NASHLIB. I have been informed that the servers supporting this link will not be replaced when they fail.

NASHLIB dates from the 1979-1981 period, though the methods were developed in the mid-1970s at Agriculture Canada to support economic modelling. In 1975, I received an invitation (from Brian Ford) to collaborate with the Numerical Algorithms Group in Oxford, and Agriculture Canada generously allowed me two 6-week periods to do so. During the second of these, in the tail end of the hurricane of January 1976, I drove to Dundee to meet Roger Fletcher. He took an interest in the concept of a program that used minimal memory. We found a printout of the Fortran for the method in (Fletcher 1970), and with a ruler and pencil, Roger and I scratched out the lines of code that were not strictly needed. As I recall, the main deletion was the cubic interpolation line search. The resulting method simply used backtracking with an acceptable point condition.

On my return to Ottawa, I coded the method in BASIC and made two small changes:

  • I managed to re-use one vector, thereby reducing the storage requirement to a square matrix and five vectors of length \(n\), the number of parameters.
  • The very short floating point precision seemed to give early termination on some problems. Often this was due to failure of the update of the approximate inverse Hessian when an intermediate quantity indicated the approximate Hessian was not positive definite. As a quick and dirty fix, I simply checked if

the current search was a steepest descent. If so, then the method is terminated. If not, the inverse Hessian approximation is reset to the unit matrix and the cycle restarted. This "quick fix" has, of course, become permanent. My experience over the years suggests it is a reasonable compromise, but the possibility of better choices is still open.

Evolution of the code

Considering the simplicity of the method, it is surprising to me how robust and efficient it has shown itself to be over the last four decades.

The code does allow of some variations:

  • the organization of some of the calculations may have an influence on outcomes. There are opportunities for accumulation of inner products, and the update of the inverse Hessian approximation involves subtractions, so digit cancellation could be an issue.
  • tolerances for termination, for the "acceptable point" (Armijo) condition, and other settings could be changed. However, I have found the original settings seem to work as well or better than other choices I have tried.

In the mid-1980s, the BASIC code was extended to allow for masks (fixed parameters) and bounds (or box) constraints on the parameters (J. C. Nash and Walker-Smith 1987). This adds about 50% more code, as well as vector storage for the constraints and indices to manage them (essentially \(3*n\)), but does permit a much wider variety of problems to be solved. Even for unconstrained problems that may have difficult properties, imposing loose bounds can prevent extreme steps in the parameters. To add this capability to R, I put the package Rvmmin on CRAN in 2011 (J. C. Nash 2018). This code is, as of 2018, part of a new optimx (J. C. Nash and Varadhan (2011), J. C. Nash (2014a)) package on CRAN. It is entirely written in R.

Provenance of truncated Newton codes for R

There are (at least) two implementations of truncated Newton methods available.

nloptr–tnewton() is a wrapper of the NLopt truncated Newton method translated to C and somewhat modified by Steven G. Johnson in the nlopt project (https://nlopt.readthedocs.io/en/latest/NLopt_Algorithms/) from Fortran code due to Ladislav Luksan (http://www.cs.cas.cz/luksan/subroutines.html). The many layers and translations make it difficult to unravel the particular details in this code, and I do not feel confident to provide a competent overview.

optimx–tn() and optimx–tnbc() are translations from Matlab source of the Truncated Newton codes of Stephen Nash (S. G. Nash 1983) that I prepared, initially in package Rtnmin. The code is entirely in R. In implementing the codes, the principal awkwardness was in making available to different functions a quite extensive set of variables relating to the function and gradient. My choice was to use the list2env() function to create an environment to hold this data. Note the survey of truncated Newton methods by Stephen in (S. G. Nash 2000).

Discussion

This story of some of the function minimizers available to R users is not finished. There are certain to be related methods in packages or collections I have overlooked. Moreover, I have not had the time or energy to fully explore some of the items I have mentioned. Nevertheless, how codes come about is important to understanding their strengths and weaknesses and in maintaining them to a level that users can reliably use them. Such matters were a large consideration for J. C. Nash (2014c), but have been more explicitly addressed by histoRicalg.

When evaluating such tools, it is important to consider what is wanted. Personally, I value reliability as more important than speed. By this, I mean

  • the program will always proceed towards a minimum
  • on termination it will provide useful information on the result obtained, such as whether there are good indications of a satisfactory result, or that we have halted for some reason such as exhausting a computational limit.

"Speed" is, of course, desirable, but how this is measured is debatable. Execution time is notoriously sensitive to hardware and software environments, as well as to the way functions and gradients are coded. Number of function and gradient counts is an alternative, since these computations often are the bottleneck of optimization. However, there are some methods where the optimization calculations are demanding either in cycles or memory.

Readers may note that I have highlighted that some codes are written entirely in R. This allows for customization or in-line use of faster code, and I have exchanged notes with several workers wanting to speed up time-consuming optimizations by such ideas. Profiling and debugging tools are generally quite challenging to use when there are multiple programming languages involved. All-R code may also be much easier to read and maintain if well-coded, especially as the programming language support for "old" languages diminishes.

These considerations are, of course, why we have a number of methods, and why Ravi Varadhan and I prepared the optimx package with tools to allow for comparison of several methods. But please, do use the comparison for evaluating and choosing a method, not for production minimization by a "try everything" strategy. As new methods are released, we hope to include them in the set optimx can call via a unified interface that is very close to the optim() function. We note that a somewhat different approach to unifying methods is available in the ROI package of Hornik, Meyer, and Theussl (2011).

Beale, E. M. L. 1972. “A Derivation of Conjugate Gradients.” In Numerical Methods for Nonlinear Optimization, edited by F. A. Lootsma, 39–43. London: Academic Press.
Byrd, Richard H., Peihuang Lu, Jorge Nocedal, and Ciyou Zhu. 1995. “A Limited Memory Algorithm for Bound Constrained Optimization.” SIAM J. Sci. Comput. 16 (5): 1190–1208. https://doi.org/10.1137/0916069.
Coppola, Antonio, Brandon Stewart, and Naoaki Okazaki. 2014. Lbfgs: Limited-Memory BFGS Optimization. https://CRAN.R-project.org/package=lbfgs.
Dai, Y. H., and Y. Yuan. 2001. “An Efficient Hybrid Conjugate Gradient Method for Unconstrained Optimization.” Annals of Operations Research 103 (1-4): 33–47.
Eddelbuettel, Dirk, and Romain François. 2011. Rcpp: Seamless R and C++ Integration.” Journal of Statistical Software 40 (8): 1–18. https://doi.org/10.18637/jss.v040.i08.
Fletcher, R. 1970. “A New Approach to Variable Metric Algorithms.” Computer Journal 13 (3): 317–22.
Fletcher, R., and C. M. Reeves. 1964. “Function Minimization by Conjugate Gradients.” The Computer Journal 7 (2): 149–54.
Fox, Phyllis. 1997. “The Port Mathematical Subroutine Library, Version 3.” Murray Hill, NJ: AT&T Bell Laboratories. http://www.bell-labs.com/project/PORT/.
Gay, David M. 1990. Usage summary for selected optimization routines. Computing Science Technical Report 153.” AT&T Bell Laboratories, Murray Hill.
Hager, William W., and Hongchao Zhang. 2006a. “A Survey of Nonlinear Conjugate Gradient Methods.” Pacific Journal of Optimization 2: 35–58.
———. 2006b. Algorithm 851: CG_DESCENT, a Conjugate Gradient Method with Guaranteed Descent.” ACM Transactions on Mathematical Software 32 (1): 113–37.
Hornik, Kurt, David Meyer, and Stefan Theussl. 2011. ROI: R Optimization Infrastructure. http://CRAN.R-project.org/package=ROI.
IEEE. 1985. IEEE Standard for Binary Floating-Point Arithmetic.” ANSI/IEEE Std 754-1985, 10–11. https://doi.org/10.1109/IEEESTD.1985.82928.
Liu, Dong C., and Jorge Nocedal. 1989. “On the Limited Memory BFGS Method for Large Scale Optimization.” Mathematical Programming 45 (1-3): 503–28. https://doi.org/10.1007/BF01589116.
Lu, Peihuang, Jorge Nocedal, Ciyou Zhu, and Richard H. Byrd. 1994. “A Limited-Memory Algorithm for Bound Constrained Optimization.” SIAM Journal on Scientific Computing 16: 1190–1208.
Melville, James. 2017. Mize: Unconstrained Numerical Optimization Algorithms. https://CRAN.R-project.org/package=mize.
Morales, José Luis, and Jorge Nocedal. 2011. “Remark on Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound constrained optimization.” ACM Trans. Math. Softw. 38 (1): 7:1–4. http://doi.acm.org/10.1145/2049662.2049669.
Nash, John C. 1979. Compact Numerical Methods for Computers : Linear Algebra and Function Minimisation. Book. Bristol: Adam Hilger.
———. 2014a. “On Best Practice Optimization Methods in R.” Journal of Statistical Software 60 (2): 1–14. http://www.jstatsoft.org/v60/i02/.
———. 2014b. Rcgmin: Conjugate Gradient Minimization of Nonlinear Functions. https://CRAN.R-project.org/package=Rcgmin.
———. 2014c. Nonlinear Parameter Optimization Using R Tools. Book. John Wiley & Sons: Chichester. http://www.wiley.com//legacy/wileychi/nash/.
———. 2018. Rvmmin: Variable Metric Nonlinear Function Minimization. https://CRAN.R-project.org/package=Rvmmin.
Nash, John C., and Ravi Varadhan. 2011. “Unifying Optimization Algorithms to Aid Software System Users: optimx for R.” Journal of Statistical Software 43 (9): 1–14. http://www.jstatsoft.org/v43/i09/.
Nash, John C., and Mary Walker-Smith. 1987. Nonlinear Parameter Estimation: An Integrated System in BASIC. New York: Marcel Dekker.
Nash, Stephen G. 1983. “Truncated-Newton Methods for Large-Scale Minimization.” In Applications of Nonlinear Programming to Optimization and Control, H. E. Rauch, editor, 91–100. Pergamon.
———. 2000. “A Survey of Truncated-Newton Methods.” Journal of Computational and Applied Mathematics 124: 45–59.
Nelder, J. A., and R. Mead. 1965. “A Simplex Method for Function Minimization.” Computer Journal 7 (4): 308–13.
Nielsen, H. B. 2000. UCMINF - an Algorithm for Unconstrained, Nonlinear Optimization.” Department of Mathematical Modelling, Technical University of Denmark. http://www2.imm.dtu.dk/~hbn/publ/TR0019.ps.
Nielsen, Hans Bruun, and Stig Bousgaard Mortensen. 2012. Ucminf: General-Purpose Unconstrained Non-Linear Optimization. http://CRAN.R-project.org/package=ucminf.
Nocedal, Jorge. 1980. “Updating Quasi-Newton Matrices with Limited Storage.” Mathematics of Computation 35 (July): 773–82. https://doi.org/10.1090/S0025-5718-1980-0572855-7.
Polak, E., and G. Ribiere. 1969. Note sur la convergence de méthodes de directions conjuguées.” Revue Française d’Informatique Et de Recherche Opérationnelle 16: 35–43.
R Core Team. 2019. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Schnabel, Robert B., John E. Koonatz, and Barry E. Weiss. 1985. “A Modular System of Algorithms for Unconstrained Minimization.” ACM Trans. Math. Softw. 11 (4): 419–40. https://doi.org/10.1145/6187.6192.
Sorenson, Harold. 1969. “Comparison of Some Conjugate Direction Procedures for Function Minimization.” Journal of The Franklin Institute - Engineering and Applied Mathematics 288 (December): 421–41. https://doi.org/10.1016/0016-0032(69)90253-1.
Zhu, Ciyou, Richard H. Byrd, Peihuang Lu, and Jorge Nocedal. 1997. “Algorithm 778: L-BFGS-b: Fortran Subroutines for Large-Scale Bound-Constrained Optimization.” ACM Trans. Math. Softw. 23 (4): 550–60. https://doi.org/10.1145/279232.279236.