Limited-memory BFGS

Limited-memory BFGS (L-BFGS or LM-BFGS) is an optimization algorithm in the family of quasi-Newton methods that approximates the Broyden–Fletcher–Goldfarb–Shanno algorithm (BFGS) using a limited amount of computer memory.[1] It is a popular algorithm for parameter estimation in machine learning.Due to its resulting linear memory requirement, the L-BFGS method is particularly well suited for optimization problems with many variables.These updates are used to implicitly do operations requiring the Hk-vector product.The algorithm starts with an initial estimate of the optimal value,are used as a key driver of the algorithm to identify the direction of steepest descent, and also to form an estimate of the Hessian matrix (second derivative) ofL-BFGS shares many features with other quasi-Newton algorithms, but is very different in how the matrix-vector multiplicationThere are multiple published approaches using a history of updates to form this direction vector.Here, we give a common approach, the so-called "two loop recursion.will be the 'initial' approximate of the inverse Hessian that our estimate at iteration k begins with.The algorithm is based on the BFGS recursion for the inverse Hessian as For a fixed k we define a sequence of vectorsThis formulation gives the search direction for the minimization problem, i.e.,ensures that the search direction is well scaled and therefore the unit step length is accepted in most iterations.A Wolfe line search is used to ensure that the curvature condition is satisfied and the BFGS updating is stable.Note that some software implementations use an Armijo backtracking line search, but cannot guarantee that the curvature conditionis negative or too close to zero, but this approach is not generally recommended since the updates may be skipped too often to allow the Hessian approximationSome solvers employ so called damped (L)BFGS update which modifies quantitiesThe two-loop recursion formula is widely used by unconstrained optimizers due to its efficiency in multiplying by the inverse Hessian.However, it does not allow for the explicit formation of either the direct or inverse Hessian and is incompatible with non-box constraints.[6] This represents the Hessian as a sum of a diagonal matrix and a low-rank update.Such a representation enables the use of L-BFGS in constrained settings, for example, as part of the SQP method.L-BFGS has been called "the algorithm of choice" for fitting log-linear (MaxEnt) models and conditional random fields with[2][3] Since BFGS (and hence L-BFGS) is designed to minimize smooth functions without constraints, the L-BFGS algorithm must be modified to handle functions that include non-differentiable components or constraints.A popular class of modifications are called active-set methods, based on the concept of the active set.The idea is that when restricted to a small neighborhood of the current iterate, the function and constraints can be simplified.The L-BFGS-B algorithm extends L-BFGS to handle simple box constraints (aka bound constraints) on variables; that is, constraints of the form li ≤ xi ≤ ui where li and ui are per-variable constant lower and upper bounds, respectively (for each xi, either or both bounds may be omitted).Orthant-wise limited-memory quasi-Newton (OWL-QN) is an L-BFGS variant for fittingAfter an L-BFGS step, the method allows some variables to change sign, and repeats the process.Schraudolph et al. present an online approximation to both BFGS and L-BFGS.[9] Similar to stochastic gradient descent, this can be used to reduce the computational complexity by evaluating the error function and gradient on a randomly drawn subset of the overall dataset in each iteration.

optimizationalgorithmquasi-Newton methodsBroyden–Fletcher–Goldfarb–Shanno algorithmcomputer memorymachine learningHessian matrixWolfe line searchbacktracking line searchcompact representationlog-linear (MaxEnt) modelsconditional random fields ℓ 2 {\displaystyle \ell _{2}} -regularizationsmoothconstraintsdifferentiableactive set ℓ 1 {\displaystyle \ell _{1}} regularizedsparsityconvexloss functiononlinestochastic gradient descentALGLIBCiteSeerXBibcodeSIAM J. Sci. Comput.AlgorithmsmethodsheuristicsUnconstrained nonlinearFunctionsGolden-section searchPowell's methodLine searchNelder–Mead methodSuccessive parabolic interpolationGradientsConvergenceTrust regionWolfe conditionsQuasi–NewtonBerndt–Hall–Hall–HausmanBroyden–Fletcher–Goldfarb–ShannoDavidon–Fletcher–PowellSymmetric rank-one (SR1)Other methodsConjugate gradientGauss–NewtonGradientMirrorLevenberg–MarquardtPowell's dog leg methodTruncated NewtonHessiansNewton's methodConstrained nonlinearBarrier methodsPenalty methodsAugmented Lagrangian methodsSequential quadratic programmingSuccessive linear programmingConvex optimizationConvex minimizationCutting-plane methodReduced gradient (Frank–Wolfe)Subgradient methodLinearquadraticAffine scalingEllipsoid algorithm of KhachiyanProjective algorithm of KarmarkarBasis-exchangeSimplex algorithm of DantzigRevised simplex algorithmCriss-cross algorithmPrincipal pivoting algorithm of LemkeActive-set methodCombinatorialApproximation algorithmDynamic programmingGreedy algorithmInteger programmingBranch and boundGraph algorithmsMinimum spanning treeBorůvkaKruskalShortest pathBellman–FordDijkstraFloyd–WarshallNetwork flowsEdmonds–KarpFord–FulkersonPush–relabel maximum flowMetaheuristicsEvolutionary algorithmHill climbingLocal searchParallel metaheuristicsSimulated annealingSpiral optimization algorithmTabu searchSoftware