AI Training

Neural network "training" is the problem of determining network parameters by minimizing a non-convex cost function under severe computational limitations. This is a difficult problem that requires heavy computational effort.

Optimization Issues

In general, minimization algorithms may not converge, may converge to a local minimum, or to a saddle point (a multi-dimensional inflection point) that is neither a minimum, nor a maximum of the cost function. See Barzilai [3].

Algorithms that use second order derivatives cannot be used for large minimization problems. As a result, the only possible algorithms in the literature are the Steepest Descent or the Conjugate Gradients algorithms. The Conjugate Gradients algorithm works well on quadratic, convex, non-singular problems but the training problem is none of the above. Here the Conjugate Gradients method may fail, the gradients are not conjugate, and the algorithm does not terminate in a finite number of steps. It follows that the only applicable algorithm in the optimization literature is the steepest descent method, with an appropriate line search to determine a step size (how far to go in the [negative] gradient direction). Other neural-network algorithms in use are heuristics with no foundations in optimization theory.

Mini Batches

Segmenting data into mini batches introduces singularities into the minimization problem of neural-network training, and input reordering (shuffling) produces different solutions when mini batches are used. See Mini Batches.

Performance

Three examples demonstrating performance are listed below:

For the exponential cost function in Barzilai [3] Algorithm B67 converges in 17 steps to a minimum. (This algorithm is stable, fast, and applicable on singular systems.) The Steepest Descent algorithm with a fixed step size of 0.001 requires more than 100,000 steps to converge to this solution.
For a variant of the MNIST problem of recognizing handwritten digits, the steepest descent algorithm diverges for step sizes 0.1 up to 0.00001 and converges for the step size 0.000001. With this step size, after 10 million iterations of this algorithm, taking 1374 seconds (more than 22 minutes) on a test machine, 3 precision-solution-digits are attained while our algorithm yields the solution at this precision in less than 0.092 seconds. That is, our algorithm is about 14900:1 faster.
For a neural network of depth 4 with 62 unknowns (weights), Algorithm B67 reduces the error (the norm of the cost function's gradient) from 3425824 to 0.000000000021781 in two steps. In comparison, the "Adam" algorithm reduces the error by a factor of 0.9991 per step, which is rather poor performance.

Similar behaviour is observed on large-scale problems. The reason is this: The total computational effort depends on the effort per step and the number of steps needed. The effort per step does depend on the problem's size, but this effort is the same since these algorithms require the computation of the gradient of the cost function in each step. The number of steps depends on the algorithm's rate of convergence which is independent of the problem's size (the rate for Newton's method is 2 regardless of size; other algorithms depend on the problem's condition number which, again, is independent of size). For this reason our algorithm's fast convergence results in orders of magnitude savings in computational effort for the training problem regardless of its size. This also means that it is easy to construct very small neural networks of the simplest structure for which standard training algorithms take an exceedingly large number of steps to converge.

Experience

Professor Barzilai (deceased) held B.Sc., M.Sc., and D.Sc. degrees in Applied Mathematics from the Technion, Israel Institute of Technology and was an expert in the field of numerical minimization algorithms. He worked for more than 6 years on (military) applied mathematical algorithms before completing graduate degrees in mathematical optimization, and has held positions at the University of Texas at Austin (Mathematics), York University (Business), Dalhousie University (Business), the Technical University of Nova Scotia (Director, School of Computer Science), and Dalhousie University (Industrial Engineering). His proprietary algorithm is the result of a theoretical breakthrough after more than 35 years of work on this problem. It is not a heuristic, nor a variant of existing algorithms.

References

Jonathan Barzilai, "On Neural-network Accuracy and Validation," SIAM News, December 3, 2021. View
Jonathan Barzilai, "A Note on Neural-Network Mini-Batches," SIAM News, October 27, 2021. View
Jonathan Barzilai, "On Neural-Network Training Algorithms," in Human-Machine Shared Contexts, William Lawless, Ranjeev Mittu, and Donald Sofge (Eds.), Elsevier, pp. 307—313, 2020. View preprint
Jonathan Barzilai, "A Spectrum Algorithm and Optimization Notes," Pure and Applied Functional Analysis, Volume 3, Number 4, pp. 533—536, 2018.
Jonathan Barzilai and Jonathan M. Borwein, "Two-Point Step Size Gradient Methods," IMA Journal of Numerical Analysis, Vol. 8, pp. 141—148, 1988.
Jonathan Barzilai and Michael A.H. Dempster, "Measuring Rates of Convergence of Numerical Algorithms," Journal of Optimization Theory and Applications, Vol. 78, No. 1, pp.109—125, 1993.
Jonathan Barzilai and Aharon Ben-Tal, "Nonpolynomial and Inverse Interpolation for Line Search: Synthesis and Convergence Rates," SIAM J. Numer. Anal., Vol. 19, pp. 1263—1277, 1982.

Other Publications

Jonathan Barzilai, A Linear-Equations Challenge, 2017. View
Jonathan Barzilai, Supply and Demand Equilibrium and Debreu's Theory of Value, 2016. View
Jonathan Barzilai, And the Mathematics is Incorrect, SIAM News, March 2016. View published letter or see unedited version.
Jonathan Barzilai, On Ordinal, Cardinal, and Expected Utility, pp. 1—6, 2011. Published as "Inapplicable Operations on Ordinal, Cardinal, and Expected Utility," Real-World Economic Review, No. 63, 25 March 2013. View
Jonathan Barzilai, "On Microeconomics Errors and Ordinal Group Decision Making," AAAI Technical Report SS-12-01, pp. 4—7, 2012. View preprint
Jonathan Barzilai, "Preference Function Modeling: The Mathematical Foundations of Decision Theory," in Trends in Multiple Criteria Decision Analysis, Matthias Ehrgott, José Rui Figueira, and Salvatore Greco (Eds.), Springer, pp. 57—86, 2010. View preprint