Tutorial for Convex Optimization formulation for neural networks

We provide a short tutorial toward how to understand neural networks via the lens of convex optimization. The most important theoretical contribution of this line of work is that we formulate a methodlody to understand the NP-hardness of finding the global optima of the regularized training problem of two-layer ReLU fully-connected neural networks.

The starting point is [1], which shows that regularized training problems of two-layer ReLU networks can be reformulated as convex programs. The convex geometry is carefully studied in [3]. We further extend the analysis for multi-layer NNs [2,4,5,7]. Based on the convex optimization formulations, [6] further gives the exact characterizations of all global optima in two-layer ReLU networks. More precisely, all globally optimal solutions of the nonconvex training problem are given by the solution set of a simple convex program up to permutation and splitting. In other words, we can find the set of optimal NNs for the regularized training problem by solving a convex optimization problem. The convex optimization formulations of NNs were also extended to NNs with batch normalization layers [8], convolutional NNs (CNNs) [2,4], polynomial activation networks [9], transformers with self-attention layers [10], Generative Adversarial Networks (GANs) [11], implicit regularization [13] and Wasserstein gradient flow (WGF) [14].

We emphasize that the zero duality gap and the exact convex formulations of NNs are important as follows.

If the duality is zero, this means that there exists a convex program which achieves the same optimal value as the original non-convex problem. Then, we can understand non-convex optimization heuristics applied to the original non-convex program (such as SGD or GD based on backpropagation) as local algorithms to solve the convex program. For instance, the non-convex SGD corresponds to an active set method that maintains a small set of variables to parameterize the high-dimensional variable set in the convex program. This is analogous to the simplex method to solve linear programs. Moreover, the convex program can help us to understand the optimization landscape of the non-convex optimization. In [6], the authors show that all global (local) optima of the regularized non-convex training objective for two-layer ReLU networks can be characterized via the corresponding (constraints-subsampled) convex dual program.

Solving the convex program is more efficient and leads to solutions that optimize and generalize well. The recent works [1,8,12] show that convex solvers outperform standard non-convex heuristics (SGD, ADAM etc.) in train and test accuracy and training time on many benchmarks. Therefore, having a convex program for deeper networks potentially can lead to practical gains.

In optimization theory and neural network theory, having a convex formulation of a non-convex optimization problem is a major step, regardless of the complexity to solve it. We can apply theoretical tools from optimization theory to analyze the sensitivity of the parameters and robustify the solution. Besides, no matter which solver we apply to solve the convex program and its hyperparameters (initialization, step-sizes, momentum, batch-size etc.), it will find a globally optimal solution. In other words, having a convex formulation decouples the model from optimizer parameters.

Supplementary materials

Mert Pilanci's talk on the Hidden Convex Optimization Landscape of Deep Neural Networks with slide and EE364b Convex Optimization II with lecture slide.

My talk at SIAM MDS22 with slide.

SCNN. a Python library for fast, global optimization of shallow neural networks.

Reference

[1] M. Pilanci, T. Ergen. Neural Networks are Convex Regularizers: Exact Polynomial-time Convex Optimization Formulations for Two-Layer Networks. ICML 2020. (paper)

[2] T. Ergen, M. Pilanci. Implicit Convex Regularizers of CNN Architectures: Convex Optimization of Two- and Three-Layer Networks in Polynomial Time. ICLR 2021 Spotlight. (paper)

[3] T. Ergen, M. Pilanci. Revealing the Structure of Deep Neural Networks via Convex Duality. ICML 2021. (paper)

[4] T. Ergen, M. Pilanci. Global Optimality Beyond Two Layers: Training Deep ReLU Networks via Convex Programs. ICML 2021. (paper)

[5] Y. Wang, T. Ergen, M. Pilanci. Parallel Deep Neural Networks Have Zero Duality Gap. (paper)

[6] Y. Wang, J. Lacotte, M. Pilanci. The Hidden Convex Optimization Landscape of Two-Layer ReLU Neural Networks: an Exact Characterization of the Optimal Solutions. ICLR 2022 Oral. (paper)

[7] T. Ergen, M. Pilanci. Path Regularization: A Convexity and Sparsity Inducing Regularization for Parallel ReLU Networks. (paper)

[8] T. Ergen, A. Sahiner, B. Ozturkler, J. Pauly, M. Mardani, M. Pilanci. Demystifying Batch Normalization in ReLU Networks: Equivalent Convex Optimization Models and Implicit Regularization. ICLR 2022. (paper)

[9] B. Bartan and M. Pilanci. Neural spectrahedra and semidefinite lifts: Global convex optimization of polynomial activation neural networks in fully polynomial-time. (paper)

[10] A. Sahiner, T. Ergen, B. Ozturkler, J. Pauly, M. Mardani, and M. Pilanci. Unraveling attention via convex duality: Analysis and interpretations of vision transformers. (paper)

[11] A. Sahiner, T. Ergen, B. Ozturkler, B. Bartan, J. Pauly, M. Mardani, and M. Pilanci. Hidden convexity of wasserstein gans: Interpretable generative models with closed-form solutions. ICLR 2021. (paper)

[12] A. Mishkin, A. Sahiner, M. Pilanci. Fast Convex Optimization for Two-Layer ReLU Networks: Equivalent Model Classes and Cone Decompositions. ICML 2022. (paper)

[13] Y. Wang, M. Pilanci, The Convex Geometry of Backpropagation: Neural Network Gradient Flows Converge to Extreme Points of the Dual Convex Program, International Conference on Learning Representations (ICLR) 2022 Poster, (paper), (poster), (slide)

[14] Y. Wang, P. Chen, M. Pilanci, W. Li, Optimal Neural Network Approximation of Wasserstein Gradient Direction via Convex Optimization. (paper)