General Information
- Instructor: Anthony Man-Cho So (manchoso at se.cuhk.edu.hk)
- Office Hours: Wednesdays 3:00pm - 4:30pm or by appointment, in ERB 604
- Lecture Time/Location:
- Mondays 11:30am - 1:15pm, in LSB C2
- Tuesdays 11:30am - 1:15pm, in HYS LG04
Course Description
The prevalence of high-dimensional data has motivated active research on efficient methods for tackling optimization problems that arise in statistical analysis. In this course, we will give an introduction to this exciting area of research, with emphasis on the theory of structured regularizers for high-dimensional statistics and the design and analysis of statistically and computationally efficient optimization algorithms. Applications in various areas of science and engineering, such as machine learning, signal processing, and statistics, will also be discussed. Prerequisite: ENGG 5501 or equivalent.
Course Outline (subject to change)
Course Requirements
Homework sets (60%) and a take-home final examination (40%)
Open problems will be introduced throughout the semester. Students who solve any one of the open problems will automatically receive an A for the course.
General References
- Yu. Nesterov: Introductory Lectures on Convex Optimization: A Basic Course. Kluwer Academic Publishers, 2004.
Schedule and Reading
- Week 1 - Jan 10: Introduction
- Week 2 - Jan 16: The notions of decomposable regularizer and restricted strong convexity and their consequences.
Jan 17: Bound on statistical error for general models. Application to sparse linear regression.
Reading:
- Week 3 - Jan 23, 24: Establishing the restricted strong convexity property for sparse linear regression.
Reading:
- Raskutti, Wainwright, Yu. Restricted Eigenvalue Properties for Correlated Gaussian Designs. Journal of Machine Learning Research 11: 2241-2259, 2010.
- The proof of Gordon's inequality can be found in Chapter 8.7 of
Foucart, Rauhut. A Mathematical Introduction to Compressive Sensing. Springer Science+Business Media, New York, 2013.
- For extension of Raskutti et al.'s result to non-Gaussian settings, see
Rudelson, Zhou. Reconstruction from Anisotropic Random Measurements. IEEE Transactions on Information Theory 59(6): 3434-3447, 2013.
Zhou. Restricted Eigenvalue Conditions on Subgaussian Random Matrices. Technical Report, 2009.
- Lecture Notes (Part 1, Part 2)
- Besides sparse linear regression, matrix completion also enjoys the restricted strong convexity property; see
Negahban, Wainwright. Restricted Strong Convexity and Weighted Matrix Completion: Optimal Bounds with Noise. Journal of Machine Learning Research 13: 1665-1697, 2012.
- Week 4 - Jan 30, 31: Chinese New Year.
- Week 5 - Feb 6, 7: Algorithmic aspects of regularized loss minimization problems - The proximal gradient method and its convergence analysis.
Reading:
- The lecture material is based on
Hou, Zhou, So, Luo. On the Linear Convergence of the Proximal Gradient Method for Trace Norm Regularization. Advances in Neural Information Processing Systems 26: Proceedings of the 2013 Conference (NIPS 2013), pp. 710-718, 2013.
See, in particular, Theorem 3.2.
WARNING: The conclusion of Theorem 3.1 of the above paper is not correct. This is due to a flaw in the proof. The correct conclusion appears in Proposition 12 of
Zhou, So. A Unified Approach to Error Bounds for Structured Convex Optimization Problems. Mathematical Programming, Series A, to appear, 2016.
- Some useful properties of the proximal map can be found in
Combettes, Wajs. Signal Recovery by Proximal Forward-Backward Splitting. Multiscale Modeling and Simulation 4(4): 1168-1200, 2005.
Rockafellar, Wets. Variational Analysis. Grundlehren der Mathematischen Wissenschaften 327, Springer-Verlag, 2004. (Particularly Chapters 1.G and 2.E)
- Lecture Notes
The lemma on p.6 of the notes (concerning certain monotonicity properties of the proximal map) first appears in
Sra. Scalable Nonconvex Inexact Proximal Splitting. Advances in Neural Information Processing Systems 25: Proceedings of the 2012 Conference (NIPS 2012), pp. 530-538, 2012.
However, to obtain the desired conclusion, an envelope theorem should be invoked in the proof. In fact, just having a unique solution to displayed equation (18) of the above paper is not sufficient to guarantee the differentiability of the function Eg. The envelope theorem mentioned on p.8 of the notes comes from Proposition 1 of
Oyama, Takenawa. On the (Non-)Differentiability of the Optimal Value Function When the Optimal Solution Is Unique. Manuscript, 2013.
Envelope theorems have many applications in economics; see the above paper and the pointers therein for details.
- The statistical nature of regularized loss minimization problems can be exploited to develop an alternative convergence analysis of first-order methods. This is done, e.g., in
Agarwal, Negahban, Wainwright. Fast Global Convergence of Gradient Methods for High-Dimensional Statistical Recovery. Annals of Statistics 40(5): 2452-2482, 2012. Supplementary Material
The above paper shows that for several classes of regularized loss minimization problems, certain first-order methods enjoy global linear convergence up to the statistical precision of the model. Modulo the global vs. local nature of the linear convergence, this is weaker than the convergence result derived from the error bound theory. Nevertheless, the approach developed in the above paper can be used to tackle problems that are not known to possess an error bound.
- Week 6 - Feb 13, 14: Algorithmic aspects of regularized loss minimization problems - Error bounds.
Reading:
- The lecture material is based on
Zhou, So. A Unified Approach to Error Bounds for Structured Convex Optimization Problems. Mathematical Programming, Series A, to appear, 2016.
- Lecture Notes
- A streamlined proof of Hoffman's error bound can be found in Section 11.8 of
Güler. Foundations of Optimization. Graduate Texts in Mathematics 258, Springer Science+Business Media, LLC, 2010.
- The notion of linear regularity used in Corollary 1 of the lecture notes and other related notions are discussed at length in
Bauschke, Borwein. On Projection Algorithms for Solving Convex Feasibility Problems. SIAM Review 38(3): 367-426, 1996.
- The outer Lipschitz continuity property used in the lecture is discussed in Sections 3C and 3D (particularly Theorem 3D.1) of
Dontchev, Rockafellar. Implicit Functions and Solution Mappings. Springer Monographs in Mathematics, Springer Science+Business Media, LLC, 2009.
- Week 7 - Feb 20, 21: Design and analysis of a successive quadratic approximation method for regularized loss minimization problems.
Reading:
- Week 8 - Feb 27, 28: Sketching and its applications to the design and analysis of second-order methods.
Reading:
- The lecture material is based on
Pilanci, Wainwright. Newton Sketch: A Near Linear-Time Optimization Algorithm with Linear-Quadratic Convergence. SIAM Journal on Optimization, to appear, 2016.
As noted in the lecture, the approach in the above paper does not offer any advantage for unconstrained problems, as the sketching dimension could be as high as the dimension of the decision vector. In fact, the same is true for a constrained problem if the optimal solution lies in the interior of the constraint set; see displayed equation (3.8) in the above paper.
- Lecture Notes
- Week 9 - Mar 6, 7: Non-convex regularized loss minimization for sparse linear regression.
Reading:
- Week 10 - Mar 13: Non-convex regularizers in statistical estimation problem
Mar 14: Bound on statistical error in the non-convex setting.
Reading:
- Week 11 - Mar 20, 21: Bound on optimization error in the non-convex setting.
Reading:
- Week 12 - Mar 27: Phase Synchronization: Formulation and preliminary observations.
Mar 28: The Generalized Power Method and bound on estimation error.
Reading:
- Week 13 - Apr 3: Error bound for Phase Synchronization.
Apr 4: Ching Ming Festival.
Reading:
- Week 14 - Apr 10: Error bound for Phase Synchronization (cont'd).
Apr 11: The Generalized Power Method for MIMO Detection.
Reading:
- Week 15 - Apr 17: Easter Monday.
Apr 18: Wrap-up and possible future studies.
Reading:
- Inexact gradient methods do not necessarily have the sufficient decrease property. Nevertheless, under appropriate assumptions, their convergence rates can be analyzed using the error bound framework; see, e.g.,
So, Zhou. Non-Asymptotic Convergence Analysis of Inexact Gradient Methods for Machine Learning Without Strong Convexity. Accepted for publication in Optimization Methods and Software, 2017.
- Here are some recent articles on the analysis of non-convex optimization problems; see also the references therein:
Sun, Qu, Wright. When are Nonconvex Problems not Scary? Preprint, 2016.
Ge, Jin, Zheng. No Spurious Local Minima in Nonconvex Low Rank Problems: A Unified Geometric Analysis. Manuscript, 2017.
- There has been some recent interest in applying difference-of-convex optimization techniques to statistical optimization problems; see
Yin, Lou, He, Xin. Minimization of l1-2 for Compressed Sensing. SIAM Journal on Scientific Computing 37(1): A536-A563, 2015.
Pang, Razaviyayn, Alvarado. Computing B-Stationary Points of Nonsmooth DC Programs. Mathematics of Operations Research 42(1): 95-118, 2017.
Nouiehed, Pang, Razaviyayn. On the Pervasiveness of Difference-Convexity in Optimization and Statistics Manuscript, 2017.
Homework Sets
Last Updated: April 27, 2017