CSE 712: Optimization for Modern Machine Learning

Course Description

Optimization is an essential component in modern machine learning and data science applications. In this seminar, we will review and discuss some papers of optimization algorithms, theory and applications in modern machine learning. The topic will include (stochastic) gradient decent, variance-reduced method, adaptive methods, derivative-free methods, bilevel methods and their applications in classification, SVM, neural networks, adversarial learning, meta-learning, GANs and etc. All students in this seminar are expected to read, discuss, present and write summaries of selected papers on such topics.

Logistics

Instructor: Kaiyi Ji, assistant professor at CSE of UB

Contact: Davis Hall 338G, kaiyiji@buffalo.edu

Location and time: Davis Hall 113A. Every Tuesday 12:00 am to 2:50 pm.

Office hours: On demand. Email me for appointments.

Course syllabus: available at PDF.

Course piazza: submit summaries, discussions via piazza.

  • To submit your summary (PDF), post a note only visible to me under the assignment/summary folder for that week.

References

There is no required textbook. Some suggested references:

  1. S. Boyd and L. Vandenberghe, “Convex Optimization,” Cambridge University Press, 2004.

  2. Y. Nesterov, “Introductory Lectures on Convex Optimization: A Basic Course,” Springer, 2004

  3. L. Bottou, F.E Curtis and J. Nocedal. “Optimization methods for large-scale machine learning,” Siam Review, 2018.

  4. I. Goodfellow, Y. Bengio, A. Courville. “Deep learning,” MIT press, 2016.

Course Objective

This course is to help students understand various algorithm designs and convergence analysis in optimization and their applications in modern machine learning, and further learn how to use them in theory and in practice. Some other skills such as presentation, paper summary are also practiced.

Course Requirements

  1. Finish the required reading before each lecture.

  2. Write a short summary (at most 1 page) of the paper(s) presented in each lecture. It needs to summarize the problem, algorithms, technical contributions and experiments. The summary is due every Monday, 11:59 pm (1 day before the next lecture). No late submission will be accepted.

  3. Present one of the selected papers throughout the semester. Each presentation should last for 30-50 min and contain 20-40 slides, and should give a brief introduction to the background, motivation, problem, algorithm, theory and experiments. You are encouraged but not mandatory to share the slides before the presentation and a link to your slides will be posted on the course website if you do.

  4. Each lecture involves one or two presentations. See course schedule!

Grading Policy

  • 25% for class participation

  • 45% for paper summaries (15*3%=45%)

  • 30% for presentation

The seminar is graded in S/U. Satisfactory score >= 75% and unsatisfactory score < 75%.

Course Schedule (Please sign up papers here: link)

Date Presenter Topic Readings Slides
8/30/2022 Kaiyi Introduction Nestrov. Introductory Lectures on Convex Programming. 1998. (Sections 2.1.1 and 2.1.5)
Bottou et al. Optimization Methods for Large-Scale Machine Learning. Siam 2018. (Sections 2.1 and 2.2)
(Optional) Sutskever et al. On the importance of initialization and momentum in deep learning. ICML 2013
S1
9/6/2022 Akash Deshmukh Stochastic method Bottou, Léon. Stochastic gradient descent tricks. Neural networks: Tricks of the trade, 2012. S2
9/13/2022 Jiayu Qin Incremental method Defazio et al. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. NeurIPS 2014. S3
9/20/2022 1. Zhaofeng Si
2. Viditha Wudaru
Variance reduction 1. Johnson, Rie and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. NeurIPS 2013.
2. Cutkosky, Ashok and Francesco Orabona. Momentum-based variance reduction in non-convex sgd. NeurIPS 2020.
S4.1
S4.2
9/27/2022 1. Madhavan Rangarajan
2. Naman Tejaswi
Compressed method 1. Bernstein, Jeremy, et al. signSGD: Compressed optimisation for non-convex problems. ICML 2018.
2. Karimireddy, Sai Praneeth, et al. Error feedback fixes signsgd and other gradient compression schemes. ICML 2019.
S5.1
S5.2
10/4/2022 1. Vanitha R
2. Siddharth Sankaran
Adaptive optimizer 1. Kingma, Diederik P., and Jimmy Ba. Adam: A method for stochastic optimization. ICLR 2014.
2. Liu, Liyuan, et al. On the variance of the adaptive learning rate and beyond. ICLR 2020.
S6.1
S6.2
10/11/2022 1. Hrisheekesh Talari
2. Brinda Ashar
Zeroth-order method 1. Tu, Chun-Chen, et al. Autozoom: Autoencoder-based zeroth order optimization method for attacking black-box neural networks. AAAI 2019.
2. Chen, Pin-Yu, et al. Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models. ACM workshop on artificial intelligence and security, 2017.
S7.1
S7.2
10/18/2022 1. Vishwas Nalka
2. Manoj Chowdary Raavi
Zeroth-order variance reduction 1. Liu, Sijia, et al. Zeroth-order stochastic variance reduction for nonconvex optimization. NeurIPS 2018.
2. Ji, Kaiyi, et al. Improved zeroth-order variance reduced algorithms and analysis for nonconvex optimization. ICML 2019.
S8.1
S8.2
10/25/2022 1. Rohan Sharma
2. Adikavya Gupta
MAML in meta-learning 1. Finn, Chelsea, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. ICML 2017.
2. Finn, Chelsea, et al. Online meta-learning. ICML 2019.
S9.1
S9.2
11/1/2022 1. Shruti Jha
2. Chengzhe Sun
MAML with partial adaptation 1. Raghu, Aniruddh, et al. Rapid learning or feature reuse? towards understanding the effectiveness of MAML. ICLR 2020.
2. Ji, Kaiyi, et al. Convergence of meta-learning with task-specific adaptation over partial parameters. NeurIPS 2020
S10.1
S10.2
11/8/2022 Ashray A Sripathi Minimax optimizer Liu, Mingrui, et al. Towards better understanding of adaptive gradient algorithms in generative adversarial nets. ICLR 2020. S11
11/15/2022 1. Jaishyam
2. Naga Ashok Reddy
Bilevel optimizer 1. Franceschi, Luca, et al. Bilevel programming for hyperparameter optimization and meta-learning. ICML 2018.
2. Rajeswaran, et al. Meta-learning with implicit gradients. NeurIPS 2019.
S12.1
S12.2
11/22/2022 1. Meng Ding
2. Muskan Arora
Advanced bilevel method 1. Ji, Kaiyi, Junjie Yang, and Yingbin Liang. Bilevel optimization: Convergence analysis and enhanced design. ICML 2021.
2. Liu, Risheng, et al. A value-function-based interior-point method for non-convex bi-level optimization. ICML 2021.
S13.1
S13.2
11/29/2022 1. Ning Gao
2. Shreya Vinchankar
Hessian-free bilevel method 1. Liu, Hanxiao, et al. Darts: Differentiable architecture search. ICLR 2019.
2. Nichol, Alex, et al. On first-order meta-learning algorithms. 2018.
S14.1
S14.2
12/6/2022 1. Shreyans Pathak
2. Mingxi Lei
Federate learning 1. McMahan, Brendan, et al. Communication-efficient learning of deep networks from decentralized data. AISTATS 2017.
2. Li, Tian, et al. Federated optimization in heterogeneous networks. MLSys 2020: 429-450.
S15.1
S15.2

Academic Integrity

Students are expected to write all summaries and homework independently, based on paper reading, presentation and in-class discussion. Directly paraphrasing others’ work or solution is regarded as plagiarism, which will result in an F grade. Any reference used in your presentation must be clearly cited. Academic integrity is required in your learning process. This course follows the departmental and university policies on academic integrity, which can be found at link.