Study Material

My Personal Notes

Slides On Transformers (Attention Is All you need)

Content:

Maths + Visualization of each Transformer block
PyTorch code implementation

Class Notes On Introduction To Statistical Learning Theory

Content:

Basic concentration inequalities
Probably Approximately Correct (PAC) learning framework
Learning via Uniform Convergence
Vapnik Chervonenkis (VC) dimension and VC generalization bound

Survey On Generalization Bounds For Over-Parameterized Neural Networks

Abstract:
Modern neural networks are generally considered to be over-parameterized because they have significantly more number of parameters than the number of training examples needed to generalize well. In the regime of over-parameterization, we can analyse the performance of these neural networks as their width tends to infinity. The infinite width Neural Tangent Kernel (NTK) plays a crucial role in deriving generalization bounds for the networks. In this survey, we discuss two major results on the generalization bound for a 2-layer fully connected network (FCN) and for a deep FCN.

Slides on the discussion of the paper "Efficiently learning one hidden layer ReLU neural networks via Schur polynomials"

Content:

PAC learning a linear combination of k ReLU activations under the standard Gaussian distribution with respect to the square loss.
An efficient algorithm with run time which is polynomial in input dimension and target accuracy.

Slides on the discussion of the paper "Optimization Methods for Large-scale Machine Learning (L Bottou et al, 2018)"

Content:

Standard results of Convergence of Stochastic Gradient Descent for convex and non-convex loss functions with fixed or diminishing step sizes.

Slides on the discussion of the paper "Local SGD Converges Fast and Communicates Little (Sebastian U. Stich, 2019)"

Content:

Proposes an algorithm called Local SGD that runs independently in parallel on different workers and averages the sequences only once in a while.
Local SGD converges at the same rate as mini-batch SGD in terms of number of evaluated gradients
The number of communication rounds can be reduced up to a factor of T^(1/2) compared to mini-batch SGD where T is number of total steps.

Slides on the discussion of the paper "Approximation Methods for Bilevel Programming (S Ghadimi & M Wang, 2018)"

Content:

A class of bilevel programming problems where the inner objective function is strongly convex.
An approximation algorithm for solving this class of problems with its finite-time convergence analysis under different convexity assumption on the outer objective function.