Speaker
Description
In deep learning, one often operates in a (highly) over parametrized regime. Meaning we have significantly more trainable parameters than available training data. Nevertheless, experiments show that the generalization error after training with (stochastic) gradient descent is still small, while one would expect over fitting, i.e. small training error and relatively large test error.
This suggests the existence of an implicit bias towards learning networks that generalize well, in settings where infinitely many networks can achieve zero training loss.
To investigate this phenomenon, we analyze the training dynamics of deep diagonal linear networks. Alternatively, this can be interpreted from the perspective of recovering sparse signals from linear measurements.
We propose a method to show convergence of the gradient descent and to fully characterize its limit. Using techniques inspired by Mirror Gradient Descent and a Lojasiewicz type of inequality.