Analytic theory of dropout regularization
Abstract
Dropout is a regularization technique widely used in training artificial neural networks to mitigate overfitting. It consists of dynamically deactivating subsets of the network during training to promote more robust representations. Despite its widespread adoption, dropout probabilities are often selected heuristically, and theoretical explanations of its success remain sparse. Here, we analytically study dropout in two-layer neural networks trained with online stochastic gradient descent. In the high-dimensional limit, we derive a set of ordinary differential equations that fully characterize the evolution of the network during training and capture the effects of dropout. We obtain a number of exact results describing the generalization error and the optimal dropout probability at short, intermediate, and long training times. Our analysis shows that dropout reduces detrimental correlations between hidden nodes, mitigates the impact of label noise, and that the optimal dropout probability increases with the level of noise in the data. Our results are validated by extensive numerical simulations.