Speaker
Description
We present and discuss a parallel training strategy for the training of neural networks. Our strategy is based on a domain decomposition-like approach, which is combined with trust region as a convergence control strategy. The resulting additive non-linear preconditioner APTS (Additively Preconditioned Trust-region Strategy) provides a general framework for the parallel training of neural networks, which includes the decomposition of the network's parameters (model-parallelism) or of the samples of the training data set (data parallelism). The combination with a Trust-Region strategy ensures global convergence and eliminates the need for extensive hyper-parameter tuning. We furthermore remark on SAPTS, a stochastic variant of APTS.
We compare (S)APTS in terms of convergence behavior and hyper-parameter sensitivity to traditional training methods like Stochastic Gradient Descent (SGD), ADAptive Moment estimation (Adam), and Limited-memory Broyden–Fletcher–Goldfarb–Shanno (LBFGS) algorithm. Our numerical experiments, which are conducted on benchmark problems from the field of image and text classification, showcase the capabilities, strengths, and limitations of APTS in training neural networks. The experiments demonstrate that APTS applied to the parameter space (model parallelism), especially with an increased number of subdomains, achieves comparable or superior generalization capabilities and faster convergence compared to traditional optimizers while offering inherent parallelism for the training procedure. APTS applied to the data space, however, shows competitive generalization capabilities with a small number of "data-subdomains" only, as its performance does not scale optimally with an increasing number of subdomains.