Speaker
Description
Over the last years, Transformer-based models have achieved cutting-edge results in areas such as natural language processing, computer vision, multimodality, and robotics due to the parallelization of their attention mechanism and its direct access to distant tokens in the sentence. Nonetheless, such parallelization can only be carried out along the sentence length, not the number of layers (i.e., depth). Despite yielding an amazing performance, the rising scaling of Transformers' depth and dimension entails a high computational cost. By formulating the forward and backward propagations of the Transformer as ODEs, we explore parallel-in-time and multilevel methods to mitigate the computational cost caused by a large depth. We present numerical experiments from the field of large-language modeling that demonstrate the effectiveness of the proposed training strategies.