Stabilizing Transformers for Reinforcement Learning
Emilio Parisotto,u00a0Francis Song,u00a0Jack Rae,u00a0Razvan Pascanu,u00a0Caglar Gulcehre,u00a0Siddhant Jayakumar,u00a0Max Jaderberg,u00a0Raphau00ebl Lopez Kaufman,u00a0Aidan Clark,u00a0Seb Noury,u00a0Matthew Botvinick,u00a0Nicolas Heess,u00a0Raia Hadsell
Owing to their ability to both effectively integrate information over long time horizons and scale to massive amounts of data, self-attention architectures have recently shown breakthrough success in natural language processing (NLP). Harnessing the transformeru2019s ability to process long time horizons of information could provide a similar performance boost in partially observable reinforcement learning (RL) domains, but the large-scale transformers used in NLP have yet to be successfully applied to the RL setting. In this work we demonstrate that the standard transformer architecture is difficult to optimize, which was previously observed in the supervised learning setting but becomes especially pronounced with RL objectives. We propose architectural modifications that substantially improve the stability and learning speed of the original Transformer and XL variant. The proposed architecture, the Gated Transformer-XL (GTrXL), surpasses LSTMs on challenging memory environments and achieves state-of-the-art results on the multi-task DMLab-30 benchmark suite, exceeding the performance of an external memory architecture. We show that the GTrXL has stability and performance that consistently matches or exceeds a competitive LSTM baseline, including on more reactive tasks where memory is less critical.