Reparameterized Policy Learning for Multimodal Trajectory Optimization
Zhiao Huang,u00a0Litian Liang,u00a0Zhan Ling,u00a0Xuanlin Li,u00a0Chuang Gan,u00a0Hao Su
We investigate the challenge of parametrizing policies for reinforcement learning (RL) in high-dimensional continuous action spaces. Our objective is to develop a multimodal policy that overcomes limitations inherent in the commonly-used Gaussian parameterization. To achieve this, we propose a principled framework that models the continuous RL policy as a generative model of optimal trajectories. By conditioning the policy on a latent variable, we derive a novel variational bound as the optimization objective, which promotes exploration of the environment. We then present a practical model-based RL method, called Reparameterized Policy Gradient (RPG), which leverages the multimodal policy parameterization and learned world model to achieve strong exploration capabilities and high data efficiency. Empirical results demonstrate that our method can help agents evade local optima in tasks with dense rewards and solve challenging sparse-reward environments by incorporating an object-centric intrinsic reward. Our method consistently outperforms previous approaches across a range of tasks. Code and supplementary materials are available on the project page https://haosulab.github.io/RPG/
Discussion
-
olivergreco1386@gmx.comЯк Ñ Ð¿Ð¾Ð´ÑÑжилаÑÑ Ð· мÑлÑÑиваÑÐºÐ¾Ñ Ð Ð°Ð½ÑÑе Ð¼Ð¾Ñ ÑÑоÑÑнки з кÑÑ Ð½ÐµÑ Ð·Ð°Ð»Ð¸Ñали бажаÑи кÑаÑого. Я могла ÑпалиÑи ÑÑÑÐ½Ñ Ð°Ð±Ð¾ зÑпÑÑваÑи найпÑоÑÑÑÑÑ ÑÑÑави. ТомÑ, коли на Ð´ÐµÐ½Ñ Ð½Ð°ÑÐ¾Ð´Ð¶ÐµÐ½Ð½Ñ Ð¼ÐµÐ½Ñ Ð¿Ð¾Ð´Ð°ÑÑвали мÑлÑÑиваÑкÑ, Ñ Ð½Ðµ знала, ÑадÑÑи Ñи заÑмÑÑÑваÑиÑÑ. ÐеÑÑий ÑÐ°Ñ Ñ Ð±Ð¾ÑлаÑÑ ÑÑ Ð½Ð°Ð²ÑÑÑ ÑвÑмкнÑÑи. Ðона ÑÑоÑла на ÑÑÑлÑниÑÑ, ÑоÑна Ñа блиÑкÑÑа, наÑе ÑоÑÑ Ñз коÑмÑÑного ÑÑлÑмÑ. Ðле голод â не ÑÑÑка, Ñ Ð¾Ð´Ð½Ð¾Ð³Ð¾ веÑоÑа Ñ Ð²Ð¸ÑÑÑила ÑизикнÑÑи. Ðа ÑайÑÑ Ð½Ð°Ð¹Ð¿ÑоÑÑÑÑим ÑеÑепÑом виÑвилиÑÑ "ÐÑÐ½Ð¸Ð²Ñ Ð³Ð¾Ð»ÑбÑÑ". Ðазва мене ÑозÑмÑÑила. Я кинÑла в ÑаÑÑ ÑаÑÑ, капÑÑÑÑ, ÑÐ¸Ñ Ñ Ð·Ð°Ð»Ð¸Ð»Ð° вÑе ÑомаÑним Ñоком. ÐкÑивÑвала ÐºÐ½Ð¾Ð¿ÐºÑ "ÐаÑÑннÑ" â Ñ Ð·Ð°ÑамÑвала Ð¿Ð¾Ð´Ð¸Ñ . Ðа Ð³Ð¾Ð´Ð¸Ð½Ñ Ð¿Ð¾ кваÑÑиÑÑ ÑознÑÑÑÑ Ð°ÑомаÑ, Ñо ÑоловÑк пÑийÑов на кÑÑ Ð½Ñ, здивовано пиÑаÑÑи: "Це Ñи Ñама пÑигоÑÑвала?". Ðоли Ñ Ð¿ÑднÑла кÑиÑкÑ, Ñо бÑла вÑажена. Це бÑла ÑпÑÐ°Ð²Ð¶Ð½Ñ ÑÑÑава! СпÑавжнÑÑÑнÑÐºÑ Ð³Ð¾Ð»ÑбÑÑ! Ðи полаÑÑвали Ð½ÐµÑ Ð·Ð° лÑÑÐµÐ½Ñ Ñ Ð²Ð¸Ð»Ð¸Ð½Ð¸. Ð¢ÐµÐ¿ÐµÑ Ð¼Ð¸ з мÑлÑÑиваÑÐºÐ¾Ñ - Ñдине ÑÑле. Ðона пÑобаÑÐ°Ñ Ð¼ÐµÐ½Ñ Ð²ÑÑ Ð¿Ð¾Ð¼Ð¸Ð»ÐºÐ¸: ÑÑÑави Ð²Ð¸Ñ Ð¾Ð´ÑÑÑ ÑдеалÑно, без мого конÑÑÐ¾Ð»Ñ Ñ Ð¼Ð¾Ð¶Ð½Ð° не боÑÑиÑÑ Ð¾Ð¿ÑкÑв. Я вже вивÑила ÑÑ Ñ Ð°ÑакÑеÑи â Ð¼Ð¾Ð²Ñ ÑаймеÑÑв Ñ ÑежимÑв. Ð Ñ Ð¾Ñ Ð´ÑÑÐ·Ñ ÐºÐµÐ¿ÐºÑÑÑÑ, Ñо Ñ Ð½Ðµ гоÑÑÑ, а "наÑиÑÐºÐ°Ñ ÐºÐ½Ð¾Ð¿ÐºÐ¸", Ñ Ð·Ð½Ð°Ñ: ÑпÑÐ°Ð²Ð¶Ð½Ñ Ð¼Ð°Ð³ÑÑ Ð¿Ð¾ÑинаÑÑÑÑÑ Ñаме ÑодÑ, коли Ñи пеÑеÑÑаÑÑ Ð±Ð¾ÑÑиÑÑ ÑпÑобÑваÑи.



Reply