rl_algorithms

Rating

Similar

AutoGLM Rumination

perplexity deepresearch

openai deepresearch

deepseek com

AI AGENT

https://github com/dzhng/deep-research

gemini google

Information

[![Language grade: Python](https://img.shields.io/lgtm/grade/python/g/medipixel/rl_algorithms.svg?logo=lgtm&logoWidth=18)](https://lgtm.com/projects/g/medipixel/rl_algorithms/context:python) [![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT) [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black) [![All Contributors](https://img.shields.io/badge/all_contributors-10-orange.svg?style=flat-square)](#contributors-)

## Contents * [Welcome!](https://github.com/medipixel/rl_algorithms#welcome) * [Contributors](https://github.com/medipixel/rl_algorithms#contributors) * [Algorithms](https://github.com/medipixel/rl_algorithms#algorithms) * [Performance](https://github.com/medipixel/rl_algorithms#performance) * [Getting Started](https://github.com/medipixel/rl_algorithms#getting-started) * [Class Diagram](https://github.com/medipixel/rl_algorithms#class-diagram) * [References](https://github.com/medipixel/rl_algorithms#references) ## Welcome! This repository contains Reinforcement Learning algorithms which are being used for research activities at Medipixel. The source code will be frequently updated. We are warmly welcoming external contributors! :) |

| |:---:|:---:|:---:| |BC agent on LunarLanderContinuous-v2|RainbowIQN agent on PongNoFrameskip-v4|SAC agent on Reacher-v2| ## Contributors Thanks goes to these wonderful people ([emoji key](https://allcontributors.org/docs/en/emoji-key)):

_{Jinwoo Park (Curt)}	_{Kyunghwan Kim}	_darthegg	_{Mincheol Kim}	_김민섭	_{Leejin Jung}	_{Chris Yoon}
_{Jiseong Han}	_{Sehyun Hwang}	_eunjin

This project follows the [all-contributors](https://github.com/all-contributors/all-contributors) specification. ## Algorithms 0. [Advantage Actor-Critic (A2C)](https://github.com/medipixel/rl_algorithms/tree/master/rl_algorithms/a2c) 1. [Deep Deterministic Policy Gradient (DDPG)](https://github.com/medipixel/rl_algorithms/tree/master/rl_algorithms/ddpg) 2. [Proximal Policy Optimization Algorithms (PPO)](https://github.com/medipixel/rl_algorithms/tree/master/rl_algorithms/ppo) 3. [Twin Delayed Deep Deterministic Policy Gradient Algorithm (TD3)](https://github.com/medipixel/rl_algorithms/tree/master/rl_algorithms/td3) 4. [Soft Actor Critic Algorithm (SAC)](https://github.com/medipixel/rl_algorithms/tree/master/rl_algorithms/sac) 5. [Behaviour Cloning (BC with DDPG, SAC)](https://github.com/medipixel/rl_algorithms/tree/master/rl_algorithms/bc) 6. [From Demonstrations (DDPGfD, SACfD, DQfD)](https://github.com/medipixel/rl_algorithms/tree/master/rl_algorithms/fd) 7. [Rainbow DQN](https://github.com/medipixel/rl_algorithms/tree/master/rl_algorithms/dqn) 8. [Rainbow IQN (without DuelingNet)](https://github.com/medipixel/rl_algorithms/tree/master/rl_algorithms/dqn) - DuelingNet [degrades performance](https://github.com/medipixel/rl_algorithms/pull/137) 9. Rainbow IQN (with [ResNet](https://github.com/medipixel/rl_algorithms/blob/master/rl_algorithms/common/networks/backbones/resnet.py)) 10. [Recurrent Replay DQN (R2D1)](https://github.com/medipixel/rl_algorithms/tree/master/rl_algorithms/recurrent) 11. [Distributed Pioritized Experience Replay (Ape-X)](https://github.com/medipixel/rl_algorithms/tree/master/rl_algorithms/common/apex) 12. [Policy Distillation](https://github.com/medipixel/rl_algorithms/tree/master/rl_algorithms/distillation) 13. [Generative Adversarial Imitation Learning (GAIL)](https://github.com/medipixel/rl_algorithms/tree/master/rl_algorithms/gail) 14. [Sample Efficient Actor-Critic with Experience Replay (ACER)](https://github.com/medipixel/rl_algorithms/tree/master/rl_algorithms/acer) ## Performance We have tested each algorithm on some of the following environments. - [PongNoFrameskip-v4](https://github.com/medipixel/rl_algorithms/tree/master/configs/pong_no_frameskip_v4) - [LunarLanderContinuous-v2](https://github.com/medipixel/rl_algorithms/tree/master/configs/lunarlander_continuous_v2) - [LunarLander_v2](https://github.com/medipixel/rl_algorithms/tree/master/configs/lunarlander_v2) - [Reacher-v2](https://github.com/medipixel/rl_algorithms/tree/master/configs/reacher-v2) Please note that this won't be frequently updated. #### PongNoFrameskip-v4 **RainbowIQN** learns the game incredibly fast! It accomplishes the perfect score (21) [within 100 episodes](https://app.wandb.ai/curt-park/dqn/runs/b2p9e9f7/logs)! The idea of RainbowIQN is roughly suggested from [W. Dabney et al.](https://arxiv.org/pdf/1806.06923.pdf). See [W&B Log](https://app.wandb.ai/curt-park/dqn/reports?view=curt-park%2FPong%20%28DQN%20%2F%20C51%20%2F%20IQN%20%2F%20IQN%20-double%20q%29) for more details. (The performance is measured on the commit [4248057](https://github.com/medipixel/rl_algorithms/pull/158)) ![pong_dqn](https://user-images.githubusercontent.com/17582508/56282434-1e93fd00-614a-11e9-9c31-af32e119d5b6.png) **RainbowIQN with ResNet**'s performance and learning speed were similar to those of RainbowIQN. Also we confirmed that **R2D1 (w/ Dueling, PER)** converges well in the Pong enviornment, though not as fast as RainbowIQN (in terms of update step). Although we were only able to test **Ape-X DQN (w/ Dueling)** with 4 workers due to limitations to computing power, we observed a significant speed-up in carrying out update steps (with batch size 512). Ape-X DQN learns Pong game in about 2 hours, compared to 4 hours for serial Dueling DQN. See [W&B Log](https://app.wandb.ai/medipixel_rl/PongNoFrameskip-v4/reports/200626-integration-test--VmlldzoxNTE1NjE) for more details. (The performance is measured on the commit [9e897ad](https://github.com/medipixel/rl_algorithms/commit/9e897adfe93600c1db85ce1a7e064064b025c2c3)) ![pong dqn with resnet & rnn](https://user-images.githubusercontent.com/17582508/85813189-80fc7a80-b79d-11ea-96cf-947a62e380f3.png) ![apex dqn](https://user-images.githubusercontent.com/17582508/85814263-83ac9f00-b7a0-11ea-9cdc-ff29de9a6d54.png) #### LunarLander-v2 / LunarLanderContinuous-v2 We used these environments just for a quick verification of each algorithm, so some of experiments may not show the best performance. ##### Click the following lines to see the figures.

LunarLander-v2: RainbowDQN, RainbowDQfD, R2D1

See W&B log for more details. (The performance is measured on the commit 9e897ad) ![lunarlander-v2_dqn](https://user-images.githubusercontent.com/17582508/85815561-a5f3ec00-b7a3-11ea-8d7c-8d54953d0c07.png)

LunarLander-v2:ACER, RainbowDQN, R2D1

See W&B log for more details. (The performance is measured on the commit 82fae77) ![lunarlander-v2_acer](https://user-images.githubusercontent.com/48741026/134847201-c7ce6d9f-e930-497f-9473-05da7620095b.png)

LunarLanderContinuous-v2: A2C, PPO, DDPG, TD3, SAC

See W&B log for more details. (The performance is measured on the commit 9e897ad) ![lunarlandercontinuous-v2_baselines](https://user-images.githubusercontent.com/17582508/85818298-43065300-b7ab-11ea-9ee0-1eda855498ed.png)

LunarLanderContinuous-v2: DDPG, DDPGfD, BC-DDPG

See W&B log for more details. (The performance is measured on the commit 9e897ad) ![lunarlandercontinuous-v2_ddpg](https://user-images.githubusercontent.com/17582508/85818519-c9bb3000-b7ab-11ea-9473-08476a959a0c.png)

LunarLanderContinuous-v2: SAC, SACfD, BC-SAC

See W&B log for more details. (The performance is measured on the commit 9e897ad) ![lunarlandercontinuous-v2_sac](https://user-images.githubusercontent.com/17582508/85818654-1acb2400-b7ac-11ea-8641-d559839cab62.png)

LunarLanderContinuous-v2: PPO, SAC, GAIL

See W&B log for more details. (The performance is measured on the commit 9e897ad) ![lunarlandercontinuous-v2_gail](https://user-images.githubusercontent.com/23740495/130401442-8b668975-8760-4a79-b757-1c1e9a9c4e47.png)

#### Reacher-v2 We reproduced the performance of **DDPG**, **TD3**, and **SAC** on Reacher-v2 (Mujoco). They reach the score around -3.5 to -4.5. ##### Click the following the line to see the figures.

Reacher-v2: DDPG, TD3, SAC

See [W&B Log](https://app.wandb.ai/medipixel_rl/reacher-v2/reports?view=curt-park%2FBaselines%20%23158) for more details. ![reacher-v2_baselines](https://user-images.githubusercontent.com/17582508/56282421-163bc200-614a-11e9-8d4d-2bb520575fbb.png)

## Getting started #### Prerequisites * This repository is tested on [Anaconda](https://www.anaconda.com/distribution/) virtual environment with python 3.6.1+ \`\`\` $ conda create -n rl_algorithms python=3.7.9 $ conda activate rl_algorithms \`\`\` * In order to run Mujoco environments (e.g. \`Reacher-v2\`), you need to acquire [Mujoco license](https://www.roboti.us/license.html). #### Installation First, clone the repository. \`\`\` git clone https://github.com/medipixel/rl_algorithms.git cd rl_algorithms \`\`\` ###### For users Install packages required to execute the code. It includes \`python setup.py install\`. Just type: \`\`\` make dep \`\`\` ###### For developers If you want to modify code you should configure formatting and linting settings. It automatically runs formatting and linting when you commit the code. Contrary to \`make dep\` command, it includes \`python setup.py develop\`. Just type: \`\`\` make dev \`\`\` After having done \`make dev\`, you can validate the code by the following commands. \`\`\` make format # for formatting make test # for linting \`\`\` #### Usages You can train or test \`algorithm\` on \`env_name\` if \`configs/env_name/algorithm.yaml\` exists. (\`configs/env_name/algorithm.yaml\` contains hyper-parameters) \`\`\` python run_env_name.py --cfg-path \`\`\` e.g. running soft actor-critic on LunarLanderContinuous-v2. \`\`\` python run_lunarlander_continuous_v2.py --cfg-path ./configs/lunarlander_continuous_v2/sac.yaml \`\`\` e.g. running a custom agent, **if you have written your own configs**: \`configs/env_name/ddpg-custom.yaml\`. \`\`\` python run_env_name.py --cfg-path ./configs/lunarlander_continuous_v2/ddpg-custom.py \`\`\` You will see the agent run with hyper parameter and model settings you configured. #### Arguments for run-files In addition, there are various argument settings for running algorithms. If you check the options to run file you should command \`\`\` python -h \`\`\` - \`--test\` - Start test mode (no training). - \`--off-render\` - Turn off rendering. - \`--log\` - Turn on logging using [W&B](https://www.wandb.com/). - \`--seed \` - Set random seed. - \`--save-period \` - Set saving period of model and optimizer parameters. - \`--max-episode-steps \` - Set maximum episode step number of the environment. If the number is less than or equal to 0, it uses the default maximum step number of the environment. - \`--episode-num \` - Set the number of episodes for training. - \`--render-after \` - Start rendering after the number of episodes. - \`--load-from \` - Load the saved models and optimizers at the beginning. #### Show feature map with Grad-CAM and Saliency-map You can show a feature map that the trained agent extract using **[Grad-CAM(Gradient-weighted Class Activation Mapping)](https://arxiv.org/pdf/1610.02391.pdf)** and **[Saliency map](https://arxiv.org/pdf/1312.6034.pdf)**. Grad-CAM is a way of combining feature maps using the gradient signal, and produce a coarse localization map of the important regions in the image. You can use it by adding [Grad-CAM config](https://github.com/medipixel/rl_algorithms/blob/master/configs/pong_no_frameskip_v4/dqn.py#L39) and \`--grad-cam\` flag when you run. For example: \`\`\` python run_env_name.py --cfg-path --test --grad-cam \`\`\` The results will be rendered as follows:

You can also use Saliency-map in a similar way to Grad-CAM just by adding \`--saliency-map\` flag. Saliency-map need trained weight carried by \`--load-from\` flag. \`\`\` python run_env_name.py --cfg-path --load-from --test --saliency-map \`\`\` Saliency map will be stored in data/saliency_map

Both Grad-CAM and Saliency-map can be only used for the agent that uses convolutional layers like **DQN for Pong environment**. You can see feature maps of all the configured convolution layers. #### Using policy distillation We seperate the document about using policy distillation in [rl_algorithms/distillation/README.md](https://github.com/medipixel/rl_algorithms/tree/master/rl_algorithms/distillation). #### W&B for logging We use [W&B](https://www.wandb.com/) for logging of network parameters and others. For logging, please follow the steps below after requirement installation: >0. Create a [wandb](https://www.wandb.com/) account >1. Check your **API key** in settings, and login wandb on your terminal: \`$ wandb login API_KEY\` >2. Initialize wandb: \`$ wandb init\` For more details, read [W&B tutorial](https://docs.wandb.com/docs/started.html). ## Class Diagram Class diagram at [#135](https://github.com/medipixel/rl_algorithms/pull/135). This won't be frequently updated. ![RL_Algorithms_ClassDiagram](https://user-images.githubusercontent.com/16010242/55934443-812d5a80-5c6b-11e9-9b31-fa8214965a55.png) ## Citing the Project To cite this repository in publications: \`\`\` @misc\{rl_algorithms, author = \{Kim, Kyunghwan and Lee, Chaehyuk and Jeong, Euijin and Han, Jiseong and Kim, Minseop and Yoon, Chris and Kim, Mincheol and Park, Jinwoo\}, title = \{Medipixel RL algorithms\}, year = \{2020\}, publisher = \{Github\}, journal = \{GitHub repository\}, howpublished = \{\url\{https://github.com/medipixel/rl_algorithms\}\}, \} \`\`\` ## References 0. [T. P. Lillicrap et al., "Continuous control with deep reinforcement learning." arXiv preprint arXiv:1509.02971, 2015.](https://arxiv.org/pdf/1509.02971.pdf) 1. [J. Schulman et al., "Proximal Policy Optimization Algorithms." arXiv preprint arXiv:1707.06347, 2017.](https://arxiv.org/abs/1707.06347.pdf) 2. [S. Fujimoto et al., "Addressing function approximation error in actor-critic methods." arXiv preprint arXiv:1802.09477, 2018.](https://arxiv.org/pdf/1802.09477.pdf) 3. [T. Haarnoja et al., "Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor." arXiv preprint arXiv:1801.01290, 2018.](https://arxiv.org/pdf/1801.01290.pdf) 4. [T. Haarnoja et al., "Soft Actor-Critic Algorithms and Applications." arXiv preprint arXiv:1812.05905, 2018.](https://arxiv.org/pdf/1812.05905.pdf) 5. [T. Schaul et al., "Prioritized Experience Replay." arXiv preprint arXiv:1511.05952, 2015.](https://arxiv.org/pdf/1511.05952.pdf) 6. [M. Andrychowicz et al., "Hindsight Experience Replay." arXiv preprint arXiv:1707.01495, 2017.](https://arxiv.org/pdf/1707.01495.pdf) 7. [A. Nair et al., "Overcoming Exploration in Reinforcement Learning with Demonstrations." arXiv preprint arXiv:1709.10089, 2017.](https://arxiv.org/pdf/1709.10089.pdf) 8. [M. Vecerik et al., "Leveraging Demonstrations for Deep Reinforcement Learning on Robotics Problems with Sparse Rewards."arXiv preprint arXiv:1707.08817, 2017](https://arxiv.org/pdf/1707.08817.pdf) 9. [V. Mnih et al., "Human-level control through deep reinforcement learning." Nature, 518 (7540):529–533, 2015.](https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf) 10. [van Hasselt et al., "Deep Reinforcement Learning with Double Q-learning." arXiv preprint arXiv:1509.06461, 2015.](https://arxiv.org/pdf/1509.06461.pdf) 11. [Z. Wang et al., "Dueling Network Architectures for Deep Reinforcement Learning." arXiv preprint arXiv:1511.06581, 2015.](https://arxiv.org/pdf/1511.06581.pdf) 12. [T. Hester et al., "Deep Q-learning from Demonstrations." arXiv preprint arXiv:1704.03732, 2017.](https://arxiv.org/pdf/1704.03732.pdf) 13. [M. G. Bellemare et al., "A Distributional Perspective on Reinforcement Learning." arXiv preprint arXiv:1707.06887, 2017.](https://arxiv.org/pdf/1707.06887.pdf) 14. [M. Fortunato et al., "Noisy Networks for Exploration." arXiv preprint arXiv:1706.10295, 2017.](https://arxiv.org/pdf/1706.10295.pdf) 15. [M. Hessel et al., "Rainbow: Combining Improvements in Deep Reinforcement Learning." arXiv preprint arXiv:1710.02298, 2017.](https://arxiv.org/pdf/1710.02298.pdf) 16. [W. Dabney et al., "Implicit Quantile Networks for Distributional Reinforcement Learning." arXiv preprint arXiv:1806.06923, 2018.](https://arxiv.org/pdf/1806.06923.pdf) 17. [Ramprasaath R. Selvaraju et al., "Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization." arXiv preprint arXiv:1610.02391, 2016.](https://arxiv.org/pdf/1610.02391.pdf) 18. [Kaiming He et al., "Deep Residual Learning for Image Recognition." arXiv preprint arXiv:1512.03385, 2015.](https://arxiv.org/pdf/1512.03385) 19. [Steven Kapturowski et al., "Recurrent Experience Replay in Distributed Reinforcement Learning." in International Conference on Learning Representations https://openreview.net/forum?id=r1lyTjAqYX, 2019.](https://openreview.net/forum?id=r1lyTjAqYX) 20. [Horgan et al., "Distributed Prioritized Experience Replay." in International Conference on Learning Representations, 2018](https://arxiv.org/pdf/1803.00933.pdf) 21. [Simonyan et al., "Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps", 2013](https://arxiv.org/pdf/1312.6034.pdf) 22. [Ho et al., "Generative adversarial imitation learning", 2016](https://arxiv.org/abs/1606.03476) 23. [Wang, Ziyu, et al. "Sample efficient actor-critic with experience replay", 2016.](https://arxiv.org/pdf/1611.01224.pdf)

Prompts

Reviews

Write Your Review

Detailed Ratings

ALL

Correctness

Helpfulness

Interesting

Upload Pictures and Videos

Name

Size

Type

Download

Last Modified

Community

Add Discussion

Upload Pictures and Videos

Chatbot close

Bot
Hi there
How can I help you today?

Send