Reward Redistribution as Align-RUDDER: Learning from a Few Demonstrations

Takudzwa Fadziso

Peer Reviewed Article

Vol. 7 (2020)

Reward Redistribution as Align-RUDDER: Learning from a Few Demonstrations

Takudzwa Fadziso^▸^▾

PDF

Submitted: 10 December 2019
Published: 15-02-2020

Abstract

Reinforcement to handle difficult tasks with sparse and delayed rewards, learning algorithms demand a large number of samples. Complex tasks are frequently broken down into sub-tasks in a hierarchical manner. A step in the Q-function corresponds to the completion of a sub-task in which the return expectation rises. RUDDER was created to identify these phases and then shift rewards to them, resulting in rapid rewards when sub-tasks are completed. Learning is significantly accelerated since the problem of delayed rewards is alleviated. Current exploration strategies, such as those used in RUDDER, struggle to find episodes with large rewards when dealing with difficult tasks. As a result, we presume that high-reward episodes are presented as demonstrations and do not need to be found through exploration. The number of demonstrations is typically low, and RUDDER's LSTM model does not learn effectively as a deep learning method. As a result, we present Align-RUDDER, which is RUDDER with two major changes. First, Align-RUDDER implies that high-reward episodes are presented as demos, replacing RUDDER's safe exploration and lesson replay buffer. Second, we substitute RUDDER's LSTM model with a profile model derived from multiple demonstration sequence alignment. Bioinformatics has shown that profile models may be built with as little as two demos. Align-RUDDER inherits the concept of reward redistribution, which lowers the time between incentives and hence accelerates learning. On complex artificial tasks with delayed rewards and limited demonstrations, Align-RUDDER surpasses competitors. Align-RUDDER can mine a diamond on the MineCraft obtain Diamond assignment, but only infrequently.

References

Ahmed, A.A.A. (2021). Event Ticketing Accounting Information System using RFID within the COVID-19 Fitness Etiquettes. Academia Letters, Article 1379. https://doi.org/10.20935/AL1379
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. and Lipman, D. J. 1990. Basic local alignment search tool. J. Molec. Biol., 214:403–410, 1990.
Altschul, S. F., Madden, T. L., Schäffer, A. A., Zhang, J., Zhang, Z., Miller, W. and Lipman D. J. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research, 25(17):3389–3402, 1997. doi: 10.1093/nar/25.17.3389.
Antonoglou, I., V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. P. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and Hassabis. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484–489. doi:10.1038/nature16961.
Arjona-Medina, J. A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J. and Hochreiter, S. 2019. RUDDER: return decomposition for delayed rewards. In Advances in Neural Information Processing Systems 32, pp. 13566–13577.
Bynagari, N. B. & Ahmed, A. A. A. (2021). Anti-Money Laundering Recognition through the Gradient Boosting Classifier. Academy of Accounting and Financial Studies Journal, 25(5), 1–11. https://doi.org/10.5281/zenodo.5523918
Bynagari, N. B. (2017). Prediction of Human Population Responses to Toxic Compounds by a Collaborative Competition. Asian Journal of Humanity, Art and Literature, 4(2), 147-156. https://doi.org/10.18034/ajhal.v4i2.577
Bynagari, N. B. (2018). On the ChEMBL Platform, a Large-scale Evaluation of Machine Learning Algorithms for Drug Target Prediction. Asian Journal of Applied Science and Engineering, 7, 53–64. Retrieved from https://upright.pub/index.php/ajase/article/view/31
Bynagari, N. B. (2019). GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. Asian Journal of Applied Science and Engineering, 8, 25–34. Retrieved from https://upright.pub/index.php/ajase/article/view/32
Bynagari, N. B., & Amin, R. (2019). Information Acquisition Driven by Reinforcement in Non-Deterministic Environments. American Journal of Trade and Policy, 6(3), 107-112. https://doi.org/10.18034/ajtp.v6i3.569
Bynagari, N. B., & Fadziso, T. (2018). Theoretical Approaches of Machine Learning to Schizophrenia. Engineering International, 6(2), 155-168. https://doi.org/10.18034/ei.v6i2.568
Ganapathy, A., Vadlamudi, S., Ahmed, A. A. A., Hossain, M. S., Islam, M. A. (2021). HTML Content and Cascading Tree Sheets: Overview of Improving Web Content Visualization. Turkish Online Journal of Qualitative Inquiry, 12(3), 2428-2438. https://doi.org/10.5281/zenodo.5522159
Hester, T., M. Vecerík, O. Pietquin, M. Lanctot, T. Schaul, B. Piot, D. Horgan, J. Quan, A. Sendonaris, I. Osband, G. Dulac-Arnold, J. Agapiou, J. Z. Leibo, and A. Gruslys. 2018. Deep q-learning from demonstrations. In The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18). Association for the Advancement of Artificial Intelligence, 2018.
Ho J. and Ermon S. 2016. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems 29, pp. 4565–4573, 2016.
Hochreiter S. and Schmidhuber J. 1995. Long short-term memory. Technical Report FKI-207-95, Fakultätfür Informatik, Technische Universität München, 1995.
Hochreiter S. and Schmidhuber J. 1997a. Long short-term memory. Neural Comput., 9(8):1735–1780.
Hochreiter S. and Schmidhuber J. 1997b. LSTM can solve hard long time lag problems. In M. C. Mozer,
Hussain, S., Ahmed, A. A. A., Kurniullah, A. Z., Ramirez-Asis, E., Al-Awawdeh, N., Al-Shamayleh, N. J. M., Julca-Guerrero, F. (2021). Protection against Letters of Credit Fraud. Journal of Legal, Ethical and Regulatory Issues, 24(Special Issue 1), 1-11. https://doi.org/10.5281/zenodo.5507840
Luoma, J., Ruutu, S., King, A. W. and Tikkanen H. 2017. Time delays, competitive interdependence, and firm performance. Strategic Management Journal, 38(3):506–525. doi: 10.1002/smj.2512.
Manavalan, M. (2016). Biclustering of Omics Data using Rectified Factor Networks. International Journal of Reciprocal Symmetry and Physical Sciences, 3, 1–10. Retrieved from https://upright.pub/index.php/ijrsps/article/view/40
Manavalan, M. (2018). Do Internals of Neural Networks Make Sense in the Context of Hydrology?. Asian Journal of Applied Science and Engineering, 7, 75–84. Retrieved from https://upright.pub/index.php/ajase/article/view/41
Manavalan, M. (2019a). P-SVM Gene Selection for Automated Microarray Categorization. International Journal of Reciprocal Symmetry and Physical Sciences, 6, 1–7. Retrieved from https://upright.pub/index.php/ijrsps/article/view/43
Manavalan, M. (2019b). Using Fuzzy Equivalence Relations to Model Position Specificity in Sequence Kernels. Asian Journal of Applied Science and Engineering, 8, 51–64. Retrieved from https://upright.pub/index.php/ajase/article/view/42
Manavalan, M., & Bynagari, N. B. (2015). A Single Long Short-Term Memory Network can Predict Rainfall-Runoff at Multiple Timescales. International Journal of Reciprocal Symmetry and Physical Sciences, 2, 1–7. Retrieved from https://upright.pub/index.php/ijrsps/article/view/39
Manavalan, M., & Chisty, N. M. A. (2019). Visualizing the Impact of Cyberattacks on Web-Based Transactions on Large-Scale Data and Knowledge-Based Systems. Engineering International, 7(2), 95-104. https://doi.org/10.18034/ei.v7i2.578
Manavalan, M., & Donepudi, P. K. (2016). A Sample-based Criterion for Unsupervised Learning of Complex Models beyond Maximum Likelihood and Density Estimation. ABC Journal of Advanced Research, 5(2), 123-130. https://doi.org/10.18034/abcjar.v5i2.581
Manojkumar, P., Suresh, M., Ahmed, A. A. A., Panchal, H., Rajan, C. C. A., Dheepanchakkravarthy, A., Geetha, A., Gunapriya, B., Mann, S., & Sadasivuni, K. K. (2021). A novel home automation distributed server management system using Internet of Things. International Journal of Ambient Energy, https://doi.org/10.1080/01430750.2021.1953590
Needleman S. B. and Wunsch C. D. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48(3):443–453, 1970.
Panchal, H., Sadasivuni, K. K., Ahmed, A. A. A., Hishan, S. S., Doranehgard, M. H., Essa, F. A., Shanmugan, S., & Khalid, M. (2021). Graphite powder mixed with black paint on the absorber plate of the solar still to enhance yield: An experimental investigation. Desalination, Volume 520. https://doi.org/10.1016/j.desal.2021.115349
Rahmandad, H., Repenning, N. and Sterman J. 2009. Effects of feedback delay on learning. System Dynamics Review, 25(4):309–338. doi: 10.1002/sdr.427.
Raya, I., Kzar, H. H., Mahmoud, Z. H., Ahmed, A. A. A., Ibatova, A. Z., & Kianfar, E. (2021). A review of gas sensors based on carbon nanomaterial. Carbon Letters. Article No: 276. https://doi.org/10.1007/s42823-021-00276-9
Reddy, S., Dragan, A. D. and. Levine S. 2020. SQIL: imitation learning via regularized behavioral cloning. ArXiv, 2020. Eighth International Conference on Learning Representations (ICLR).
Scheller, C., Y. Schraner, and M. Vogel. 2020. Sample efficient reinforcement learning through learning from demonstrations in Minecraft. arXiv, abs/2003.06066, 2020.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A. and Klimov O. 2018. Proximal policy optimization algorithms. ArXiv, 2018.
Sharma, D. K., Chakravarthi, D. S., Shaikh, A. A., Ahmed, A. A. A., Jaiswal, S., Naved, M. (2021). The aspect of vast data management problem in healthcare sector and implementation of cloud computing technique. Materials Today: Proceedings. https://doi.org/10.1016/j.matpr.2021.07.388
Silver, D., A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, Frey B. J. and Dueck D. 2007. Clustering by passing messages between data points. Science, 315(5814): 972–976, 2007. doi: 10.1126/science.1136800.
Smith T. F. and Waterman M. S. 1981. Identification of common molecular subsequences. Journal of Molecular Biology, 147(1):195–197, 1981
Stormo, G. D., Schneider, T. D., Gold, L. and Ehrenfeucht A. 1982. Use of the ‘Perceptron’ algorithm to distinguish translational initiation sites in E. coli. Nucleic Acids Research, 10(9):2997–3011, 1982.
Sutton R. S. and Barto A. G. 2018. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 2 edition.
Sutton, R. S., Precup, D. and Singh S. P. 1999. Between MDPs and Semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112(1-2):181–211, 1999.

Keywords

Reward Redistribution
Align-RUDDER
Learning from Demonstrations

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

How to Cite

Fadziso, T. (2020). Reward Redistribution as Align-RUDDER: Learning from a Few Demonstrations. International Journal of Reciprocal Symmetry and Theoretical Physics, 7, 1-8. https://upright.pub/index.php/ijrstp/article/view/52

Most read articles by the same author(s)

Takudzwa Fadziso, Quantum Vision Investigations Frame Worked after Long Short-Term Typed Memory , International Journal of Reciprocal Symmetry and Theoretical Physics: Vol. 5 (2018)

[1] Ahmed, A.A.A. (2021). Event Ticketing Accounting Information System using RFID within the COVID-19 Fitness Etiquettes. Academia Letters, Article 1379. https://doi.org/10.20935/AL1379

[2] Altschul, S. F., Gish, W., Miller, W., Myers, E. W. and Lipman, D. J. 1990. Basic local alignment search tool. J. Molec. Biol., 214:403–410, 1990.

[3] Altschul, S. F., Madden, T. L., Schäffer, A. A., Zhang, J., Zhang, Z., Miller, W. and Lipman D. J. 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research, 25(17):3389–3402, 1997. doi: 10.1093/nar/25.17.3389.

[4] Antonoglou, I., V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. P. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and Hassabis. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484–489. doi:10.1038/nature16961.

[5] Arjona-Medina, J. A., Gillhofer, M., Widrich, M., Unterthiner, T., Brandstetter, J. and Hochreiter, S. 2019. RUDDER: return decomposition for delayed rewards. In Advances in Neural Information Processing Systems 32, pp. 13566–13577.

[6] Bynagari, N. B. & Ahmed, A. A. A. (2021). Anti-Money Laundering Recognition through the Gradient Boosting Classifier. Academy of Accounting and Financial Studies Journal, 25(5), 1–11. https://doi.org/10.5281/zenodo.5523918

[7] Bynagari, N. B. (2017). Prediction of Human Population Responses to Toxic Compounds by a Collaborative Competition. Asian Journal of Humanity, Art and Literature, 4(2), 147-156. https://doi.org/10.18034/ajhal.v4i2.577

[8] Bynagari, N. B. (2018). On the ChEMBL Platform, a Large-scale Evaluation of Machine Learning Algorithms for Drug Target Prediction. Asian Journal of Applied Science and Engineering, 7, 53–64. Retrieved from https://upright.pub/index.php/ajase/article/view/31

[9] Bynagari, N. B. (2019). GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium. Asian Journal of Applied Science and Engineering, 8, 25–34. Retrieved from https://upright.pub/index.php/ajase/article/view/32

[10] Bynagari, N. B., & Amin, R. (2019). Information Acquisition Driven by Reinforcement in Non-Deterministic Environments. American Journal of Trade and Policy, 6(3), 107-112. https://doi.org/10.18034/ajtp.v6i3.569

[11] Bynagari, N. B., & Fadziso, T. (2018). Theoretical Approaches of Machine Learning to Schizophrenia. Engineering International, 6(2), 155-168. https://doi.org/10.18034/ei.v6i2.568

[12] Ganapathy, A., Vadlamudi, S., Ahmed, A. A. A., Hossain, M. S., Islam, M. A. (2021). HTML Content and Cascading Tree Sheets: Overview of Improving Web Content Visualization. Turkish Online Journal of Qualitative Inquiry, 12(3), 2428-2438. https://doi.org/10.5281/zenodo.5522159

[13] Hester, T., M. Vecerík, O. Pietquin, M. Lanctot, T. Schaul, B. Piot, D. Horgan, J. Quan, A. Sendonaris, I. Osband, G. Dulac-Arnold, J. Agapiou, J. Z. Leibo, and A. Gruslys. 2018. Deep q-learning from demonstrations. In The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18). Association for the Advancement of Artificial Intelligence, 2018.

[14] Ho J. and Ermon S. 2016. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems 29, pp. 4565–4573, 2016.

[15] Hochreiter S. and Schmidhuber J. 1995. Long short-term memory. Technical Report FKI-207-95, Fakultätfür Informatik, Technische Universität München, 1995.

[16] Hochreiter S. and Schmidhuber J. 1997a. Long short-term memory. Neural Comput., 9(8):1735–1780.

[17] Hochreiter S. and Schmidhuber J. 1997b. LSTM can solve hard long time lag problems. In M. C. Mozer,

[18] Hussain, S., Ahmed, A. A. A., Kurniullah, A. Z., Ramirez-Asis, E., Al-Awawdeh, N., Al-Shamayleh, N. J. M., Julca-Guerrero, F. (2021). Protection against Letters of Credit Fraud. Journal of Legal, Ethical and Regulatory Issues, 24(Special Issue 1), 1-11. https://doi.org/10.5281/zenodo.5507840

[19] Luoma, J., Ruutu, S., King, A. W. and Tikkanen H. 2017. Time delays, competitive interdependence, and firm performance. Strategic Management Journal, 38(3):506–525. doi: 10.1002/smj.2512.

[20] Manavalan, M. (2016). Biclustering of Omics Data using Rectified Factor Networks. International Journal of Reciprocal Symmetry and Physical Sciences, 3, 1–10. Retrieved from https://upright.pub/index.php/ijrsps/article/view/40

[21] Manavalan, M. (2018). Do Internals of Neural Networks Make Sense in the Context of Hydrology?. Asian Journal of Applied Science and Engineering, 7, 75–84. Retrieved from https://upright.pub/index.php/ajase/article/view/41

[22] Manavalan, M. (2019a). P-SVM Gene Selection for Automated Microarray Categorization. International Journal of Reciprocal Symmetry and Physical Sciences, 6, 1–7. Retrieved from https://upright.pub/index.php/ijrsps/article/view/43

[23] Manavalan, M. (2019b). Using Fuzzy Equivalence Relations to Model Position Specificity in Sequence Kernels. Asian Journal of Applied Science and Engineering, 8, 51–64. Retrieved from https://upright.pub/index.php/ajase/article/view/42

[24] Manavalan, M., & Bynagari, N. B. (2015). A Single Long Short-Term Memory Network can Predict Rainfall-Runoff at Multiple Timescales. International Journal of Reciprocal Symmetry and Physical Sciences, 2, 1–7. Retrieved from https://upright.pub/index.php/ijrsps/article/view/39

[25] Manavalan, M., & Chisty, N. M. A. (2019). Visualizing the Impact of Cyberattacks on Web-Based Transactions on Large-Scale Data and Knowledge-Based Systems. Engineering International, 7(2), 95-104. https://doi.org/10.18034/ei.v7i2.578

[26] Manavalan, M., & Donepudi, P. K. (2016). A Sample-based Criterion for Unsupervised Learning of Complex Models beyond Maximum Likelihood and Density Estimation. ABC Journal of Advanced Research, 5(2), 123-130. https://doi.org/10.18034/abcjar.v5i2.581

[27] Manojkumar, P., Suresh, M., Ahmed, A. A. A., Panchal, H., Rajan, C. C. A., Dheepanchakkravarthy, A., Geetha, A., Gunapriya, B., Mann, S., & Sadasivuni, K. K. (2021). A novel home automation distributed server management system using Internet of Things. International Journal of Ambient Energy, https://doi.org/10.1080/01430750.2021.1953590

[28] Needleman S. B. and Wunsch C. D. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48(3):443–453, 1970.

[29] Panchal, H., Sadasivuni, K. K., Ahmed, A. A. A., Hishan, S. S., Doranehgard, M. H., Essa, F. A., Shanmugan, S., & Khalid, M. (2021). Graphite powder mixed with black paint on the absorber plate of the solar still to enhance yield: An experimental investigation. Desalination, Volume 520. https://doi.org/10.1016/j.desal.2021.115349

[30] Rahmandad, H., Repenning, N. and Sterman J. 2009. Effects of feedback delay on learning. System Dynamics Review, 25(4):309–338. doi: 10.1002/sdr.427.

[31] Raya, I., Kzar, H. H., Mahmoud, Z. H., Ahmed, A. A. A., Ibatova, A. Z., & Kianfar, E. (2021). A review of gas sensors based on carbon nanomaterial. Carbon Letters. Article No: 276. https://doi.org/10.1007/s42823-021-00276-9

[32] Reddy, S., Dragan, A. D. and. Levine S. 2020. SQIL: imitation learning via regularized behavioral cloning. ArXiv, 2020. Eighth International Conference on Learning Representations (ICLR).

[33] Scheller, C., Y. Schraner, and M. Vogel. 2020. Sample efficient reinforcement learning through learning from demonstrations in Minecraft. arXiv, abs/2003.06066, 2020.

[34] Schulman, J., Wolski, F., Dhariwal, P., Radford, A. and Klimov O. 2018. Proximal policy optimization algorithms. ArXiv, 2018.

[35] Sharma, D. K., Chakravarthi, D. S., Shaikh, A. A., Ahmed, A. A. A., Jaiswal, S., Naved, M. (2021). The aspect of vast data management problem in healthcare sector and implementation of cloud computing technique. Materials Today: Proceedings. https://doi.org/10.1016/j.matpr.2021.07.388

[36] Silver, D., A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, Frey B. J. and Dueck D. 2007. Clustering by passing messages between data points. Science, 315(5814): 972–976, 2007. doi: 10.1126/science.1136800.

[37] Smith T. F. and Waterman M. S. 1981. Identification of common molecular subsequences. Journal of Molecular Biology, 147(1):195–197, 1981

[38] Stormo, G. D., Schneider, T. D., Gold, L. and Ehrenfeucht A. 1982. Use of the ‘Perceptron’ algorithm to distinguish translational initiation sites in E. coli. Nucleic Acids Research, 10(9):2997–3011, 1982.

[39] Sutton R. S. and Barto A. G. 2018. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 2 edition.

[40] Sutton, R. S., Precup, D. and Singh S. P. 1999. Between MDPs and Semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112(1-2):181–211, 1999.

Abstract

References

Similar Articles

Most read articles by the same author(s)