Peer Reviewed Article

Vol. 7 (2020)

Reward Redistribution as Align-RUDDER: Learning from a Few Demonstrations

10 December 2019


Reinforcement to handle difficult tasks with sparse and delayed rewards, learning algorithms demand a large number of samples. Complex tasks are frequently broken down into sub-tasks in a hierarchical manner. A step in the Q-function corresponds to the completion of a sub-task in which the return expectation rises. RUDDER was created to identify these phases and then shift rewards to them, resulting in rapid rewards when sub-tasks are completed. Learning is significantly accelerated since the problem of delayed rewards is alleviated. Current exploration strategies, such as those used in RUDDER, struggle to find episodes with large rewards when dealing with difficult tasks. As a result, we presume that high-reward episodes are presented as demonstrations and do not need to be found through exploration. The number of demonstrations is typically low, and RUDDER's LSTM model does not learn effectively as a deep learning method. As a result, we present Align-RUDDER, which is RUDDER with two major changes. First, Align-RUDDER implies that high-reward episodes are presented as demos, replacing RUDDER's safe exploration and lesson replay buffer. Second, we substitute RUDDER's LSTM model with a profile model derived from multiple demonstration sequence alignment. Bioinformatics has shown that profile models may be built with as little as two demos. Align-RUDDER inherits the concept of reward redistribution, which lowers the time between incentives and hence accelerates learning. On complex artificial tasks with delayed rewards and limited demonstrations, Align-RUDDER surpasses competitors. Align-RUDDER can mine a diamond on the MineCraft obtain Diamond assignment, but only infrequently.


