Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
memory [2018/10/17 11:55]
admin
memory [2018/10/17 12:05]
admin
Line 200: Line 200:
 specifically modulate the reward credited to past events. specifically modulate the reward credited to past events.
  
 +https://​arxiv.org/​pdf/​1810.05017.pdf ONE-SHOT HIGH-FIDELITY IMITATION: TRAINING LARGE-SCALE DEEP NETS WITH RL
 +
 +In this paper, we introduce an off-policy RL algorithm (MetaMimic) to narrow this gap. MetaMimic can learn both (i) policies for high-fidelity one-shot imitation of diverse novel skills, and (ii) policies that enable the agent to solve tasks more efficiently than the demonstrators. MetaMimic relies on the principle of storing all experiences in a memory and replaying these to learn massive deep neural network policies by off-policy RL.