The problem is from my training loop: it doesn’t detach or repackage the hidden state in between batches? If so, then
loss.backward() is trying to back-propagate all the way through to the start of time, which works for the first batch but not for the second because the graph for the first batch has been discarded.
there are two possible solutions.
detach/repackage the hidden state in between batches. There are (at
least) three ways to do this (and I chose this solution):
(or equivalently hidden = hidden.detach()).
replace loss.backward() with
loss.backward(retain_graph=True)but know that each successive batch will take more time than the previous one because it will have to back-propagate all the way through to the start of the first batch.