Rich Sutton and I wrote a paper about how to efficiently learn predictions that can range over many time steps. The focus of the paper is on algorithms whose computational complexity does not grow with the time span of the predictions. This is important because many predictive questions have a large or even infinite span.
Another contribution of the paper is that we connect several well-known algorithmic constructs one-to-one to concrete properties that we may desire from our algorithms. For instance, we find that eligibility traces arise naturally when we desire algorithms that are independent of span, and we find desire to update our predictions online leads automatically to temporal-difference errors.
The full paper can be found here: http://arxiv.org/abs/1508.04582.
We consider how to learn multi-step predictions efficiently. Conventional algorithms wait until observing actual outcomes before performing the computations to update their predictions. If predictions are made at a high rate or span over a large amount of time, substantial computation can be required to store all relevant observations and to update all predictions when the outcome is finally observed. We show that the exact same predictions can be learned in a much more computationally congenial way, with uniform per-step computation that does not depend on the span of the predictions. We apply this idea to various settings of increasing generality by repeatedly adding desired properties we want our predictions to be able to satisfy, constructing algorithms that ensure these desiderata, and then deriving equivalent span-independent algorithms. Interestingly, along the way several known algorithmic constructs emerge spontaneously from our derivations, including dutch eligibility traces, temporal difference errors, and averaging. This allows us to link these constructs one-to-one to the corresponding desiderata, unambiguously connecting the `how’ to the `why’. Each step, we make sure that the derived algorithm subsumes the previous algorithms, thereby retaining their properties. Ultimately we arrive at a single general span-independent learning algorithm.