Publications

Publications top

  • Hado van Hasselt, Arthur Guez, Matteo Hessel, Volodymyr Mnih, David Silver (2016). Learning values across many orders of magnitudesNIPS, 2016.

    Abstract: Most learning algorithms are not invariant to the scale of the function that is being approximated. We propose to adaptively normalize the targets used in learning. This is useful in value-based reinforcement learning, where the magnitude of appropriate value approximations can change over time when we update the policy of behavior. Our main motivation is prior work on learning to play Atari games, where the rewards were all clipped to a predetermined range. This clipping facilitates learning across many different games with a single learning algorithm, but a clipped reward function can result in qualitatively different behavior. Using the adaptive normalization we can remove this domain-specific heuristic without diminishing overall performance.

  • Ziyu Wang, Tom Schaul, Matteo Hessel, Hado van Hasselt, Marc Lanctot, Nando de Freitas (2016). Dueling Network Architectures for Deep Reinforcement LearningICML 2016. Best paper award.

    Abstract: In recent years there have been many successes of using deep representations in reinforcement learning. Still, many of these applications use conventional architectures, such as convolutional networks, LSTMs, or auto-encoders. In this paper, we present a new neural network architecture for model-free reinforcement learning. Our dueling network represents two separate estimators: one for the state value function and one for the state-dependent action advantage function. The main benefit of this factoring is to generalize learning across actions without imposing any change to the underlying reinforcement learning algorithm. Our results show that this architecture leads to better policy evaluation in the presence of many similar-valued actions. Moreover, the dueling architecture enables our RL agent to outperform the state-of-the-art on the Atari 2600 domain.

  • Hado van Hasselt, Arthur Guez, and David Silver (2015). Deep Reinforcement Learning with Double Q-learningAAAI 2016.

    Abstract: The popular Q-learning algorithm is known to overestimate action values under certain conditions. It was not previously known whether, in practice, such overestimations are common, whether this harms performance, and whether they can generally be prevented. In this paper, we answer all these questions affirmatively. In particular, we first show that the recent DQN algorithm, which combines Q-learning with a deep neural network, suffers from substantial overestimations in some games in the Atari 2600 domain. We then show that the idea behind the Double Q-learning algorithm, which was introduced in a tabular setting, can be generalized to work with large-scale function approximation. We propose a specific adaptation to the DQN algorithm and show that the resulting algorithm not only reduces the observed overestimations, as hypothesized, but that this also leads to much better performance on several games.

  • Hado van Hasselt, and Richard S. Sutton (2015). Learning to Predict Independent of SpanarXiv:1508.04582

    Abstract: We consider how to learn multi-step predictions efficiently. Conventional algorithms wait until observing actual outcomes before performing the computations to update their predictions. If predictions are made at a high rate or span over a large amount of time, substantial computation can be required to store all relevant observations and to update all predictions when the outcome is finally observed. We show that the exact same predictions can be learned in a much more computationally congenial way, with uniform per-step computation that does not depend on the span of the predictions. We apply this idea to various settings of increasing generality, repeatedly adding desired properties and each time deriving an equivalent span-independent algorithm for the conventional algorithm that satisfies these desiderata. Interestingly, along the way several known algorithmic constructs emerge spontaneously from our derivations, including dutch eligibility traces, temporal difference errors, and averaging. This allows us to link these constructs one-to-one to the corresponding desiderata, unambiguously connecting the `how’ to the `why’. Each step, we make sure that the derived algorithm subsumes the previous algorithms, thereby retaining their properties. Ultimately we arrive at a single general temporal-difference algorithm that is applicable to the full setting of reinforcement learning.

  • A. Rupam Mahmood, Hado van Hasselt, and Richard S. Sutton (2014). Weighted importance sampling for off-policy learning with linear function approximationAdvances in Neural Information Processing Systems 27.

    Abstract: Importance sampling is an essential component of off-policy model-free reinforcement learning algorithms. However, its most effective variant, weighted importance sampling, does not carry over easily to function approximation and, because of this, it is not utilized in existing off-policy learning algorithms. In this paper, we take two steps toward bridging this gap. First, we show that weighted importance sampling can be viewed as a special case of weighting the error of individual training samples, and that this weighting has theoretical and empirical benefits similar to those of weighted importance sampling. Second, we show that these benefits extend to a new weighted-importance-sampling version of off-policy LSTD(λ). We show empirically that our new WIS-LSTD(λ) algorithm can result in much more rapid and reliable convergence than conventional off-policy LSTD(λ) (Yu 2010, Bertsekas & Yu 2009).

  • Hado van Hasselt, A. Rupam Mahmood, and Richard S. Sutton (2014). Off-policy TD(λ) with a true online equivalenceProceedings of the 30th Conference on Uncertainty in Artificial Intelligence. AUAI Press.

    Abstract: Van Seijen and Sutton (2014) recently proposed a new version of the linear TD(λ) learning algorithm that is exactly equivalent to an online forward view and that empirically performed better than its classical counterpart in both prediction and control problems. However, their algorithm is restricted to on-policy learning. In the more general case of off-policy learning, in which the policy whose outcome is predicted and the policy used to generate data may be different, their algorithm cannot be applied. One reason for this is that the algorithm bootstraps and thus is subject to instability problems when function approximation is used. A second reason true online TD(λ) cannot be used for off-policy learning is that the off-policy case requires sophisticated importance sampling in its eligibility traces. To address these limitations, we generalize their equivalence result and use this generalization to construct the first online algorithm to be exactly equivalent to an off-policy forward view. We show this algorithm, named true online GTD(λ) empirically outperforms GTD(λ) (Maei, 2011) which was derived from the same objective as our forward view but lacks the exact online equivalence. In the general theorem that allows us to derive this new algorithm, we encounter a new general eligibility-trace update.

  • Richard S. Sutton, A. Rupam Mahmood, Doina Precup and Hado P. van Hasselt (2014). A new Q(λ) with interim forward view and Monte Carlo equivalenceProceedings of the 31st International Conference on Machine Learning (ICML 2014), Beijing, China.

    Abstract: Q-learning, the most popular of reinforcement learning algorithms, has always included an extension to eligibility traces to enable more rapid learning and improved asymptotic performance on non-Markov problems. The λ parameter smoothly shifts on-policy algorithms such as TD(λ) and Sarsa(λ) from a pure bootstrapping form (λ = 0) to a pure Monte Carlo form (λ = 1). In off-policy algorithms, including Q(λ), GQ(λ), and off-policy LSTD(λ), the λ parameter is intended to play the same role, but does not; on every exploratory action these algorithms bootstrap regardless of the value of λ, and as a result they fail to approximate Monte Carlo learning when λ = 1. It may seem that this is inevitable for any online off-policy algorithm; if updates are made on each step on which the target policy is followed, then how could just the right updates be ‘un-made’ upon deviation from the target policy? In this paper, we introduce a new version of Q(λ) that does exactly that, without significantly increased algorithmic complexity. En route to our new Q(λ), we introduce a new derivation technique based on the forward-view/backward-view analysis familiar from TD(λ) but extended to apply at every time step rather than only at the end of episodes. We apply this technique to derive first a new off-policy version of TD(λ), called PTD(λ), and then our new Q(λ), called PQ(λ).

  • Hado van Hasselt and Han La Poutré (2013). Stacking Under Uncertainty: We Know How To Predict, But How Should We Act? Proceedings of IEEE Symposium on Computational Intelligence in Production and Logistics Systems (SSCI/CIPLS 2013), Singapore, Singapore.
  • Hado van Hasselt (2013). Estimating the Maximum Expected Value: An Analysis of (Nested) Cross Validation and the Maximum Sample Average. arXiv:1302.7175
  • Hado van Hasselt (2012). Reinforcement Learning in Continuous State and Action Spaces. In: Reinforcement Learning: State of the Art, Springer.
  • Harm van Seijen, Shimon Whiteson, Hado van Hasselt, and Marco Wiering (2011). Exploiting Best-Match Equations for Efficient Reinforcement Learning. Journal of Machine Learning Research, volume 12, pp. 2045-2094.
    Note: The PDF at JMLR does not embed all fonts, which may cause some symbols to display incorrectly on some computers. The PDF linked above solves this issue.
  • Marco A. Wiering, Hado van Hasselt, Auke-Dirk Pietersma, Lambert Schomaker (2011). Reinforcement Learning Algorithms for solving Classification Problems. Proceedings of IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning (ADPRL 2011), Paris, France, pp. 91-96.
  • Hado van Hasselt (2011). Insights in Reinforcement Learning.
    Ph.D. thesis, Utrecht University.
  • Hado van Hasselt (2010). Double Q-learning. Advances in Neural Information Processing Systems 23 (NIPS 2010), Vancouver, British Columbia, Canada, pp. 2613-2622.
    Note: a poster presentation is also available for this article.
  • Hado van Hasselt and Marco Wiering (2009). Using Continuous Action Spaces to Solve Discrete Problems. Proceedings of the International Joint Conference on Neural Networks (IJCNN 2009), Atlanta, GA, USA.
  • Harm van Seijen, Hado van Hasselt, Shimon Whiteson, and Marco Wiering (2009). A Theoretical and Empirical Analysis of Expected Sarsa. Proceedings of the IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL 2009), pp. 177-184.
  • Marco Wiering and Hado van Hasselt (2009). The QV Family Compared to Other Reinforcement Learning Algorithms. Proceedings of IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning (ADPRL 2009), Nashville, USA.
  • Joost Westra, Hado van Hasselt, Frank Dignum and Virginia Dignum (2009). Adaptive Serious Games Using Agent Organizations. Agents for Games and Simulations, pp. 206-220. Berlin / Heidelberg: Springer.
  • Marco Wiering and Hado van Hasselt (2008). Ensemble Algorithms in Reinforcement Learning. IEEE Transactions, SMC Part B. August 2008. pp. 930-936.
  • Joost Westra, Hado van Hasselt, Virginia Dignum and Frank Dignum (2008). On-line Adapting Games Using Agent Organizations. Proceedings of IEEE Symposium on Computational Intelligence and Games (CIG 2008), Perth, Australia.
  • Hado van Hasselt and Marco Wiering (2007). Reinforcement Learning in Continuous Action Spaces. Proceedings of IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning (ADPRL 2007), Honolulu, HI, USA, pp. 272-279.
    Note: This paper introduces the Cacla algorithm.
  • Hado van Hasselt and Marco Wiering (2007). Convergence of Model-Based Temporal Difference Learning for Control. Proceedings of IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning (ADPRL 2007), Honolulu, HI, USA, pp. 60-67.
  • Marco Wiering and Hado van Hasselt (2007). Two Novel On-policy Reinforcement Learning Algorithms based on TD(lambda) methods. Proceedings of IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning (ADPRL 2007), Honolulu, HI, USA, pp. 280-287.
  • Hado van Hasselt. “A Short Introduction To Some Reinforcement Learning Algorithms”.