Skip to main content

Showing 1–50 of 63 results for author: Nachum, O

Searching in archive cs. Search in all archives.
.
  1. arXiv:2309.10150  [pdf, other

    cs.RO cs.AI cs.LG

    Q-Transformer: Scalable Offline Reinforcement Learning via Autoregressive Q-Functions

    Authors: Yevgen Chebotar, Quan Vuong, Alex Irpan, Karol Hausman, Fei Xia, Yao Lu, Aviral Kumar, Tianhe Yu, Alexander Herzog, Karl Pertsch, Keerthana Gopalakrishnan, Julian Ibarz, Ofir Nachum, Sumedh Sontakke, Grecia Salazar, Huong T Tran, Jodilyn Peralta, Clayton Tan, Deeksha Manjunath, Jaspiar Singht, Brianna Zitkovich, Tomas Jackson, Kanishka Rao, Chelsea Finn, Sergey Levine

    Abstract: In this work, we present a scalable reinforcement learning method for training multi-task policies from large offline datasets that can leverage both human demonstrations and autonomously collected data. Our method uses a Transformer to provide a scalable representation for Q-functions trained via offline temporal difference backups. We therefore refer to the method as Q-Transformer. By discretizi… ▽ More

    Submitted 17 October, 2023; v1 submitted 18 September, 2023; originally announced September 2023.

    Comments: See website at https://qtransformer.github.io

  2. arXiv:2306.14892  [pdf, other

    cs.LG cs.AI

    Supervised Pretraining Can Learn In-Context Reinforcement Learning

    Authors: Jonathan N. Lee, Annie Xie, Aldo Pacchiano, Yash Chandak, Chelsea Finn, Ofir Nachum, Emma Brunskill

    Abstract: Large transformer models trained on diverse datasets have shown a remarkable ability to learn in-context, achieving high few-shot performance on tasks they were not explicitly trained to solve. In this paper, we study the in-context learning capabilities of transformers in decision-making problems, i.e., reinforcement learning (RL) for bandits and Markov decision processes. To do so, we introduce… ▽ More

    Submitted 26 June, 2023; originally announced June 2023.

  3. arXiv:2305.16985  [pdf, other

    cs.LG

    Inverse Dynamics Pretraining Learns Good Representations for Multitask Imitation

    Authors: David Brandfonbrener, Ofir Nachum, Joan Bruna

    Abstract: In recent years, domains such as natural language processing and image recognition have popularized the paradigm of using large datasets to pretrain representations that can be effectively transferred to downstream tasks. In this work we evaluate how such a paradigm should be done in imitation learning, where both pretraining and finetuning data are trajectories collected by experts interacting wi… ▽ More

    Submitted 25 October, 2023; v1 submitted 26 May, 2023; originally announced May 2023.

  4. arXiv:2305.14654  [pdf, other

    cs.RO cs.AI

    Barkour: Benchmarking Animal-level Agility with Quadruped Robots

    Authors: Ken Caluwaerts, Atil Iscen, J. Chase Kew, Wenhao Yu, Tingnan Zhang, Daniel Freeman, Kuang-Huei Lee, Lisa Lee, Stefano Saliceti, Vincent Zhuang, Nathan Batchelor, Steven Bohez, Federico Casarini, Jose Enrique Chen, Omar Cortes, Erwin Coumans, Adil Dostmohamed, Gabriel Dulac-Arnold, Alejandro Escontrela, Erik Frey, Roland Hafner, Deepali Jain, Bauyrjan Jyenis, Yuheng Kuang, Edward Lee , et al. (19 additional authors not shown)

    Abstract: Animals have evolved various agile locomotion strategies, such as sprinting, leaping, and jumping. There is a growing interest in developing legged robots that move like their biological counterparts and show various agile skills to navigate complex environments quickly. Despite the interest, the field lacks systematic benchmarks to measure the performance of control policies and hardware in agili… ▽ More

    Submitted 23 May, 2023; originally announced May 2023.

    Comments: 17 pages, 19 figures

  5. arXiv:2305.11854  [pdf, other

    cs.LG cs.AI stat.ML

    Multimodal Web Navigation with Instruction-Finetuned Foundation Models

    Authors: Hiroki Furuta, Kuang-Huei Lee, Ofir Nachum, Yutaka Matsuo, Aleksandra Faust, Shixiang Shane Gu, Izzeddin Gur

    Abstract: The progress of autonomous web navigation has been hindered by the dependence on billions of exploratory interactions via online reinforcement learning, and domain-specific model designs that make it difficult to leverage generalization from rich out-of-domain data. In this work, we study data-driven offline training for web agents with vision-language foundation models. We propose an instruction-… ▽ More

    Submitted 25 February, 2024; v1 submitted 19 May, 2023; originally announced May 2023.

    Comments: Accepted to ICLR 2024. Website: https://sites.google.com/view/mm-webnav/

  6. arXiv:2303.04129  [pdf, other

    cs.AI cs.LG

    Foundation Models for Decision Making: Problems, Methods, and Opportunities

    Authors: Sherry Yang, Ofir Nachum, Yilun Du, Jason Wei, Pieter Abbeel, Dale Schuurmans

    Abstract: Foundation models pretrained on diverse data at scale have demonstrated extraordinary capabilities in a wide range of vision and language tasks. When such models are deployed in real world environments, they inevitably interface with other entities and agents. For example, language models are often used to interact with human beings through dialogue, and visual perception models are used to autono… ▽ More

    Submitted 7 March, 2023; originally announced March 2023.

  7. arXiv:2302.00111  [pdf, other

    cs.AI

    Learning Universal Policies via Text-Guided Video Generation

    Authors: Yilun Du, Mengjiao Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Joshua B. Tenenbaum, Dale Schuurmans, Pieter Abbeel

    Abstract: A goal of artificial intelligence is to construct an agent that can solve a wide variety of tasks. Recent progress in text-guided image synthesis has yielded models with an impressive ability to generate complex novel images, exhibiting combinatorial generalization across domains. Motivated by this success, we investigate whether such tools can be used to construct more general-purpose agents. Spe… ▽ More

    Submitted 20 November, 2023; v1 submitted 31 January, 2023; originally announced February 2023.

    Comments: NeurIPS 2023, Project Website: https://universal-policy.github.io/

  8. arXiv:2212.06817  [pdf, other

    cs.RO cs.AI cs.CL cs.CV cs.LG

    RT-1: Robotics Transformer for Real-World Control at Scale

    Authors: Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Tomas Jackson, Sally Jesmonth, Nikhil J Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Kuang-Huei Lee, Sergey Levine, Yao Lu, Utsav Malla, Deeksha Manjunath , et al. (26 additional authors not shown)

    Abstract: By transferring knowledge from large, diverse, task-agnostic datasets, modern machine learning models can solve specific downstream tasks either zero-shot or with small task-specific datasets to a high level of performance. While this capability has been demonstrated in other fields such as computer vision, natural language processing or speech recognition, it remains to be shown in robotics, wher… ▽ More

    Submitted 11 August, 2023; v1 submitted 13 December, 2022; originally announced December 2022.

    Comments: See website at robotics-transformer1.github.io

  9. arXiv:2211.13337  [pdf, other

    cs.LG

    Multi-Environment Pretraining Enables Transfer to Action Limited Datasets

    Authors: David Venuto, Sherry Yang, Pieter Abbeel, Doina Precup, Igor Mordatch, Ofir Nachum

    Abstract: Using massive datasets to train large-scale models has emerged as a dominant approach for broad generalization in natural language and vision applications. In reinforcement learning, however, a key challenge is that available data of sequential decision making is often not annotated with actions - for example, videos of game-play are much more available than sequences of frames paired with their l… ▽ More

    Submitted 5 December, 2022; v1 submitted 23 November, 2022; originally announced November 2022.

  10. arXiv:2211.02100  [pdf, other

    cs.LG cs.AI

    Contrastive Value Learning: Implicit Models for Simple Offline RL

    Authors: Bogdan Mazoure, Benjamin Eysenbach, Ofir Nachum, Jonathan Tompson

    Abstract: Model-based reinforcement learning (RL) methods are appealing in the offline setting because they allow an agent to reason about the consequences of actions without interacting with the environment. Prior methods learn a 1-step dynamics model, which predicts the next state given the current state and action. These models do not immediately tell the agent which actions to take, but must be integrat… ▽ More

    Submitted 3 November, 2022; originally announced November 2022.

    Comments: Deep Reinforcement Learning Workshop, NeurIPS 2022

  11. arXiv:2211.02016  [pdf, other

    cs.LG cs.AI

    Oracle Inequalities for Model Selection in Offline Reinforcement Learning

    Authors: Jonathan N. Lee, George Tucker, Ofir Nachum, Bo Dai, Emma Brunskill

    Abstract: In offline reinforcement learning (RL), a learner leverages prior logged data to learn a good policy without interacting with the environment. A major challenge in applying such methods in practice is the lack of both theoretically principled and practical tools for model selection and evaluation. To address this, we study the problem of model selection in offline RL with value function approximat… ▽ More

    Submitted 3 November, 2022; originally announced November 2022.

  12. arXiv:2210.13435  [pdf, other

    cs.LG

    Dichotomy of Control: Separating What You Can Control from What You Cannot

    Authors: Mengjiao Yang, Dale Schuurmans, Pieter Abbeel, Ofir Nachum

    Abstract: Future- or return-conditioned supervised learning is an emerging paradigm for offline reinforcement learning (RL), where the future outcome (i.e., return) associated with an observed action sequence is used as input to a policy trained to imitate those same actions. While return-conditioning is at the heart of popular algorithms such as decision transformer (DT), these methods tend to perform poor… ▽ More

    Submitted 24 October, 2022; originally announced October 2022.

  13. arXiv:2210.03945  [pdf, other

    cs.LG cs.AI

    Understanding HTML with Large Language Models

    Authors: Izzeddin Gur, Ofir Nachum, Yingjie Miao, Mustafa Safdari, Austin Huang, Aakanksha Chowdhery, Sharan Narang, Noah Fiedel, Aleksandra Faust

    Abstract: Large language models (LLMs) have shown exceptional performance on a variety of natural language tasks. Yet, their capabilities for HTML understanding -- i.e., parsing the raw HTML of a webpage, with applications to automation of web-based tasks, crawling, and browser-assisted retrieval -- have not been fully explored. We contribute HTML understanding models (fine-tuned LLMs) and an in-depth analy… ▽ More

    Submitted 19 May, 2023; v1 submitted 8 October, 2022; originally announced October 2022.

  14. arXiv:2207.13224  [pdf, other

    cs.RO cs.AI cs.LG

    PI-ARS: Accelerating Evolution-Learned Visual-Locomotion with Predictive Information Representations

    Authors: Kuang-Huei Lee, Ofir Nachum, Tingnan Zhang, Sergio Guadarrama, Jie Tan, Wenhao Yu

    Abstract: Evolution Strategy (ES) algorithms have shown promising results in training complex robotic control policies due to their massive parallelism capability, simple implementation, effective parameter-space exploration, and fast training time. However, a key limitation of ES is its scalability to large capacity models, including modern neural network architectures. In this work, we develop Predictive… ▽ More

    Submitted 26 July, 2022; originally announced July 2022.

    Comments: To appear at IROS 2022. The supplementary video is available at https://kuanghuei.github.io/piars

  15. arXiv:2206.12441  [pdf, ps, other

    cs.LG

    Joint Representation Training in Sequential Tasks with Shared Structure

    Authors: Aldo Pacchiano, Ofir Nachum, Nilseh Tripuraneni, Peter Bartlett

    Abstract: Classical theory in reinforcement learning (RL) predominantly focuses on the single task setting, where an agent learns to solve a task through trial-and-error experience, given access to data only from that task. However, many recent empirical works have demonstrated the significant practical benefits of leveraging a joint representation trained across multiple, related tasks. In this work we the… ▽ More

    Submitted 24 June, 2022; originally announced June 2022.

  16. arXiv:2206.00059  [pdf, other

    cs.CL cs.AI

    A Mixture-of-Expert Approach to RL-based Dialogue Management

    Authors: Yinlam Chow, Aza Tulepbergenov, Ofir Nachum, MoonKyung Ryu, Mohammad Ghavamzadeh, Craig Boutilier

    Abstract: Despite recent advancements in language models (LMs), their application to dialogue management (DM) problems and ability to carry on rich conversations remain a challenge. We use reinforcement learning (RL) to develop a dialogue agent that avoids being short-sighted (outputting generic utterances) and maximizes overall user satisfaction. Most existing RL approaches to DM train the agent at the wor… ▽ More

    Submitted 31 May, 2022; originally announced June 2022.

  17. arXiv:2205.15241  [pdf, other

    cs.AI cs.LG

    Multi-Game Decision Transformers

    Authors: Kuang-Huei Lee, Ofir Nachum, Mengjiao Yang, Lisa Lee, Daniel Freeman, Winnie Xu, Sergio Guadarrama, Ian Fischer, Eric Jang, Henryk Michalewski, Igor Mordatch

    Abstract: A longstanding goal of the field of AI is a method for learning a highly capable, generalist agent from diverse experience. In the subfields of vision and language, this was largely achieved by scaling up transformer-based models and training them on large, diverse datasets. Motivated by this progress, we investigate whether the same strategy can be used to produce generalist reinforcement learnin… ▽ More

    Submitted 15 October, 2022; v1 submitted 30 May, 2022; originally announced May 2022.

    Comments: NeurIPS 2022. 24 pages, 16 figures. Additional information, videos and code can be seen at https://sites.google.com/view/multi-game-transformers

  18. arXiv:2205.13703  [pdf, other

    cs.LG

    Why So Pessimistic? Estimating Uncertainties for Offline RL through Ensembles, and Why Their Independence Matters

    Authors: Seyed Kamyar Seyed Ghasemipour, Shixiang Shane Gu, Ofir Nachum

    Abstract: Motivated by the success of ensembles for uncertainty estimation in supervised learning, we take a renewed look at how ensembles of $Q$-functions can be leveraged as the primary source of pessimism for offline reinforcement learning (RL). We begin by identifying a critical flaw in a popular algorithmic choice used by many ensemble-based RL algorithms, namely the use of shared pessimistic target va… ▽ More

    Submitted 26 May, 2022; originally announced May 2022.

    Comments: Our codebase can be found at https://github.com/google-research/google-research/tree/master/jrl

  19. arXiv:2205.10816  [pdf, other

    cs.LG cs.AI

    Chain of Thought Imitation with Procedure Cloning

    Authors: Mengjiao Yang, Dale Schuurmans, Pieter Abbeel, Ofir Nachum

    Abstract: Imitation learning aims to extract high-performance policies from logged demonstrations of expert behavior. It is common to frame imitation learning as a supervised learning problem in which one fits a function approximator to the input-output mapping exhibited by the logged demonstrations (input observations to output actions). While the framing of imitation learning as a supervised input-output… ▽ More

    Submitted 22 May, 2022; originally announced May 2022.

  20. arXiv:2201.12417  [pdf, other

    cs.LG cs.AI stat.ML

    Why Should I Trust You, Bellman? The Bellman Error is a Poor Replacement for Value Error

    Authors: Scott Fujimoto, David Meger, Doina Precup, Ofir Nachum, Shixiang Shane Gu

    Abstract: In this work, we study the use of the Bellman equation as a surrogate objective for value prediction accuracy. While the Bellman equation is uniquely solved by the true value function over all state-action pairs, we find that the Bellman error (the difference between both sides of the equation) is a poor proxy for the accuracy of the value function. In particular, we show that (1) due to cancellat… ▽ More

    Submitted 28 June, 2022; v1 submitted 28 January, 2022; originally announced January 2022.

    Comments: ICML 2022

  21. arXiv:2112.12320  [pdf, other

    cs.LG stat.ML

    Model Selection in Batch Policy Optimization

    Authors: Jonathan N. Lee, George Tucker, Ofir Nachum, Bo Dai

    Abstract: We study the problem of model selection in batch policy optimization: given a fixed, partial-feedback dataset and $M$ model classes, learn a policy with performance that is competitive with the policy derived from the best model class. We formalize the problem in the contextual bandit setting with linear model classes by identifying three sources of error that any model selection algorithm should… ▽ More

    Submitted 22 December, 2021; originally announced December 2021.

  22. arXiv:2111.14629  [pdf, other

    cs.LG cs.AI

    Improving Zero-shot Generalization in Offline Reinforcement Learning using Generalized Similarity Functions

    Authors: Bogdan Mazoure, Ilya Kostrikov, Ofir Nachum, Jonathan Tompson

    Abstract: Reinforcement learning (RL) agents are widely used for solving complex sequential decision making tasks, but still exhibit difficulty in generalizing to scenarios not seen during training. While prior online approaches demonstrated that using additional signals beyond the reward function can lead to better generalization capabilities in RL agents, i.e. using self-supervised learning (SSL), they st… ▽ More

    Submitted 29 November, 2021; originally announced November 2021.

    Comments: Offline RL workshop at NeurIPS 2021

  23. arXiv:2110.14770  [pdf, other

    cs.LG

    TRAIL: Near-Optimal Imitation Learning with Suboptimal Data

    Authors: Mengjiao Yang, Sergey Levine, Ofir Nachum

    Abstract: The aim in imitation learning is to learn effective policies by utilizing near-optimal expert demonstrations. However, high-quality demonstrations from human experts can be expensive to obtain in large numbers. On the other hand, it is often much easier to obtain large quantities of suboptimal or task-agnostic trajectories, which are not useful for direct imitation, but can nevertheless provide in… ▽ More

    Submitted 27 October, 2021; originally announced October 2021.

  24. arXiv:2108.02096  [pdf, other

    cs.LG

    Policy Gradients Incorporating the Future

    Authors: David Venuto, Elaine Lau, Doina Precup, Ofir Nachum

    Abstract: Reasoning about the future -- understanding how decisions in the present time affect outcomes in the future -- is one of the central challenges for reinforcement learning (RL), especially in highly-stochastic or partially observable environments. While predicting the future directly is hard, in this work we introduce a method that allows an agent to "look into the future" without explicitly predic… ▽ More

    Submitted 11 August, 2021; v1 submitted 4 August, 2021; originally announced August 2021.

  25. arXiv:2105.12272  [pdf, other

    cs.LG cs.AI

    Provable Representation Learning for Imitation with Contrastive Fourier Features

    Authors: Ofir Nachum, Mengjiao Yang

    Abstract: In imitation learning, it is common to learn a behavior policy to match an unknown target policy via max-likelihood training on a collected set of target demonstrations. In this work, we consider using offline experience datasets - potentially far from the target distribution - to learn low-dimensional state representations that provably accelerate the sample-efficiency of downstream imitation lea… ▽ More

    Submitted 8 October, 2021; v1 submitted 25 May, 2021; originally announced May 2021.

  26. arXiv:2104.13877  [pdf, other

    cs.LG cs.AI stat.ML

    Autoregressive Dynamics Models for Offline Policy Evaluation and Optimization

    Authors: Michael R. Zhang, Tom Le Paine, Ofir Nachum, Cosmin Paduraru, George Tucker, Ziyu Wang, Mohammad Norouzi

    Abstract: Standard dynamics models for continuous control make use of feedforward computation to predict the conditional distribution of next state and reward given current state and action using a multivariate Gaussian with a diagonal covariance structure. This modeling choice assumes that different dimensions of the next state and reward are conditionally independent given the current state and action and… ▽ More

    Submitted 28 April, 2021; originally announced April 2021.

    Comments: ICLR 2021. 17 pages

  27. arXiv:2103.16596  [pdf, other

    cs.LG stat.ML

    Benchmarks for Deep Off-Policy Evaluation

    Authors: Justin Fu, Mohammad Norouzi, Ofir Nachum, George Tucker, Ziyu Wang, Alexander Novikov, Mengjiao Yang, Michael R. Zhang, Yutian Chen, Aviral Kumar, Cosmin Paduraru, Sergey Levine, Tom Le Paine

    Abstract: Off-policy evaluation (OPE) holds the promise of being able to leverage large, offline datasets for both evaluating and selecting complex policies for decision making. The ability to learn offline is particularly important in many real-world domains, such as in healthcare, recommender systems, or robotics, where online data collection is an expensive and potentially dangerous process. Being able t… ▽ More

    Submitted 30 March, 2021; originally announced March 2021.

    Comments: ICLR 2021 paper. Policies and evaluation code are available at https://github.com/google-research/deep_ope

  28. arXiv:2103.12726  [pdf, other

    cs.LG cs.AI stat.ML

    Policy Information Capacity: Information-Theoretic Measure for Task Complexity in Deep Reinforcement Learning

    Authors: Hiroki Furuta, Tatsuya Matsushima, Tadashi Kozuno, Yutaka Matsuo, Sergey Levine, Ofir Nachum, Shixiang Shane Gu

    Abstract: Progress in deep reinforcement learning (RL) research is largely enabled by benchmark task environments. However, analyzing the nature of those environments is often overlooked. In particular, we still do not have agreeable ways to measure the difficulty or solvability of a task, given that each has fundamentally different actions, observations, dynamics, rewards, and can be tackled with diverse R… ▽ More

    Submitted 31 May, 2021; v1 submitted 23 March, 2021; originally announced March 2021.

    Comments: Accepted to ICML2021. The code is available at: https://github.com/frt03/pic

  29. arXiv:2103.09756  [pdf, ps, other

    cs.LG cs.AI

    Near Optimal Policy Optimization via REPS

    Authors: Aldo Pacchiano, Jonathan Lee, Peter Bartlett, Ofir Nachum

    Abstract: Since its introduction a decade ago, \emph{relative entropy policy search} (REPS) has demonstrated successful policy learning on a number of simulated and real-world robotic domains, not to mention providing algorithmic components used by many recently proposed reinforcement learning (RL) algorithms. While REPS is commonly known in the community, there exist no guarantees on its performance when u… ▽ More

    Submitted 17 March, 2021; originally announced March 2021.

    Comments: 8 main pages, 37 total pages

  30. arXiv:2103.08050  [pdf, other

    cs.LG

    Offline Reinforcement Learning with Fisher Divergence Critic Regularization

    Authors: Ilya Kostrikov, Jonathan Tompson, Rob Fergus, Ofir Nachum

    Abstract: Many modern approaches to offline Reinforcement Learning (RL) utilize behavior regularization, typically augmenting a model-free actor critic algorithm with a penalty measuring divergence of the policy from the offline data. In this work, we propose an alternative approach to encouraging the learned policy to stay close to the data, namely parameterizing the critic as the log-behavior-policy, whic… ▽ More

    Submitted 14 March, 2021; originally announced March 2021.

  31. arXiv:2102.05815  [pdf, other

    cs.LG cs.AI

    Representation Matters: Offline Pretraining for Sequential Decision Making

    Authors: Mengjiao Yang, Ofir Nachum

    Abstract: The recent success of supervised learning methods on ever larger offline datasets has spurred interest in the reinforcement learning (RL) field to investigate whether the same paradigms can be translated to RL algorithms. This research area, known as offline RL, has largely focused on offline policy optimization, aiming to find a return-maximizing policy exclusively from offline data. In this pape… ▽ More

    Submitted 10 February, 2021; originally announced February 2021.

  32. arXiv:2012.06919  [pdf, other

    cs.LG

    Offline Policy Selection under Uncertainty

    Authors: Mengjiao Yang, Bo Dai, Ofir Nachum, George Tucker, Dale Schuurmans

    Abstract: The presence of uncertainty in policy evaluation significantly complicates the process of policy ranking and selection in real-world settings. We formally consider offline policy selection as learning preferences over a set of policy prospects given a fixed experience dataset. While one can select or rank policies based on point estimates of their policy values or high-confidence intervals, access… ▽ More

    Submitted 12 December, 2020; originally announced December 2020.

  33. arXiv:2010.13611  [pdf, other

    cs.LG

    OPAL: Offline Primitive Discovery for Accelerating Offline Reinforcement Learning

    Authors: Anurag Ajay, Aviral Kumar, Pulkit Agrawal, Sergey Levine, Ofir Nachum

    Abstract: Reinforcement learning (RL) has achieved impressive performance in a variety of online settings in which an agent's ability to query the environment for transitions and rewards is effectively unlimited. However, in many practical applications, the situation is reversed: an agent may have access to large amounts of undirected offline experience data, while access to the online environment is severe… ▽ More

    Submitted 4 May, 2021; v1 submitted 26 October, 2020; originally announced October 2020.

    Comments: https://sites.google.com/view/opal-iclr

  34. arXiv:2010.11652  [pdf, other

    cs.LG stat.ML

    CoinDICE: Off-Policy Confidence Interval Estimation

    Authors: Bo Dai, Ofir Nachum, Yinlam Chow, Lihong Li, Csaba Szepesvári, Dale Schuurmans

    Abstract: We study high-confidence behavior-agnostic off-policy evaluation in reinforcement learning, where the goal is to estimate a confidence interval on a target policy's value, given only access to a static experience dataset collected by unknown behavior policies. Starting from a function space embedding of the linear program formulation of the $Q$-function, we obtain an optimization problem with gene… ▽ More

    Submitted 22 October, 2020; originally announced October 2020.

    Comments: To appear at NeurIPS 2020 as spotlight

  35. arXiv:2007.13609  [pdf, other

    cs.LG stat.ML

    Statistical Bootstrapping for Uncertainty Estimation in Off-Policy Evaluation

    Authors: Ilya Kostrikov, Ofir Nachum

    Abstract: In reinforcement learning, it is typical to use the empirically observed transitions and rewards to estimate the value of a policy via either model-based or Q-fitting approaches. Although straightforward, these techniques in general yield biased estimates of the true value of the policy. In this work, we investigate the potential for statistical bootstrapping to be used as a way to take these bias… ▽ More

    Submitted 27 July, 2020; originally announced July 2020.

  36. arXiv:2007.03438  [pdf, other

    cs.LG math.OC stat.ML

    Off-Policy Evaluation via the Regularized Lagrangian

    Authors: Mengjiao Yang, Ofir Nachum, Bo Dai, Lihong Li, Dale Schuurmans

    Abstract: The recently proposed distribution correction estimation (DICE) family of estimators has advanced the state of the art in off-policy evaluation from behavior-agnostic data. While these estimators all perform some form of stationary distribution correction, they arise from different derivations and objective functions. In this paper, we unify these estimators as regularized Lagrangians of the same… ▽ More

    Submitted 24 July, 2020; v1 submitted 7 July, 2020; originally announced July 2020.

  37. arXiv:2006.13888  [pdf, other

    cs.LG stat.ML

    RL Unplugged: A Suite of Benchmarks for Offline Reinforcement Learning

    Authors: Caglar Gulcehre, Ziyu Wang, Alexander Novikov, Tom Le Paine, Sergio Gomez Colmenarejo, Konrad Zolna, Rishabh Agarwal, Josh Merel, Daniel Mankowitz, Cosmin Paduraru, Gabriel Dulac-Arnold, Jerry Li, Mohammad Norouzi, Matt Hoffman, Ofir Nachum, George Tucker, Nicolas Heess, Nando de Freitas

    Abstract: Offline methods for reinforcement learning have a potential to help bridge the gap between reinforcement learning research and real-world applications. They make it possible to learn policies from offline datasets, thus overcoming concerns associated with online data collection in the real-world, including cost, safety, or ethical concerns. In this paper, we propose a benchmark called RL Unplugged… ▽ More

    Submitted 12 February, 2021; v1 submitted 24 June, 2020; originally announced June 2020.

    Comments: NeurIPS paper. 21 pages including supplementary material, the github link for the datasets: https://github.com/deepmind/deepmind-research/rl_unplugged

  38. arXiv:2006.03647  [pdf, other

    cs.LG cs.AI stat.ML

    Deployment-Efficient Reinforcement Learning via Model-Based Offline Optimization

    Authors: Tatsuya Matsushima, Hiroki Furuta, Yutaka Matsuo, Ofir Nachum, Shixiang Gu

    Abstract: Most reinforcement learning (RL) algorithms assume online access to the environment, in which one may readily interleave updates to the policy with experience collection using that policy. However, in many real-world applications such as health, education, dialogue agents, and robotics, the cost or potential risk of deploying a new data-collection policy is high, to the point that it can become pr… ▽ More

    Submitted 23 June, 2020; v1 submitted 5 June, 2020; originally announced June 2020.

  39. arXiv:2004.07219  [pdf, other

    cs.LG stat.ML

    D4RL: Datasets for Deep Data-Driven Reinforcement Learning

    Authors: Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, Sergey Levine

    Abstract: The offline reinforcement learning (RL) setting (also known as full batch RL), where a policy is learned from a static dataset, is compelling as progress enables RL methods to take advantage of large, previously-collected datasets, much like how the rise of large datasets has fueled results in supervised learning. However, existing online RL benchmarks are not tailored towards the offline setting… ▽ More

    Submitted 5 February, 2021; v1 submitted 15 April, 2020; originally announced April 2020.

    Comments: Website available at https://sites.google.com/view/d4rl/home

  40. arXiv:2002.05522  [pdf, other

    cs.LG cs.AI stat.ML

    BRPO: Batch Residual Policy Optimization

    Authors: Sungryull Sohn, Yinlam Chow, Jayden Ooi, Ofir Nachum, Honglak Lee, Ed Chi, Craig Boutilier

    Abstract: In batch reinforcement learning (RL), one often constrains a learned policy to be close to the behavior (data-generating) policy, e.g., by constraining the learned action distribution to differ from the behavior policy by some maximum degree that is the same at each state. This can cause batch RL to be overly conservative, unable to exploit large policy changes at frequently-visited, high-confiden… ▽ More

    Submitted 28 March, 2020; v1 submitted 7 February, 2020; originally announced February 2020.

  41. arXiv:2001.01866  [pdf, other

    cs.LG stat.ML

    Reinforcement Learning via Fenchel-Rockafellar Duality

    Authors: Ofir Nachum, Bo Dai

    Abstract: We review basic concepts of convex duality, focusing on the very general and supremely useful Fenchel-Rockafellar duality. We summarize how this duality may be applied to a variety of reinforcement learning (RL) settings, including policy evaluation or optimization, online or offline learning, and discounted or undiscounted rewards. The derivations yield a number of intriguing results, including t… ▽ More

    Submitted 9 January, 2020; v1 submitted 6 January, 2020; originally announced January 2020.

  42. arXiv:1912.05032  [pdf, other

    cs.LG stat.ML

    Imitation Learning via Off-Policy Distribution Matching

    Authors: Ilya Kostrikov, Ofir Nachum, Jonathan Tompson

    Abstract: When performing imitation learning from expert demonstrations, distribution matching is a popular approach, in which one alternates between estimating distribution ratios and then using these ratios as rewards in a standard reinforcement learning (RL) algorithm. Traditionally, estimation of the distribution ratio requires on-policy data, which has caused previous work to either be exorbitantly dat… ▽ More

    Submitted 10 December, 2019; originally announced December 2019.

  43. arXiv:1912.02074  [pdf, other

    cs.LG cs.AI

    AlgaeDICE: Policy Gradient from Arbitrary Experience

    Authors: Ofir Nachum, Bo Dai, Ilya Kostrikov, Yinlam Chow, Lihong Li, Dale Schuurmans

    Abstract: In many real-world applications of reinforcement learning (RL), interactions with the environment are limited due to cost or feasibility. This presents a challenge to traditional RL algorithms since the max-return objective involves an expectation over on-policy samples. We introduce a new formulation of max-return optimization that allows the problem to be re-expressed by an expectation over an a… ▽ More

    Submitted 4 December, 2019; originally announced December 2019.

  44. arXiv:1911.11361  [pdf, other

    cs.LG cs.AI stat.ML

    Behavior Regularized Offline Reinforcement Learning

    Authors: Yifan Wu, George Tucker, Ofir Nachum

    Abstract: In reinforcement learning (RL) research, it is common to assume access to direct online interactions with the environment. However in many real-world applications, access to the environment is limited to a fixed offline dataset of logged experience. In such settings, standard RL algorithms have been shown to diverge or otherwise yield poor performance. Accordingly, recent work has suggested a numb… ▽ More

    Submitted 26 November, 2019; originally announced November 2019.

  45. arXiv:1910.02097  [pdf, other

    cs.LG cs.AI stat.ML

    Group-based Fair Learning Leads to Counter-intuitive Predictions

    Authors: Ofir Nachum, Heinrich Jiang

    Abstract: A number of machine learning (ML) methods have been proposed recently to maximize model predictive accuracy while enforcing notions of group parity or fairness across sub-populations. We propose a desirable property for these procedures, slack-consistency: For any individual, the predictions of the model should be monotonic with respect to allowed slack (i.e., maximum allowed group-parity violatio… ▽ More

    Submitted 4 October, 2019; originally announced October 2019.

  46. arXiv:1909.10618  [pdf, other

    cs.LG cs.AI stat.ML

    Why Does Hierarchy (Sometimes) Work So Well in Reinforcement Learning?

    Authors: Ofir Nachum, Haoran Tang, Xingyu Lu, Shixiang Gu, Honglak Lee, Sergey Levine

    Abstract: Hierarchical reinforcement learning has demonstrated significant success at solving difficult reinforcement learning (RL) tasks. Previous works have motivated the use of hierarchy by appealing to a number of intuitive benefits, including learning over temporally extended transitions, exploring over temporally extended periods, and training and exploring in a more semantically meaningful action spa… ▽ More

    Submitted 30 December, 2019; v1 submitted 23 September, 2019; originally announced September 2019.

    Comments: Presented as an oral at the NeurIPS 2019 DeepRL Workshop

  47. arXiv:1908.05224  [pdf, other

    cs.RO cs.AI cs.LG

    Multi-Agent Manipulation via Locomotion using Hierarchical Sim2Real

    Authors: Ofir Nachum, Michael Ahn, Hugo Ponte, Shixiang Gu, Vikash Kumar

    Abstract: Manipulation and locomotion are closely related problems that are often studied in isolation. In this work, we study the problem of coordinating multiple mobile agents to exhibit manipulation behaviors using a reinforcement learning (RL) approach. Our method hinges on the use of hierarchical sim2real -- a simulated environment is used to learn low-level goal-reaching skills, which are then used as… ▽ More

    Submitted 7 October, 2019; v1 submitted 13 August, 2019; originally announced August 2019.

  48. arXiv:1906.04733  [pdf, other

    cs.LG cs.AI stat.ML

    DualDICE: Behavior-Agnostic Estimation of Discounted Stationary Distribution Corrections

    Authors: Ofir Nachum, Yinlam Chow, Bo Dai, Lihong Li

    Abstract: In many real-world reinforcement learning applications, access to the environment is limited to a fixed dataset, instead of direct (online) interaction with the environment. When using this data for either evaluation or training of a new policy, accurate estimates of discounted stationary distribution ratios -- correction terms which quantify the likelihood that the new policy will experience a ce… ▽ More

    Submitted 4 November, 2019; v1 submitted 10 June, 2019; originally announced June 2019.

    Comments: Appearing in NeurIPS 2019 Vancouver, Canada

  49. arXiv:1906.02736  [pdf, other

    cs.LG stat.ML

    DeepMDP: Learning Continuous Latent Space Models for Representation Learning

    Authors: Carles Gelada, Saurabh Kumar, Jacob Buckman, Ofir Nachum, Marc G. Bellemare

    Abstract: Many reinforcement learning (RL) tasks provide the agent with high-dimensional observations that can be simplified into low-dimensional continuous states. To formalize this process, we introduce the concept of a DeepMDP, a parameterized latent space model that is trained via the minimization of two tractable losses: prediction of rewards and prediction of the distribution over next latent states.… ▽ More

    Submitted 6 June, 2019; originally announced June 2019.

    Comments: 13 pages main text, 16 pages appendix. ICML 2019

  50. arXiv:1901.10031  [pdf, other

    cs.LG cs.AI stat.ML

    Lyapunov-based Safe Policy Optimization for Continuous Control

    Authors: Yinlam Chow, Ofir Nachum, Aleksandra Faust, Edgar Duenez-Guzman, Mohammad Ghavamzadeh

    Abstract: We study continuous action reinforcement learning problems in which it is crucial that the agent interacts with the environment only through safe policies, i.e.,~policies that do not take the agent to undesirable situations. We formulate these problems as constrained Markov decision processes (CMDPs) and present safe policy optimization algorithms that are based on a Lyapunov approach to solve the… ▽ More

    Submitted 11 February, 2019; v1 submitted 28 January, 2019; originally announced January 2019.