# Browsing by Subject "Reinforcement learning"

Now showing 1 - 20 of 71

- Results Per Page
1 5 10 20 40 60 80 100

- Sort Options
Ascending Descending

Item A deep learning approach to wireless system design for channel sensing, contention & estimation(2022-07-15) Doshi, Akash Sandeep; Andrews, Jeffrey G.; de Veciana, Gustavo; Dimakis, Alexandros; Kim, Hyeji; Yoo, TaesangShow more Deep Learning techniques are expected to play a key role in the development of wireless systems at the Physical (PHY) and Medium Access Control (MAC) layer for sixth generation (6G) communication networks. In particular, learning-based advancements would be needed to provide for (a) more efficient utilization of shared spectrum to accommodate an ever-increasing number of wireless devices and (b) improved scalability of existing signal processing techniques as the spatial and frequency dimensions of wireless architectures rapidly expand. In the first part of this dissertation, we propose a multi-agent deep reinforcement learning (RL) framework to perform contention-based medium access in shared spectrum. Centralized approaches to spectrum sharing require excessive real-time overhead messaging and obtaining the solution is known to be NP-hard. Instead, we assume that base stations operate in a fully decentralized fashion, and model shared spectrum access of base stations performing spectrum sensing as a decentralized partially observable Markov decision process (MDP). We introduce a two-stage MDP in each time slot that uses information from spectrum sensing and reception quality to make a medium access decision. Our distributed reinforcement learning framework achieves performance competitive with a genie-aided adaptive energy detection threshold. We then extend this framework to a larger action space of medium access and transmit modulation scheme. We modify the reward function to provide for this extension and utilize a stabilizing reinforcement learning technique to provide for scalability, hence achieving an improved cumulative reward on both indoor and outdoor layouts with a large number of BSs. In the second part of this dissertation, we develop a deep generative learning framework to perform channel estimation from an insufficient number of pilots in high dimensional wireless systems. Channel estimation using generative priors assumes that the reconstructed channel lies in the range of the generative model, and optimizes the input vector to estimate the channel matrix. We show that our approach outperforms state-of-the-art compressed sensing (CS) baselines. Subsequently, we develop a novel over-the-air design for training the aforementioned deep generative models using Generative Adversarial Networks from pilot measurements instead of clean channel realizations, and still achieve performance competitive with the CS baselines.Show more Item A modular attention hypothesis for modeling visuomotor behaviors(2021-07-24) Zhang, Ruohan; Ballard, Dana H. (Dana Harry), 1946-; Hayhoe, Mary; Stone, Peter; Huth, Alexander; Dayan, PeterShow more In this dissertation, we explore the hypothesis that complex intelligent behaviors, in vivo, can be decomposed into modules, which are organized in hierarchies and executed in parallel. This organization is similar to a multiprocessing architecture in silico. Biological attention can be viewed as a "process manager" that manages information processing and multiple computations. In this work, we seek to understand and model this modular attention mechanism for humans in a range of behavioral settings. We explain this approach to understanding modular attention at three levels based on David Marr’s paradigm: the computation theory level, the representation and algorithm level, and the hardware implementation level. At the computation theory level, we propose that simple visuomotor behaviors can be broken down into modules that require attention for their execution. At the representation and algorithm level, we model human eye movements and actions in a variety of visuomotor tasks. We collect and publish a large-scale, high-quality dataset of eye movements and actions of humans playing Atari video games. We study the active vision problem by jointly modeling human eye movements and actions, and compare how humans and artificial learning agents play these video games differently. We then propose a modular reinforcement learning model for modeling human subjects’ navigation behaviors in a virtual-reality environment with multiple goals. We further develop a modular inverse reinforcement learning algorithm to efficiently estimate the subjective reward and discount factors associated with each behavioral goal. At the implementation level, we propose a theoretical neuronal communication model named gamma spike multiplexing that allows the cortex to perform multiple computations simultaneously without crosstalk. The model explains how the modular attention hypothesis might be implemented in the biological brain. The end goals of this work are to (1) build models to explain and predict observed human visuomotor behaviors and attention; (2) use these biologically inspired models to develop algorithms for better artificial learning systems.Show more Item Advancing coding theory and mean field games via decentralized optimization and deep learning(2022-12-02) Mishra, Rajesh Kumar; Kim, Hyeji; Vishwanath, Sriram; Andrews, Jeffrey; Jafar, Syed; Anastasopoulos, AchilleasShow more The advent of new learning approaches like deep learning and reinforcement learning has enhanced our capability to look at decentralized optimization problems with new dimensions and solve them with new insights. In this work, we have focused on such problems in coding theory for communication systems and mean-field games. We consider a coding problem where the transmitter and the receiver cooperate in a decentralized manner to develop an encoding strategy for channels with feedback. We use control theory basics like Markov decision process (MDP), dynamic programming, and deep learning to solve these problems. We also consider an interference channel problem where the transmitters cooperate in a distributed fashion and obtain an encoding scheme to maximize the sum rate. Finally, we consider a multi-agent system and use a decentralized mechanism of mean eld games and reinforcement learning to solve for the optimal strategies. In part I, we propose linear sequential coding for communication channels with passive noisy output feedback with peak power constraint and a total power constraint. We propose a dynamic programming algorithm that solves an MDP to obtain analytical expressions for the scheme and the final Mean Squared Error (MSE) expression. We show the outperformance of our schemes against the state-of-the-art schemes. In part II, we use RNN based encoder at the transmitter and the receiver for encoding channels with active feedback, which outperform the traditional approaches. We also propose an analytical scheme by interpreting the codes obtained using the deep learning method. We also implement our scheme on a hardware SDR setup and showcase the performance improvement in the over-the-air environment. In part III, we consider a multiuser interference channel with transmitter and receiver pairs. Prior work proposed schemes using linear interference alignment techniques to transmit the messages through interference-free subspaces. We propose deep learning approaches to develop encoding schemes for the messages. We showcase that using message bits for training can produce intelligent non-intuitive encoding schemes that perform better than the general interference alignment schemes. In part IV, we consider the classic game theory problem of mean-field games, which we solve using sequential decomposition and model-free reinforcement learning techniques, which are more efficient than those previously used for multi-agent systems. We also provide an inverse reinforcement algorithm that proposes optimal strategies for games without the knowledge of the reward function.Show more Item Aligning robot navigation behaviors with human intentions and preferences(2024-05) Karnan, Haresh ; Stone, Peter, 1971- October 14-; Deshpande, Ashish D.; Farshid Alambeigi; Junmin Wang; Garrett Warnell; Anca Dragan; Joydeep BiswasShow more Recent advances in the field of machine learning have led to new ways for mobile robots to acquire advanced navigational capabilities (Bojarski et al., 2016; Kahn et al., 2018; Kendall et al., 2019; Pan et al., 2018; Silver et al., 2010). However, these learning-based methods raise the possibility that learned navigation behaviors may not align with the intentions and preferences of people, also known as value misalignment. To mitigate this danger, this dissertation aims to answer the question "How can we use machine learning methods to align the navigational behaviors of autonomous mobile robots with human intentions and preferences?" First, this dissertation answers this question by introducing a new approach to learning navigation behaviors by imitating human-provided demonstrations of the intended navigation task. This contribution allows mobile robots to acquire autonomous visual navigation capabilities through imitating human demonstrations using a novel objective function that encourages the agent to align with the navigation objective of the human and penalizes for misalignment. Second, this dissertation introduces two algorithms to enhance terrain-aware off-road navigation for mobile robots through learning visual terrain awareness in a self-supervised manner. This contribution enables mobile robots to obey a human operator’s preference for navigating over different terrains in urban outdoor environments and extrapolate these preferences to visually novel terrains by leveraging multi-modal representations. Finally, in the context of robot navigation in human-occupied environments, this dissertation¹ introduces a dataset and an algorithm for robot navigation in a socially compliant manner in both indoor and outdoor environments. In summary, the contributions in this dissertation take a significant step towards addressing the value alignment problem in autonomous navigation, enabling mobile robots to navigate autonomously with objectives that align with the intentions and preferences of humans.Show more Item An evaluation framework for future privacy protection systems(2021-04-02) Liau, David; Barber, Kathleen S.; Arapostathis, Ari, 1954-; Khurshid, Sarfraz; Garg, Vijay; Barbour, JoshuaShow more This research offers a tool bringing together the UT Center for Identity (CID) Identity Ecosystem, game theory and Markov decision processes to generate and evaluate the best strategy for defending against the theft of personal identity information. This research conducts a simulation-based study to evaluate and evolve the efficacy of identity protection strategies. In doing so, the research develops a universally applicable tool for evaluating and recommending identity protection strategies in general. Leveraging the UT CID Identity Ecosystem and its underlying Bayesian network model representation of Personal Identifiable Information (PII), this research delivers a dynamic version of UT CID Identity Ecosystem as a universal identity protection system evaluation framework. We aim to understand how initial individual PII exposure evolves into crucial PII breaches over time in a dynamic Identity Ecosystem. we further provide quantitative analysis to differentiate and measure identity protection strategies and their characteristics.Show more Item An exploration of modeling and control methods for bipedal humanoid robots(2023-04-24) Cruz, Melissa Jordan; Sentis, Luis; Chen, Dongmei (Maggie)Show more Two whole-body motion planning and control methods are presented in this report: trajectory generation and tracking using Centroidal Dynamics and optimization methods and using Reinforcement Learning (RL). Centroidal Dynamics utilizes a simplified model of the robot by assuming that all of the robot’s mass is located at the center of mass of the robot. This assumption greatly reduces the computational cost at the expense of a less accurate robot model. The RL trajectory generation and control is implemented using NVIDIA’s Isaac Gym environment. Isaac Gym massively parallelizes computation by using available GPUs, greatly decreasing computation time, making it a useful tool to develop standing and walking policies using RL on humanoid robots. Both methods produced trajectories that resulted in stable XY-planar movement. The Centroidal Dynamics method produced more promising results, with stable Z movement. More work should be done on the RL method regarding reward tuning.Show more Item Automated domain analysis and transfer learning in general game playing(2010-08) Kuhlmann, Gregory John; Stone, Peter, 1971-; Lifschitz, Vladimir; Mooney, Raymond J.; Porter, Bruce W.; Schaeffer, JonathanShow more Creating programs that can play games such as chess, checkers, and backgammon, at a high level has long been a challenge and benchmark for AI. Computer game playing is arguably one of AI's biggest success stories. Several game playing systems developed in the past, such as Deep Blue, Chinook and TD-Gammon have demonstrated competitive play against the top human players. However, such systems are limited in that they play only one particular game and they typically must be supplied with game-specific knowledge. While their performance is impressive, it is difficult to determine if their success is due to generally applicable techniques or due to the human game analysis. A general game player is an agent capable of taking as input a description of a game's rules and proceeding to play without any subsequent human input. In doing so, the agent, rather than the human designer, is responsible for the domain analysis. Developing such a system requires the integration of several AI components, including theorem proving, feature discovery, heuristic search, and machine learning. In the general game playing scenario, the player agent is supplied with a game's rules in a formal language, prior to match play. This thesis contributes a collection of general methods for analyzing these game descriptions to improve performance. Prior work on automated domain analysis has focused on generating heuristic evaluation functions for use in search. The thesis builds upon this work by introducing a novel feature generation method. Also, I introduce a method for generating and comparing simple evaluation functions based on these features. I describe how more sophisticated evaluation functions can be generated through learning. Finally, this thesis demonstrates the utility of domain analysis in facilitating knowledge transfer between games for improved learning speed. The contributions are fully implemented with empirical results in the general game playing system.Show more Item Autonomous inter-task transfer in reinforcement learning domains(2008-08) Taylor, Matthew Edmund; Stone, Peter, 1971-Show more Reinforcement learning (RL) methods have become popular in recent years because of their ability to solve complex tasks with minimal feedback. While these methods have had experimental successes and have been shown to exhibit some desirable properties in theory, the basic learning algorithms have often been found slow in practice. Therefore, much of the current RL research focuses on speeding up learning by taking advantage of domain knowledge, or by better utilizing agents’ experience. The ambitious goal of transfer learning, when applied to RL tasks, is to accelerate learning on some target task after training on a different, but related, source task. This dissertation demonstrates that transfer learning methods can successfully improve learning in RL tasks via experience from previously learned tasks. Transfer learning can increase RL’s applicability to difficult tasks by allowing agents to generalize their experience across learning problems. This dissertation presents inter-task mappings, the first transfer mechanism in this area to successfully enable transfer between tasks with different state variables and actions. Inter-task mappings have subsequently been used by a number of transfer researchers. A set of six transfer learning algorithms are then introduced. While these transfer methods differ in terms of what base RL algorithms they are compatible with, what type of knowledge they transfer, and what their strengths are, all utilize the same inter-task mapping mechanism. These transfer methods can all successfully use mappings constructed by a human from domain knowledge, but there may be situations in which domain knowledge is unavailable, or insufficient, to describe how two given tasks are related. We therefore also study how inter-task mappings can be learned autonomously by leveraging existing machine learning algorithms. Our methods use classification and regression techniques to successfully discover similarities between data gathered in pairs of tasks, culminating in what is currently one of the most robust mapping-learning algorithms for RL transfer. Combining transfer methods with these similarity-learning algorithms allows us to empirically demonstrate the plausibility of autonomous transfer. We fully implement these methods in four domains (each with different salient characteristics), show that transfer can significantly improve an agent’s ability to learn in each domain, and explore the limits of transfer’s applicability.Show more Item Autonomous qualitative learning of distinctions and actions in a developing agent(2010-08) Mugan, Jonathan William; Kuipers, Benjamin; Stone, Peter, 1971-; Ballard, Dana; Cohen, Leslie; Mooney, RaymondShow more How can an agent bootstrap up from a pixel-level representation to autonomously learn high-level states and actions using only domain general knowledge? This thesis attacks a piece of this problem and assumes that an agent has a set of continuous variables describing the environment and a set of continuous motor primitives, and poses a solution for the problem of how an agent can learn a set of useful states and effective higher-level actions through autonomous experience with the environment. There exist methods for learning models of the environment, and there also exist methods for planning. However, for autonomous learning, these methods have been used almost exclusively in discrete environments. This thesis proposes attacking the problem of learning high-level states and actions in continuous environments by using a qualitative representation to bridge the gap between continuous and discrete variable representations. In this approach, the agent begins with a broad discretization and initially can only tell if the value of each variable is increasing, decreasing, or remaining steady. The agent then simultaneously learns a qualitative representation (discretization) and a set of predictive models of the environment. The agent then converts these models into plans to form actions. The agent then uses those learned actions to explore the environment. The method is evaluated using a simulated robot with realistic physics. The robot is sitting at a table that contains one or two blocks, as well as other distractor objects that are out of reach. The agent autonomously explores the environment without being given a task. After learning, the agent is given various tasks to determine if it learned the necessary states and actions to complete them. The results show that the agent was able to use this method to autonomously learn to perform the tasks.Show more Item Autonomous trading in modern electricity markets(2015-12) Urieli, Daniel; Stone, Peter, 1971-; Mooney, Raymond; Ravikumar, Pradeep; Baldick, Ross; Kolter, ZicoShow more The smart grid is an electricity grid augmented with digital technologies that automate the management of electricity delivery. The smart grid is envisioned to be a main enabler of sustainable, clean, efficient, reliable, and secure energy supply. One of the milestones in the smart grid vision will be programs for customers to participate in electricity markets through demand-side management and distributed generation; electricity markets will (directly or indirectly) incentivize customers to adapt their demand to supply conditions, which in turn will help to utilize intermittent energy resources such as from solar and wind, and to reduce peak-demand. Since wholesale electricity markets are not designed for individual participation, retail brokers could represent customer populations in the wholesale market, and make profit while contributing to the electricity grid’s stability and reducing customer costs. A retail broker will need to operate continually and make real-time decisions in a complex, dynamic environment. Therefore, it will benefit from employing an autonomous broker agent. With this motivation in mind, this dissertation makes five main contributions to the areas of artificial intelligence, smart grids, and electricity markets. First, this dissertation formalizes the problem of autonomous trading by a retail broker in modern electricity markets. Since the trading problem is intractable to solve exactly, this formalization provides a guideline for approximate solutions. Second, this dissertation introduces a general algorithm for autonomous trading in modern electricity markets, named LATTE (Lookahead-policy for Autonomous Time-constrained Trading of Electricity). LATTE is a general framework that can be instantiated in different ways that tailor it to specific setups. Third, this dissertation contributes fully implemented and operational autonomous broker agents, each using a different instantiation of LATTE. These agents were successful in international competitions and controlled experiments and can serve as benchmarks for future research in this domain. Detailed descriptions of the agents’ behaviors as well as their source code are included in this dissertation. Fourth, this dissertation contributes extensive empirical analysis which validates the effectiveness of LATTE in different competition levels under a variety of environmental conditions, shedding light on the main reasons for its success by examining the importance of its constituent components. Fifth, this dissertation examines the impact of Time-Of-Use (TOU) tariffs in competitive electricity markets through empirical analysis. Time-Of-Use tariffs are proposed for demand-side management both in the literature and in the real-world. The success of the different instantiations of LATTE demonstrates its generality in the context of electricity markets. Ultimately, this dissertation demonstrates that an autonomous broker can act effectively in modern electricity markets by executing an efficient lookahead policy that optimizes its predicted utility, and by doing so the broker can benefit itself, its customers, and the economy.Show more Item Boosting deep reinforcement learning algorithms with deep probabilistic models(2021-05-10) Yue, Yuguang; Zhou, Mingyuan (Assistant professor); Mueller, Peter; Sarkar, Abhra; Qian, XiaoningShow more This thesis develops new methodologies that boost deep reinforcement learning algorithms from a probabilistic point of view. More specifically, three angles are studied to make improvements in terms of sample efficiency of the deep reinforcement learning algorithms: 1). We apply a hierarchical structure on policy construction to obtain a flexible policy so that it has the capability of capturing complex distribution and make more appropriate decisions. 2). We manage to reduce the variance of the policy gradient estimation calculated via a Monte Carlo estimation by designing a "self-critic'' baseline function, the new gradient estimator has a smaller variance and leads to a better empirical performance. 3). We apply the distributional reinforcement learning framework on the continuous-action setting with a stochastic policy, and stabilize the training process with double generative networks. All the methods bring clear gains, which demonstrate the benefits of applying deep probabilistic models to improve deep reinforcement learning algorithms.Show more Item Capacity auctions for electricity(2018-12) Yucel, Emre; Dyer, James S.; Butler, John; Muthuraman, Kumar; Anderson, EdwardShow more Faced with uncertainty of future electricity generation supply, many regional electricity markets have adopted or considered adopting capacity markets for electricity. We study the structure of these markets and in particular capacity supply auctions such as the one implemented by PJM Interconnection (PJM), a regional transmission organization. Participants bid generation capacity into the auction, and those that win receive a capacity payment in return for having this capacity available for generation at a future delivery date. The auctions can be classified as multi-unit uniform price auctions, though price is set according to a demand curve rather than by participants' bids. We find closed-form solutions for the optimal bids as a function of cost, study welfare impacts of the auction, and show how the results can be extended numerically for more complex situations. We then use these optimal bid functions in an agent-based simulation of electricity markets, comparing energy-only markets to capacity markets and measuring the impact on both the generators and consumers of electricity. Lastly we use our agent-based simulation model coupled with reinforcement learners to determine whether or not the optimal bid strategy discovered in the beginning can be learned over time by agents participating in the energy and capacity markets.Show more Item Comparing human and machine attention in visuomotor tasks(2021-07-24) Guo, Sihang; Ballard, Dana H. (Dana Harry), 1946-Show more The emergence of deep learning has transformed the way researchers approach complex machine perception problems, and has resulted in models with (super)human-level performance in various perception and motor tasks. Originally rooted in the human visual system, deep learning methods have only recently been adapted to understand human perception: in both vision and language tasks, layers and regions of the cerebral cortex have been identified to share similar learned representations with deep models. In this thesis, we take a step further into the domain of visuomotor decision-making tasks, exploring the possibility of using deep reinforcement learning (RL) algorithms to model human perceptual representations. By comparing the learned representations between human and RL models in terms of attention, we investigate the effect of learning and different hyperparameters on the resulting attention similarity. We found a positive correlation between RL performance and attention similarity, and make observations about human visuomotor behaviors from the comparison.Show more Item Computational solution and analysis of disentanglement puzzles(2023-05-04) Zhang, Xinya; Vouga, Paul Etienne; Huang, Qixing; Fussell, Donald S; Kry, Paul G.; Biswas, JoydeepShow more Disentanglement puzzles, as a type of mechanical puzzles whose goal is to separate pieces from each other. Traditionally these puzzles are designed for human entertainment, leaving reputations of challenging to solve. The intrinsic difficulty of such puzzles also poses various challenges to the state-of-the-art algorithms from various research areas in computer sciences. For example, the present motion planning algorithms from robotics usually fail to solve disentanglement puzzles, and no geometry processing tools have been designed for the analysis and design of disentanglement puzzles. Mathematically the disentanglement puzzles can be represented with their corresponding Configuration Space, or C-Space. This concept is widely used among robotics, physical simulation, and geometry processing. For example, the motions of robot arms, state transitions of physical systems, and the deformation of geometries can all be abstracted as vector calculus over a point in the corresponding C-space. The challenges of disentanglement puzzles come from their equally challenging C-space. Unfortunately, the C-space is usually not well understood due to the curse of dimensionality, and thus commonly represented in an implicit manner. Such a practice creates a challenge in motion planning called "narrow passage problem", which refers to the difficulties for a motion planning algorithm to discover a path where one or more narrow tunnels must be involved. This is the exact problem that blocks the solving of disentanglement puzzles in computational methods. This thesis focuses on better understanding the disentanglement puzzles and their C-space. The crucial goal is to develop a motion planning algorithm that can solve these puzzles with reasonable performance. Additional goals include suggesting novel techniques to help the analysis of existing puzzles, and potentially design of new puzzles. The proposed thesis consists of four technical chapters after a dedicated chapter for related works. The first technical chapter discusses the construction of a navigation function that can go through narrow tunnels by employing simulation electromagnetic fields. It proposes a few algorithms under this category, their performance and limitations. The second technical chapter explores the possibility of utilizing reinforcement learning to accelerate the solution of similar but novel puzzles with the training results from existing puzzles. It discusses the challenges to apply reinforcement learning methods in motion planning, and suggests disentanglement puzzles can serve as a new class of task/dataset to evaluate reinforcement learning algorithms. The third technical chapter proposes a scalable motion planning system, that can solve disentanglement puzzles through detecting narrow tunnels in C-space. It presents algorithms to detect key features of puzzle pieces, schemes to locate candidate narrow tunnels by assembling those features, and a distributed motion planning algorithm to search the C-space from a large number of potential narrow tunnels. The last technical chapter focuses on the analysis of wire puzzles. It proposes a set of metrics to quantify puzzle vs. non-puzzle, and suggests two common approaches to design disentanglement puzzles. With this set of metrics, this chapter makes new applications like automatic puzzle design possible.Show more Item Convex optimization meets formal methods : verification, synthesis, and learning in Markov decision processes(2021-08-09) Cubuktepe, Murat; Topcu, Ufuk; Tanaka, Takashi; Sentis, Luis; Katoen, Joost-PieterShow more This dissertation studies the applicability of convex optimization to the formal verification and synthesis of systems that exhibit randomness or stochastic uncertainties. These systems can be represented by a general family of uncertain, partially observable, and parametric Markov decision processes (MDPs). These models have found applications in artificial intelligence, planning, autonomy, and control theory and can accurately characterize dynamic, uncertain environments. The synthesis of policies for this family of models has long been regarded theoretically and empirically intractable. The goal of this dissertation is to develop theoretically sound and computationally efficient synthesis algorithms that provably satisfy formal high-level task specifications in temporal logic. The first part is on developing convex-optimization-based techniques to parameter synthesis in parametric Markov decision processes where the values of the transitions are functions over real-valued parameters. The second part builds on the formulations of the first part and utilizes sampling-based methods for verification and optimization in uncertain MDPs that allow the probabilistic transition function to belong to a so-called uncertainty set. The third part develops inverse reinforcement learning algorithms in partially observable MDPs to several limitations of existing techniques that do not take the information asymmetry between the expert and the agent into account. Finally, the fourth part synthesizes policies for uncertain partially observable MDPs that allow both of the probabilistic transition and observation functions to be uncertain. In each part, a unifying theme is, the resulting algorithms approximate the underlying optimization problem as a convex optimization problem. Additionally, by combining techniques from convex optimization and formal methods, the algorithms bring strong performance guarantees with respect to task specifications. The computational efficiency and applicability of the resulting algorithms are demonstrated in numerous domains such as aircraft collision avoidance, spacecraft and unmanned aerial vehicle motion planning, and joint active perception and planning in urban environments.Show more Item Cooperation and communication in multiagent deep reinforcement learning(2016-12) Hausknecht, Matthew John; Stone, Peter, 1971-; Ballard, Dana; Mooney, Ray; Miikkulainen, Risto; Singh, SatinderShow more Reinforcement learning is the area of machine learning concerned with learning which actions to execute in an unknown environment in order to maximize cumulative reward. As agents begin to perform tasks of genuine interest to humans, they will be faced with environments too complex for humans to predetermine the correct actions using hand-designed solutions. Instead, capable learning agents will be necessary to tackle complex real-world domains. However, traditional reinforcement learning algorithms have difficulty with domains featuring 1) high-dimensional continuous state spaces, for example pixels from a camera image, 2) high-dimensional parameterized-continuous action spaces, 3) partial observability, and 4) multiple independent learning agents. We hypothesize that deep neural networks hold the key to scaling reinforcement learning towards complex tasks. This thesis seeks to answer the following two-part question: 1) How can the power of Deep Neural Networks be leveraged to extend Reinforcement Learning to complex environments featuring partial observability, high-dimensional parameterized-continuous state and action spaces, and sparse rewards? 2) How can multiple Deep Reinforcement Learning agents learn to cooperate in a multiagent setting? To address the first part of this question, this thesis explores the idea of using recurrent neural networks to combat partial observability experienced by agents in the domain of Atari 2600 video games. Next, we design a deep reinforcement learning agent capable of discovering effective policies for the parameterized-continuous action space found in the Half Field Offense simulated soccer domain. To address the second part of this question, this thesis investigates architectures and algorithms suited for cooperative multiagent learning. We demonstrate that sharing parameters and memories between deep reinforcement learning agents fosters policy similarity, which can result in cooperative behavior. Additionally, we hypothesize that communication can further aid cooperation, and we present the Grounded Semantic Network (GSN), which learns a communication protocol grounded in the observation space and reward function of the task. In general, we find that the GSN is effective on domains featuring partial observability and asymmetric information. All in all, this thesis demonstrates that reinforcement learning combined with deep neural network function approximation can produce algorithms capable of discovering effective policies for domains with partial observability, parameterized-continuous actions spaces, and sparse rewards. Additionally, we demonstrate that single agent deep reinforcement learning algorithms can be naturally extended towards cooperative multiagent tasks featuring learned communication. These results represent a non-trivial step towards extending agent-based AI towards complex environments.Show more Item Curriculum learning in reinforcement learning(2021-05-06) Narvekar, Sanmit Santosh; Stone, Peter, 1971-; Niekum, Scott; Mooney, Raymond; Brunskill, EmmaShow more In recent years, reinforcement learning (RL) has been increasingly successful at solving complex tasks. Despite these successes, one of the fundamental challenges is that many RL methods require large amounts of experience, and thus can be slow to train in practice. Transfer learning is a recent area of research that has been shown to speed up learning on a complex task by transferring knowledge from one or more easier source tasks. Most existing transfer learning methods treat this transfer of knowledge as a one-step process, where knowledge from all the sources are directly transferred to the target. However, for complex tasks, it may be more beneficial (and even necessary) to gradually acquire skills over multiple tasks in sequence, where each subsequent task requires and builds upon knowledge gained in a previous task. This idea is pervasive throughout human learning, where people learn complex skills gradually by training via a curriculum. The goal of this thesis is to explore whether autonomous reinforcement learning agents can also benefit by training via a curriculum, and whether such curricula can be designed fully autonomously. In order to answer these questions, this thesis first formalizes the concept of a curriculum, and the methodology of curriculum learning in reinforcement learning. Curriculum learning consists of 3 main elements: 1) task generation, which creates a suitable set of source tasks; 2) sequencing, which focuses on how to order these tasks into a curriculum; and 3) transfer learning, which considers how to transfer knowledge between tasks in the curriculum. This thesis introduces several methods to both create suitable source tasks and automatically sequence them into a curriculum. We show that these methods produce curricula that are tailored to the individual sensing and action capabilities of different agents, and show how the curricula learned can be adapted for new, but related target tasks. Together, these methods form the components of an autonomous curriculum design agent, that can suggest a training curriculum customized to both the unique abilities of each agent and the task in question. We expect this research on the curriculum learning approach will increase the applicability and scalability of RL methods by providing a faster way of training reinforcement learning agents, compared to learning tabula rasa.Show more Item Data efficient reinforcement learning with off-policy and simulated data(2019-11-12) Hanna, Josiah Paul; Stone, Peter, 1971-; Niekum, Scott; Krähenbühl, Philipp; Sutton, RichardShow more Learning from interaction with the environment -- trying untested actions, observing successes and failures, and tying effects back to causes -- is one of the first capabilities we think of when considering autonomous agents. Reinforcement learning (RL) is the area of artificial intelligence research that has the goal of allowing autonomous agents to learn in this way. Despite much recent success, many modern reinforcement learning algorithms are still limited by the requirement of large amounts of experience before useful skills are learned. Two possible approaches to improving data efficiency are to allow algorithms to make better use of past experience collected with past behaviors (known as off-policy data) and to allow algorithms to make better use of simulated data sources. This dissertation investigates the use of such auxiliary data by answering the question, "How can a reinforcement learning agent leverage off-policy and simulated data to evaluate and improve upon the expected performance of a policy?" This dissertation first considers how to directly use off-policy data in reinforcement learning through importance sampling. When used in reinforcement learning, importance sampling is limited by high variance that leads to inaccurate estimates. This dissertation addresses this limitation in two ways. First, this dissertation introduces the behavior policy gradient algorithm that adapts the data collection policy towards a policy that generates data that leads to low variance importance sampling evaluation of a fixed policy. Second, this dissertation introduces the family of regression importance sampling estimators which improve the weighting of already collected off-policy data so as to lower the variance of importance sampling evaluation of a fixed policy. In addition to evaluation of a fixed policy, we apply the behavior policy gradient algorithm and regression importance sampling to batch policy gradient policy improvement. In the case of regression importance sampling, this application leads to the introduction of the sampling error corrected policy gradient estimator that improves the data efficiency of batch policy gradient algorithms. Towards the goal of learning from simulated experience, this dissertation introduces an algorithm -- the grounded action transformation algorithm -- that takes small amounts of real world data and modifies the simulator such that skills learned in simulation are more likely to carry over to the real world. Key to this approach is the idea of local simulator modification -- the simulator is automatically altered to better model the real world for actions the data collection policy would take in states the data collection policy would visit. Local modification necessitates an iterative approach: the simulator is modified, the policy improved, and then more data is collected for further modification. Finally, in addition to examining them each independently, this dissertation also considers the possibility of combining the use of simulated data with importance sampled off-policy data. We combine these sources of auxiliary data by control variate techniques that use simulated data to lower the variance of off-policy policy value estimation. Combining these sources of auxiliary data allows us to introduce two algorithms -- weighted doubly robust bootstrap and model-based bootstrap -- for the problem of lower-bounding the performance of an untested policy.Show more Item Data-driven design for multihop and multi-band cellular networks(2023-01-03) Gupta, Manan; Andrews, Jeffrey G.; de Veciana, Gustavo A.; Chinchali, Sandeep; Vishwanath, Sriram; Visotsky, EugeneShow more Millimeter wave (mmWave) integrated access and backhaul (IAB) and multiband heterogeneous networks allow operators to avail more spectral resources and keep up with the intense consumer demand for faster data connectivity. However, these promising network architectures pose new challenges to network resource management, such as multihop routing, link scheduling, and traffic steering, and motivate the re-thinking of traditional solutions. IAB facilitates cost-effective deployment of mmWave cellular networks via multihop self-backhauling, albeit at the cost of poor rate scaling and packet latency. In the first part of this dissertation, we develop data-driven link scheduling policies for IAB networks to minimize the multihop delay while accounting for practical network constraints like feedback delays, choice of half-duplex (HD) or full-duplex (FD) transceivers, and scheduling restrictions. We formulate the link scheduling problem as a Markov decision process (MDP) with a continuous action space and solve it using the deep deterministic policy gradient (DDPG) algorithm. Detailed system-level simulations show that the reinforcement learning RL-based scheduler can reduce the mean delay by 230% and 260% compared to a backpressure scheduler and max-min scheduler respectively. In the second part, we investigate the network-level benefits of upgrading an IAB network with FD transceivers as a potential means to overcome latency and throughput challenges faced by IAB networks. We formulate a network utility maximization problem with practical and tractable throughput and latency constraints to evaluate both FD-IAB and HD-IAB networks and analytically characterize the latency gain from an FD upgrade. Even when the residual self-interference is significantly above the noise floor, this transceiver-level upgrade can improve throughput by 8x and reduce latency by 4x for a fourth-hop user. In the third part, we develop a novel learning-based model predictive control (MPC) approach for base station (BS) selection and band assignment while accounting for user mobility. We first train a deep recurrent neural network to reliably forecast the mobile users' future rates. The MPC controller then uses this forecast to optimize the association decisions to maximize the service rate-based network utility. To efficiently solve the MPC, we also develop an optimization algorithm based on the Frank-Wolfe method. The MPC approach improves the 5th percentile service rate by 2.7x compared to the traditional signal strength-based association. Its performance approaches that of a genie-aided scheme in terms of the number of handovers.Show more Item Deep R learning for continual area sweeping(2019-05-15) Shah, Rishi Alpesh; Dawson, Clinton N.Show more In order to maintain robustness, autonomous robots need to constantly update their knowledge of the environment, which can be expensive when they are deployed in large, dynamic spaces. The continual area sweeping task formalizes the problem of a robot continually patrolling an area in a non-uniform way in order to efficiently use travel time. However, the existing problem formulation makes strong assumptions about the environment, and to date only a sub-optimal greedy approach has been proposed. We generalize the continual area sweeping formulation to include fewer environmental constraints, and propose a novel reinforcement learning approach. We evaluate our approach in an abstract simulation and in a high fidelity Gazebo simulation, which shows significant improvement upon the initial approach in general settingsShow more