Learning dexterous robotic grasping by watching humans in action

Date

2023-12

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Dexterous robotic hands are appealing for their agility and human-like morphology, yet their high degree of freedom makes learning to manipulate challenging. Current robot learning systems predominantly rely on strong human supervision in the form of teleoperation or kinesthetic demonstrations. Acquiring such supervised instructions is often laborious, costly, and confined in terms of the range and diversity of the demonstrations. In contrast, "in-the wild" human videos involving object interactions are abundantly available on the internet, capturing a broader spectrum of human interactions in natural settings. In this thesis, I study how to upgrade current dexterous robotic grasping systems to utilize freely available human data to learn powerful manipulation priors. I present my research on how in-the-wild human data can be used to train sample-efficient and high-performing dexterous manipulation policies. I demonstrate all my results with a 30-DoF five-fingered robotic hand in simulation on a wide range of objects, and show that our policies guided by human priors are significantly more effective, generalize better to novel objects, and yield improved sample efficiency. I study this problem along two axes: Object-centric functional affordance First I study how we can use visual object affordances learnt from human-object interactions to accelerate robotic grasping. I propose an approach to embed an object-centric visual affordance model within a deep reinforcement learning loop to learn grasping policies that favor the same object regions favored by people. Unlike traditional approaches that learn from human demonstration trajectories (e.g., hand joint sequences captured with a glove), the proposed prior is object-centric and image-based, allowing the agent to anticipate useful affordance regions for objects unseen during policy learning. Our work offers a step towards manipulation agents that learn by watching how people use objects, without requiring state and action information about the human body. Human hand poses from Internet videos Next, I study how we can leverage human hand poses extracted from in-the-wild YouTube videos to guide dexterous robotic grasping. The close morphological similarity between a human hand and a dexterous robotic hand can be harnessed to learn more robust manipulation policies, while relying on easily available Internet data. Toward this end, we propose DexVIP, an approach to learn dexterous robotic grasping from human-object interactions present in in-the-wild YouTube videos. We do this by curating grasp images from human-object interaction videos and imposing a prior over the agent's hand pose when learning to grasp with deep reinforcement learning. A key advantage of our method is that the learned policy is able to leverage free-form in-the-wild visual data. As a result, it can easily scale to new objects, and it sidesteps the standard practice of collecting human demonstrations in a lab---a much more expensive and indirect way to capture human expertise. In both cases, my results highlight the importance of learning visual models of objects and actions from human-object interactions in natural settings and their utility in training robust and generalizable dexterous robot grasping policies.

Description

LCSH Subject Headings

Citation