Learning to answer questions from human feedback
We study how to improve the question answering (QA) model from human feedback. We formulate a contextual bandit learning scenario: human users pose questions to a system, and give feedback to output answers based on the context from which they are taken. Prior work in this setting only considered a simplified setting where questions are always answerable with simulated feedback from supervised data. We study information-seeking scenarios where crowdworkers interact with deployed QA systems. We propose a sequential prediction formulation -- first predict whether the question is answerable and then predict an answer span only for answerable questions which is more robust to bandit learning compared to approaches for handling unanswerable by dedicating a special unanswerable span.
Our experiments demonstrate significant performance gains over time under a variety of setups, including domain adaptation. We observe little user adaptation during the course of the 9-round deployment study. Our ablation study illustrates that the sequence prediction formulation is crucial for question answering models to continually improve over time by learning the answerability of user questions better. We hope this spurs future research in interactive QA models.