Talk-to-Resolve: Combining scene understanding and spatial dialogue to resolve granular task ambiguity for a collocated robot
The utility of collocating robots largely depends on the easy and intuitive
interaction mechanism with the human. If a robot accepts task instruction in
natural language, first, it has to understand the user's intention by decoding
the instruction. However, while executing the task, the robot may face
unforeseeable circumstances due to the variations in the observed scene and
therefore requires further user intervention. In this article, we present a
system called Talk-to-Resolve (TTR) that enables a robot to initiate a coherent
dialogue exchange with the instructor by observing the scene visually to
resolve the impasse. Through dialogue, it either finds a cue to move forward in
the original plan, an acceptable alternative to the original plan, or
affirmation to abort the task altogether. To realize the possible stalemate, we
utilize the dense captions of the observed scene and the given instruction
jointly to compute the robot's next action. We evaluate our system based on a
data set of initial instruction and situational scene pairs. Our system can
identify the stalemate and resolve them with appropriate dialogue exchange with
82% accuracy. Additionally, a user study reveals that the questions from our
systems are more natural (4.02 on average on a scale of 1 to 5) as compared to
a state-of-the-art (3.08 on average).