Large language models are not zero-shot communicators
Despite widespread use of LLMs as conversational agents, evaluations of
performance fail to capture a crucial aspect of communication: interpreting
language in context. Humans interpret language using beliefs and prior
knowledge about the world. For example, we intuitively understand the response
"I wore gloves" to the question "Did you leave fingerprints?" as meaning "No".
To investigate whether LLMs have the ability to make this type of inference,
known as an implicature, we design a simple task and evaluate widely used
state-of-the-art models. We find that, despite only evaluating on utterances
that require a binary inference (yes or no), most perform close to random.
Models adapted to be "aligned with human intent" perform much better, but still
show a significant gap with human performance. We present our findings as the
starting point for further research into evaluating how LLMs interpret language
in context and to drive the development of more pragmatic and useful models of