One-shot Visual Reasoning on RPMs with an Application to Video Frame Prediction
Raven's Progressive Matrices (RPMs) are frequently used in evaluating human's
visual reasoning ability. Researchers have made considerable effort in
developing a system which could automatically solve the RPM problem, often
through a black-box end-to-end Convolutional Neural Network (CNN) for both
visual recognition and logical reasoning tasks. Towards the objective of
developing a highly explainable solution, we propose a One-shot
Human-Understandable ReaSoner (Os-HURS), which is a two-step framework
including a perception module and a reasoning module, to tackle the challenges
of real-world visual recognition and subsequent logical reasoning tasks,
respectively. For the reasoning module, we propose a "2+1" formulation that can
be better understood by humans and significantly reduces the model complexity.
As a result, a precise reasoning rule can be deduced from one RPM sample only,
which is not feasible for existing solution methods. The proposed reasoning
module is also capable of yielding a set of reasoning rules, precisely modeling
the human knowledge in solving the RPM problem. To validate the proposed method
on real-world applications, an RPM-like One-shot Frame-prediction (ROF) dataset
is constructed, where visual reasoning is conducted on RPMs constructed using
real-world video frames instead of synthetic images. Experimental results on
various RPM-like datasets demonstrate that the proposed Os-HURS achieves a
significant and consistent performance gain compared with the state-of-the-art
models.