PlanT: Explainable Planning Transformers via Object-Level Representations
Planning an optimal route in a complex environment requires efficient
reasoning about the surrounding scene. While human drivers prioritize important
objects and ignore details not relevant to the decision, learning-based
planners typically extract features from dense, high-dimensional grid
representations containing all vehicle and road context information. In this
paper, we propose PlanT, a novel approach for planning in the context of
self-driving that uses a standard transformer architecture. PlanT is based on
imitation learning with a compact object-level input representation. On the
Longest6 benchmark for CARLA, PlanT outperforms all prior methods (matching the
driving score of the expert) while being 5.3x faster than equivalent
pixel-based planning baselines during inference. Combining PlanT with an
off-the-shelf perception module provides a sensor-based driving system that is
more than 10 points better in terms of driving score than the existing state of
the art. Furthermore, we propose an evaluation protocol to quantify the ability
of planners to identify relevant objects, providing insights regarding their
decision-making. Our results indicate that PlanT can focus on the most relevant
object in the scene, even when this object is geometrically distant.
Authors
Katrin Renz, Kashyap Chitta, Otniel-Bogdan Mercea, A. Sophia Koepke, Zeynep Akata, Andreas Geiger