AV-CAT: An Audio-Visual Context-Aware Transformer for Face Shaping - 42Papers