Qualitative results of our method. We focus on the task of high-fidelity person-agnostic lip-sync generation, which modifies the mouth shapes of any target template video according to the audio source. Here our lip-sync results should have the same mouth shape as the synced video to the audio source. The figures are selected from VoxCeleb