Pretrained language models (plms) have dramatically shifted the paradigm of semantic parsing, where the mapping from natural language utterances to structured logical forms is now formulated as a seq2seq task.
Despite the promising performance, previous approaches often suffer from hallucination problems due to their negligence of the structural information contained in the sentence, which essentially constitutes the key semantic of the logical forms.
Text to speech (tts) systems usually leverage a cascaded acoustic model and vocoder pipeline with mel-spectrograms as the intermediaterepresentations, which suffer from two limitations : 1) the acoustic model and vocoder are separately trained instead of jointly optimized, which incurs cascaded errors ; 2) the intermediate speech representations (e.g.,mel-spectrogram) are pre-designed and lose phase information, which are sub-optimal.
To solve these problems, in this paper, we develop delightfultts2, a new end-to-end speech synthesis system with automatically learned speechrepresentations and jointly optimized acoustic model and vocoder.
Learning representations of multimodal data that are both informative and robust to missing modalities at test time remains a challenging problem due to the inherent heterogeneity of data obtained from different channels.
To address it, we present a novel geometric multimodal contrastive (gmc) representation learning method comprised of two main components : i) a two-level architectureconsisting of modality-specific base encoder, allowing to process an arbitrary number of modalities to an intermediate representation of fixed dimensionality and a shared projection head, mapping the intermediate representations to a latent representation space ; ii) a multimodal contrastive loss function thatencourages the geometric alignment of the learned representations.