In this work, we propose a speaker anonymization pipeline that leverages high
quality automatic speech recognition and synthesis systems to generate speech
conditioned on phonetic transcriptions and a
Sequence-to-sequence (seq2seq) models are prevalent in semantic parsing, but
have been found to struggle at out-of-distribution compositional
generalization. While specialized model architectures and
Existing language model compression methods mostly use a simple L2 loss to
distill knowledge in the intermediate representations of a large BERT model to
a smaller one. Although widely used, this obje
Binary-source code matching plays an important role in many security and
software engineering related tasks such as malware detection, reverse
engineering and vulnerability assessment. Currently, seve
Phase retrieval is the problem of reconstructing images from magnitude-only
measurements. In many real-world applications the problem is underdetermined.
When training data is available, generative mo
Despite the success of a number of recent techniques for visual
self-supervised deep learning, there remains limited investigation into the
representations that are ultimately learned. By using recent
Deep learning has been widely applied in many computer vision applications,
with remarkable success. However, running deep learning models on mobile
devices is generally challenging due to the limitat
To translate natural language questions into executable database queries,
most approaches rely on a fully annotated training set. Annotating a large
dataset with queries is difficult as it requires qu
Machine learning systems are often deployed for making critical decisions
like credit lending, hiring, etc. While making decisions, such systems often
encode the user's demographic information (like g
There is a large space of NUMA and hardware prefetcher configurations that
can significantly impact the performance of an application. Previous studies
have demonstrated how a model can automatically
We introduce HybridPose, a novel 6D object pose estimation approach.
HybridPose utilizes a hybrid intermediate representation to express different
geometric information in the input image, including k
The development of technologies for causal inference with the privacy
preservation of distributed data has attracted considerable attention in recent
years. To address this issue, we propose a quasi-e
This study achieved bidirectional translation between descriptions and
actions using small paired data. The ability to mutually generate descriptions
and actions is essential for robots to collaborate
Pretrained language models (plms) have dramatically shifted the paradigm of semantic parsing, where the mapping from natural language utterances to structured logical forms is now formulated as a seq2seq task.
Despite the promising performance, previous approaches often suffer from hallucination problems due to their negligence of the structural information contained in the sentence, which essentially constitutes the key semantic of the logical forms.
Text to speech (tts) systems usually leverage a cascaded acoustic model and vocoder pipeline with mel-spectrograms as the intermediaterepresentations, which suffer from two limitations : 1) the acoustic model and vocoder are separately trained instead of jointly optimized, which incurs cascaded errors ; 2) the intermediate speech representations (e.g.,mel-spectrogram) are pre-designed and lose phase information, which are sub-optimal.
To solve these problems, in this paper, we develop delightfultts2, a new end-to-end speech synthesis system with automatically learned speechrepresentations and jointly optimized acoustic model and vocoder.
Multi-source data fusion, in which multiple data sources are jointly analyzed
to obtain improved information, has considerable research attention. For the
datasets of multiple medical institutions, da
We present a graph neural network model for solving graph-to-graph learning
problems. Most deep learning on graphs considers ``simple'' problems such as
graph classification or regressing real-valued
Learning representations of multimodal data that are both informative and robust to missing modalities at test time remains a challenging problem due to the inherent heterogeneity of data obtained from different channels.
To address it, we present a novel geometric multimodal contrastive (gmc) representation learning method comprised of two main components : i) a two-level architectureconsisting of modality-specific base encoder, allowing to process an arbitrary number of modalities to an intermediate representation of fixed dimensionality and a shared projection head, mapping the intermediate representations to a latent representation space ; ii) a multimodal contrastive loss function thatencourages the geometric alignment of the learned representations.
In this paper we propose a new intermediate supervision method, named
LabelEnc, to boost the training of object detection systems. The key idea is to
introduce a novel label encoding function, mapping
Intermediate features of a pre-trained model have been shown informative for
making accurate predictions on downstream tasks, even if the model backbone is
kept frozen. The key challenge is how to uti