Recent advances in the field of abstractive summarization leverage
pre-trained language models rather than train a model from scratch. However,
such models are sluggish to train and accompanied by a m
We find that fine-tuning can achieve worse accuracy than linear probing out-of-distribution (ood) when the pretrained features are good and the distribution shift is large.
On 10 distribution shift datasets(breeds-living17, breeds-entity30, domainnet, cifar stl, cifar10.1, fmow,imagenetv2,imagenet-r, imagenet-a, imagenet-sketch), fine-tuning obtains on average 2% higher accuracy in-distribution but 7% lower accuracy out-of-distribution than linear probing.
In 2011, I published a popular-level book, The Fallacy of Fine-Tuning: Why
the Universe is Not Designed for Us. It investigated a common claim found in
contemporary religious literature that the param
In computer vision, it has achieved great success in adapting large-scale pretrained vision models (e.g., vision transformer) to downstream tasks via fine-tuning.
Common approaches for fine-tuning either update all modelparameters or leverage linear probes.
Fine-tuning pre-trained contextualized embedding models has become an
integral part of the NLP pipeline. At the same time, probing has emerged as a
way to investigate the linguistic knowledge captured
Classifiers that are linear in their parameters, and trained by optimizing a
convex loss function, have predictable behavior with respect to changes in the
training data, initial conditions, and optim
This paper explores applying the wav2vec2 framework to speaker recognition instead of speech recognition.
We study the effectiveness of the pre-trained weights on the speaker recognition task, and how to pool the wav2vec2 outputsequence into a fixed-length speaker embedding.
There has been recent success in pre-training on monolingual data and
fine-tuning on Machine Translation (MT), but it remains unclear how to best
leverage a pre-trained model for a given MT task. This
Existing fine-tuning methods either tune all parameters of the pre-trained
model (full fine-tuning), which is not efficient, or only tune the last linear
layer (linear probing), which suffers a signif
Recent studies have shown that CLIP has achieved remarkable success in
performing zero-shot inference while its fine-tuning performance is not
satisfactory. In this paper, we identify that fine-tuning
A common approach to transfer learning under distribution shift is to
fine-tune the last few layers of a pre-trained model, preserving learned
features while also adapting to the new task. This paper
Claims are the central component of an argument. Detecting claims across
different domains or data sets can often be challenging due to their varying
conceptualization. We propose to alleviate this pr
Transformers are responsible for the vast majority of recent advances in
natural language processing. The majority of practical natural language
processing applications of these models is typically en
Although widely adopted, existing approaches for fine-tuning pre-trained
language models have been shown to be unstable across hyper-parameter settings,
motivating recent work on trust region methods.
Large pre-trained models such as CLIP offer consistent accuracy across a
range of data distributions when performing zero-shot inference (i.e., without
fine-tuning on a specific dataset). Although exi
In this work, we take few-shot named entity recognition (ner) for a pilot study, where existing fine-tuning strategies are much different from pre-training.
We propose a novel few-shot fine-tuning framework for ner, named entity fine-tuning framework for few-shot named entity recognition (fff-ner), so we can formulate the ner fine-tuning as (masked) token prediction or generation, depending on the choice of pre-trained language models.
Adapting large-scale pretrained language models to downstream tasks via
fine-tuning is the standard method for achieving state-of-the-art performance
on NLP benchmarks. However, fine-tuning all weight
We show that with small-to-medium training data, fine-tuning only the biasterms (or a subset of the bias terms) of pre-trained bert models is competitivewith (and sometimes better than) fine-tuning the entire model.
These findings are relevant for the questionof understanding the commonly-used process of finetuning : they support the hypothesis that finetuning is mainly about exposing knowledge induced by language-modeling training, rather than learning new task-specific linguistic knowledge.
Fine-tuning is known to improve NLP models by adapting an initial model
trained on more plentiful but less domain-salient examples to data in a target
domain. Such domain adaptation is typically done