Quark: Controllable Text Generation with Reinforced Unlearning
Large-scale language models often learn behaviors that are misaligned with
user expectations. Generated text may contain offensive or toxic language,
contain significant repetition, or be of a different sentiment than desired by
the user. We consider the task of unlearning these misalignments by fine-tuning
the language model on signals of what not to do. We introduce Quantized Reward
Konditioning (Quark), an algorithm for optimizing a reward function that
quantifies an (un)wanted property, while not straying too far from the original
model. Quark alternates between (i) collecting samples with the current
language model, (ii) sorting them into quantiles based on reward, with each
quantile identified by a reward token prepended to the language model's input,
and (iii) using a standard language modeling loss on samples from each quantile
conditioned on its reward token, while remaining nearby the original language
model via a KL-divergence penalty. By conditioning on a high-reward token at
generation time, the model generates text that exhibits less of the unwanted
property. For unlearning toxicity, negative sentiment, and repetition, our
experiments show that Quark outperforms both strong baselines and
state-of-the-art reinforcement learning methods like PPO (Schulman et al.
2017), while relying only on standard language modeling primitives.