Skip to article frontmatterSkip to article content

SELF-IMPROVEMENT IN LANGUAGE MODELS: THE SHARPENING MECHANISM

ICT CAS

Motivation

A dominant hypothesis for why improvement without external feedback might be possible is that models contain “hidden knowledge”. But how to extract it?

The starting point is the widely observed phenomenon that langauge models are often better at verifying whether responses are correct than they at generating correct responses. So self-improvement can be viewed as any attempt to narrow this gap, i.e., use the model as its own verifier to improve generation and sharpen the model toward high-quality responses.

Formmaly, consider a base model πbase:X(Y)\pi_{base}: \mathcal{X} \rightarrow \triangle(\mathcal{Y}) representing a conditional distribution that maps a prompt xXx \in \mathcal{X} to a distribution over responses. πbase\pi_{base}'s jet feature is that it is a good verifier, as measured by some self-reward function rself(yx;πbase)r_{self}(y |x; \pi_{base}) measuring model certainty, e.g. regularized sequence likelihood, llm-as-judges, model confidence.

Sharpening is any process that tilts πbase\pi_{base} toward responses that are more certain. A sharpened model π^argmaxyYrself(yx;πbase)\hat{\pi} \approx \mathop{\arg\max}\limits_{y \in \mathcal{Y}} r_{self}(y|x; \pi_{base}).

Sharpening may be implemented at inference-time, e.g. decoding strategies, or self-training. Though it is unclear a-prior whether there are self-rewards related to task performance. But many previous work suggest that models do have hidden knowledge, but it is computationally challenging to extract. Self-improvement algorithms leverage these verifications to improve the quaility of generations.