This is absolutely fascinating.
Anthropic have discovered what they term “Subliminal Learning”, where LLM’s learn traits from model-generated data that is semantically unrelated to those traits.
The immediate context is how it applies to model alignment, and as a potential vector for bad behaviour to slip through.
It also ties into the concept of “Dark Knowledge”, which came to light (pun intended!) during past studies of model distillation behaviour.
This took me back to studying data embedding and information theory from Claude Shannon in my university CS courses.
What profound implications this could have.
Subliminal Learning: Language Models Transmit Behavioral Traits via Hidden Signals in Data
Born Again Neural Networks
Knowledge Distillation (KD) consists of transferring “knowledge†from one machine learning model (the teacher) to another (the student). Commonly, the teacher is a high-capacity model with formidable performance, while the student is more compact. By transferring knowledge, one hopes to benefit from the student’s compactness, without sacrificing too much performance. We study KD from a new perspective: rather than compressing models, we train students parameterized identically to their teachers. Surprisingly, these Born-Again Networks (BANs), outperform their teachers significantly, both on computer vision and language modeling tasks. Our experiments with BANs based on DenseNets demonstrate state-of-the-art performance on the CIFAR-10 (3.5%) and CIFAR-100 (15.5%) datasets, by validation error. Additional experiments explore two distillation objectives: (i) Confidence-Weighted by Teacher Max (CWTM) and (ii) Dark Knowledge with Permuted Predictions (DKPP). Both methods elucidate the essential components of KD, demonstrating the effect of the teacher outputs on both predicted and non-predicted classes.
