Researchers built a controlled toy system that lets language features activate together and tracked how they move through a simple linear transformation followed by a common activation step. They discovered that when features frequently co‑occur, their interference actually reinforces each other, creating structured clusters shaped by a weight‑decay regularization technique. This finding overturns the idea that neural networks rely on uncorrelated features and hints that encouraging feature correlations could make representations more compact and interpretable.
When features in a neural network co‑activate, interference can become constructive, reshaping the geometry of learned representations.
Prieto, Stevinson, Barsbey and colleagues at arXiv challenged the prevailing intuition that superposition in deep nets merely adds noise to be pruned by non‑linearities. They built Bag‑of‑Words Superposition (BOWS), a toy but rigorously controlled framework that encodes binary bag‑of‑words vectors drawn from internet text into a high‑dimensional embedding. By explicitly allowing the input features to be correlated—as is the case in real language corpora—they could trace how overlapping activations propagate through a linear layer followed by ReLU gates.
Their experiments revealed that when two or more features frequently co‑occur, their interference does not cancel out but instead aligns constructively. The constructive effect arises because the weight matrix arranges correlated features along directions that reinforce each other’s activation, while the ReLU non‑linearity suppresses spurious activations that would otherwise arise from unrelated features. In other words, the geometry of the feature space adapts to the data’s co‑activation statistics, turning potential interference into a useful signal.
The authors further examined how this geometry depends on training hyper‑parameters. Models trained with weight decay consistently displayed tighter clustering of semantically related words and exhibited cyclical patterns in the latent space that mirror the word‑frequency distribution of the training corpus. These structures were absent when weight decay was omitted, indicating that regularization can encourage the network to exploit correlations constructively rather than merely shrinking weights indiscriminately.
This finding matters because it revises a core assumption in mechanistic interpretability: that an over‑complete basis of sparse, uncorrelated features is the natural way neural nets organize information. By showing that correlated features can be arranged in a way that turns interference into a constructive geometric property, the work opens a new line of inquiry into how regularization, data statistics, and network architecture jointly shape internal representations. It also offers a concrete explanation for the semantic clusters and ring‑like manifolds that have been observed in large language models, phenomena that previous superposition narratives could not account for. For practitioners, the study suggests that deliberately encouraging feature correlations—perhaps through tailored loss functions or data augmentation—might yield more compact and interpretable representations without sacrificing performance.
Future work must determine whether this constructive superposition extends to vision or multimodal models and how to harness it for more efficient architectures.