Chapter 7: Goal Misgeneralization

Appendix: Singular Learning Theory [WIP]

  • 9 min
  • Written by Markov Grey

When we train neural networks, we're running a search through the space of all possible algorithms. Statistical structure in our training data shapes the geometric structure of the loss landscape, which shapes the development of learned algorithms during training, which finally determines the algorithms we end up with. If we want to understand which algorithms our neural networks actually learn, and potentially steer their development toward algorithms we prefer, then we need to understand each step of this process. Classical learning theory developed tools to answer this question, but neural networks violate the fundamental assumptions these tools rely on. Singular Learning Theory (SLT) provides new mathematical frameworks specifically designed for understanding the learning process in neural networks. This appendix builds upon the concepts from our learning dynamics section, diving deeper into how we can understand and potentially shape the learning process to get the types of algorithms we want.

Learning theory attempts to predict which learned algorithms will generalize beyond training data. The central aim of classical learning theory is to bound various kinds of error: in particular, the approximation error, generalization error, and optimization error. One intuition driving decades of machine learning research is that simpler models tend to generalize better than complex ones. The Occam's razor principle of ML suggests that when multiple algorithms fit your training data equally well, you should prefer the simpler one because it's more likely to work on new, unseen data. Classical learning theory used parameter count as a proxy for algorithmic complexity. Fewer parameters meant simpler algorithms, which meant better generalization. This framework provided tools to predict not just which models would generalize, but how much data you'd need to learn reliably, how confident you should be in your predictions, and when you should prefer one model over another. However this assumption breaks down with overparametrized neural networks.

From data to model behaviour: Structure in data determines internal structure in models and thus generalisation. Current approaches to alignment work by shaping the training distribution (left), which only indirectly determines model structure (right) through the effects on shaping the optimisation process (middle left & right). To mitigate the limitations of this indirect approach, alignment requires a better understanding of these intermediate links (Lehalleur et al., 2025)

Gradient descent navigates loss landscapes in ways that closely approximate Bayesian inference. This means that the algorithms we discover through training largely reflect which algorithms were already probable under random parameter sampling (Mingard et al., 2020, Is SGD a Bayesian Sampler? Well, Almost). If SGD mostly finds algorithms that were already likely to emerge from random chance, then the bias toward misaligned goals exists before training even begins. The problem isn't that SGD introduces bad incentives during optimization - it's that the space of possible algorithms is fundamentally skewed toward simple, proxy-based solutions over complex, genuinely aligned ones. This might mean we can't fix goal misgeneralization by changing how we train models; we need to change which algorithms are easy to represent in the first place.

Neural networks belong to a class of models where multiple parameter settings can implement identical algorithms. Our discussion in the Learning Dynamics section talked about how each parameter configuration corresponds to a specific learned algorithm. But in neural networks, this relationship is not one-to-one. We can permute hidden units without changing the network's behavior. We can scale weights in one layer and compensate by scaling weights in the next layer. We can have completely different parameter configurations that implement identical decision-making processes.

Classical learning theory assumes each parameter setting corresponds to a unique algorithm. In linear regression, every different set of weights produces a different line, and every line represents a unique prediction algorithm. This one-to-one mapping allows classical theory to use parameter count as a complexity measure and apply standard mathematical approximations to predict learning outcomes.

These symmetries create a fundamental breakdown in how we measure algorithmic complexity. What looks like one algorithm in behaviorally—like "navigate to the coin"—might correspond to millions of different parameter configurations, i.e. millions of different types of learned algorithms can implement the exact same "navigate to the coin" behavior. Usually, the approach to measure the complexity of the learned algorithm is just by counting parameters. This fails because thousands of parameters might collaborate to implement a single computational pattern, while other parameters might be completely redundant. This breakdown is called "singularity" in mathematical terms, referring to degeneracies in the parameter-to-function mapping. The word "singular" means the same thing as in "singular matrix"—something fundamental about the mathematical structure is degenerate. In neural networks, regions of parameter space implementing the same algorithm form complex geometric structures with singularities where standard mathematical tools fail (Murfet et al., 2020, Deep Learning is Singular, and That's Good).

We can permute hidden units without changing the network's function. We can scale weights in one layer and compensate by scaling weights in the next layer. We can even have completely different parameter configurations that implement identical decision-making algorithms. A network that learns "collect coins" might have millions of different parameter configurations that all implement precisely the same coin-collection algorithm.

Measuring algorithmic complexity creates a new layer to investigate beyond behavioral and algorithmic equivalence. Throughout our chapter, we've focused on distinguishing between behaviorally equivalent but algorithmically different solutions—like "move right" versus "collect coins" in CoinRun. Both produce identical training behavior but represent different internal reasoning processes. Now we encounter a third layer: algorithmically identical solutions that differ only in their specific parameter values. Why does this parametric variation matter if the algorithms are identical? Because parameter space geometry determines what SGD discovers during training. Even when two parameter configurations implement the same algorithm, they occupy different regions of the loss landscape. Some algorithmic solutions might be implementable through millions of parameter configurations (creating wide basins in parameter space), while others require precise parameter coordination (creating narrow regions). SGD is more likely to discover algorithms that correspond to larger regions of parameter space.

Classical learning theory's tools fail completely when applied to these "singular" models. The mathematical techniques that work for linear regression—counting parameters, using standard approximations, making clean theoretical predictions—all assume each parameter contributes independently to algorithmic complexity. In neural networks, parameters interact in complex ways where massive numbers of parameters might collaborate to implement simple computational patterns, while other parameters might be completely redundant. This breakdown is called "singularity" in mathematical terms. The parameter-to-function mapping develops degeneracies and redundancies that violate the assumptions underlying classical statistical tools. Standard approximations like the Laplace approximation become invalid. Parameter counting becomes meaningless as a complexity measure. The neat relationship between parameter count and generalization completely disappears (Murfet et al., 2020, Deep Learning is Singular, and That's Good).

Free energy minimization for machine learning

Sumio Watanabe's Free Energy Formula makes this intuitive tradeoff mathematically precise:

𝐹𝑛𝑛𝐿𝑛(𝑤)+𝜆log𝑛

The first term rewards accuracy: 𝑛𝐿𝑛(𝑤) measures how well the algorithm fits the training data. Algorithms with lower loss get lower free energy, making them more probable.

The second term penalizes complexity: 𝜆log𝑛 measures the algorithmic complexity penalty. Here 𝜆 is the Real Log Canonical Threshold - the number of "effective parameters" needed to specify this particular algorithm within the neural network architecture. Algorithms requiring more effective parameters get higher free energy, making them less probable.

The balance between accuracy and complexity changes as training progresses. Early in training (small 𝑛 ), the complexity penalty dominates, so simple algorithms are preferred even if they have higher loss. Later in training, the accuracy term dominates, so complex but accurate algorithms become preferred. This creates "phase transitions" where the preferred algorithmic solution suddenly switches.

Empirical Evidence: Opposing Staircases. Researchers tracking both loss and estimated complexity during training observe "opposing staircases" - each sudden drop in loss is accompanied by a jump in algorithmic complexity. This validates SLT's prediction that learning proceeds through discrete phase transitions from simple, high-loss solutions to complex, low-loss solutions rather than gradual refinement of a single algorithm.

This explains why simple proxy goals are systematically more likely to emerge during training. Algorithms like "always move right" require very few effective parameters - most network weights can vary freely without changing this basic behavior pattern. Complex algorithms like "recognize objects, plan paths, navigate obstacles" require many effective parameters working together in precise coordination. The Free Energy Formula shows that simpler algorithms have systematically higher probability of being discovered during finite training (Watanabe, 2009, Algebraic Geometry and Statistical Learning Theory).

Internal model selection describes how neural networks choose between competing solutions during training. Rather than gradually refining a single algorithm, networks undergo discrete transitions between completely different algorithmic approaches. Each transition represents abandoning one algorithmic solution in favor of another that provides a better accuracy-complexity tradeoff given the current amount of training data.

Path dependence emerges because the sequence of simple solutions constrains which complex solutions become accessible later. Two networks might both start with simple approximations and eventually transition to complex algorithms, but which complex algorithm becomes accessible depends entirely on the path taken through intermediate simple solutions. Small initialization differences can determine which sequence gets traversed, leading to completely different final algorithms.

Goal misgeneralization becomes mathematically inevitable when viewed through SLT rather than merely possible. The internal model selection mechanism shows that during finite training, algorithms with lower effective complexity will be systematically preferred over those with higher complexity, even when complex algorithms better capture intended goals. This transforms our understanding from "goal misgeneralization sometimes happens" to "goal misgeneralization is the default unless actively prevented."

Loss landscape singularities create systematic biases toward certain types of goals based on their algorithmic simplicity. The geometric structures where multiple parameter configurations implement identical algorithms determine which goals are easy versus hard to discover. Goals implementable with fewer effective parameters occupy larger regions of parameter space, making them more likely to be found. This provides mathematical foundations for our counting arguments - the "counting" reflects real geometric volumes.

Goal crystallization refers to the phase transitions where networks abandon simple goal approximations for more complex ones. During training, systems don't gradually refine their goals - they undergo sudden transitions where one goal structure gets replaced by another. Early transitions typically involve simple approximations (like "move right" instead of "collect coins"). Later transitions may lead to more complex goals, but which complex goals become accessible depends on the path through earlier simple approximations.

The complexity of human values creates systematic vulnerabilities. Implementing genuine alignment requires learning context-dependent rules, handling edge cases, and making subtle moral distinctions - all requiring high algorithmic complexity. Simple heuristics producing aligned-looking behavior while missing deeper intent will systematically have lower complexity, making them more probable during finite training regimes.

Phase transitions provide early warning signals for goal misgeneralization. The Local Learning Coefficient can be estimated during training to track complexity changes. Sudden jumps often coincide with goal crystallization - moments where networks abandon one goal structure for another. Monitoring these transitions could provide advance warning when systems shift toward potentially misaligned goals (Hoogland et al., 2024, Loss Landscape Degeneracy and Stagewise Development).

Developmental interpretability emerges naturally as a method for understanding goal formation. Rather than reverse-engineering completed models, we can study the sequence of phase transitions through which goals crystallize. Each transition reveals why particular goals were selected over alternatives, providing insights crucial for detection and mitigation. We will talk about this more in our chapter on interpretability.

SLT focuses on parameter space geometry while abstracting away how parameters map to behaviors. Understanding goal misgeneralization requires connecting geometric complexity to actual algorithmic behavior, but SLT provides only partial tools for this connection. Two models with identical geometric properties could learn different goals depending on their parameter-function mappings (Skalse, 2023, My Criticism of Singular Learning Theory).