[[interpolate|interpolation]] / overparameterization does not contradict generalization;
spike in [[test error]] around [[interpolate|interpolation threshold]]
![[double descent figure.png]]
- In classical statistics (lhs),
we expect to see a [[bias-variance decomposition]] where more complex models tend to overfit the training data
- In [[deep learning]] or even [[ridge regularization|ridge regression]] (rhs): [[model complexity|more complex]] ("overparameterized") [[statistical model]]s generalize even better (even after [[interpolate|interpolating]] the training data)!
- Often observe a [[phase transition]] around the interpolation threshold
# why
[[2024SchaefferEtAlDoubleDescentDemystified]] paints a surprisingly simple picture:
well understood in [[kernel ridge regression]];
related to [[ridge regularization]] [[test error]] exploding due to
1. [[test error|test]] dataset [[covariate]]s aligning to [[principal component]]s of low [[variance]] in [[train loss|train]]ing dataset
2. [[architectural bias|model misspecification]] (occasionally [[noise]])
actually surprisingly simple!
see [[2024SchaefferEtAlDoubleDescentDemystified]] and [[2024AtanasovEtAlScalingRenormalizationHighdimensional|Scaling and renormalization in high-dimensional regression]] for [[statistical mechanics|statistical physics]] derivations
people have found "**multiple** descent curves";
see [[2021ChenEtAlMultipleDescentDesign]].
![[inductive bias#^inductive-bias]]
see [[simplicity bias in neural networks]]
- [[double descent in ridge regression for binary classification]]
- [[cgmt for ridge regression with squared loss for binary classification]]
- [[2019HastieEtAlSurprisesHighdimensionalRidgeless|Surprises in High-Dimensional Ridgeless Least Squares Interpolation]]
# sources
[[2024SchaefferEtAlDoubleDescentDemystified]]
Key [[history]]cal papers from Belkin et al:
- [[2018BelkinEtAlDoesDataInterpolation]] (interpolation does not contradict generalization)
- [[2018BelkinEtAlUnderstandDeepLearning]]
- [[2019BelkinEtAlReconcilingModernMachine]]
[[2019HastieEtAlSurprisesHighdimensionalRidgeless|Surprises in High-Dimensional Ridgeless Least Squares Interpolation]]
[[2023HenighanEtAlSuperpositionMemorizationDouble]]
[[CS 229br lec 06 2023-03-02]] Boaz jokes that "more money should give better results... [[gradient descent]] is a capitalist algorithm"