Bibliography

  • [AMS09] P-A Absil, R Mahony and R Sepulchre “Optimization Algorithms on Matrix Manifolds” Princeton University Press, 2009 URL: https://market.android.com/details?id=book-NSQGQeLN3NcC
  • [AAJ+16] Alekh Agarwal, Animashree Anandkumar, Prateek Jain and Praneeth Netrapalli “Learning Sparsely Used Overcomplete Dictionaries via Alternating Minimization” In SIAM Journal on Optimization 26.4, 2016, pp. 2775–2799 DOI: 10.1137/140979861
  • [AEB06] Michal Aharon, Michael Elad and Alfred Bruckstein “K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation” In IEEE Transactions on signal processing 54.11 IEEE, 2006, pp. 4311–4322
  • [AR20] Jason M. Allred and Kaushik Roy “Controlled Forgetting: Targeted Stimulation and Dopaminergic Plasticity Modulation for Unsupervised Lifelong Learning in Spiking Neural Networks” In Frontiers in Neuroscience 14, 2020 DOI: 10.3389/fnins.2020.00007
  • [ADG+16] Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, Brendan Shillingford and Nando De Freitas “Learning to learn by gradient descent by gradient descent” In Advances in neural information processing systems, 2016, pp. 3981–3989
  • [ACB17] Martin Arjovsky, Soumith Chintala and Léon Bottou “Wasserstein generative adversarial networks” In International conference on machine learning, 2017, pp. 214–223 PMLR
  • [AGM+15] Sanjeev Arora, Rong Ge, Tengyu Ma and Ankur Moitra “Simple, Efficient, and Neural Algorithms for Sparse Coding” In Proceedings of The 28th Conference on Learning Theory 40, Proceedings of Machine Learning Research Paris, France: PMLR, 2015, pp. 113–149 URL: https://proceedings.mlr.press/v40/Arora15.html
  • [AW18] Aharon Azulay and Yair Weiss “Why do deep convolutional networks generalize so poorly to small image transformations?” In arXiv preprint arXiv:1805.12177, 2018
  • [BJC85] B., J. and C. “Architectures neuromimétiques adaptatives : Détection de primitives” In Cognitiva 2, 1985, pp. 593–597
  • [BKH16] Jimmy Lei Ba, Jamie Ryan Kiros and Geoffrey E Hinton “Layer normalization” In arXiv preprint arXiv:1607.06450, 2016
  • [BM24] Hao Bai and Yi Ma “Improving neuron-level interpretability with white-box language models” In arXiv preprint arXiv:2410.16443, 2024
  • [BGN+17] Bowen Baker, Otkrist Gupta, N. Naik and R. Raskar “Designing Neural Network Architectures using Reinforcement Learning” In ArXiv abs/1611.02167, 2017
  • [Bal11] Pierre Baldi “Autoencoders, unsupervised learning and deep architectures” In Proceedings of the 2011 International Conference on Unsupervised and Transfer Learning Workshop - Volume 27, UTLW’11 Washington, USA: JMLR.org, 2011, pp. 37–50
  • [BH89] Pierre Baldi and Kurt Hornik “Neural networks and principal component analysis: Learning from examples without local minima” In Neural networks 2.1 Elsevier, 1989, pp. 53–58
  • [BLZ+22] Fan Bao, Chongxuan Li, Jun Zhu and Bo Zhang “Analytic-dpm: an analytic estimate of the optimal reverse variance in diffusion probabilistic models” In arXiv preprint arXiv:2201.06503, 2022
  • [BSM+20] Pinglei Bao, Liang She, Mason McGill and Doris Y. Tsao “A map of object space in primate inferotemporal cortex” In Nature 583, 2020, pp. 103–108
  • [BKS15] Boaz Barak, Jonathan A Kelner and David Steurer “Dictionary learning and tensor decomposition via the sum-of-squares method” In Proceedings of the forty-seventh annual ACM symposium on Theory of Computing New York, NY, USA: ACM, 2015 DOI: 10.1145/2746539.2746605
  • [BT09] Amir Beck and Marc Teboulle “A fast iterative shrinkage-thresholding algorithm for linear inverse problems” In SIAM journal on imaging sciences 2.1 SIAM, 2009, pp. 183–202
  • [Bel57] Richard Bellman “Dynamic Programming” Princeton University Press,, 1957
  • [BN24] Jeremy Bernstein and Laker Newhouse “Old Optimizer, New Norm: An Anthology”, 2024 arXiv: https://arxiv.org/abs/2409.20325
  • [Bla72] R. Blahut “Computation of channel capacity and rate-distortion functions” In IEEE Transactions on Information Theory 18.4, 1972, pp. 460–473 DOI: 10.1109/TIT.1972.1054855
  • [BBF+01] Lorna Booth, Jehoshua Bruck, M. Franceschetti and Ronald Meester “Covering Algorithms, Continuum Percolation and the Geometry of Wireless Networks” In Ann. Appl. Probab. 13, 2001 DOI: 10.1214/aoap/1050689601
  • [Bor97] Vivek S Borkar “Stochastic approximation with two time scales” In Systems & Control Letters 29.5 Elsevier, 1997, pp. 291–294
  • [BDS16] Vivek S Borkar, Raaz Dwivedi and Neeraja Sahasrabudhe “Gaussian approximations in high dimensional estimation” In Systems & Control Letters 92 Elsevier, 2016, pp. 42–45
  • [Bos50] R. Boscovich “De calculo probabilitatum que respondent diversis valoribus summe errorum post plures observationes, quarum singule possient esse erronee certa quadam quantitate”, 1750
  • [Bou23] Nicolas Boumal “An Introduction to Optimization on Smooth Manifolds” Cambridge University Press, 2023 DOI: 10.1017/9781009166164
  • [BV04] Stephen Boyd and Lieven Vandenberghe “Convex Optimization” Cambridge University Press, 2004
  • [BN24a] Arwen Bradley and Preetum Nakkiran “Classifier-Free Guidance is a Predictor-Corrector” In arXiv [cs.LG], 2024 arXiv: http://arxiv.org/abs/2408.09000
  • [BN20] Guy Bresler and Dheeraj Nagaraj “Sharp representation theorems for relu networks with precise dependence on depth” In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20 Article 897 Red Hook, NY, USA: Curran Associates Inc., 2020, pp. 10697–10706 URL: https://dl.acm.org/doi/10.5555/3495724.3496621
  • [BB11] Haim Brezis and Haim Brézis “Functional analysis, Sobolev spaces and partial differential equations” Springer, 2011
  • [BEJ25] Paige Bright, Alan Edelman and Steven G Johnson “Matrix Calculus (for Machine Learning and Beyond)” In arXiv preprint arXiv:2501.14787, 2025
  • [BMR+20] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever and Dario Amodei “Language models are few-shot learners” In arXiv preprint arXiv:2005.14165, 2020
  • [BM13] Joan Bruna and Stéphane Mallat “Invariant Scattering Convolution Networks” In IEEE Transactions on Pattern Analysis and Machine Intelligence 35.8, 2013, pp. 1872–1886
  • [BGW21] Sam Buchanan, Dar Gilboa and John Wright “Deep Networks and the Multiple Manifold Problem” In International Conference on Learning Representations, 2021 URL: https://openreview.net/forum?id=O-6Pm_d_Q-
  • [CD91] M. Callier and A. Desoer “Linear System Theory” Springer-Verlag, 1991
  • [Can06] E. Candès “Compressive sampling” In Proceedings of the International Congress of Mathematicians, 2006
  • [CT05] E. Candès and T. Tao “Decoding by linear programming” In IEEE Transactions on Information Theory 51.12, 2005
  • [CT05a] E. Candès and T. Tao “Error Correction via Linear Programming” In IEEE Symposium on FOCS, 2005, pp. 295–308
  • [CTM+21] Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski and Armand Joulin “Emerging properties in self-supervised vision transformers” In Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 9650–9660
  • [Cha66] Gregory J. Chaitin “On the Length of Programs for Computing Finite Binary Sequences” In J. ACM 13.4 New York, NY, USA: Association for Computing Machinery, 1966, pp. 547–569 DOI: 10.1145/321356.321363
  • [CYY+22] Kwan Ho Ryan Chan, Yaodong Yu, Chong You, Haozhi Qi, John Wright and Yi Ma “ReduNet: A White-box Deep Network from the Principle of Maximizing Rate Reduction” In Journal of Machine Learning Research 23.114, 2022, pp. 1–103 URL: http://jmlr.org/papers/v23/21-0631.html
  • [CJG+15] Tsung-Han Chan, Kui Jia, Shenghua Gao, Jiwen Lu, Zinan Zeng and Yi Ma “PCANet: A simple deep learning baseline for image classification?” In TIP, 2015
  • [CT17] Le Chang and Doris Tsao “The Code for Facial Identity in the Primate Brain” In Cell 169, 2017, pp. 1013–1028.e14 DOI: 10.1016/j.cell.2017.05.011
  • [CRE+19] Arslan Chaudhry, Marcus Rohrbach, Mohamed Elhoseiny, Thalaiyasingam Ajanthan, Puneet K Dokania, Philip HS Torr and Marc’Aurelio Ranzato “On tiny episodic memories in continual learning” In arXiv preprint arXiv:1902.10486, 2019
  • [CHZ+23] Minshuo Chen, Kaixuan Huang, Tuo Zhao and Mengdi Wang “Score approximation, estimation and distribution recovery of diffusion models on low-dimensional data” In International Conference on Machine Learning, 2023, pp. 4672–4712 PMLR
  • [CRB+18] Ricky TQ Chen, Yulia Rubanova, Jesse Bettencourt and David K Duvenaud “Neural ordinary differential equations” In Advances in neural information processing systems 31, 2018
  • [CCL+23] Sitan Chen, Sinho Chewi, Jerry Li, Yuanzhi Li, Adil Salim and Anru Zhang “Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions” In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 OpenReview.net, 2023 URL: https://openreview.net/forum?id=zyLVMgsZ0U%5C_
  • [CZG+24] Siyi Chen, Huijie Zhang, Minzhe Guo, Yifu Lu, Peng Wang and Qing Qu “Exploring low-dimensional subspace in diffusion models for controllable image editing” In Advances in neural information processing systems 37, 2024, pp. 27340–27371
  • [CZL+25] Siyi Chen, Yimeng Zhang, Sijia Liu and Qing Qu “The Dual Power of Interpretable Token Embeddings: Jailbreaking Attacks and Defenses for Diffusion Model Unlearning” In arXiv preprint arXiv:2504.21307, 2025
  • [CKN+20] Ting Chen, Simon Kornblith, Mohammad Norouzi and Geoffrey Hinton “A simple framework for contrastive learning of visual representations” In arXiv preprint arXiv:2002.05709, 2020
  • [CLH+24] Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh and Yifeng Lu “Symbolic discovery of optimization algorithms” In Advances in neural information processing systems 36, 2024
  • [Cho17] Francois Chollet “Xception: Deep Learning with Depthwise Separable Convolutions” In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 1800–1807
  • [CKM+23] Hyungjin Chung, Jeongsol Kim, Michael Thompson Mccann, Marc Louis Klasky and Jong Chul Ye “Diffusion Posterior Sampling for General Noisy Inverse Problems” In The Eleventh International Conference on Learning Representations, 2023 URL: https://openreview.net/forum?id=OnD9zGAGT0k
  • [CW16] Taco Cohen and Max Welling “Group equivariant convolutional networks” In International Conference on Machine Learning, 2016, pp. 2990–2999
  • [CW16a] Taco Cohen and Max Welling “Group equivariant convolutional networks” In International conference on machine learning, 2016, pp. 2990–2999 PMLR
  • [CW16b] Taco S. Cohen and Max Welling “Group Equivariant Convolutional Networks” In CoRR abs/1602.07576, 2016 arXiv: http://arxiv.org/abs/1602.07576
  • [CV95] Corinna Cortes and Vladimir Vapnik “Support-Vector Networks” In Mach. Learn. 20.3 USA: Kluwer Academic Publishers, 1995, pp. 273–297 DOI: 10.1023/A:1022627411411
  • [CT91] T. Cover and J. Thomas “Elements of Information Theory” Wiley Series in Telecommunications, 1991
  • [Cov64] Thomas Cover “Geometrical and Statistical Properties of Systems of Linear Inequalities with Applications in Pattern Recognition” In IEEE TRANSACTIONS ON ELECTRONIC COMPUTERS, 1964
  • [Cyb89] George V. Cybenko “Approximation by superpositions of a sigmoidal function” In Mathematics of Control, Signals and Systems 2, 1989, pp. 303–314 URL: https://api.semanticscholar.org/CorpusID:3958369
  • [D ̵D00] D. “High-dimensional data analysis: The curses and blessings of dimensionality” In AMS Math Challenges Lecture, 2000
  • [DM03] D. and M. “Optimally sparse representation in general (nonorthogonal) dictionaries via 1\ell^{1}roman_ℓ start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT minimization” In PNAS 100.5, 2003, pp. 2197–2202
  • [DTL+22] Xili Dai, Shengbang Tong, Mingyang Li, Ziyang Wu, Michael Psenka, Kwan Ho Ryan Chan, Pengyuan Zhai, Yaodong Yu, Xiaojun Yuan, Heung-Yeung Shum and Yi Ma “CTRL: Closed-Loop Transcription to an LDR via Minimaxing Rate Reduction” In Entropy 24.4, 2022 DOI: 10.3390/e24040456
  • [Dan02] George B Dantzig “Linear Programming” In Operations Research 50.1 INFORMS, 2002, pp. 42–47 URL: https://www.jstor.org/stable/3088447
  • [DSD+23] Giannis Daras, Kulin Shah, Yuval Dagan, Aravind Gollakota, Alex Dimakis and Adam Klivans “Ambient Diffusion: Learning Clean Distributions from Corrupted Data” In Thirty-seventh Conference on Neural Information Processing Systems, 2023 URL: https://openreview.net/forum?id=wBJBLy9kBY
  • [DSD+23a] Giannis Daras, Kulin Shah, Yuval Dagan, Aravind Gollakota, Alex Dimakis and Adam Klivans “Ambient Diffusion: Learning Clean Distributions from Corrupted Data” In Advances in Neural Information Processing Systems 36 Curran Associates, Inc., 2023, pp. 288–313 URL: https://proceedings.neurips.cc/paper_files/paper/2023/file/012af729c5d14d279581fc8a5db975a1-Paper-Conference.pdf
  • [DGG+25] Valentin De Bortoli, Alexandre Galashov, J Swaroop Guntupalli, Guangyao Zhou, Kevin Murphy, Arthur Gretton and Arnaud Doucet “Distributional Diffusion Models with Scoring Rules” In arXiv preprint arXiv:2502.02483, 2025
  • [DCM+23] Aaron Defazio, Ashok Cutkosky, Harsh Mehta and Konstantin Mishchenko “Optimal linear decay learning rate schedules and further refinements” In arXiv preprint arXiv:2310.07831, 2023
  • [DDS22] Julie Delon, Agnes Desolneux and Antoine Salmona “Gromov–Wasserstein distances between Gaussian distributions” In Journal of Applied Probability 59.4 Cambridge University Press, 2022, pp. 1178–1198
  • [DCL+19] Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova “Bert: Pre-training of deep bidirectional transformers for language understanding” In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), 2019, pp. 4171–4186
  • [DN21] Prafulla Dhariwal and Alexander Quinn Nichol “Diffusion Models Beat GANs on Image Synthesis” In Advances in Neural Information Processing Systems, 2021 URL: https://openreview.net/forum?id=AAWuCvzaVt
  • [Don01] D L Donoho “Sparse components of images and optimal atomic decompositions” In Constructive approximation 17.3 Springer ScienceBusiness Media LLC, 2001, pp. 353–382 DOI: 10.1007/s003650010032
  • [DVD+98] D L Donoho, M Vetterli, R A DeVore and I Daubechies “Data compression and harmonic analysis” In IEEE transactions on information theory / Professional Technical Group on Information Theory 44.6, 1998, pp. 2435–2476 DOI: 10.1109/18.720544
  • [Don05] David L Donoho “Neighborly polytopes and sparse solutions of underdetermined linear equations” In Stanford Technical Report 2005-04 Citeseer, 2005
  • [DBK+21] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit and Neil Houlsby “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale” In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 OpenReview.net, 2021 URL: https://openreview.net/forum?id=YicbFdNTTy
  • [EY36] Carl Eckart and Gale Young “The approximation of one matrix by another of lower rank” In Psychometrika 1.3, 1936, pp. 211–218 DOI: 10.1007/BF02288367
  • [EAS98] A Edelman, T Arias and S Smith “The Geometry of Algorithms with Orthogonality Constraints” In SIAM Journal on Matrix Analysis and Applications 20.2 Society for IndustrialApplied Mathematics, 1998, pp. 303–353 DOI: 10.1137/S0895479895290954
  • [EA06] Michael Elad and Michal Aharon “Image denoising via sparse and redundant representations over learned dictionaries” In IEEE transactions on image processing: a publication of the IEEE Signal Processing Society 15.12, 2006, pp. 3736–3745 URL: https://www.ncbi.nlm.nih.gov/pubmed/17153947
  • [EHO+22] Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain and Carol Chen “Toy models of superposition” In arXiv preprint arXiv:2209.10652, 2022
  • [EHO+22a] Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg and Christopher Olah “Toy Models of Superposition” In Transformer Circuits Thread, 2022 URL: https://transformer-circuits.pub/2022/toy_model/index.html
  • [ETT+17] Logan Engstrom, Brandon Tran, Dimitris Tsipras, Ludwig Schmidt and Aleksander Madry “A rotation and a translation suffice: Fooling CNNs with simple transformations” In arXiv preprint arXiv:1712.02779, 2017
  • [EKB+24] Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer and Frederic Boesel “Scaling rectified flow transformers for high-resolution image synthesis” In Forty-first international conference on machine learning, 2024
  • [FZS22] William Fedus, Barret Zoph and Noam Shazeer “Switch transformers: scaling to trillion parameter models with simple and efficient sparsity” In J. Mach. Learn. Res. 23.1 JMLR.org, 2022
  • [Fel49] William Feller “On the Theory of Stochastic Processes, with Particular Reference to Applications”, 1949 URL: https://api.semanticscholar.org/CorpusID:121027442
  • [FCR20] Tanner Fiez, Benjamin Chasnov and Lillian Ratliff “Implicit learning dynamics in stackelberg games: Equilibria characterization, convergence analysis, and empirical study” In International Conference on Machine Learning, 2020, pp. 3133–3144 PMLR
  • [FCR19] Tanner Fiez, Benjamin Chasnov and Lillian J Ratliff “Convergence of learning dynamics in stackelberg games” In arXiv preprint arXiv:1906.01217, 2019
  • [Fuk69] Kunihiko Fukushima “Visual Feature Extraction by a Multilayered Network of Analog Threshold Elements” In IEEE Transactions on Systems Science and Cybernetics 5.4, 1969, pp. 322–333 DOI: 10.1109/TSSC.1969.300225
  • [Fuk80] Kunihiko Fukushima “Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position” In Biological Cybernetics 36, 1980, pp. 193–202 URL: https://api.semanticscholar.org/CorpusID:206775608
  • [GTT+25] Leo Gao, Tom Dupre Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike and Jeffrey Wu “Scaling and evaluating sparse autoencoders” In The Thirteenth International Conference on Learning Representations, 2025 URL: https://openreview.net/forum?id=tcsZt9ZNKD
  • [GG23] Guillaume Garrigos and Robert M Gower “Handbook of convergence theorems for (stochastic) gradient methods” In arXiv preprint arXiv:2301.11235, 2023
  • [Gil61] E.. Gilbert “Random Plane Networks” In Journal of the Society for Industrial and Applied Mathematics 9.4, 1961, pp. 533–543 DOI: 10.1137/0109045
  • [GC19] Aaron Gokaslan and Vanya Cohen “OpenWebText Corpus”, http://Skylion007.github.io/OpenWebTextCorpus, 2019
  • [GPM+14] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville and Yoshua Bengio “Generative Adversarial Nets” In Advances in Neural Information Processing Systems 27 Curran Associates, Inc., 2014 URL: https://proceedings.neurips.cc/paper_files/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf
  • [GPM+14a] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville and Yoshua Bengio “Generative adversarial nets” In Advances in neural information processing systems, 2014, pp. 2672–2680
  • [GL10] Karol Gregor and Yann LeCun “Learning fast approximations of sparse coding” In Proceedings of the 27th International Conference on International Conference on Machine Learning, 2010, pp. 399–406
  • [Gri11] Rémi Gribonval “Should Penalized Least Squares Regression be Interpreted as Maximum A Posteriori Estimation?” In IEEE transactions on signal processing: a publication of the IEEE Signal Processing Society 59.5, 2011, pp. 2405–2410 DOI: 10.1109/TSP.2011.2107908
  • [GJB15] Remi Gribonval, Rodolphe Jenatton and Francis Bach “Sparse and spurious: Dictionary learning with noise and outliers” In IEEE transactions on information theory 61.11 Institute of ElectricalElectronics Engineers (IEEE), 2015, pp. 6298–6319 DOI: 10.1109/tit.2015.2472522
  • [Gro87] Stephen Grossberg “Competitive Learning: From Interactive Activation to Adaptive Resonance” In Cogn. Sci. 11, 1987, pp. 23–63
  • [HY01] M.H. Hansen and B. Yu “Model Selection and the Principle of Minimum Description Length” In Journal of American Statistical Association 96, 2001, pp. 746–774
  • [HTF09] Trevor Hastie, Robert Tibshirani and Jerome Friedman “The Elements of Statistical Learning” Springer, 2009
  • [HCX+22] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár and Ross Girshick “Masked autoencoders are scalable vision learners” In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 16000–16009
  • [HZR+16] Kaiming He, Xiangyu Zhang, Shaoqing Ren and Jian Sun “Deep Residual Learning for Image Recognition” In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016 IEEE Computer Society, 2016, pp. 770–778 DOI: 10.1109/CVPR.2016.90
  • [HZR+16a] Kaiming He, Xiangyu Zhang, Shaoqing Ren and Jian Sun “Deep residual learning for image recognition” In Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778
  • [HS06] G.. Hinton and R.. Salakhutdinov “Reducing the Dimensionality of Data with Neural Networks” In Science 313.5786 American Association for the Advancement of Science, 2006, pp. 504–507 DOI: 10.1126/science.1127647
  • [HZ93] Geoffrey E. Hinton and Richard S. Zemel “Autoencoders, minimum description length and Helmholtz free energy” In Proceedings of the 6th International Conference on Neural Information Processing Systems, NIPS’93 Denver, Colorado: Morgan Kaufmann Publishers Inc., 1993, pp. 3–10
  • [HJA20] Jonathan Ho, Ajay Jain and Pieter Abbeel “Denoising diffusion probabilistic models” In Advances in Neural Information Processing Systems 33, 2020, pp. 6840–6851
  • [HS22] Jonathan Ho and Tim Salimans “Classifier-Free Diffusion Guidance” In arXiv [cs.LG], 2022 arXiv: http://arxiv.org/abs/2207.12598
  • [HS97] Sepp Hochreiter and Jürgen Schmidhuber “Long Short-term Memory” In Neural computation 9, 1997, pp. 1735–80 DOI: 10.1162/neco.1997.9.8.1735
  • [HSD20] David Hong, Yue Sheng and Edgar Dobriban “Selecting the number of components in PCA via random signflips”, 2020 arXiv:2012.02985 [math.ST]
  • [Hot33] H. Hotelling “Analysis of a Complex of Statistical Variables into Principal Components” In Journal of Educational Psychology, 1933
  • [HLV+17] Gao Huang, Zhuang Liu, Laurens Van Der Maaten and Kilian Q Weinberger “Densely Connected Convolutional Networks” In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 2261–2269
  • [HM99] Jinggang Huang and David Mumford “Statistics of natural images and models” In Proceedings. 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149) 1, 1999, pp. 541–547 IEEE
  • [HW59] D.H. Hubel and T.N. Wiesel “eceptive fields of single neurones in the cat’s striate cortex” In J. Physiol. 148.3, 1959, pp. 574–591
  • [HCS+24] Robert Huben, Hoagy Cunningham, Logan Riggs Smith, Aidan Ewart and Lee Sharkey “Sparse Autoencoders Find Highly Interpretable Features in Language Models” In The Twelfth International Conference on Learning Representations, 2024 URL: https://openreview.net/forum?id=F76bwRSLeK
  • [HKV19] “Automatic Machine Learning: Methods, Systems, Challenges” Springer, 2019
  • [HKJ+21] Uiwon Hwang, Heeseung Kim, Dahuin Jung, Hyemi Jang, Hyungyu Lee and Sungroh Yoon “Stein Latent Optimization for Generative Adversarial Networks” In arXiv preprint arXiv:2106.05319, 2021
  • [Hyv05] Aapo Hyvärinen “Estimation of Non-Normalized Statistical Models by Score Matching” In Journal of Machine Learning Research 6.24, 2005, pp. 695–709 URL: http://jmlr.org/papers/v6/hyvarinen05a.html
  • [HO97] Aapo Hyvärinen and Erkki Oja “A Fast Fixed-Point Algorithm for Independent Component Analysis” In Neural Computation 9.7, 1997, pp. 1483–1492 DOI: 10.1162/neco.1997.9.7.1483
  • [HO00] Aapo Hyvärinen and Erkki Oja “Independent Component Analysis: Algorithms and Applications” In Neural Networks 13.4-5, 2000, pp. 411–430
  • [HO00a] Aapo Hyvärinen and Erkki Oja “Independent component analysis: algorithms and applications” In Neural Networks 13.4 Elsevier Science Ltd., 2000, pp. 411–430 DOI: 10.1016/S0893-6080(00)00026-5
  • [IS15] Sergey Ioffe and Christian Szegedy “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift” In ICML, 2015, pp. 448–456 URL: http://proceedings.mlr.press/v37/ioffe15.html
  • [Jam15] G J O Jameson “A simple proof of Stirling’s formula for the gamma function” In The Mathematical Gazette 99.544 Cambridge University Press, 2015, pp. 68–74 DOI: 10.1017/mag.2014.9
  • [JRR+24] Yibo Jiang, Goutham Rajendran, Pradeep Kumar Ravikumar, Bryon Aragam and Victor Veitch “On the Origins of Linear Representations in Large Language Models” In International Conference on Machine Learning 235 PMLR, 2024, pp. 21879–21911
  • [Jol02] I. Jollife “Principal Component Analysis” Springer-Verlag, 2002
  • [Jol86] I. Jolliffe “Principal Component Analysis” New York, NY: Springer-Verlag, 1986
  • [Jon82] Douglas Samuel Jones “The theory of generalised functions” Cambridge University Press, 1982
  • [JJB+] Keller Jordan, Yuchen Jin, Vlado Boza, You Jiacheng, Franz Cecista, Laker Newhouse and Jeremy Bernstein “Muon: An optimizer for hidden layers in neural networks, 2024” In URL https://kellerjordan. github. io/posts/muon 6
  • [JT20] Sheena A. Josselyn and Susumu Tonegawa “Memory engrams: Recalling the past and imagining the future” In Science 367, 2020
  • [KS21] Z Kadkhodaie and E P Simoncelli “Stochastic solutions for linear inverse problems using the prior implicit in a denoiser” In Adv. Neural Information Processing Systems (NeurIPS) 34 Curran Associates, Inc., 2021 URL: https://proceedings.neurips.cc/paper_files/paper/2021/file/6e28943943dbed3c7f82fc05f269947a-Paper.pdf
  • [Kal60] Rudolph Emil Kalman “A new approach to linear filtering and prediction problems”, 1960
  • [KG24] Mason Kamb and Surya Ganguli “An analytic theory of creativity in convolutional diffusion models” In arXiv preprint arXiv:2412.20292, 2024
  • [Kar22] Andrej Karpathy “nanoGPT” In GitHub repository GitHub, https://github.com/karpathy/nanoGPT, 2022
  • [Kar22a] Andrej Karpathy “The spelled-out intro to neural networks and backpropagation: building micrograd”, 2022 YouTube URL: https://www.youtube.com/watch?v=VMj-3S1tku0
  • [KK18] Ronald Kemker and Christopher Kanan “FearNet: Brain-Inspired Model for Incremental Learning” In International Conference on Learning Representations, 2018
  • [KB14] Diederik P Kingma and Jimmy Ba “Adam: A method for stochastic optimization” In arXiv preprint arXiv:1412.6980, 2014
  • [KW13] Diederik P Kingma and Max Welling “Auto-Encoding Variational Bayes” In arXiv [stat.ML], 2013 arXiv: http://arxiv.org/abs/1312.6114v11
  • [KW19] Diederik P Kingma and Max Welling “An introduction to variational autoencoders” In Foundations and Trends® in Machine Learning 12.4 Now Publishers, 2019, pp. 307–392 DOI: 10.1561/2200000056
  • [KPR+17] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho and Agnieszka Grabska-Barwinska “Overcoming catastrophic forgetting in neural networks” In Proceedings of the national academy of sciences 114.13 National Acad Sciences, 2017, pp. 3521–3526
  • [KUM+17] Günter Klambauer, Thomas Unterthiner, Andreas Mayr and Sepp Hochreiter “Self-normalizing neural networks” In Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017, pp. 972–981
  • [KTV18] Artemy Kolchinsky, Brendan D Tracey and Steven Van Kuyk “Caveats for information bottleneck in deterministic scenarios” In arXiv preprint arXiv:1808.07593, 2018
  • [Kol98] Andrei N. Kolmogorov “On Tables of Random Numbers (Reprinted from ”Sankhya: The Indian Journal of Statistics”, Series A, Vol. 25 Part 4, 1963)” In Theor. Comput. Sci. 207, 1998, pp. 387–395 URL: https://api.semanticscholar.org/CorpusID:33390800
  • [KS12] Irwin Kra and Santiago R Simanca “On Circulant Matrices” In Notices of the American Mathematical Society 59, 2012, pp. 368–377
  • [Kra91] Mark A Kramer “Nonlinear principal component analysis using autoassociative neural networks” In AIChE Journal 37.2 Wiley Online Library, 1991, pp. 233–243
  • [KH+09] Alex Krizhevsky and Geoffrey Hinton “Learning multiple layers of features from tiny images” Toronto, ON, Canada, 2009
  • [KNH14] Alex Krizhevsky, Vinod Nair and Geoffrey Hinton “The CIFAR-10 dataset” In online: http://www.cs.toronto.edu/kriz/cifar.html 55, 2014
  • [KSH12] Alex Krizhevsky, Ilya Sutskever and Geoffrey E Hinton “Imagenet classification with deep convolutional neural networks” In Advances in neural information processing systems, 2012, pp. 1097–1105
  • [LBB+25] Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, Kyle Lacey, Yam Levi, Cheng Li, Dominik Lorenz, Jonas Müller, Dustin Podell, Robin Rombach, Harry Saini, Axel Sauer and Luke Smith “FLUX.1 Kontext: Flow matching for in-context image generation and editing in latent space” In arXiv [cs.GR], 2025 arXiv: http://arxiv.org/abs/2506.15742
  • [LRM+12] Quoc V. Le, Marc’Aurelio Ranzato, Rajat Monga, Matthieu Devin, Kai Chen, Greg S. Corrado, Jeff Dean and Andrew Y. Ng “Building high-level features using large scale unsupervised learning” In Proceedings of the 29th International Coference on International Conference on Machine Learning, ICML’12 Edinburgh, Scotland: Omnipress, 2012, pp. 507–514
  • [LBD+89] Y. LeCun, B. Boser, J.. Denker, D. Henderson, R.. Howard, W. Hubbard and L.. Jackel “Backpropagation Applied to Handwritten Zip Code Recognition” In Neural Computation 1.4, 1989, pp. 541–551 DOI: 10.1162/neco.1989.1.4.541
  • [LBB+98] Y. Lecun, L. Bottou, Y. Bengio and P. Haffner “Gradient-based learning applied to document recognition” In Proceedings of the IEEE 86.11, 1998, pp. 2278–2324 DOI: 10.1109/5.726791
  • [LBB+98a] Yann LeCun, Léon Bottou, Yoshua Bengio and Patrick Haffner “Gradient-based learning applied to document recognition” In Proceedings of the IEEE 86.11 Ieee, 1998, pp. 2278–2324
  • [LPM03] Ann Lee, Kim Pedersen and David Mumford “The Nonlinear Statistics of High-Contrast Patches in Natural Images” In International Journal of Computer Vision 54, 2003 DOI: 10.1023/A:1023705401078
  • [LSJ+16] Jason D Lee, Max Simchowitz, Michael I Jordan and Benjamin Recht “Gradient descent only converges to minimizers” In Conference on learning theory, 2016, pp. 1246–1257 PMLR
  • [Lee02] John M. Lee “Introduction to Smooth Manifolds”, 2002
  • [LMH+18] Jaakko Lehtinen, Jacob Munkberg, Jon Hasselgren, Samuli Laine, Tero Karras, Miika Aittala and Timo Aila “Noise2Noise: Learning Image Restoration without Clean Data” In Proceedings of the 35th International Conference on Machine Learning 80, Proceedings of Machine Learning Research PMLR, 2018, pp. 2965–2974 URL: https://proceedings.mlr.press/v80/lehtinen18a.html
  • [LY24] Gen Li and Yuling Yan “O (d/T) convergence theory for diffusion probabilistic models under minimal assumptions” In arXiv preprint arXiv:2409.18959, 2024
  • [LFD+22] Haochuan Li, Farzan Farnia, Subhro Das and Ali Jadbabaie “On convergence of gradient descent ascent: A tight local analysis” In International Conference on Machine Learning, 2022, pp. 12717–12740 PMLR
  • [Li17] Xi-Lin Li “Preconditioned stochastic gradient descent” In IEEE transactions on neural networks and learning systems 29.5 IEEE, 2017, pp. 1454–1466
  • [LZQ24] Wenda Li, Huijie Zhang and Qing Qu “Shallow diffuse: Robust and invisible watermarking through low-dimensional subspaces in diffusion models” In arXiv preprint arXiv:2410.21088, 2024
  • [LWQ25] Xiang Li, Rongrong Wang and Qing Qu “Towards Understanding the Mechanisms of Classifier-Free Guidance” In arXiv [cs.CV], 2025 arXiv: http://arxiv.org/abs/2505.19210
  • [LB19] Yanjun Li and Yoram Bresler “Multichannel sparse blind deconvolution on the sphere” In IEEE Transactions on Information Theory 65.11 IEEE, 2019, pp. 7415–7436
  • [LRZ+12] Xiao Liang, Xiang Ren, Zhengdong Zhang and Yi Ma “Repairing Sparse Low-Rank Texture” In Computer Vision – ECCV 2012 Berlin, Heidelberg: Springer Berlin Heidelberg, 2012, pp. 482–495
  • [LMB+14] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár and C Lawrence Zitnick “Microsoft coco: Common objects in context” In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, 2014, pp. 740–755 Springer
  • [LZ94] T Linder and R Zamir “On the asymptotic tightness of the Shannon lower bound” In IEEE transactions on information theory 40.6 Institute of ElectricalElectronics Engineers (IEEE), 1994, pp. 2026–2031 DOI: 10.1109/18.340474
  • [LMZ+24] Jiacheng Liu, Sewon Min, Luke Zettlemoyer, Yejin Choi and Hannaneh Hajishirzi “Infini-gram: Scaling unbounded n-gram language models to a trillion tokens” In arXiv preprint arXiv:2401.17377, 2024
  • [LSY+25] Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu and Junjie Yan “Muon is Scalable for LLM Training” In arXiv preprint arXiv:2502.16982, 2025
  • [LV09] Z. Liu and L. Vandenberghe “Semidefinite programming methods for system realization and identification” In Proceedings of the 48h IEEE Conference on Decision and Control (CDC) held jointly with 2009 28th Chinese Control Conference, 2009, pp. 4676–4681 DOI: 10.1109/CDC.2009.5400177
  • [LV10] Zhang. Liu and Lieven. Vandenberghe “Interior-Point Method for Nuclear Norm Approximation with Application to System Identification” In SIAM Journal on Matrix Analysis and Applications 31.3, 2010, pp. 1235–1256 DOI: 10.1137/090755436
  • [LH17] Ilya Loshchilov and Frank Hutter “Decoupled Weight Decay Regularization” In International Conference on Learning Representations, 2017 URL: https://api.semanticscholar.org/CorpusID:53592270
  • [MYP+10] M., Y., P. and R. “Generalized power method for sparse principal component analysis” In Journal of Machine Learning Research 11, 2010, pp. 517–553
  • [MDH+07] Y. Ma, H. Derksen, W. Hong and J. Wright “Segmentation of multivariate mixed data via lossy coding and compression” In To appear in IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007
  • [MKS+04] Y. Ma, J. Košecká, S. Soatto and S. Sastry “An Invitation to 3-D Vision, From Images to Models” New York: Springer-Verlag, 2004
  • [MDH+07a] Yi Ma, Harm Derksen, Wei Hong and John Wright “Segmentation of multivariate mixed data via lossy data coding and compression” In IEEE transactions on pattern analysis and machine intelligence 29.9 IEEE, 2007, pp. 1546–1562
  • [MTS22] Yi Ma, Doris Tsao and Heung-Yeung Shum “On the principles of Parsimony and Self-consistency for the emergence of intelligence” In Frontiers Inf. Technol. Electron. Eng. 23.9, 2022, pp. 1298–1323 DOI: 10.1631/FITEE.2200297
  • [MHN13] Andrew L Maas, Awni Y Hannun and Andrew Y Ng “Rectifier nonlinearities improve neural network acoustic models” In Proc. ICML 30, 2013, pp. 3 Citeseer
  • [MBP14] Julien Mairal, Francis Bach and Jean Ponce “Sparse Modeling for Image and Vision Processing” In Foundations and Trends® in Computer Graphics and Vision 8.2-3 Now Publishers, 2014, pp. 85–283 DOI: 10.1561/0600000058
  • [MSM93] Mitchell P. Marcus, Beatrice Santorini and Mary Ann Marcinkiewicz “Building a Large Annotated Corpus of English: The Penn Treebank” In Computational Linguistics 19.2 Cambridge, MA: MIT Press, 1993, pp. 313–330 URL: https://aclanthology.org/J93-2004
  • [Mar06] Andrei Andreevich Markov “An example of statistical investigation of the text Eugene Onegin concerning the connection of samples in chains” In Science in Context 19.4 Cambridge University Press, 2006, pp. 591–600
  • [MBD+21] James Martens, Andy Ballard, Guillaume Desjardins, Grzegorz Swirszcz, Valentin Dalibard, Jascha Sohl-Dickstein and Samuel S Schoenholz “Rapid training of deep neural networks without skip connections or normalization layers using Deep Kernel Shaping” In arXiv [cs.LG], 2021 arXiv: http://arxiv.org/abs/2110.01765
  • [MC89] Michael McCloskey and Neal J Cohen “Catastrophic interference in connectionist networks: The sequential learning problem” In Psychology of learning and motivation 24 Elsevier, 1989, pp. 109–165
  • [MP43] Warren McCulloch and Walter Pitts “A Logical Calculus of the Ideas Immanent in Nervous Activity” In Bulletin of Mathematical Biophysics 5, 1943, pp. 115–133
  • [MM70] Jerry M. Mendel and Robert W. Mclaren “Reinforcement-learning control and pattern recognition systems” In In Mendel, J. M. and Fu, K. S., editors, Adaptive, Learning and Pattern Recognition Systems: Theory and Applications, 1970, pp. 287–318
  • [MXB+16] Stephen Merity, Caiming Xiong, James Bradbury and Richard Socher “Pointer Sentinel Mixture Models”, 2016 arXiv:1609.07843 [cs.CL]
  • [MM12] Stephan Mertens and Cristopher Moore “Continuum percolation thresholds in two dimensions” In Phys. Rev. E 86 American Physical Society, 2012, pp. 061109 DOI: 10.1103/PhysRevE.86.061109
  • [Min54] Marvin Minsky “Theory of Neural-Analog Reinforcement Systems and its Application to the Brain-Model Problem”, 1954
  • [MP69] Marvin Minsky and Seymour Papert “Perceptrons: An Introduction to Computational Geometry” the MIT Press, 1969
  • [Mir60] L Mirsky “SYMMETRIC GAUGE FUNCTIONS AND UNITARILY INVARIANT NORMS” In The Quarterly Journal of Mathematics 11.1 Oxford Academic, 1960, pp. 50–59 DOI: 10.1093/qmath/11.1.50
  • [Miy61] K. Miyasawa “An empirical bayes estimator of the mean of a normal population” In Bull. Inst. Internat. Statist. 38, 1961
  • [MKK+18] Takeru Miyato, Toshiki Kataoka, Masanori Koyama and Yuichi Yoshida “Spectral normalization for generative adversarial networks” In arXiv preprint arXiv:1802.05957, 2018
  • [MRY+11] Hossein Mobahi, Shankar Rao, Allen Yang, Shankar Sastry and Yi Ma “Segmentation of Natural Images by Texture and Boundary Compression” In the International Journal of Computer Vision 95.1, 2011, pp. 86–98
  • [MLE19] Vishal Monga, Yuelong Li and Yonina C Eldar “Algorithm unrolling: Interpretable, efficient deep learning for signal and image processing” In arXiv preprint arXiv:1912.10557, 2019
  • [MKH19] Rafael Müller, Simon Kornblith and Geoffrey E Hinton “When does label smoothing help?” In Advances in neural information processing systems 32, 2019
  • [Mum96] David Mumford “The Statistical Description of Visual Signals”, 1996 URL: https://api.semanticscholar.org/CorpusID:14049135
  • [MG99] David Mumford and Basilis Gidas “Stochastic Models for Generic Images” In Quarterly of Applied Mathematics 59, 1999 DOI: 10.1090/qam/1811096
  • [MK07] Joseph F Murray and Kenneth Kreutz-Delgado “Learning sparse overcomplete codes for images” In The Journal of VLSI Signal Processing Systems for Signal Image and Video Technology 46.1 Springer ScienceBusiness Media LLC, 2007, pp. 1–13 DOI: 10.1007/s11265-006-0003-z
  • [NDE+13] S. Nam, M.E. Davies, M. Elad and R. Gribonval “The cosparse analysis model and algorithms” In Applied and Computational Harmonic Analysis 34.1, 2013, pp. 30–56 DOI: https://doi.org/10.1016/j.acha.2012.03.006
  • [NZM+24] Matthew Niedoba, Berend Zwartsenberg, Kevin Murphy and Frank Wood “Towards a Mechanistic Explanation of Diffusion Model Generalization” In arXiv preprint arXiv:2411.19339, 2024
  • [NW06] Jorge Nocedal and Stephen Wright “Numerical optimization” Springer Science & Business Media, 2006
  • [NIG+18] Chigozie Nwankpa, Winifred Ijomah, Anthony Gachagan and Stephen Marshall “Activation functions: Comparison of trends in practice and research for deep learning” In arXiv preprint arXiv:1811.03378, 2018
  • [Oja82] Erkki Oja “A simplified neuron model as a principal component analyzer” In Journal of Mathematical Biology 15, 1982, pp. 267–273 URL: https://api.semanticscholar.org/CorpusID:16577977
  • [OF97] B A Olshausen and D J Field “Sparse coding with an overcomplete basis set: a strategy employed by V1?” In Vision research 37.23, 1997, pp. 3311–3325 URL: https://www.ncbi.nlm.nih.gov/pubmed/9425546
  • [OF96] Bruno A Olshausen and David J Field “Emergence of simple-cell receptive field properties by learning a sparse code for natural images” In Nature 381 Nature Publishing Group, 1996, pp. 607 DOI: 10.1038/381607a0
  • [OVK17] Aaron Oord, Oriol Vinyals and Koray Kavukcuoglu “Neural discrete representation learning” In arXiv [cs.LG], 2017 arXiv: http://arxiv.org/abs/1711.00937
  • [ODM+23] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa and Alaaeldin El-Nouby “Dinov2: Learning robust visual features without supervision” In arXiv preprint arXiv:2304.07193, 2023
  • [PBW+24] Druv Pai, Sam Buchanan, Ziyang Wu, Yaodong Yu and Yi Ma “Masked Completion via Structured Diffusion with White-Box Transformers” In The Twelfth International Conference on Learning Representations, 2024 URL: https://openreview.net/forum?id=PvyOYleymy
  • [PPC+23] Druv Pai, Michael Psenka, Chih-Yuan Chiu, Manxi Wu, Edgar Dobriban and Yi Ma “Pursuit of a discriminative representation for multiple subspaces via sequential games” In Journal of the Franklin Institute 360.6, 2023, pp. 4135–4171 DOI: https://doi.org/10.1016/j.jfranklin.2023.02.011
  • [PKL+16] Denis Paperno, Germán Kruszewski, Angeliki Lazaridou, Quan Ngoc Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda and Raquel Fernández “The LAMBADA dataset: Word prediction requiring a broad discourse context”, 2016 arXiv: https://arxiv.org/abs/1606.06031
  • [PHD20] Vardan Papyan, XY Han and David L Donoho “Prevalence of Neural Collapse during the terminal phase of deep learning training” In arXiv preprint arXiv:2008.08186, 2020
  • [PRE17] Vardan Papyan, Yaniv Romano and Michael Elad “Convolutional neural networks analyzed via convolutional sparse coding” In The Journal of Machine Learning Research 18.1 JMLR. org, 2017, pp. 2887–2938
  • [PCV24] Kiho Park, Yo Joong Choe and Victor Veitch “The Linear Representation Hypothesis and the Geometry of Large Language Models” In International Conference on Machine Learning, 2024, pp. 39643–39666 PMLR
  • [PSR+23] Georgios Pavlakos, Dandan Shan, Ilija Radosavovic, Angjoo Kanazawa, David Fouhey and Jitendra Malik “Reconstructing Hands in 3D with Transformers” In arxiv, 2023
  • [Pea01] K. Pearson “On Lines and Planes of Closest Fit to Systems of Points in Space” In Philosophical Magazine 2.6, 1901, pp. 559–572
  • [PW22] Yuri Poliyanski and Yihong Wu “Information Theory: From Coding to Learning” Cambridge University Press, 2022
  • [PPR+24] Michael Psenka, Druv Pai, Vishal Raman, Shankar Sastry and Yi Ma “Representation Learning via Manifold Flattening and Reconstruction” In Journal of Machine Learning Research 25.132, 2024, pp. 1–47 URL: http://jmlr.org/papers/v25/23-0615.html
  • [QLZ19] Qing Qu, Xiao Li and Zhihui Zhu “A nonconvex approach for exact and efficient multichannel sparse blind deconvolution” In Advances in Neural Information Processing Systems, 2019, pp. 4017–4028
  • [QLZ20] Qing Qu, Xiao Li and Zhihui Zhu “Exact Recovery of Multichannel Sparse Blind Deconvolution via Gradient Descent” In SIAM Journal on Imaging Sciences 13.3, 2020, pp. 1630–1652
  • [QZL+20] Qing Qu, Yuexiang Zhai, Xiao Li, Yuqian Zhang and Zhihui Zhu “Geometric Analysis of Nonconvex Optimization Landscapes for Overcomplete Learning” In International Conference on Learning Representations, 2020 URL: https://openreview.net/forum?id=rygixkHKDH
  • [QZL+20a] Qing Qu, Zhihui Zhu, Xiao Li, Manolis C. Tsakiris, John Wright and René Vidal “Finding the Sparsest Vectors in a Subspace: Theory, Algorithms, and Applications”, 2020 arXiv: https://arxiv.org/abs/2001.06970
  • [RD03] R. and D. “Lambertian reflectance and linear subspaces” In IEEE Transactions on Pattern Analysis and Machine Intelligence 25.2, 2003, pp. 218–233
  • [RKH+21] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger and Ilya Sutskever “Learning Transferable Visual Models From Natural Language Supervision” In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event 139, Proceedings of Machine Learning Research PMLR, 2021, pp. 8748–8763 URL: http://proceedings.mlr.press/v139/radford21a.html
  • [RMC16] Alec Radford, Luke Metz and Soumith Chintala “Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks” In arXiv preprint arXiv:1511.06434, 2016 arXiv:1511.06434 [cs.LG]
  • [RWC+19] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei and Ilya Sutskever “Language models are unsupervised multitask learners” In OpenAI blog 1.8, 2019, pp. 9
  • [RDN+22] Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu and Mark Chen “Hierarchical Text-Conditional Image Generation with CLIP Latents” In arXiv [cs.CV], 2022 arXiv: http://arxiv.org/abs/2204.06125
  • [RPC+06] Marc’Aurelio Ranzato, Christopher Poultney, Sumit Chopra and Yann Cun “Efficient Learning of Sparse Representations with an Energy-Based Model” In Advances in Neural Information Processing Systems 19 MIT Press, 2006 URL: https://papers.nips.cc/paper/3112-efficient-learning-of-sparse-representations-with-an-energy-based-model
  • [RS11] M Raphan and E P Simoncelli “Least squares estimation without priors or supervision” Published online, Nov 2010. In Neural Computation 23.2, 2011, pp. 374–420 DOI: 10.1162/NECO˙a˙00076
  • [RKS+17] Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl and Christoph H Lampert “iCaRL: Incremental classifier and representation learning” In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2017, pp. 2001–2010
  • [RBK18] Erwin Riegler, Helmut Bolcskei and Gunther Koliander “Rate-distortion theory for general sets and measures” In 2018 IEEE International Symposium on Information Theory (ISIT) IEEE, 2018, pp. 101–105 DOI: 10.1109/isit.2018.8437740
  • [RKB23] Erwin Riegler, Günther Koliander and Helmut Bölcskei “Lossy compression of general random variables” In Information and inference: a journal of the IMA 12.3 Oxford University Press (OUP), 2023, pp. 1759–1829 DOI: 10.1093/imaiai/iaac035
  • [Ris78] J. Rissanen “Paper: Modeling by shortest data description” In Automatica 14.5 USA: Pergamon Press, Inc., 1978, pp. 465–471 DOI: 10.1016/0005-1098(78)90005-5
  • [Rob56] Herbert E. Robbins “An Empirical Bayes Approach to Statistics”, 1956 URL: https://api.semanticscholar.org/CorpusID:26161481
  • [RBL+22] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser and Björn Ommer “High-Resolution Image Synthesis with Latent Diffusion Models” In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022 IEEE, 2022, pp. 10674–10685 DOI: 10.1109/CVPR52688.2022.01042
  • [RFB15] Olaf Ronneberger, Philipp Fischer and Thomas Brox “U-net: Convolutional networks for biomedical image segmentation” In International Conference on Medical image computing and computer-assisted intervention, 2015, pp. 234–241 Springer
  • [RAL+24] François Rozet, Gérôme Andry, Francois Lanusse and Gilles Louppe “Learning Diffusion Priors from Observations by Expectation Maximization” In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024 URL: https://openreview.net/forum?id=7v88Fh6iSM
  • [RE14] R. Rubinstein and M. Elad “Dictionary Learning for Analysis-Synthesis Thresholding” In IEEE Transactions on Signal Processing 62.22, 2014, pp. 5962–5972
  • [RHW86] D.. Rumelhart, G.. Hinton and R.. Williams “Learning internal representations by error propagation” In Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1: Foundations Cambridge, MA, USA: MIT Press, 1986, pp. 318–362
  • [RHW86a] David E. Rumelhart, Geoffrey E. Hinton and Ronald J. Williams “Learning representations by back-propagating errors” In Nature 323.6088, 1986, pp. 533–536 DOI: 10.1038/323533a0
  • [SCS+22] Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L. Denton, Seyed Kamyar Seyed Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, Jonathan Ho, David J. Fleet and Mohammad Norouzi “Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding” In NeurIPS, 2022 URL: http://papers.nips.cc/paper%5C_files/paper/2022/hash/ec795aeadae0b7d230fa35cbaf04c041-Abstract-Conference.html
  • [Sas99] Shankar Sastry “Nonlinear Systems: Analysis, Stability, and Control” Springer, 1999
  • [SHT+25] Fabian Schaipp, Alexander Hägele, Adrien Taylor, Umut Simsekli and Francis Bach “The Surprising Agreement Between Convex Optimization Theory and Learning-Rate Scheduling for Large Model Training” In arXiv preprint arXiv:2501.18965, 2025
  • [SMB10] Dominik Scherer, Andreas Müller and Sven Behnke “Evaluation of pooling operations in convolutional architectures for object recognition” In International conference on artificial neural networks, 2010, pp. 92–101 Springer
  • [SBZ+25] Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani and Tri Dao “Flashattention-3: Fast and accurate attention with asynchrony and low-precision” In Advances in Neural Information Processing Systems 37, 2025, pp. 68658–68685
  • [Sha48] C.. Shannon “A mathematical theory of communication” In The Bell System Technical Journal 27.3, 1948, pp. 379–423 DOI: 10.1002/j.1538-7305.1948.tb01338.x
  • [Sha59] Claude E Shannon “Coding theorems for a discrete source with a fidelity criterion” In IRE Nat. Conv. Rec 4.142-163, 1959, pp. 1
  • [SMM+17] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton and Jeff Dean “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer” In ICLR, 2017 URL: https://openreview.net/pdf?id=B1ckMDqlg
  • [SZ14] Karen Simonyan and Andrew Zisserman “Very deep convolutional networks for large-scale image recognition” In arXiv preprint arXiv:1409.1556, 2014
  • [SZ15] Karen Simonyan and Andrew Zisserman “Very Deep Convolutional Networks for Large-Scale Image Recognition” In International Conference on Learning Representations, 2015
  • [SWM+15] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan and Surya Ganguli “Deep Unsupervised Learning using Nonequilibrium Thermodynamics” In Proceedings of the 32nd International Conference on Machine Learning 37, Proceedings of Machine Learning Research Lille, France: PMLR, 2015, pp. 2256–2265 URL: https://proceedings.mlr.press/v37/sohl-dickstein15.html
  • [SME20] Jiaming Song, Chenlin Meng and Stefano Ermon “Denoising diffusion implicit models” In arXiv preprint arXiv:2010.02502, 2020
  • [SE19] Yang Song and Stefano Ermon “Generative Modeling by Estimating Gradients of the Data Distribution” In Advances in Neural Information Processing Systems 32 Curran Associates, Inc., 2019 URL: https://proceedings.neurips.cc/paper_files/paper/2019/file/3001ef257407d5a371a96dcd947c7d93-Paper.pdf
  • [SSK+21] Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon and Ben Poole “Score-Based Generative Modeling through Stochastic Differential Equations” In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 OpenReview.net, 2021 URL: https://openreview.net/forum?id=PxTIG12RRHS
  • [SWW12] Daniel A Spielman, Huan Wang and John Wright “Exact Recovery of Sparsely-Used Dictionaries” In Proceedings of the 25th Annual Conference on Learning Theory 23, Proceedings of Machine Learning Research Edinburgh, Scotland: PMLR, 2012, pp. 37.1–37.18 URL: https://proceedings.mlr.press/v23/spielman12.html
  • [SHK+14] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever and Ruslan Salakhutdinov “Dropout: a simple way to prevent neural networks from overfitting” In The journal of machine learning research 15.1 JMLR. org, 2014, pp. 1929–1958
  • [SQW15] Ju Sun, Qing Qu and John Wright “When are nonconvex problems not scary?” In arXiv preprint arXiv:1510.06096, 2015
  • [SQW17] Ju Sun, Qing Qu and John Wright “Complete Dictionary Recovery Over the Sphere I: Overview and the Geometric Picture” In IEEE Transactions on Information Theory 63.2, 2017, pp. 853–884
  • [SQW17a] Ju Sun, Qing Qu and John Wright “Complete dictionary recovery over the sphere I: Overview and the geometric picture” In IEEE Transactions on Information Theory 63.2 IEEE, 2017, pp. 853–884
  • [SB18] Richard S. Sutton and Andrew G. Barto “Reinforcement Learning: An Introduction” Cambridge, MA, USA: A Bradford Book, 2018
  • [SLJ+14] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, D. Erhan, Vincent Vanhoucke and Andrew Rabinovich “Going deeper with convolutions” In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 1–9 URL: https://api.semanticscholar.org/CorpusID:206592484
  • [Tea] Moonshot Team In Kimi K2: Open Agentic Intelligence Moonshot AI URL: https://moonshotai.github.io/Kimi-K2/
  • [Tel16] Matus Telgarsky “Benefits of depth in neural networks” In 29th Annual Conference on Learning Theory 49, Proceedings of Machine Learning Research Columbia University, New York, New York, USA: PMLR, 2016, pp. 1517–1539 URL: https://proceedings.mlr.press/v49/telgarsky16.html
  • [Til15] Andreas M Tillmann “On the computational intractability of exact and approximate dictionary learning” In IEEE signal processing letters 22.1 Institute of ElectricalElectronics Engineers (IEEE), 2015, pp. 45–49 DOI: 10.1109/lsp.2014.2345761
  • [TB99] M. Tipping and C. Bishop “Probabilistic principal component analysis” In Journal of Royal Statistical Society: Series B 61.3, 1999, pp. 611–622
  • [TZ15] Naftali Tishby and Noga Zaslavsky “Deep learning and the information bottleneck principle” In 2015 IEEE Information Theory Workshop (ITW), 2015, pp. 1–5 IEEE
  • [TDC+24] Shengbang Tong, Xili Dai, Yubei Chen, Mingyang Li, ZENGYI LI, Brent Yi, Yann LeCun and Yi Ma “Unsupervised Learning of Structured Representation via Closed-Loop Transcription” In Conference on Parsimony and Learning 234, Proceedings of Machine Learning Research PMLR, 2024, pp. 440–457 URL: https://proceedings.mlr.press/v234/tong24a.html
  • [TDW+23] Shengbang Tong, Xili Dai, Ziyang Wu, Mingyang Li, Brent Yi and Yi Ma “Incremental Learning of Structured Memory via Closed-Loop Transcription” In The Eleventh International Conference on Learning Representations, 2023 URL: https://openreview.net/forum?id=XrgjF5-M3xi
  • [TCD+20] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles and Hervé Jégou “Training data-efficient image transformers & distillation through attention. arXiv 2020” In arXiv preprint arXiv:2012.12877 2.3, 2020
  • [Tu07] Zhuowen Tu “Learning Generative Models via Discriminative Approaches” In 2007 IEEE Conference on Computer Vision and Pattern Recognition, 2007, pp. 1–8 DOI: 10.1109/CVPR.2007.383035
  • [Tur50] Alan Turing “Computing Machinary and Intelligence” In Mind 59, 1950, pp. 433–460
  • [Tur36] Alan M. Turing “On Computable Numbers, with an Application to the Entscheidungsproblem” In Proceedings of the London Mathematical Society 2.42, 1936, pp. 230–265
  • [UVL16] Dmitry Ulyanov, Andrea Vedaldi and Victor Lempitsky “Instance normalization: The missing ingredient for fast stylization” In arXiv preprint arXiv:1607.08022, 2016
  • [VM96] P. Van Overschee and B. Moor “Subspace Identification for Linear Systems” Kluwer Academic, 1996
  • [VSP+17] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser and Illia Polosukhin “Attention is all you need” In Advances in neural information processing systems 30, 2017
  • [VST+20] Gido M Ven, Hava T Siegelmann and Andreas S Tolias “Brain-inspired replay for continual learning with artificial neural networks” In Nature Communications 11.1 Nature, 2020, pp. 1–14
  • [VJO+21] Luca Venturi, Samy Jelassi, Tristan Ozuch and Joan Bruna “Depth separation beyond radial functions” In Journal of machine learning research: JMLR 23, 2021, pp. 122:1–122:56 DOI: 10.5555/3586589.3586711
  • [Ver18] Roman Vershynin “High-dimensional probability: An introduction with applications in data science” Cambridge University Press, 2018
  • [VM04] R. Vidal and Y. Ma “A unified algebraic approach to 2-D and 3-D motion segmentation” In Proceedings of the European Conference on Computer Vision, 2004
  • [VMS16] Rene Vidal, Yi Ma and S.. Sastry “Generalized Principal Component Analysis” Springer Publishing Company, Incorporated, 2016
  • [VMS05] Rene Vidal, Yi Ma and Shankar Sastry “Generalized principal component analysis” In IEEE transactions on pattern analysis and machine intelligence 27.12 IEEE, 2005, pp. 1945–1959
  • [Vin11] Pascal Vincent “A Connection Between Score Matching and Denoising Autoencoders” In Neural Computation 23.7, 2011, pp. 1661–1674 DOI: 10.1162/NECO˙a˙00142
  • [WWG+12] Andrew Wagner, John Wright, Arvind Ganesh, Zihan Zhou, Hossein Mobahi and Yi Ma “Toward a practical face recognition system: Robust alignment and illumination by sparse representation” In IEEE Transactions on Pattern Analysis and Machine Intelligence 34.2 IEEE, 2012, pp. 372–386
  • [WB68] C. Wallace and D. Boulton “An Information Measure for Classification” In The Computer Journal 11, 1968, pp. 185–194
  • [WD99] C. Wallace and D. Dowe “Minimum message length and Kolmogorov complexity” In The Computer Journal 42.4, 1999, pp. 270–283
  • [WF65] Marion Dwain Waltz and King-Sun Fu “A heuristic approach to reinforcement learning control systems” In IEEE Transactions on Automatic Control 10, 1965, pp. 390–398 URL: https://api.semanticscholar.org/CorpusID:62489576
  • [WLP+24] Peng Wang, Huikang Liu, Druv Pai, Yaodong Yu, Zhihui Zhu, Qing Qu and Yi Ma “A Global Geometric Analysis of Maximal Coding Rate Reduction” In arXiv preprint arXiv:2406.01909, 2024
  • [WLY+25] Peng Wang, Yifu Lu, Yaodong Yu, Druv Pai, Qing Qu and Yi Ma “Attention-Only Transformers via Unrolled Subspace Denoising” In arXiv preprint arXiv:2506.03790, 2025
  • [WZZ+24] Peng Wang, Huijie Zhang, Zekai Zhang, Siyi Chen, Yi Ma and Qing Qu “Diffusion Models Learn Low-Dimensional Distributions via Subspace Clustering” In arXiv preprint arXiv:2409.02426, 2024
  • [WGY+23] Xudong Wang, Rohit Girdhar, Stella X Yu and Ishan Misra “Cut and learn for unsupervised object detection and instance segmentation” In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 3124–3134
  • [Wer74] P.. Werbos “Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences” PhD Thesis, Applied Mathematics Dept., Harvard Univ., 1974
  • [Wer94] Paul J. Werbos “The Roots of Backpropagation: From Ordered Derivatives to Neural Networks and Political Forecasting”, 1994 URL: https://api.semanticscholar.org/CorpusID:60847433
  • [WB18] T. Wiatowski and H. Bölcskei “A Mathematical Theory of Deep Convolutional Neural Networks for Feature Extraction” In IEEE Transactions on Information Theory, 2018
  • [Wie42] Norbert Wiener “The interpolation, extrapolation and smoothing of stationary time series” In Report of the Services 19, Research Project DIC-6037 MIT, 1942
  • [Wie48] Norbert Wiener “Cybernetics: Or Control and Communication in the Animal and the Machine” the MIT Press, 1948
  • [Wie49] Norbert Wiener “Extrapolation, Interpolation, and Smoothing of Stationary Time Series” New York: Wiley, 1949
  • [Wie61] Norbert Wiener “Cybernetics: Or Control and Communication in the Animal and the Machine” the MIT Press, 1961
  • [WM21] John Wright and Yi Ma “High-Dimensional Data Analysis with Low-Dimensional Models: Principles, Computation, and Applications” Cambridge University Press, 2021
  • [WM22] John Wright and Yi Ma “High-Dimensional Data Analysis with Low-Dimensional Models: Principles, Computation, and Applications” Cambridge University Press, 2022
  • [WTL+08] John Wright, Yangyu Tao, Zhouchen Lin, Yi Ma and Heung-Yeung Shum “Classification via minimum incremental coding length (MICL)” In Advances in Neural Information Processing Systems, 2008, pp. 1633–1640
  • [WYG+09] John Wright, Allen Y. Yang, Arvind Ganesh, S. Sastry and Yi Ma “Robust Face Recognition via Sparse Representation” In IEEE Trans. Pattern Anal. Mach. Intell. 31.2 Washington, DC, USA: IEEE Computer Society, 2009, pp. 210–227 DOI: 10.1109/TPAMI.2008.79
  • [WX20] Denny Wu and J. Xu “On the Optimal Weighted 2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT Regularization in Overparameterized Linear Regression” In ArXiv abs/2006.05800, 2020
  • [WTN+23] Luhuan Wu, Brian L. Trippe, Christian A Naesseth, John Patrick Cunningham and David Blei “Practical and Asymptotically Exact Conditional Sampling in Diffusion Models” In Thirty-seventh Conference on Neural Information Processing Systems, 2023 URL: https://openreview.net/forum?id=eWKqr1zcRv
  • [WCL+24] Yuchen Wu, Minshuo Chen, Zihao Li, Mengdi Wang and Yuting Wei “Theoretical Insights for Diffusion Guidance: A Case Study for Gaussian Mixture Models” In arXiv [cs.LG], 2024 arXiv: http://arxiv.org/abs/2403.01639
  • [WH18] Yuxin Wu and Kaiming He “Group normalization” In Proceedings of the European conference on computer vision (ECCV), 2018, pp. 3–19
  • [WDL+25] Ziyang Wu, Tianjiao Ding, Yifu Lu, Druv Pai, Jingyuan Zhang, Weida Wang, Yaodong Yu, Yi Ma and Benjamin David Haeffele “Token Statistics Transformer: Linear-Time Attention via Variational Rate Reduction” In The Thirteenth International Conference on Learning Representations, 2025
  • [XGD+17] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu and Kaiming He “Aggregated Residual Transformations for Deep Neural Networks” In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 5987–5995
  • [XWC+15] Bing Xu, Naiyan Wang, Tianqi Chen and Mu Li “Empirical evaluation of rectified activations in convolutional network” In arXiv preprint arXiv:1505.00853, 2015
  • [YHB+22] Greg Yang, Edward J Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen and Jianfeng Gao “Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer” In arXiv preprint arXiv:2203.03466, 2022
  • [YH21] Greg Yang and J E Hu “Tensor Programs IV: Feature learning in infinite-width neural networks” In International Conference on Machine Learning 139 PMLR, 2021, pp. 11727–11737 URL: https://proceedings.mlr.press/v139/yang21c
  • [YB99] Yuhong Yang and Andrew Barron “Information-theoretic determination of minimax rates of convergence” In Annals of statistics 27.5 Institute of Mathematical Statistics, 1999, pp. 1564–1599 DOI: 10.1214/aos/1017939142
  • [YYY+20] Zitong Yang, Yaodong Yu, Chong You, Jacob Steinhardt and Yi Ma “Rethinking Bias-Variance Trade-off for Generalization of Neural Networks” In International Conference on Machine Learning, 2020
  • [YYZ+24] Brent Yi, Vickie Ye, Maya Zheng, Lea Müller, Georgios Pavlakos, Yi Ma, Jitendra Malik and Angjoo Kanazawa “Estimating Body and Hand Motion in an Ego-sensed World” In arXiv preprint arXiv:2410.03665, 2024
  • [YZB+23] Brent Yi, Weijia Zeng, Sam Buchanan and Yi Ma “Canonical Factors for Hybrid Neural Fields” In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 3414–3426
  • [YBP+24] Yaodong Yu, Sam Buchanan, Druv Pai, Tianzhe Chu, Ziyang Wu, Shengbang Tong, Hao Bai, Yuexiang Zhai, Benjamin D Haeffele and Yi Ma “White-Box Transformers via Sparse Rate Reduction: Compression Is All There Is?” In Journal of Machine Learning Research 25.300, 2024, pp. 1–128
  • [YBP+23] Yaodong Yu, Sam Buchanan, Druv Pai, Tianzhe Chu, Ziyang Wu, Shengbang Tong, Benjamin Haeffele and Yi Ma “White-box transformers via sparse rate reduction” In Advances in Neural Information Processing Systems 36, 2023
  • [YCY+20] Yaodong Yu, Kwan Ho Ryan Chan, Chong You, Chaobing Song and Yi Ma “Learning Diverse and Discriminative Representations via the Principle of Maximal Coding Rate Reduction” In Advances in neural information processing systems, 2020
  • [YCO+21] Zeyu Yun, Yubei Chen, Bruno A Olshausen and Yann LeCun “Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors” In arXiv preprint arXiv:2103.15949, 2021
  • [ZKR+17] Manzil Zaheer, Satwik Kottur, Siamak Ravanbakhsh, Barnabas Poczos, Russ R Salakhutdinov and Alexander J Smola “Deep Sets” In Advances in Neural Information Processing Systems 30 Curran Associates, Inc., 2017, pp. 3391–3401 URL: http://papers.nips.cc/paper/6931-deep-sets.pdf
  • [ZAK24] Moslem Zamani, Hadi Abbaszadehpeivasti and Etienne Klerk “Convergence rate analysis of the gradient descent–ascent method for convex–concave saddle-point problems” In Optimization Methods and Software 39.5 Taylor & Francis, 2024, pp. 967–989
  • [ZMZ+20] Yuexiang Zhai, Hermish Mehta, Zhengyuan Zhou and Yi Ma “Understanding l4-based Dictionary Learning: Interpretation, Stability, and Robustness” In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020 OpenReview.net, 2020 URL: https://openreview.net/forum?id=SJeY-1BKDS
  • [ZCB+24] Bingliang Zhang, Wenda Chu, Julius Berner, Chenlin Meng, Anima Anandkumar and Yang Song “Improving diffusion inverse problem solving with decoupled noise annealing” In arXiv [cs.LG], 2024 arXiv: http://arxiv.org/abs/2407.01521
  • [ZBH+17] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht and Oriol Vinyals “Understanding deep learning requires rethinking generalization” In International Conference on Learning Representations, 2017
  • [ZLG+10] Zhengdong Zhang, Xiao Liang, Arvind Ganesh and Yi Ma “TILT: Transform Invariant Low-Rank Textures” In International Journal of Computer Vision 99, 2010, pp. 1–24 URL: https://api.semanticscholar.org/CorpusID:734744
  • [ZLC+23] Zibo Zhao, Wen Liu, Xin Chen, Xianfang Zeng, Rui Wang, Pei Cheng, BIN FU, Tao Chen, Gang YU and Shenghua Gao “Michelangelo: Conditional 3D Shape Generation based on Shape-Image-Text Aligned Latent Representation” In Thirty-seventh Conference on Neural Information Processing Systems, 2023 URL: https://openreview.net/forum?id=xmxgMij3LY
  • [ZCZ+25] Hongkai Zheng, Wenda Chu, Bingliang Zhang, Zihui Wu, Austin Wang, Berthy Feng, Caifeng Zou, Yu Sun, Nikola Borislavov Kovachki, Zachary E Ross, Katherine Bouman and Yisong Yue “InverseBench: Benchmarking Plug-and-Play Diffusion Priors for Inverse Problems in Physical Sciences” In The Thirteenth International Conference on Learning Representations, 2025 URL: https://openreview.net/forum?id=U3PBITXNG6
  • [ZLG+] Ziyan Zheng, Chin Wa Lau, Nian Guo, Xiang Shi and Shao-Lun Huang “White-box error correction code transformer” In The Second Conference on Parsimony and Learning (Proceedings Track)
  • [ZM97] Song Chun Zhu and David Mumford “Prior Learning and Gibbs Reaction-Diffusion” In IEEE Trans. Pattern Anal. Mach. Intell. 19.11 USA: IEEE Computer Society, 1997, pp. 1236–1250 DOI: 10.1109/34.632983
  • [ZWM97] Song Chun Zhu, Ying Nian Wu and David Mumford “Minimax Entropy Principle and Its Application to Texture Modeling” In Neural Computation 9.8, 1997, pp. 1627–1660 DOI: 10.1162/neco.1997.9.8.1627
  • [ZM97a] Song-Chun Zhu and David Mumford “Learning generic prior models for visual computation” In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1997, pp. 463–469 URL: https://api.semanticscholar.org/CorpusID:12762065
  • [ZDZ+21] Zhihui Zhu, Tianyu Ding, Jinxin Zhou, Xiao Li, Chong You, Jeremias Sulam and Qing Qu “A geometric analysis of neural collapse with unconstrained features” In Advances in Neural Information Processing Systems 34, 2021, pp. 29820–29834
  • [ZL17] Barret Zoph and Quoc V. Le “Neural Architecture Search with Reinforcement Learning”, 2017 URL: https://arxiv.org/abs/1611.01578