Skip to main content

Showing 1–25 of 25 results for author: Chatterji, N

Searching in archive cs. Search in all archives.
.
  1. arXiv:2407.21783  [pdf, other

    cs.AI cs.CL cs.CV

    The Llama 3 Herd of Models

    Authors: Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Roziere, Bethany Biron, Binh Tang , et al. (510 additional authors not shown)

    Abstract: Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical… ▽ More

    Submitted 15 August, 2024; v1 submitted 31 July, 2024; originally announced July 2024.

  2. arXiv:2310.17623  [pdf, other

    cs.CL cs.LG

    Proving Test Set Contamination in Black Box Language Models

    Authors: Yonatan Oren, Nicole Meister, Niladri Chatterji, Faisal Ladhak, Tatsunori B. Hashimoto

    Abstract: Large language models are trained on vast amounts of internet data, prompting concerns and speculation that they have memorized public benchmarks. Going from speculation to proof of contamination is challenging, as the pretraining data used by proprietary models are often not publicly accessible. We show that it is possible to provide provable guarantees of test set contamination in language model… ▽ More

    Submitted 23 November, 2023; v1 submitted 26 October, 2023; originally announced October 2023.

  3. arXiv:2211.09110  [pdf, other

    cs.CL cs.AI cs.LG

    Holistic Evaluation of Language Models

    Authors: Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher Ré, Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu Ren, Huaxiu Yao , et al. (25 additional authors not shown)

    Abstract: Language models (LMs) are becoming the foundation for almost all major language technologies, but their capabilities, limitations, and risks are not well understood. We present Holistic Evaluation of Language Models (HELM) to improve the transparency of language models. First, we taxonomize the vast space of potential scenarios (i.e. use cases) and metrics (i.e. desiderata) that are of interest fo… ▽ More

    Submitted 1 October, 2023; v1 submitted 16 November, 2022; originally announced November 2022.

    Comments: Authored by the Center for Research on Foundation Models (CRFM) at the Stanford Institute for Human-Centered Artificial Intelligence (HAI). Project page: https://crfm.stanford.edu/helm/v1.0

    Journal ref: Published in Transactions on Machine Learning Research (TMLR), 2023

  4. arXiv:2209.09315  [pdf, other

    cs.LG cs.AI math.ST stat.ML

    Deep Linear Networks can Benignly Overfit when Shallow Ones Do

    Authors: Niladri S. Chatterji, Philip M. Long

    Abstract: We bound the excess risk of interpolating deep linear networks trained using gradient flow. In a setting previously used to establish risk bounds for the minimum $\ell_2$-norm interpolant, we show that randomly initialized deep linear networks can closely approximate or even match known bounds for the minimum $\ell_2$-norm interpolant. Our analysis also reveals that interpolating deep linear model… ▽ More

    Submitted 6 February, 2023; v1 submitted 19 September, 2022; originally announced September 2022.

  5. arXiv:2205.13094  [pdf, other

    cs.LG cs.AI math.ST stat.ML

    Undersampling is a Minimax Optimal Robustness Intervention in Nonparametric Classification

    Authors: Niladri S. Chatterji, Saminul Haque, Tatsunori Hashimoto

    Abstract: While a broad range of techniques have been proposed to tackle distribution shift, the simple baseline of training on an $\textit{undersampled}$ balanced dataset often achieves close to state-of-the-art-accuracy across several popular benchmarks. This is rather surprising, since undersampling algorithms discard excess majority group data. To understand this phenomenon, we ask if learning is fundam… ▽ More

    Submitted 19 June, 2023; v1 submitted 25 May, 2022; originally announced May 2022.

  6. arXiv:2202.07626  [pdf, other

    cs.LG math.ST stat.ML

    Random Feature Amplification: Feature Learning and Generalization in Neural Networks

    Authors: Spencer Frei, Niladri S. Chatterji, Peter L. Bartlett

    Abstract: In this work, we provide a characterization of the feature-learning process in two-layer ReLU networks trained by gradient descent on the logistic loss following random initialization. We consider data with binary labels that are generated by an XOR-like function of the input features. We permit a constant fraction of the training labels to be corrupted by an adversary. We show that, although line… ▽ More

    Submitted 13 September, 2023; v1 submitted 15 February, 2022; originally announced February 2022.

    Comments: 46 pages; JMLR camera ready revision

  7. arXiv:2202.05928  [pdf, ps, other

    cs.LG math.ST stat.ML

    Benign Overfitting without Linearity: Neural Network Classifiers Trained by Gradient Descent for Noisy Linear Data

    Authors: Spencer Frei, Niladri S. Chatterji, Peter L. Bartlett

    Abstract: Benign overfitting, the phenomenon where interpolating models generalize well in the presence of noisy data, was first observed in neural network models trained with gradient descent. To better understand this empirical observation, we consider the generalization error of two-layer neural networks trained to interpolation by gradient descent on the logistic loss following random initialization. We… ▽ More

    Submitted 13 September, 2023; v1 submitted 11 February, 2022; originally announced February 2022.

    Comments: 39 pages; minor corrections

  8. arXiv:2112.12986  [pdf, other

    cs.LG stat.ML

    Is Importance Weighting Incompatible with Interpolating Classifiers?

    Authors: Ke Alexander Wang, Niladri S. Chatterji, Saminul Haque, Tatsunori Hashimoto

    Abstract: Importance weighting is a classic technique to handle distribution shifts. However, prior work has presented strong empirical and theoretical evidence demonstrating that importance weights can have little to no effect on overparameterized neural networks. Is importance weighting truly incompatible with the training of overparameterized neural networks? Our paper answers this in the negative. We sh… ▽ More

    Submitted 4 March, 2022; v1 submitted 24 December, 2021; originally announced December 2021.

    Comments: International Conference on Learning Representations (ICLR), 2022

  9. arXiv:2110.02914  [pdf, ps, other

    stat.ML cs.LG math.ST

    Foolish Crowds Support Benign Overfitting

    Authors: Niladri S. Chatterji, Philip M. Long

    Abstract: We prove a lower bound on the excess risk of sparse interpolating procedures for linear regression with Gaussian data in the overparameterized regime. We apply this result to obtain a lower bound for basis pursuit (the minimum $\ell_1$-norm interpolant) that implies that its excess risk can converge at an exponentially slower rate than OLS (the minimum $\ell_2$-norm interpolant), even when the gro… ▽ More

    Submitted 17 March, 2022; v1 submitted 6 October, 2021; originally announced October 2021.

  10. arXiv:2108.11489  [pdf, ps, other

    stat.ML cs.LG math.ST

    The Interplay Between Implicit Bias and Benign Overfitting in Two-Layer Linear Networks

    Authors: Niladri S. Chatterji, Philip M. Long, Peter L. Bartlett

    Abstract: The recent success of neural network models has shone light on a rather surprising statistical phenomenon: statistical models that perfectly fit noisy data can generalize well to unseen test data. Understanding this phenomenon of $\textit{benign overfitting}$ has attracted intense theoretical and empirical study. In this paper, we consider interpolating two-layer linear neural networks trained wit… ▽ More

    Submitted 9 September, 2022; v1 submitted 25 August, 2021; originally announced August 2021.

    Comments: Accepted for publication at JMLR

  11. arXiv:2108.07258  [pdf, other

    cs.LG cs.AI cs.CY

    On the Opportunities and Risks of Foundation Models

    Authors: Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, Shyamal Buch, Dallas Card, Rodrigo Castellon, Niladri Chatterji, Annie Chen, Kathleen Creel, Jared Quincy Davis, Dora Demszky, Chris Donahue, Moussa Doumbouya, Esin Durmus, Stefano Ermon, John Etchemendy, Kawin Ethayarajh , et al. (89 additional authors not shown)

    Abstract: AI is undergoing a paradigm shift with the rise of models (e.g., BERT, DALL-E, GPT-3) that are trained on broad data at scale and are adaptable to a wide range of downstream tasks. We call these models foundation models to underscore their critically central yet incomplete character. This report provides a thorough account of the opportunities and risks of foundation models, ranging from their cap… ▽ More

    Submitted 12 July, 2022; v1 submitted 16 August, 2021; originally announced August 2021.

    Comments: Authored by the Center for Research on Foundation Models (CRFM) at the Stanford Institute for Human-Centered Artificial Intelligence (HAI). Report page with citation guidelines: https://crfm.stanford.edu/report.html

  12. arXiv:2105.14363  [pdf, other

    cs.LG cs.AI stat.ML

    On the Theory of Reinforcement Learning with Once-per-Episode Feedback

    Authors: Niladri S. Chatterji, Aldo Pacchiano, Peter L. Bartlett, Michael I. Jordan

    Abstract: We study a theory of reinforcement learning (RL) in which the learner receives binary feedback only once at the end of an episode. While this is an extreme test case for theory, it is also arguably more representative of real-world applications than the traditional requirement in RL practice that the learner receive feedback at every time step. Indeed, in many real-world applications of reinforcem… ▽ More

    Submitted 21 August, 2022; v1 submitted 29 May, 2021; originally announced May 2021.

    Comments: Published at NeurIPS 2022

  13. arXiv:2102.04998  [pdf, ps, other

    stat.ML cs.AI cs.LG math.OC

    When does gradient descent with logistic loss interpolate using deep networks with smoothed ReLU activations?

    Authors: Niladri S. Chatterji, Philip M. Long, Peter L. Bartlett

    Abstract: We establish conditions under which gradient descent applied to fixed-width deep networks drives the logistic loss to zero, and prove bounds on the rate of convergence. Our analysis applies for smoothed approximations to the ReLU, such as Swish and the Huberized ReLU, proposed in previous applied work. We provide two sufficient conditions for convergence. The first is simply a bound on the loss at… ▽ More

    Submitted 1 July, 2021; v1 submitted 9 February, 2021; originally announced February 2021.

  14. arXiv:2012.02409  [pdf, other

    stat.ML cs.LG math.OC

    When does gradient descent with logistic loss find interpolating two-layer networks?

    Authors: Niladri S. Chatterji, Philip M. Long, Peter L. Bartlett

    Abstract: We study the training of finite-width two-layer smoothed ReLU networks for binary classification using the logistic loss. We show that gradient descent drives the training loss to zero if the initial loss is small enough. When the data satisfies certain cluster and separation conditions and the network is wide enough, we show that one step of gradient descent reduces the loss sufficiently that the… ▽ More

    Submitted 1 July, 2021; v1 submitted 4 December, 2020; originally announced December 2020.

  15. arXiv:2004.12019  [pdf, other

    stat.ML cs.LG math.ST

    Finite-sample Analysis of Interpolating Linear Classifiers in the Overparameterized Regime

    Authors: Niladri S. Chatterji, Philip M. Long

    Abstract: We prove bounds on the population risk of the maximum margin algorithm for two-class linear classification. For linearly separable training data, the maximum margin algorithm has been shown in previous work to be equivalent to a limit of training with logistic loss using gradient descent, as the training error is driven to zero. We analyze this algorithm applied to random data including misclassif… ▽ More

    Submitted 1 June, 2021; v1 submitted 24 April, 2020; originally announced April 2020.

    Comments: Corrected typographical errors from the previous version of this paper

  16. arXiv:2002.00291  [pdf, ps, other

    stat.ML cs.LG math.ST

    Oracle Lower Bounds for Stochastic Gradient Sampling Algorithms

    Authors: Niladri S. Chatterji, Peter L. Bartlett, Philip M. Long

    Abstract: We consider the problem of sampling from a strongly log-concave density in $\mathbb{R}^d$, and prove an information theoretic lower bound on the number of stochastic gradient queries of the log density needed. Several popular sampling algorithms (including many Markov chain Monte Carlo methods) operate by using stochastic gradients of the log density to generate a sample; our results establish an… ▽ More

    Submitted 3 July, 2021; v1 submitted 1 February, 2020; originally announced February 2020.

    Comments: 21 pages; accepted for publication at Bernoulli

  17. arXiv:1912.00528  [pdf, other

    cs.LG stat.ML

    The intriguing role of module criticality in the generalization of deep networks

    Authors: Niladri S. Chatterji, Behnam Neyshabur, Hanie Sedghi

    Abstract: We study the phenomenon that some modules of deep neural networks (DNNs) are more critical than others. Meaning that rewinding their parameter values back to initialization, while keeping other modules fixed at the trained parameters, results in a large drop in the network's performance. Our analysis reveals interesting properties of the loss landscape which leads us to propose a complexity measur… ▽ More

    Submitted 14 February, 2020; v1 submitted 1 December, 2019; originally announced December 2019.

  18. arXiv:1905.13285  [pdf, ps, other

    stat.ML cs.LG stat.CO

    Langevin Monte Carlo without smoothness

    Authors: Niladri S. Chatterji, Jelena Diakonikolas, Michael I. Jordan, Peter L. Bartlett

    Abstract: Langevin Monte Carlo (LMC) is an iterative algorithm used to generate samples from a distribution that is known only up to a normalizing constant. The nonasymptotic dependence of its mixing time on the dimension and target accuracy is understood mainly in the setting of smooth (gradient-Lipschitz) log-densities, a serious limitation for applications in machine learning. In this paper, we remove th… ▽ More

    Submitted 24 February, 2020; v1 submitted 30 May, 2019; originally announced May 2019.

    Comments: Updated to match the AISTATS 2020 camera ready version. Some example applications added and typos corrected

  19. arXiv:1905.10040  [pdf, other

    stat.ML cs.LG

    OSOM: A simultaneously optimal algorithm for multi-armed and linear contextual bandits

    Authors: Niladri S. Chatterji, Vidya Muthukumar, Peter L. Bartlett

    Abstract: We consider the stochastic linear (multi-armed) contextual bandit problem with the possibility of hidden simple multi-armed bandit structure in which the rewards are independent of the contextual information. Algorithms that are designed solely for one of the regimes are known to be sub-optimal for the alternate regime. We design a single computationally efficient algorithm that simultaneously obt… ▽ More

    Submitted 5 October, 2020; v1 submitted 24 May, 2019; originally announced May 2019.

  20. arXiv:1902.00996  [pdf, ps, other

    stat.ML cs.LG math.NA

    Is There an Analog of Nesterov Acceleration for MCMC?

    Authors: Yi-An Ma, Niladri Chatterji, Xiang Cheng, Nicolas Flammarion, Peter Bartlett, Michael I. Jordan

    Abstract: We formulate gradient-based Markov chain Monte Carlo (MCMC) sampling as optimization on the space of probability measures, with Kullback-Leibler (KL) divergence as the objective functional. We show that an underdamped form of the Langevin algorithm performs accelerated gradient descent in this metric. To characterize the convergence of the algorithm, we construct a Lyapunov functional and exploit… ▽ More

    Submitted 22 October, 2019; v1 submitted 3 February, 2019; originally announced February 2019.

  21. arXiv:1805.01648  [pdf, other

    stat.ML cs.LG math.PR stat.CO

    Sharp convergence rates for Langevin dynamics in the nonconvex setting

    Authors: Xiang Cheng, Niladri S. Chatterji, Yasin Abbasi-Yadkori, Peter L. Bartlett, Michael I. Jordan

    Abstract: We study the problem of sampling from a distribution $p^*(x) \propto \exp\left(-U(x)\right)$, where the function $U$ is $L$-smooth everywhere and $m$-strongly convex outside a ball of radius $R$, but potentially nonconvex inside this ball. We study both overdamped and underdamped Langevin MCMC and establish upper bounds on the number of steps required to obtain a sample from a distribution that is… ▽ More

    Submitted 6 July, 2020; v1 submitted 4 May, 2018; originally announced May 2018.

    Comments: 78 pages, 2 figures

  22. arXiv:1802.09732  [pdf, other

    stat.ML cs.LG

    Online learning with kernel losses

    Authors: Aldo Pacchiano, Niladri S. Chatterji, Peter L. Bartlett

    Abstract: We present a generalization of the adversarial linear bandits framework, where the underlying losses are kernel functions (with an associated reproducing kernel Hilbert space) rather than linear functions. We study a version of the exponential weights algorithm and bound its regret in this setting. Under conditions on the eigendecay of the kernel we provide a sharp characterization of the regret f… ▽ More

    Submitted 27 February, 2018; originally announced February 2018.

    Comments: 40 pages, 4 figures

  23. arXiv:1802.05431  [pdf, other

    stat.ML cs.LG

    On the Theory of Variance Reduction for Stochastic Gradient Monte Carlo

    Authors: Niladri S. Chatterji, Nicolas Flammarion, Yi-An Ma, Peter L. Bartlett, Michael I. Jordan

    Abstract: We provide convergence guarantees in Wasserstein distance for a variety of variance-reduction methods: SAGA Langevin diffusion, SVRG Langevin diffusion and control-variate underdamped Langevin diffusion. We analyze these methods under a uniform set of assumptions on the log-posterior distribution, assuming it to be smooth, strongly convex and Hessian Lipschitz. This is achieved by a new proof tech… ▽ More

    Submitted 15 February, 2018; originally announced February 2018.

    Comments: 37 pages; 4 figures

  24. arXiv:1711.03634  [pdf, ps, other

    stat.ML cs.LG

    Alternating minimization for dictionary learning: Local Convergence Guarantees

    Authors: Niladri S. Chatterji, Peter L. Bartlett

    Abstract: We present theoretical guarantees for an alternating minimization algorithm for the dictionary learning/sparse coding problem. The dictionary learning problem is to factorize vector samples $y^{1},y^{2},\ldots, y^{n}$ into an appropriate basis (dictionary) $A^*$ and sparse vectors $x^{1*},\ldots,x^{n*}$. Our algorithm is a simple alternating minimization procedure that switches between $\ell_1$ mi… ▽ More

    Submitted 30 July, 2019; v1 submitted 9 November, 2017; originally announced November 2017.

    Comments: Erratum: An earlier version of this paper appeared in NIPS 2017 which had an erroneous claim about convergence guarantees with random initialization. The main result -- Theorem 3 -- has been corrected by adding an assumption about the initialization (Assumption B1)

  25. arXiv:1707.03663  [pdf, ps, other

    stat.ML cs.LG stat.CO

    Underdamped Langevin MCMC: A non-asymptotic analysis

    Authors: Xiang Cheng, Niladri S. Chatterji, Peter L. Bartlett, Michael I. Jordan

    Abstract: We study the underdamped Langevin diffusion when the log of the target distribution is smooth and strongly concave. We present a MCMC algorithm based on its discretization and show that it achieves $\varepsilon$ error (in 2-Wasserstein distance) in $\mathcal{O}(\sqrt{d}/\varepsilon)$ steps. This is a significant improvement over the best known rate for overdamped Langevin MCMC, which is… ▽ More

    Submitted 26 January, 2018; v1 submitted 12 July, 2017; originally announced July 2017.

    Comments: 23 pages; Correction to Corollary 7