Skip to main content

Showing 1–22 of 22 results for author: Coleman, B

Searching in archive cs. Search in all archives.
.
  1. arXiv:2402.09668  [pdf, other

    cs.LG cs.AI cs.CL

    How to Train Data-Efficient LLMs

    Authors: Noveen Sachdeva, Benjamin Coleman, Wang-Cheng Kang, Jianmo Ni, Lichan Hong, Ed H. Chi, James Caverlee, Julian McAuley, Derek Zhiyuan Cheng

    Abstract: The training of large language models (LLMs) is expensive. In this paper, we study data-efficient approaches for pre-training LLMs, i.e., techniques that aim to optimize the Pareto frontier of model quality and training resource/data consumption. We seek to understand the tradeoffs associated with data selection routines based on (i) expensive-to-compute data-quality estimates, and (ii) maximizati… ▽ More

    Submitted 14 February, 2024; originally announced February 2024.

    Comments: Under review. 44 pages, 30 figures

  2. arXiv:2311.13583  [pdf, other

    cs.LG

    Adaptive Sampling for Deep Learning via Efficient Nonparametric Proxies

    Authors: Shabnam Daghaghi, Benjamin Coleman, Benito Geordie, Anshumali Shrivastava

    Abstract: Data sampling is an effective method to improve the training speed of neural networks, with recent results demonstrating that it can even break the neural scaling laws. These results critically rely on high-quality scores to estimate the importance of an input to the network. We observe that there are two dominant strategies: static sampling, where the scores are determined before training, and dy… ▽ More

    Submitted 22 November, 2023; originally announced November 2023.

  3. arXiv:2308.15014  [pdf, other

    cs.IR

    CAPS: A Practical Partition Index for Filtered Similarity Search

    Authors: Gaurav Gupta, Jonah Yi, Benjamin Coleman, Chen Luo, Vihan Lakshman, Anshumali Shrivastava

    Abstract: With the surging popularity of approximate near-neighbor search (ANNS), driven by advances in neural representation learning, the ability to serve queries accompanied by a set of constraints has become an area of intense interest. While the community has recently proposed several algorithms for constrained ANNS, almost all of these methods focus on integration with graph-based indexes, the predomi… ▽ More

    Submitted 29 August, 2023; originally announced August 2023.

    Comments: 14 pages

  4. arXiv:2305.16545  [pdf, other

    cs.DS cs.DB cs.IR

    CARAMEL: A Succinct Read-Only Lookup Table via Compressed Static Functions

    Authors: Benjamin Coleman, David Torres Ramos, Vihan Lakshman, Chen Luo, Anshumali Shrivastava

    Abstract: Lookup tables are a fundamental structure in many data processing and systems applications. Examples include tokenized text in NLP, quantized embedding collections in recommendation systems, integer sketches for streaming data, and hash-based string representations in genomics. With the increasing size of web-scale data, such applications often require compression techniques that support fast rand… ▽ More

    Submitted 25 May, 2023; originally announced May 2023.

    Comments: 8 pages

  5. arXiv:2305.12102  [pdf, other

    cs.LG cs.IR

    Unified Embedding: Battle-Tested Feature Representations for Web-Scale ML Systems

    Authors: Benjamin Coleman, Wang-Cheng Kang, Matthew Fahrbach, Ruoxi Wang, Lichan Hong, Ed H. Chi, Derek Zhiyuan Cheng

    Abstract: Learning high-quality feature embeddings efficiently and effectively is critical for the performance of web-scale machine learning systems. A typical model ingests hundreds of features with vocabularies on the order of millions to billions of tokens. The standard approach is to represent each feature value as a d-dimensional embedding, introducing hundreds of billions of parameters for extremely h… ▽ More

    Submitted 14 November, 2023; v1 submitted 20 May, 2023; originally announced May 2023.

    Comments: NeurIPS'23 Spotlight

    Journal ref: Proceedings of the 37th Annual Conference on Neural Information Processing Systems (NeurIPS 2023) 56234-56255

  6. BOLT: An Automated Deep Learning Framework for Training and Deploying Large-Scale Search and Recommendation Models on Commodity CPU Hardware

    Authors: Nicholas Meisburger, Vihan Lakshman, Benito Geordie, Joshua Engels, David Torres Ramos, Pratik Pranav, Benjamin Coleman, Benjamin Meisburger, Shubh Gupta, Yashwanth Adunukota, Tharun Medini, Anshumali Shrivastava

    Abstract: Efficient large-scale neural network training and inference on commodity CPU hardware is of immense practical significance in democratizing deep learning (DL) capabilities. Presently, the process of training massive models consisting of hundreds of millions to billions of parameters requires the extensive use of specialized hardware accelerators, such as GPUs, which are only accessible to a limite… ▽ More

    Submitted 12 September, 2023; v1 submitted 30 March, 2023; originally announced March 2023.

    Comments: 6 pages, 5 tables, 3 figures. CIKM 2023 (Applied Research Track)

  7. arXiv:2212.08038  [pdf, ps, other

    cs.CY

    Redefining Relationships in Music

    Authors: Christian Detweiler, Beth Coleman, Fernando Diaz, Lieke Dom, Chris Donahue, Jesse Engel, Cheng-Zhi Anna Huang, Larry James, Ethan Manilow, Amanda McCroskery, Kyle Pedersen, Pamela Peter-Agbia, Negar Rostamzadeh, Robert Thomas, Marco Zamarato, Ben Zevenbergen

    Abstract: AI tools increasingly shape how we discover, make and experience music. While these tools can have the potential to empower creativity, they may fundamentally redefine relationships between stakeholders, to the benefit of some and the detriment of others. In this position paper, we argue that these tools will fundamentally reshape our music culture, with profound effects (for better and for worse)… ▽ More

    Submitted 16 December, 2022; v1 submitted 13 December, 2022; originally announced December 2022.

    Comments: Presented at Cultures in AI/AI in Culture workshop at NeurIPS 2022

  8. arXiv:2210.15748  [pdf, other

    cs.DS

    DESSERT: An Efficient Algorithm for Vector Set Search with Vector Set Queries

    Authors: Joshua Engels, Benjamin Coleman, Vihan Lakshman, Anshumali Shrivastava

    Abstract: We study the problem of $\textit{vector set search}$ with $\textit{vector set queries}$. This task is analogous to traditional near-neighbor search, with the exception that both the query and each element in the collection are $\textit{sets}$ of vectors. We identify this problem as a core subroutine for semantic search applications and find that existing solutions are unacceptably slow. Towards th… ▽ More

    Submitted 26 October, 2023; v1 submitted 27 October, 2022; originally announced October 2022.

    Comments: Code available, https://github.com/ThirdAIResearch/Dessert

  9. arXiv:2209.04732  [pdf

    cs.DB cs.AI

    Ontologizing Health Systems Data at Scale: Making Translational Discovery a Reality

    Authors: Tiffany J. Callahan, Adrianne L. Stefanski, Jordan M. Wyrwa, Chenjie Zeng, Anna Ostropolets, Juan M. Banda, William A. Baumgartner Jr., Richard D. Boyce, Elena Casiraghi, Ben D. Coleman, Janine H. Collins, Sara J. Deakyne-Davies, James A. Feinstein, Melissa A. Haendel, Asiyah Y. Lin, Blake Martin, Nicolas A. Matentzoglu, Daniella Meeker, Justin Reese, Jessica Sinclair, Sanya B. Taneja, Katy E. Trinkley, Nicole A. Vasilevsky, Andrew Williams, Xingman A. Zhang , et al. (7 additional authors not shown)

    Abstract: Background: Common data models solve many challenges of standardizing electronic health record (EHR) data, but are unable to semantically integrate all the resources needed for deep phenotyping. Open Biological and Biomedical Ontology (OBO) Foundry ontologies provide computable representations of biological knowledge and enable the integration of heterogeneous data. However, mapping EHR data to OB… ▽ More

    Submitted 30 January, 2023; v1 submitted 10 September, 2022; originally announced September 2022.

    Comments: Supplementary Material is included at the end of the manuscript

    ACM Class: J.3

  10. arXiv:2206.06444  [pdf

    cs.AI cs.CY stat.AP

    A method for comparing multiple imputation techniques: a case study on the U.S. National COVID Cohort Collaborative

    Authors: Elena Casiraghi, Rachel Wong, Margaret Hall, Ben Coleman, Marco Notaro, Michael D. Evans, Jena S. Tronieri, Hannah Blau, Bryan Laraway, Tiffany J. Callahan, Lauren E. Chan, Carolyn T. Bramante, John B. Buse, Richard A. Moffitt, Til Sturmer, Steven G. Johnson, Yu Raymond Shao, Justin Reese, Peter N. Robinson, Alberto Paccanaro, Giorgio Valentini, Jared D. Huling, Kenneth Wilkins, :, Tell Bennet , et al. (12 additional authors not shown)

    Abstract: Healthcare datasets obtained from Electronic Health Records have proven to be extremely useful to assess associations between patients' predictors and outcomes of interest. However, these datasets often suffer from missing values in a high proportion of cases and the simple removal of these cases may introduce severe bias. For these reasons, several multiple imputation algorithms have been propose… ▽ More

    Submitted 25 September, 2022; v1 submitted 13 June, 2022; originally announced June 2022.

  11. arXiv:2106.11565  [pdf, other

    cs.DS

    Practical Near Neighbor Search via Group Testing

    Authors: Joshua Engels, Benjamin Coleman, Anshumali Shrivastava

    Abstract: We present a new algorithm for the approximate near neighbor problem that combines classical ideas from group testing with locality-sensitive hashing (LSH). We reduce the near neighbor search problem to a group testing problem by designating neighbors as "positives," non-neighbors as "negatives," and approximate membership queries as group tests. We instantiate this framework using distance-sensit… ▽ More

    Submitted 22 June, 2021; originally announced June 2021.

    Comments: For source code see https://github.com/JoshuaEng/FLINNG

  12. arXiv:2106.11426  [pdf, other

    cs.LG cs.AI cs.DS

    Efficient Inference via Universal LSH Kernel

    Authors: Zichang Liu, Benjamin Coleman, Anshumali Shrivastava

    Abstract: Large machine learning models achieve unprecedented performance on various tasks and have evolved as the go-to technique. However, deploying these compute and memory hungry models on resource constraint environments poses new challenges. In this work, we propose mathematically provable Representer Sketch, a concise set of count arrays that can approximate the inference procedure with simple hashin… ▽ More

    Submitted 21 June, 2021; originally announced June 2021.

  13. arXiv:2104.03221  [pdf, other

    cs.DS

    Graph Reordering for Cache-Efficient Near Neighbor Search

    Authors: Benjamin Coleman, Santiago Segarra, Anshumali Shrivastava, Alex Smola

    Abstract: Graph search is one of the most successful algorithmic trends in near neighbor search. Several of the most popular and empirically successful algorithms are, at their core, a simple walk along a pruned near neighbor graph. Such algorithms consistently perform at the top of industrial speed benchmarks for applications such as embedding search. However, graph traversal applications often suffer from… ▽ More

    Submitted 7 April, 2021; originally announced April 2021.

  14. arXiv:2102.12301  [pdf, other

    cs.DS cs.LG stat.ML

    Density Sketches for Sampling and Estimation

    Authors: Aditya Desai, Benjamin Coleman, Anshumali Shrivastava

    Abstract: We introduce Density sketches (DS): a succinct online summary of the data distribution. DS can accurately estimate point wise probability density. Interestingly, DS also provides a capability to sample unseen novel data from the underlying data distribution. Thus, analogous to popular generative models, DS allows us to succinctly replace the real-data in almost all machine learning pipelines with… ▽ More

    Submitted 24 February, 2021; originally announced February 2021.

  15. arXiv:2008.02641  [pdf, other

    cs.LG cs.IT stat.ME stat.ML

    Bloom Origami Assays: Practical Group Testing

    Authors: Louis Abraham, Gary Becigneul, Benjamin Coleman, Bernhard Scholkopf, Anshumali Shrivastava, Alexander Smola

    Abstract: We study the problem usually referred to as group testing in the context of COVID-19. Given n samples collected from patients, how should we select and test mixtures of samples to maximize information and minimize the number of tests? Group testing is a well-studied problem with several appealing solutions, but recent biological studies impose practical constraints for COVID-19 that are incompatib… ▽ More

    Submitted 21 July, 2020; originally announced August 2020.

    Comments: arXiv admin note: text overlap with arXiv:2005.06413

  16. arXiv:2006.14554  [pdf, other

    stat.ML cs.LG

    STORM: Foundations of End-to-End Empirical Risk Minimization on the Edge

    Authors: Benjamin Coleman, Gaurav Gupta, John Chen, Anshumali Shrivastava

    Abstract: Empirical risk minimization is perhaps the most influential idea in statistical learning, with applications to nearly all scientific and technical domains in the form of regression and classification models. To analyze massive streaming datasets in distributed computing environments, practitioners increasingly prefer to deploy regression models on edge rather than in the cloud. By keeping data on… ▽ More

    Submitted 25 June, 2020; originally announced June 2020.

  17. arXiv:2006.09352  [pdf, other

    cs.DS cs.CR cs.LG stat.ML

    A One-Pass Private Sketch for Most Machine Learning Tasks

    Authors: Benjamin Coleman, Anshumali Shrivastava

    Abstract: Differential privacy (DP) is a compelling privacy definition that explains the privacy-utility tradeoff via formal, provable guarantees. Inspired by recent progress toward general-purpose data release algorithms, we propose a private sketch, or small summary of the dataset, that supports a multitude of machine learning tasks including regression, classification, density estimation, near-neighbor s… ▽ More

    Submitted 16 June, 2020; originally announced June 2020.

    Comments: 10 pages, 4 figures

  18. arXiv:1912.02283  [pdf, other

    cs.DS cs.LG

    Sub-linear RACE Sketches for Approximate Kernel Density Estimation on Streaming Data

    Authors: Benjamin Coleman, Anshumali Shrivastava

    Abstract: Kernel density estimation is a simple and effective method that lies at the heart of many important machine learning applications. Unfortunately, kernel methods scale poorly for large, high dimensional datasets. Approximate kernel density estimation has a prohibitively high memory and computation cost, especially in the streaming setting. Recent sampling algorithms for high dimensional densities c… ▽ More

    Submitted 4 December, 2019; originally announced December 2019.

  19. arXiv:1910.04358  [pdf, other

    q-bio.GN cs.IR

    Fast Processing and Querying of 170TB of Genomics Data via a Repeated And Merged BloOm Filter (RAMBO)

    Authors: Gaurav Gupta, Minghao Yan, Benjamin Coleman, Bryce Kille, R. A. Leo Elworth, Tharun Medini, Todd Treangen, Anshumali Shrivastava

    Abstract: DNA sequencing, especially of microbial genomes and metagenomes, has been at the core of recent research advances in large-scale comparative genomics. The data deluge has resulted in exponential growth in genomic datasets over the past years and has shown no sign of slowing down. Several recent attempts have been made to tame the computational burden of sequence search on these terabyte and petaby… ▽ More

    Submitted 30 April, 2022; v1 submitted 10 October, 2019; originally announced October 2019.

    Comments: 9 pages

  20. arXiv:1910.02611  [pdf, other

    cs.DS cs.IR

    RAMBO: Repeated And Merged BloOm Filter for Ultra-fast Multiple Set Membership Testing (MSMT) on Large-Scale Data

    Authors: Gaurav Gupta, Minghao Yan, Benjamin Coleman, R. A. Leo Elworth, Tharun Medini, Todd Treangen, Anshumali Shrivastava

    Abstract: Multiple Set Membership Testing (MSMT) is a well-known problem in a variety of search and query applications. Given a dataset of K different sets and a query q, it aims to find all of the sets containing the query. Trivially, an MSMT instance can be reduced to K membership testing instances, each with the same q, leading to O(K) query time with a simple array of Bloom Filters. We propose a data-st… ▽ More

    Submitted 17 July, 2020; v1 submitted 7 October, 2019; originally announced October 2019.

    Comments: 14 pages, 5 figures

  21. arXiv:1908.08762  [pdf, other

    cs.DS

    Revisiting Consistent Hashing with Bounded Loads

    Authors: John Chen, Ben Coleman, Anshumali Shrivastava

    Abstract: Dynamic load balancing lies at the heart of distributed caching. Here, the goal is to assign objects (load) to servers (computing nodes) in a way that provides load balancing while at the same time dynamically adjusts to the addition or removal of servers. One essential requirement is that the addition or removal of small servers should not require us to recompute the complete assignment. A popula… ▽ More

    Submitted 16 June, 2020; v1 submitted 23 August, 2019; originally announced August 2019.

  22. arXiv:1902.06687  [pdf, other

    cs.DS cs.CG cs.LG eess.SP stat.ML

    Sub-linear Memory Sketches for Near Neighbor Search on Streaming Data

    Authors: Benjamin Coleman, Richard G. Baraniuk, Anshumali Shrivastava

    Abstract: We present the first sublinear memory sketch that can be queried to find the nearest neighbors in a dataset. Our online sketching algorithm compresses an N element dataset to a sketch of size $O(N^b \log^3 N)$ in $O(N^{(b+1)} \log^3 N)$ time, where $b < 1$. This sketch can correctly report the nearest neighbors of any query that satisfies a stability condition parameterized by $b$. We achieve subl… ▽ More

    Submitted 14 September, 2020; v1 submitted 18 February, 2019; originally announced February 2019.

    Comments: Published in ICML2020