Skip to main content

Showing 1–50 of 81 results for author: Adi, Y

Searching in archive cs. Search in all archives.
.
  1. arXiv:2409.07437  [pdf, other

    cs.SD cs.CL eess.AS

    A Suite for Acoustic Language Model Evaluation

    Authors: Gallil Maimon, Amit Roth, Yossi Adi

    Abstract: Speech language models have recently demonstrated great potential as universal speech processing systems. Such models have the ability to model the rich acoustic information existing in audio signals, beyond spoken content, such as emotion, background noise, etc. Despite this, evaluation benchmarks which evaluate awareness to a wide range of acoustic aspects, are lacking. To help bridge this gap,… ▽ More

    Submitted 11 September, 2024; originally announced September 2024.

  2. arXiv:2409.03701  [pdf, other

    cs.CL cs.SD eess.AS

    LAST: Language Model Aware Speech Tokenization

    Authors: Arnon Turetzky, Yossi Adi

    Abstract: Speech tokenization serves as the foundation of speech language model (LM), enabling them to perform various tasks such as spoken language modeling, text-to-speech, speech-to-text, etc. Most speech tokenizers are trained independently of the LM training process, relying on separate acoustic models and quantization methods. Following such an approach may create a mismatch between the tokenization p… ▽ More

    Submitted 10 September, 2024; v1 submitted 5 September, 2024; originally announced September 2024.

  3. arXiv:2409.02915  [pdf, other

    cs.SD eess.AS

    Latent Watermarking of Audio Generative Models

    Authors: Robin San Roman, Pierre Fernandez, Antoine Deleforge, Yossi Adi, Romain Serizel

    Abstract: The advancements in audio generative models have opened up new challenges in their responsible disclosure and the detection of their misuse. In response, we introduce a method to watermark latent generative models by a specific watermarking of their training data. The resulting watermarked models produce latent representations whose decoded outputs are detected with high confidence, regardless of… ▽ More

    Submitted 4 September, 2024; originally announced September 2024.

  4. arXiv:2408.17434  [pdf, other

    cs.SD eess.AS

    Audio Enhancement from Multiple Crowdsourced Recordings: A Simple and Effective Baseline

    Authors: Shiran Aziz, Yossi Adi, Shmuel Peleg

    Abstract: With the popularity of cellular phones, events are often recorded by multiple devices from different locations and shared on social media. Several different recordings could be found for many events. Such recordings are usually noisy, where noise for each device is local and unrelated to others. This case of multiple microphones at unknown locations, capturing local, uncorrelated noise, was rarely… ▽ More

    Submitted 30 August, 2024; originally announced August 2024.

  5. arXiv:2407.21783  [pdf, other

    cs.AI cs.CL cs.CV

    The Llama 3 Herd of Models

    Authors: Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Roziere, Bethany Biron, Binh Tang , et al. (510 additional authors not shown)

    Abstract: Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical… ▽ More

    Submitted 15 August, 2024; v1 submitted 31 July, 2024; originally announced July 2024.

  6. arXiv:2407.15595  [pdf, other

    cs.LG cs.AI

    Discrete Flow Matching

    Authors: Itai Gat, Tal Remez, Neta Shaul, Felix Kreuk, Ricky T. Q. Chen, Gabriel Synnaeve, Yossi Adi, Yaron Lipman

    Abstract: Despite Flow Matching and diffusion models having emerged as powerful generative paradigms for continuous variables such as images and videos, their application to high-dimensional discrete data, such as language, is still limited. In this work, we present Discrete Flow Matching, a novel discrete flow paradigm designed specifically for generating discrete data. Discrete Flow Matching offers severa… ▽ More

    Submitted 22 July, 2024; originally announced July 2024.

  7. arXiv:2407.12563  [pdf, other

    cs.SD eess.AS

    Audio Conditioning for Music Generation via Discrete Bottleneck Features

    Authors: Simon Rouard, Yossi Adi, Jade Copet, Axel Roebel, Alexandre Défossez

    Abstract: While most music generation models use textual or parametric conditioning (e.g. tempo, harmony, musical genre), we propose to condition a language model based music generation system with audio input. Our exploration involves two distinct strategies. The first strategy, termed textual inversion, leverages a pre-trained text-to-music model to map audio input to corresponding "pseudowords" in the te… ▽ More

    Submitted 30 July, 2024; v1 submitted 17 July, 2024; originally announced July 2024.

    Comments: 6 pages, 2 figures, accepted at ISMIR 2024

  8. arXiv:2407.12206  [pdf, other

    cs.CL cs.SD eess.AS

    A Language Modeling Approach to Diacritic-Free Hebrew TTS

    Authors: Amit Roth, Arnon Turetzky, Yossi Adi

    Abstract: We tackle the task of text-to-speech (TTS) in Hebrew. Traditional Hebrew contains Diacritics, which dictate the way individuals should pronounce given words, however, modern Hebrew rarely uses them. The lack of diacritics in modern Hebrew results in readers expected to conclude the correct pronunciation and understand which phonemes to use based on the context. This imposes a fundamental challenge… ▽ More

    Submitted 16 July, 2024; originally announced July 2024.

    Comments: Accepted at Interspeech24

  9. arXiv:2407.07566  [pdf, other

    cs.CL cs.SD eess.AS

    HebDB: a Weakly Supervised Dataset for Hebrew Speech Processing

    Authors: Arnon Turetzky, Or Tal, Yael Segal-Feldman, Yehoshua Dissen, Ella Zeldes, Amit Roth, Eyal Cohen, Yosi Shrem, Bronya R. Chernyak, Olga Seleznova, Joseph Keshet, Yossi Adi

    Abstract: We present HebDB, a weakly supervised dataset for spoken language processing in the Hebrew language. HebDB offers roughly 2500 hours of natural and spontaneous speech recordings in the Hebrew language, consisting of a large variety of speakers and topics. We provide raw recordings together with a pre-processed, weakly supervised, and filtered version. The goal of HebDB is to further enhance resear… ▽ More

    Submitted 10 July, 2024; originally announced July 2024.

    Comments: Accepted at Interspeech2024

  10. arXiv:2406.13621  [pdf, other

    cs.CL cs.CV cs.LG

    Improving Visual Commonsense in Language Models via Multiple Image Generation

    Authors: Guy Yariv, Idan Schwartz, Yossi Adi, Sagie Benaim

    Abstract: Commonsense reasoning is fundamentally based on multimodal knowledge. However, existing large language models (LLMs) are primarily trained using textual data only, limiting their ability to incorporate essential visual information. In contrast, Visual Language Models, which excel at visually-oriented tasks, often fail at non-visual tasks such as basic commonsense reasoning. This divergence highlig… ▽ More

    Submitted 19 June, 2024; originally announced June 2024.

  11. arXiv:2406.11037  [pdf, other

    cs.SD eess.AS

    NAST: Noise Aware Speech Tokenization for Speech Language Models

    Authors: Shoval Messica, Yossi Adi

    Abstract: Speech tokenization is the task of representing speech signals as a sequence of discrete units. Such representations can be later used for various downstream tasks including automatic speech recognition, text-to-speech, etc. More relevant to this study, such representation serves as the basis of Speech Language Models. In this work, we tackle the task of speech tokenization under the noisy setup a… ▽ More

    Submitted 16 June, 2024; originally announced June 2024.

    Comments: Accepted at Interspeech 2024

  12. arXiv:2406.10970  [pdf, other

    cs.SD eess.AS

    Joint Audio and Symbolic Conditioning for Temporally Controlled Text-to-Music Generation

    Authors: Or Tal, Alon Ziv, Itai Gat, Felix Kreuk, Yossi Adi

    Abstract: We present JASCO, a temporally controlled text-to-music generation model utilizing both symbolic and audio-based conditions. JASCO can generate high-quality music samples conditioned on global text descriptions along with fine-grained local controls. JASCO is based on the Flow Matching modeling paradigm together with a novel conditioning method. This allows music generation controlled both locally… ▽ More

    Submitted 16 June, 2024; originally announced June 2024.

  13. arXiv:2406.07725  [pdf, ps, other

    cs.SD eess.AS

    The Interspeech 2024 Challenge on Speech Processing Using Discrete Units

    Authors: Xuankai Chang, Jiatong Shi, Jinchuan Tian, Yuning Wu, Yuxun Tang, Yihan Wu, Shinji Watanabe, Yossi Adi, Xie Chen, Qin Jin

    Abstract: Representing speech and audio signals in discrete units has become a compelling alternative to traditional high-dimensional feature vectors. Numerous studies have highlighted the efficacy of discrete units in various applications such as speech compression and restoration, speech recognition, and speech generation. To foster exploration in this domain, we introduce the Interspeech 2024 Challenge,… ▽ More

    Submitted 11 June, 2024; originally announced June 2024.

    Comments: This manuscript has been accepted by Interspeech2024

  14. arXiv:2406.02315  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    An Independence-promoting Loss for Music Generation with Language Models

    Authors: Jean-Marie Lemercier, Simon Rouard, Jade Copet, Yossi Adi, Alexandre Défossez

    Abstract: Music generation schemes using language modeling rely on a vocabulary of audio tokens, generally provided as codes in a discrete latent space learnt by an auto-encoder. Multi-stage quantizers are often employed to produce these tokens, therefore the decoding strategy used for token prediction must be adapted to account for multiple codebooks: either it should model the joint distribution over all… ▽ More

    Submitted 9 June, 2024; v1 submitted 4 June, 2024; originally announced June 2024.

    Comments: Accepted to ICML 2024

  15. arXiv:2404.00725  [pdf, other

    cs.SE cs.AI cs.CL cs.LG

    The Larger the Better? Improved LLM Code-Generation via Budget Reallocation

    Authors: Michael Hassid, Tal Remez, Jonas Gehring, Roy Schwartz, Yossi Adi

    Abstract: It is a common belief that large language models (LLMs) are better than smaller-sized ones. However, larger models also require significantly more time and compute during inference. This begs the question: what happens when both models operate under the same budget? (e.g., compute, run-time). To address this question, we analyze code generation LLMs of various sizes and make comparisons such as ru… ▽ More

    Submitted 25 July, 2024; v1 submitted 31 March, 2024; originally announced April 2024.

    Comments: COLM 2024

  16. arXiv:2401.06104  [pdf, other

    cs.CL

    Transformers are Multi-State RNNs

    Authors: Matanel Oren, Michael Hassid, Nir Yarden, Yossi Adi, Roy Schwartz

    Abstract: Transformers are considered conceptually different from the previous generation of state-of-the-art NLP models - recurrent neural networks (RNNs). In this work, we demonstrate that decoder-only transformers can in fact be conceptualized as unbounded multi-state RNNs - an RNN variant with unlimited hidden state size. We further show that transformers can be converted into $\textit{bounded}$ multi-s… ▽ More

    Submitted 18 June, 2024; v1 submitted 11 January, 2024; originally announced January 2024.

    Comments: preprint

  17. arXiv:2401.04577  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    Masked Audio Generation using a Single Non-Autoregressive Transformer

    Authors: Alon Ziv, Itai Gat, Gael Le Lan, Tal Remez, Felix Kreuk, Alexandre Défossez, Jade Copet, Gabriel Synnaeve, Yossi Adi

    Abstract: We introduce MAGNeT, a masked generative sequence modeling method that operates directly over several streams of audio tokens. Unlike prior work, MAGNeT is comprised of a single-stage, non-autoregressive transformer. During training, we predict spans of masked tokens obtained from a masking scheduler, while during inference we gradually construct the output sequence using several decoding steps. T… ▽ More

    Submitted 5 March, 2024; v1 submitted 9 January, 2024; originally announced January 2024.

  18. arXiv:2310.05224  [pdf, other

    cs.CL cs.LG

    Generative Spoken Language Model based on continuous word-sized audio tokens

    Authors: Robin Algayres, Yossi Adi, Tu Anh Nguyen, Jade Copet, Gabriel Synnaeve, Benoit Sagot, Emmanuel Dupoux

    Abstract: In NLP, text language models based on words or subwords are known to outperform their character-based counterparts. Yet, in the speech community, the standard input of spoken LMs are 20ms or 40ms-long discrete units (shorter than a phoneme). Taking inspiration from word-based LM, we introduce a Generative Spoken Language Model (GSLM) based on word-size continuous-valued audio embeddings that can g… ▽ More

    Submitted 8 October, 2023; originally announced October 2023.

    Comments: Conference paper at EMNLP 2023

  19. arXiv:2309.17020  [pdf, other

    eess.AS cs.SD

    Low-Resource Self-Supervised Learning with SSL-Enhanced TTS

    Authors: Po-chun Hsu, Ali Elkahky, Wei-Ning Hsu, Yossi Adi, Tu Anh Nguyen, Jade Copet, Emmanuel Dupoux, Hung-yi Lee, Abdelrahman Mohamed

    Abstract: Self-supervised learning (SSL) techniques have achieved remarkable results in various speech processing tasks. Nonetheless, a significant challenge remains in reducing the reliance on vast amounts of speech data for pre-training. This paper proposes to address this challenge by leveraging synthetic speech to augment a low-resource pre-training corpus. We construct a high-quality text-to-speech (TT… ▽ More

    Submitted 4 June, 2024; v1 submitted 29 September, 2023; originally announced September 2023.

    Comments: ASRU 2023 SPARKS Workshop

  20. arXiv:2309.16429  [pdf, other

    cs.LG cs.AI

    Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation

    Authors: Guy Yariv, Itai Gat, Sagie Benaim, Lior Wolf, Idan Schwartz, Yossi Adi

    Abstract: We consider the task of generating diverse and realistic videos guided by natural audio samples from a wide variety of semantic classes. For this task, the videos are required to be aligned both globally and temporally with the input audio: globally, the input audio is semantically associated with the entire output video, and temporally, each segment of the input audio is associated with a corresp… ▽ More

    Submitted 28 September, 2023; originally announced September 2023.

    Comments: 9 pages, 6 figures

  21. arXiv:2308.12950  [pdf, other

    cs.CL

    Code Llama: Open Foundation Models for Code

    Authors: Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, Jérémy Rapin, Artyom Kozhevnikov, Ivan Evtimov, Joanna Bitton, Manish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori, Wenhan Xiong, Alexandre Défossez, Jade Copet, Faisal Azhar, Hugo Touvron, Louis Martin, Nicolas Usunier, Thomas Scialom , et al. (1 additional authors not shown)

    Abstract: We release Code Llama, a family of large language models for code based on Llama 2 providing state-of-the-art performance among open models, infilling capabilities, support for large input contexts, and zero-shot instruction following ability for programming tasks. We provide multiple flavors to cover a wide range of applications: foundation models (Code Llama), Python specializations (Code Llama… ▽ More

    Submitted 31 January, 2024; v1 submitted 24 August, 2023; originally announced August 2023.

  22. arXiv:2308.05725  [pdf, ps, other

    cs.CL cs.LG cs.SD eess.AS

    EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis

    Authors: Tu Anh Nguyen, Wei-Ning Hsu, Antony D'Avirro, Bowen Shi, Itai Gat, Maryam Fazel-Zarani, Tal Remez, Jade Copet, Gabriel Synnaeve, Michael Hassid, Felix Kreuk, Yossi Adi, Emmanuel Dupoux

    Abstract: Recent work has shown that it is possible to resynthesize high-quality speech based, not on text, but on low bitrate discrete units that have been learned in a self-supervised fashion and can therefore capture expressive aspects of speech that are hard to transcribe (prosody, voice styles, non-verbal vocalization). The adoption of these methods is still limited by the fact that most speech synthes… ▽ More

    Submitted 10 August, 2023; originally announced August 2023.

  23. arXiv:2308.02560  [pdf, other

    cs.SD cs.LG eess.AS

    From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion

    Authors: Robin San Roman, Yossi Adi, Antoine Deleforge, Romain Serizel, Gabriel Synnaeve, Alexandre Défossez

    Abstract: Deep generative models can generate high-fidelity audio conditioned on various types of representations (e.g., mel-spectrograms, Mel-frequency Cepstral Coefficients (MFCC)). Recently, such models have been used to synthesize audio waveforms conditioned on highly compressed representations. Although such methods produce impressive results, they are prone to generate audible artifacts when the condi… ▽ More

    Submitted 8 November, 2023; v1 submitted 2 August, 2023; originally announced August 2023.

    Comments: 10 pages

    Journal ref: Thirty-seventh Conference on Neural Information Processing Systems (2023)

  24. arXiv:2306.15687  [pdf, other

    eess.AS cs.CL cs.LG cs.SD

    Voicebox: Text-Guided Multilingual Universal Speech Generation at Scale

    Authors: Matthew Le, Apoorv Vyas, Bowen Shi, Brian Karrer, Leda Sari, Rashel Moritz, Mary Williamson, Vimal Manohar, Yossi Adi, Jay Mahadeokar, Wei-Ning Hsu

    Abstract: Large-scale generative models such as GPT and DALL-E have revolutionized the research community. These models not only generate high fidelity outputs, but are also generalists which can solve tasks not explicitly taught. In contrast, speech generative models are still primitive in terms of scale and task generalization. In this paper, we present Voicebox, the most versatile text-guided generative… ▽ More

    Submitted 19 October, 2023; v1 submitted 23 June, 2023; originally announced June 2023.

    Comments: Accepted to NeurIPS 2023

  25. arXiv:2306.05284  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    Simple and Controllable Music Generation

    Authors: Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi, Alexandre Défossez

    Abstract: We tackle the task of conditional music generation. We introduce MusicGen, a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens. Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns, which eliminates the need for cascading several models, e.g., hierarchicall… ▽ More

    Submitted 29 January, 2024; v1 submitted 8 June, 2023; originally announced June 2023.

    Comments: Published at Neurips 2023

  26. arXiv:2305.13516  [pdf, other

    cs.CL cs.SD eess.AS

    Scaling Speech Technology to 1,000+ Languages

    Authors: Vineel Pratap, Andros Tjandra, Bowen Shi, Paden Tomasello, Arun Babu, Sayani Kundu, Ali Elkahky, Zhaoheng Ni, Apoorv Vyas, Maryam Fazel-Zarandi, Alexei Baevski, Yossi Adi, Xiaohui Zhang, Wei-Ning Hsu, Alexis Conneau, Michael Auli

    Abstract: Expanding the language coverage of speech technology has the potential to improve access to information for many more people. However, current speech technology is restricted to about one hundred languages which is a small fraction of the over 7,000 languages spoken around the world. The Massively Multilingual Speech (MMS) project increases the number of supported languages by 10-40x, depending on… ▽ More

    Submitted 22 May, 2023; originally announced May 2023.

  27. arXiv:2305.13050  [pdf, other

    cs.SD cs.CV cs.LG eess.AS

    AudioToken: Adaptation of Text-Conditioned Diffusion Models for Audio-to-Image Generation

    Authors: Guy Yariv, Itai Gat, Lior Wolf, Yossi Adi, Idan Schwartz

    Abstract: In recent years, image generation has shown a great leap in performance, where diffusion models play a central role. Although generating high-quality images, such models are mainly conditioned on textual descriptions. This begs the question: "how can we adopt such models to be conditioned on other modalities?". In this paper, we propose a novel method utilizing latent diffusion models trained for… ▽ More

    Submitted 22 May, 2023; originally announced May 2023.

    Comments: Accepted to INTERSPEECH 2023

  28. arXiv:2305.13009  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Textually Pretrained Speech Language Models

    Authors: Michael Hassid, Tal Remez, Tu Anh Nguyen, Itai Gat, Alexis Conneau, Felix Kreuk, Jade Copet, Alexandre Defossez, Gabriel Synnaeve, Emmanuel Dupoux, Roy Schwartz, Yossi Adi

    Abstract: Speech language models (SpeechLMs) process and generate acoustic data only, without textual supervision. In this work, we propose TWIST, a method for training SpeechLMs using a warm-start from a pretrained textual language models. We show using both automatic and human evaluations that TWIST outperforms a cold-start SpeechLM across the board. We empirically analyze the effect of different model de… ▽ More

    Submitted 30 January, 2024; v1 submitted 22 May, 2023; originally announced May 2023.

    Comments: NeurIPS 2023

  29. arXiv:2305.12393  [pdf, other

    cs.LG cs.NE

    Layer Collaboration in the Forward-Forward Algorithm

    Authors: Guy Lorberbom, Itai Gat, Yossi Adi, Alex Schwing, Tamir Hazan

    Abstract: Backpropagation, which uses the chain rule, is the de-facto standard algorithm for optimizing neural networks nowadays. Recently, Hinton (2022) proposed the forward-forward algorithm, a promising alternative that optimizes neural nets layer-by-layer, without propagating gradients throughout the network. Although such an approach has several advantages over back-propagation and shows promising resu… ▽ More

    Submitted 21 May, 2023; originally announced May 2023.

  30. arXiv:2301.10606  [pdf, other

    cs.CL cs.SD eess.AS

    A Holistic Cascade System, benchmark, and Human Evaluation Protocol for Expressive Speech-to-Speech Translation

    Authors: Wen-Chin Huang, Benjamin Peloquin, Justine Kao, Changhan Wang, Hongyu Gong, Elizabeth Salesky, Yossi Adi, Ann Lee, Peng-Jen Chen

    Abstract: Expressive speech-to-speech translation (S2ST) aims to transfer prosodic attributes of source speech to target speech while maintaining translation accuracy. Existing research in expressive S2ST is limited, typically focusing on a single expressivity aspect at a time. Likewise, this research area lacks standard evaluation protocols and well-curated benchmark datasets. In this work, we propose a ho… ▽ More

    Submitted 25 January, 2023; originally announced January 2023.

    Comments: This is the full version of our submission to ICASSP 2023

  31. Analysing Discrete Self Supervised Speech Representation for Spoken Language Modeling

    Authors: Amitay Sicherman, Yossi Adi

    Abstract: This work profoundly analyzes discrete self-supervised speech representations (units) through the eyes of Generative Spoken Language Modeling (GSLM). Following the findings of such an analysis, we propose practical improvements to the discrete unit for the GSLM. First, we start comprehending these units by analyzing them in three axes: interpretation, visualization, and resynthesis. Our analysis f… ▽ More

    Submitted 1 March, 2023; v1 submitted 2 January, 2023; originally announced January 2023.

    Comments: Accepted at ICASSP 2023

  32. arXiv:2212.11377  [pdf, other

    eess.AS cs.CV cs.LG cs.SD

    ReVISE: Self-Supervised Speech Resynthesis with Visual Input for Universal and Generalized Speech Enhancement

    Authors: Wei-Ning Hsu, Tal Remez, Bowen Shi, Jacob Donley, Yossi Adi

    Abstract: Prior works on improving speech quality with visual input typically study each type of auditory distortion separately (e.g., separation, inpainting, video-to-speech) and present tailored algorithms. This paper proposes to unify these subjects and study Generalized Speech Enhancement, where the goal is not to reconstruct the exact reference clean signal, but to focus on improving certain aspects of… ▽ More

    Submitted 21 December, 2022; originally announced December 2022.

  33. arXiv:2212.09730  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    Speaking Style Conversion in the Waveform Domain Using Discrete Self-Supervised Units

    Authors: Gallil Maimon, Yossi Adi

    Abstract: We introduce DISSC, a novel, lightweight method that converts the rhythm, pitch contour and timbre of a recording to a target speaker in a textless manner. Unlike DISSC, most voice conversion (VC) methods focus primarily on timbre, and ignore people's unique speaking style (prosody). The proposed approach uses a pretrained, self-supervised model for encoding speech to discrete units, which makes i… ▽ More

    Submitted 18 October, 2023; v1 submitted 19 December, 2022; originally announced December 2022.

    Comments: Accepted at EMNLP 2023

  34. arXiv:2211.12232  [pdf, other

    cs.SD cs.LG eess.AS

    AERO: Audio Super Resolution in the Spectral Domain

    Authors: Moshe Mandel, Or Tal, Yossi Adi

    Abstract: We present AERO, a audio super-resolution model that processes speech and music signals in the spectral domain. AERO is based on an encoder-decoder architecture with U-Net like skip connections. We optimize the model using both time and frequency domain loss functions. Specifically, we consider a set of reconstruction losses together with perceptual ones in the form of adversarial and feature disc… ▽ More

    Submitted 26 February, 2023; v1 submitted 22 November, 2022; originally announced November 2022.

  35. arXiv:2211.03089  [pdf, other

    cs.SD eess.AS

    I Hear Your True Colors: Image Guided Audio Generation

    Authors: Roy Sheffer, Yossi Adi

    Abstract: We propose Im2Wav, an image guided open-domain audio generation system. Given an input image or a sequence of images, Im2Wav generates a semantically relevant sound. Im2Wav is based on two Transformer language models, that operate over a hierarchical discrete audio representation obtained from a VQ-VAE based model. We first produce a low-level audio representation using a language model. Then, we… ▽ More

    Submitted 27 February, 2023; v1 submitted 6 November, 2022; originally announced November 2022.

    Comments: Accepted at ICASSP 2023

  36. arXiv:2211.01223  [pdf, other

    cs.SD eess.AS

    Audio Language Modeling using Perceptually-Guided Discrete Representations

    Authors: Felix Kreuk, Yaniv Taigman, Adam Polyak, Jade Copet, Gabriel Synnaeve, Alexandre Défossez, Yossi Adi

    Abstract: In this work, we study the task of Audio Language Modeling, in which we aim at learning probabilistic models for audio that can be used for generation and completion. We use a state-of-the-art perceptually-guided audio compression model, to encode audio to discrete representations. Next, we train a transformer-based causal language model using these representations. At inference time, we perform a… ▽ More

    Submitted 4 November, 2022; v1 submitted 2 November, 2022; originally announced November 2022.

  37. arXiv:2210.13438  [pdf, other

    eess.AS cs.AI cs.SD stat.ML

    High Fidelity Neural Audio Compression

    Authors: Alexandre Défossez, Jade Copet, Gabriel Synnaeve, Yossi Adi

    Abstract: We introduce a state-of-the-art real-time, high-fidelity, audio codec leveraging neural networks. It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion. We simplify and speed-up the training by using a single multiscale spectrogram adversary that efficiently reduces artifacts and produce high-quality samples. We introduce a novel loss… ▽ More

    Submitted 24 October, 2022; originally announced October 2022.

    Comments: Preprint

  38. arXiv:2210.06143  [pdf, ps, other

    cs.LG stat.ML

    On the Importance of Gradient Norm in PAC-Bayesian Bounds

    Authors: Itai Gat, Yossi Adi, Alexander Schwing, Tamir Hazan

    Abstract: Generalization bounds which assess the difference between the true risk and the empirical risk, have been studied extensively. However, to obtain bounds, current techniques use strict assumptions such as a uniformly bounded or a Lipschitz loss function. To avoid these assumptions, in this paper, we follow an alternative approach: we relax uniform bounds assumptions by using on-average bounded loss… ▽ More

    Submitted 2 November, 2022; v1 submitted 12 October, 2022; originally announced October 2022.

    Comments: NeurIPS 22. arXiv admin note: text overlap with arXiv:2002.09866

  39. arXiv:2209.15483  [pdf, other

    cs.CL cs.LG eess.AS

    Augmentation Invariant Discrete Representation for Generative Spoken Language Modeling

    Authors: Itai Gat, Felix Kreuk, Tu Anh Nguyen, Ann Lee, Jade Copet, Gabriel Synnaeve, Emmanuel Dupoux, Yossi Adi

    Abstract: Generative Spoken Language Modeling research focuses on optimizing speech Language Models (LMs) using raw audio recordings without accessing any textual supervision. Such speech LMs usually operate over discrete units obtained from quantizing internal representations of self-supervised models. Although such units show impressive modeling results, their robustness capabilities have not been extensi… ▽ More

    Submitted 29 May, 2023; v1 submitted 30 September, 2022; originally announced September 2022.

  40. arXiv:2209.15352  [pdf, other

    cs.SD cs.CL cs.LG eess.AS

    AudioGen: Textually Guided Audio Generation

    Authors: Felix Kreuk, Gabriel Synnaeve, Adam Polyak, Uriel Singer, Alexandre Défossez, Jade Copet, Devi Parikh, Yaniv Taigman, Yossi Adi

    Abstract: We tackle the problem of generating audio samples conditioned on descriptive text captions. In this work, we propose AaudioGen, an auto-regressive generative model that generates audio samples conditioned on text inputs. AudioGen operates on a learnt discrete audio representation. The task of text-to-audio generation poses multiple challenges. Due to the way audio travels through a medium, differe… ▽ More

    Submitted 5 March, 2023; v1 submitted 30 September, 2022; originally announced September 2022.

    Comments: Accepted to ICLR 2023

  41. arXiv:2207.10643  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    STOP: A dataset for Spoken Task Oriented Semantic Parsing

    Authors: Paden Tomasello, Akshat Shrivastava, Daniel Lazar, Po-Chun Hsu, Duc Le, Adithya Sagar, Ali Elkahky, Jade Copet, Wei-Ning Hsu, Yossi Adi, Robin Algayres, Tu Ahn Nguyen, Emmanuel Dupoux, Luke Zettlemoyer, Abdelrahman Mohamed

    Abstract: End-to-end spoken language understanding (SLU) predicts intent directly from audio using a single model. It promises to improve the performance of assistant systems by leveraging acoustic information lost in the intermediate textual representation and preventing cascading errors from Automatic Speech Recognition (ASR). Further, having one unified model has efficiency advantages when deploying assi… ▽ More

    Submitted 18 October, 2022; v1 submitted 28 June, 2022; originally announced July 2022.

  42. Deep Audio Waveform Prior

    Authors: Arnon Turetzky, Tzvi Michelson, Yossi Adi, Shmuel Peleg

    Abstract: Convolutional neural networks contain strong priors for generating natural looking images [1]. These priors enable image denoising, super resolution, and inpainting in an unsupervised manner. Previous attempts to demonstrate similar ideas in audio, namely deep audio priors, (i) use hand picked architectures such as harmonic convolutions, (ii) only work with spectrogram input, and (iii) have been u… ▽ More

    Submitted 21 July, 2022; originally announced July 2022.

    Comments: Interspeech 2022

  43. arXiv:2207.00760  [pdf, other

    cs.SD cs.AI cs.LG eess.AS

    Unsupervised Symbolic Music Segmentation using Ensemble Temporal Prediction Errors

    Authors: Shahaf Bassan, Yossi Adi, Jeffrey S. Rosenschein

    Abstract: Symbolic music segmentation is the process of dividing symbolic melodies into smaller meaningful groups, such as melodic phrases. We proposed an unsupervised method for segmenting symbolic music. The proposed model is based on an ensemble of temporal prediction error models. During training, each model predicts the next token to identify musical phrase changes. While at test time, we perform a pea… ▽ More

    Submitted 2 July, 2022; originally announced July 2022.

  44. arXiv:2206.11000  [pdf, other

    eess.AS cs.LG cs.SD

    A Systematic Comparison of Phonetic Aware Techniques for Speech Enhancement

    Authors: Or Tal, Moshe Mandel, Felix Kreuk, Yossi Adi

    Abstract: Speech enhancement has seen great improvement in recent years using end-to-end neural networks. However, most models are agnostic to the spoken phonetic content. Recently, several studies suggested phonetic-aware speech enhancement, mostly using perceptual supervision. Yet, injecting phonetic features during model optimization can take additional forms (e.g., model conditioning). In this paper, we… ▽ More

    Submitted 22 June, 2022; originally announced June 2022.

    Comments: Published @ Interspeech 2022

  45. arXiv:2205.01324  [pdf, other

    cs.LG cs.NE stat.ML

    Learning Discrete Structured Variational Auto-Encoder using Natural Evolution Strategies

    Authors: Alon Berliner, Guy Rotman, Yossi Adi, Roi Reichart, Tamir Hazan

    Abstract: Discrete variational auto-encoders (VAEs) are able to represent semantic latent spaces in generative learning. In many real-life settings, the discrete latent space consists of high-dimensional structures, and propagating gradients through the relevant structures often requires enumerating over an exponentially large latent space. Recently, various approaches were devised to propagate approximated… ▽ More

    Submitted 3 May, 2022; originally announced May 2022.

    Comments: Published as a conference paper at ICLR 2022

  46. arXiv:2204.02967  [pdf, other

    cs.CL cs.SD eess.AS

    Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation

    Authors: Sravya Popuri, Peng-Jen Chen, Changhan Wang, Juan Pino, Yossi Adi, Jiatao Gu, Wei-Ning Hsu, Ann Lee

    Abstract: Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues as there exists little parallel S2ST data, compared to the amount of data available for conventional cascaded systems that consist of automatic speech recognition (ASR), machine translation (MT), and text-to-speech (TTS) synthesis. In this work, we explore self-supervised pre-training with unlabeled speech data and… ▽ More

    Submitted 13 September, 2022; v1 submitted 6 April, 2022; originally announced April 2022.

    Comments: Accepted to be published in the Proceedings of Interspeech 2022

  47. arXiv:2203.16502  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    Generative Spoken Dialogue Language Modeling

    Authors: Tu Anh Nguyen, Eugene Kharitonov, Jade Copet, Yossi Adi, Wei-Ning Hsu, Ali Elkahky, Paden Tomasello, Robin Algayres, Benoit Sagot, Abdelrahman Mohamed, Emmanuel Dupoux

    Abstract: We introduce dGSLM, the first "textless" model able to generate audio samples of naturalistic spoken dialogues. It uses recent work on unsupervised spoken unit discovery coupled with a dual-tower transformer architecture with cross-attention trained on 2000 hours of two-channel raw conversational audio (Fisher dataset) without any text or labels. We show that our model is able to generate speech,… ▽ More

    Submitted 22 November, 2022; v1 submitted 30 March, 2022; originally announced March 2022.

  48. Probing phoneme, language and speaker information in unsupervised speech representations

    Authors: Maureen de Seyssel, Marvin Lavechin, Yossi Adi, Emmanuel Dupoux, Guillaume Wisniewski

    Abstract: Unsupervised models of representations based on Contrastive Predictive Coding (CPC)[1] are primarily used in spoken language modelling in that they encode phonetic information. In this study, we ask what other types of information are present in CPC speech representations. We focus on three categories: phone class, gender and language, and compare monolingual and bilingual models. Using qualitativ… ▽ More

    Submitted 30 March, 2022; originally announced March 2022.

    Comments: Submitted to INTERSPEECH 2022, 5 pages, 2 figures

  49. arXiv:2202.08862  [pdf, other

    cs.SD cs.LG eess.AS

    RemixIT: Continual self-training of speech enhancement models via bootstrapped remixing

    Authors: Efthymios Tzinis, Yossi Adi, Vamsi Krishna Ithapu, Buye Xu, Paris Smaragdis, Anurag Kumar

    Abstract: We present RemixIT, a simple yet effective self-supervised method for training speech enhancement without the need of a single isolated in-domain speech nor a noise waveform. Our approach overcomes limitations of previous methods which make them dependent on clean in-domain target signals and thus, sensitive to any domain mismatch between train and test samples. RemixIT is based on a continuous se… ▽ More

    Submitted 3 August, 2022; v1 submitted 17 February, 2022; originally announced February 2022.

    Comments: To appear in IEEE Journal of Selected Topics in Signal Processing

    Journal ref: J-STSP-SLSAP-00040-2022

  50. arXiv:2202.07359  [pdf, other

    cs.CL cs.LG cs.SD eess.AS

    textless-lib: a Library for Textless Spoken Language Processing

    Authors: Eugene Kharitonov, Jade Copet, Kushal Lakhotia, Tu Anh Nguyen, Paden Tomasello, Ann Lee, Ali Elkahky, Wei-Ning Hsu, Abdelrahman Mohamed, Emmanuel Dupoux, Yossi Adi

    Abstract: Textless spoken language processing research aims to extend the applicability of standard NLP toolset onto spoken language and languages with few or no textual resources. In this paper, we introduce textless-lib, a PyTorch-based library aimed to facilitate research in this research area. We describe the building blocks that the library provides and demonstrate its usability by discuss three differ… ▽ More

    Submitted 15 February, 2022; originally announced February 2022.

    Comments: The library is available here https://github.com/facebookresearch/textlesslib/