Skip to main content

Showing 1–50 of 160 results for author: Kang, W

Searching in archive cs. Search in all archives.
.
  1. arXiv:2409.00819  [pdf, other

    cs.SD cs.CL eess.AS

    LibriheavyMix: A 20,000-Hour Dataset for Single-Channel Reverberant Multi-Talker Speech Separation, ASR and Speaker Diarization

    Authors: Zengrui Jin, Yifan Yang, Mohan Shi, Wei Kang, Xiaoyu Yang, Zengwei Yao, Fangjun Kuang, Liyong Guo, Lingwei Meng, Long Lin, Yong Xu, Shi-Xiong Zhang, Daniel Povey

    Abstract: The evolving speech processing landscape is increasingly focused on complex scenarios like meetings or cocktail parties with multiple simultaneous speakers and far-field conditions. Existing methodologies for addressing these challenges fall into two categories: multi-channel and single-channel solutions. Single-channel approaches, notable for their generality and convenience, do not require speci… ▽ More

    Submitted 1 September, 2024; originally announced September 2024.

    Comments: InterSpeech 2024

  2. Anonymization of Voices in Spaces for Civic Dialogue: Measuring Impact on Empathy, Trust, and Feeling Heard

    Authors: Wonjune Kang, Margaret A. Hughes, Deb Roy

    Abstract: Anonymity is a powerful component of many participatory media platforms that can afford people greater freedom of expression and protection from external coercion and interference. However, it can be difficult to effectively implement on platforms that leverage spoken language due to distinct biomarkers present in the human voice. In this work, we explore the use of voice anonymization methods wit… ▽ More

    Submitted 25 August, 2024; originally announced August 2024.

    Comments: Accepted to CSCW 2024 (Proceedings of the ACM on Human-Computer Interaction, Vol. 8, No. CSCW2)

  3. arXiv:2408.11518  [pdf, other

    cs.CV

    EmoFace: Emotion-Content Disentangled Speech-Driven 3D Talking Face with Mesh Attention

    Authors: Yihong Lin, Liang Peng, Jianqiao Hu, Xiandong Li, Wenxiong Kang, Songju Lei, Xianjia Wu, Huang Xu

    Abstract: The creation of increasingly vivid 3D virtual digital humans has become a hot topic in recent years. Currently, most speech-driven work focuses on training models to learn the relationship between phonemes and visemes to achieve more realistic lips. However, they fail to capture the correlations between emotions and facial expressions effectively. To solve this problem, we propose a new model, ter… ▽ More

    Submitted 21 August, 2024; originally announced August 2024.

  4. Mesh deformation-based single-view 3D reconstruction of thin eyeglasses frames with differentiable rendering

    Authors: Fan Zhang, Ziyue Ji, Weiguang Kang, Weiqing Li, Zhiyong Su

    Abstract: With the support of Virtual Reality (VR) and Augmented Reality (AR) technologies, the 3D virtual eyeglasses try-on application is well on its way to becoming a new trending solution that offers a "try on" option to select the perfect pair of eyeglasses at the comfort of your own home. Reconstructing eyeglasses frames from a single image with traditional depth and image-based methods is extremely d… ▽ More

    Submitted 9 August, 2024; originally announced August 2024.

    Journal ref: Graphical Models, Volume 135, October 2024, 101225

  5. arXiv:2408.01826  [pdf, other

    cs.CV

    GLDiTalker: Speech-Driven 3D Facial Animation with Graph Latent Diffusion Transformer

    Authors: Yihong Lin, Zhaoxin Fan, Lingyu Xiong, Liang Peng, Xiandong Li, Wenxiong Kang, Xianjia Wu, Songju Lei, Huang Xu

    Abstract: Speech-driven talking head generation is an important but challenging task for many downstream applications such as augmented reality. Existing methods have achieved remarkable performance by utilizing autoregressive models or diffusion models. However, most still suffer from modality inconsistencies, specifically the misalignment between audio and mesh modalities, which causes inconsistencies in… ▽ More

    Submitted 16 August, 2024; v1 submitted 3 August, 2024; originally announced August 2024.

    Comments: 9 pages, 5 figures

  6. arXiv:2408.01077  [pdf, other

    cs.CV

    PhysMamba: State Space Duality Model for Remote Physiological Measurement

    Authors: Zhixin Yan, Yan Zhong, Hongbin Xu, Wenjun Zhang, Lin Shu, Hongbin Xu, Wenxiong Kang

    Abstract: Remote Photoplethysmography (rPPG) is a non-contact technique for extracting physiological signals from facial videos, used in applications like emotion monitoring, medical assistance, and anti-face spoofing. Unlike controlled laboratory settings, real-world environments often contain motion artifacts and noise, affecting the performance of existing rPPG methods. To address this, we propose PhysMa… ▽ More

    Submitted 17 August, 2024; v1 submitted 2 August, 2024; originally announced August 2024.

  7. arXiv:2407.14804  [pdf, other

    cs.CR

    WiFaKey: Generating Cryptographic Keys from Face in the Wild

    Authors: Xingbo Dong, Hui Zhang, Yen Lung Lai, Zhe Jin, Junduan Huang, Wenxiong Kang, Andrew Beng Jin Teoh

    Abstract: Deriving a unique cryptographic key from biometric measurements is a challenging task due to the existing noise gap between the biometric measurements and error correction coding. Additionally, privacy and security concerns arise as biometric measurements are inherently linked to the user. Biocryptosystems represent a key branch of solutions aimed at addressing these issues. However, many existing… ▽ More

    Submitted 20 July, 2024; originally announced July 2024.

  8. arXiv:2407.06317  [pdf, other

    cs.AI cs.CV cs.RO

    Enhanced Safety in Autonomous Driving: Integrating Latent State Diffusion Model for End-to-End Navigation

    Authors: Detian Chu, Linyuan Bai, Jianuo Huang, Zhenlong Fang, Peng Zhang, Wei Kang, Haifeng Lin

    Abstract: With the advancement of autonomous driving, ensuring safety during motion planning and navigation is becoming more and more important. However, most end-to-end planning methods suffer from a lack of safety. This research addresses the safety issue in the control optimization problem of autonomous driving, formulated as Constrained Markov Decision Processes (CMDPs). We propose a novel, model-based… ▽ More

    Submitted 17 July, 2024; v1 submitted 8 July, 2024; originally announced July 2024.

  9. arXiv:2407.05967  [pdf, other

    cs.CV

    STMR: Spiral Transformer for Hand Mesh Reconstruction

    Authors: Huilong Xie, Wenwei Song, Wenxiong Kang, Yihong Lin

    Abstract: Recent advancements in both transformer-based methods and spiral neighbor sampling techniques have greatly enhanced hand mesh reconstruction. Transformers excel in capturing complex vertex relationships, and spiral neighbor sampling is vital for utilizing topological structures. This paper ingeniously integrates spiral sampling into the Transformer architecture, enhancing its ability to leverage m… ▽ More

    Submitted 8 July, 2024; originally announced July 2024.

  10. arXiv:2407.03251  [pdf, other

    cs.CV

    ACTRESS: Active Retraining for Semi-supervised Visual Grounding

    Authors: Weitai Kang, Mengxue Qu, Yunchao Wei, Yan Yan

    Abstract: Semi-Supervised Visual Grounding (SSVG) is a new challenge for its sparse labeled data with the need for multimodel understanding. A previous study, RefTeacher, makes the first attempt to tackle this task by adopting the teacher-student framework to provide pseudo confidence supervision and attention-based supervision. However, this approach is incompatible with current state-of-the-art visual gro… ▽ More

    Submitted 6 July, 2024; v1 submitted 3 July, 2024; originally announced July 2024.

  11. arXiv:2407.03243  [pdf, other

    cs.CV

    Visual Grounding with Attention-Driven Constraint Balancing

    Authors: Weitai Kang, Luowei Zhou, Junyi Wu, Changchang Sun, Yan Yan

    Abstract: Unlike Object Detection, Visual Grounding task necessitates the detection of an object described by complex free-form language. To simultaneously model such complex semantic and visual representations, recent state-of-the-art studies adopt transformer-based models to fuse features from both modalities, further introducing various modules that modulate visual features to align with the language exp… ▽ More

    Submitted 6 July, 2024; v1 submitted 3 July, 2024; originally announced July 2024.

  12. arXiv:2407.03200  [pdf, other

    cs.CV

    SegVG: Transferring Object Bounding Box to Segmentation for Visual Grounding

    Authors: Weitai Kang, Gaowen Liu, Mubarak Shah, Yan Yan

    Abstract: Different from Object Detection, Visual Grounding deals with detecting a bounding box for each text-image pair. This one box for each text-image data provides sparse supervision signals. Although previous works achieve impressive results, their passive utilization of annotation, i.e. the sole use of the box annotation as regression ground truth, results in a suboptimal performance. In this paper,… ▽ More

    Submitted 6 July, 2024; v1 submitted 3 July, 2024; originally announced July 2024.

    Comments: Accepted to ECCV 2024

  13. arXiv:2406.05968  [pdf, other

    eess.AS cs.CL

    Prompting Large Language Models with Audio for General-Purpose Speech Summarization

    Authors: Wonjune Kang, Deb Roy

    Abstract: In this work, we introduce a framework for speech summarization that leverages the processing and reasoning capabilities of large language models (LLMs). We propose an end-to-end system that combines an instruction-tuned LLM with an audio encoder that converts speech into token representations that the LLM can interpret. Using a dataset with paired speech-text data, the overall system is trained t… ▽ More

    Submitted 9 June, 2024; originally announced June 2024.

    Comments: Accepted to Interspeech 2024

  14. arXiv:2405.18295  [pdf, other

    cs.CV

    Intent3D: 3D Object Detection in RGB-D Scans Based on Human Intention

    Authors: Weitai Kang, Mengxue Qu, Jyoti Kini, Yunchao Wei, Mubarak Shah, Yan Yan

    Abstract: In real-life scenarios, humans seek out objects in the 3D world to fulfill their daily needs or intentions. This inspires us to introduce 3D intention grounding, a new task in 3D object detection employing RGB-D, based on human intention, such as "I want something to support my back". Closely related, 3D visual grounding focuses on understanding human reference. To achieve detection based on human… ▽ More

    Submitted 6 July, 2024; v1 submitted 28 May, 2024; originally announced May 2024.

  15. arXiv:2405.17880  [pdf, other

    cs.LG

    Diffusion Rejection Sampling

    Authors: Byeonghu Na, Yeongmin Kim, Minsang Park, Donghyeok Shin, Wanmo Kang, Il-Chul Moon

    Abstract: Recent advances in powerful pre-trained diffusion models encourage the development of methods to improve the sampling performance under well-trained diffusion models. This paper introduces Diffusion Rejection Sampling (DiffRS), which uses a rejection sampling scheme that aligns the sampling transition kernels with the true ones at each timestep. The proposed method can be viewed as a mechanism tha… ▽ More

    Submitted 28 May, 2024; originally announced May 2024.

    Comments: Accepted at ICML 2024

  16. arXiv:2405.13302  [pdf, other

    stat.ML cs.DM cs.LG math.OC

    Accelerated Evaluation of Ollivier-Ricci Curvature Lower Bounds: Bridging Theory and Computation

    Authors: Wonwoo Kang, Heehyun Park

    Abstract: Curvature serves as a potent and descriptive invariant, with its efficacy validated both theoretically and practically within graph theory. We employ a definition of generalized Ricci curvature proposed by Ollivier, which Lin and Yau later adapted to graph theory, known as Ollivier-Ricci curvature (ORC). ORC measures curvature using the Wasserstein distance, thereby integrating geometric concepts… ▽ More

    Submitted 21 May, 2024; originally announced May 2024.

  17. arXiv:2405.09131  [pdf, other

    cs.CV

    RobustMVS: Single Domain Generalized Deep Multi-view Stereo

    Authors: Hongbin Xu, Weitao Chen, Baigui Sun, Xuansong Xie, Wenxiong Kang

    Abstract: Despite the impressive performance of Multi-view Stereo (MVS) approaches given plenty of training samples, the performance degradation when generalizing to unseen domains has not been clearly explored yet. In this work, we focus on the domain generalization problem in MVS. To evaluate the generalization results, we build a novel MVS domain generalization benchmark including synthetic and real-worl… ▽ More

    Submitted 15 May, 2024; originally announced May 2024.

    Comments: Accepted to TCSVT. Code will be released at: https://github.com/ToughStoneX/Robust-MVS. Benchmark will be released at: https://github.com/ToughStoneX/MVS_Evaluation_Benchmark

  18. arXiv:2404.06884  [pdf, ps, other

    cs.IT

    Demand Private Coded Caching: the Two-File Case

    Authors: Qinyi Lu, Nan Liu, Wei Kang

    Abstract: We investigate the demand private coded caching problem, which is an $(N,K)$ coded caching problem with $N$ files, $K$ users, each equipped with a cache of size $M$, and an additional privacy constraint on user demands. We first present a new virtual-user-based achievable scheme for arbitrary number of users and files. Then, for the case of 2 files and arbitrary number of users, we derive some new… ▽ More

    Submitted 6 May, 2024; v1 submitted 10 April, 2024; originally announced April 2024.

  19. arXiv:2404.01415  [pdf, other

    cs.CV

    On the Faithfulness of Vision Transformer Explanations

    Authors: Junyi Wu, Weitai Kang, Hao Tang, Yuan Hong, Yan Yan

    Abstract: To interpret Vision Transformers, post-hoc explanations assign salience scores to input pixels, providing human-understandable heatmaps. However, whether these interpretations reflect true rationales behind the model's output is still underexplored. To address this gap, we study the faithfulness criterion of explanations: the assigned salience scores should represent the influence of the correspon… ▽ More

    Submitted 1 April, 2024; originally announced April 2024.

    Comments: CVPR 2024

  20. arXiv:2403.15483  [pdf

    eess.SP cs.LG

    Rolling bearing fault diagnosis method based on generative adversarial enhanced multi-scale convolutional neural network model

    Authors: Maoxuan Zhou, Wei Kang, Kun He

    Abstract: In order to solve the problem that current convolutional neural networks can not capture the correlation features between the time domain signals of rolling bearings effectively, and the model accuracy is limited by the number and quality of samples, a rolling bearing fault diagnosis method based on generative adversarial enhanced multi-scale convolutional neural network model is proposed. Firstly… ▽ More

    Submitted 21 March, 2024; originally announced March 2024.

  21. arXiv:2403.14552  [pdf, other

    cs.CV

    Token Transformation Matters: Towards Faithful Post-hoc Explanation for Vision Transformer

    Authors: Junyi Wu, Bin Duan, Weitai Kang, Hao Tang, Yan Yan

    Abstract: While Transformers have rapidly gained popularity in various computer vision applications, post-hoc explanations of their internal mechanisms remain largely unexplored. Vision Transformers extract visual information by representing image regions as transformed tokens and integrating them via attention weights. However, existing post-hoc explanation methods merely consider these attention weights,… ▽ More

    Submitted 21 March, 2024; originally announced March 2024.

    Comments: CVPR 2024

  22. arXiv:2403.13380  [pdf, other

    cs.CE

    A characteristics-based method for shock-ramp data analysis

    Authors: Jingxiang Shen, Wei Kang

    Abstract: For the data analysis problem of shock-ramp compression, i.e., ramp compression after a relatively strong initial shock, a characteristics-based method that strictly deals with the initial hydrodynamic shock is described in detail. Validation of this analysis method using simulated shock-ramp data generated by molecular dynamics and one-dimensional radiation hydrodynamic code is also presented.

    Submitted 20 March, 2024; originally announced March 2024.

  23. arXiv:2403.09468  [pdf, other

    cs.CV

    Eta Inversion: Designing an Optimal Eta Function for Diffusion-based Real Image Editing

    Authors: Wonjun Kang, Kevin Galim, Hyung Il Koo

    Abstract: Diffusion models have achieved remarkable success in the domain of text-guided image generation and, more recently, in text-guided image editing. A commonly adopted strategy for editing real images involves inverting the diffusion process to obtain a noisy representation of the original image, which is then denoised to achieve the desired edits. However, current methods for diffusion inversion oft… ▽ More

    Submitted 15 July, 2024; v1 submitted 14 March, 2024; originally announced March 2024.

    Comments: ECCV 2024. Code: https://github.com/furiosa-ai/eta-inversion

  24. arXiv:2403.08310  [pdf, other

    cs.CV

    StyleDyRF: Zero-shot 4D Style Transfer for Dynamic Neural Radiance Fields

    Authors: Hongbin Xu, Weitao Chen, Feng Xiao, Baigui Sun, Wenxiong Kang

    Abstract: 4D style transfer aims at transferring arbitrary visual style to the synthesized novel views of a dynamic 4D scene with varying viewpoints and times. Existing efforts on 3D style transfer can effectively combine the visual features of style images and neural radiance fields (NeRF) but fail to handle the 4D dynamic scenes limited by the static scene assumption. Consequently, we aim to handle the no… ▽ More

    Submitted 13 March, 2024; originally announced March 2024.

    Comments: In submission. The code and model are released at: https://github.com/ToughStoneX/StyleDyRF

  25. arXiv:2403.08182  [pdf, other

    cs.CV

    SeCG: Semantic-Enhanced 3D Visual Grounding via Cross-modal Graph Attention

    Authors: Feng Xiao, Hongbin Xu, Qiuxia Wu, Wenxiong Kang

    Abstract: 3D visual grounding aims to automatically locate the 3D region of the specified object given the corresponding textual description. Existing works fail to distinguish similar objects especially when multiple referred objects are involved in the description. Experiments show that direct matching of language and visual modal has limited capacity to comprehend complex referential relationships in utt… ▽ More

    Submitted 12 March, 2024; originally announced March 2024.

  26. arXiv:2403.01189  [pdf, other

    cs.LG cs.CV

    Training Unbiased Diffusion Models From Biased Dataset

    Authors: Yeongmin Kim, Byeonghu Na, Minsang Park, JoonHo Jang, Dongjun Kim, Wanmo Kang, Il-Chul Moon

    Abstract: With significant advancements in diffusion models, addressing the potential risks of dataset bias becomes increasingly important. Since generated outputs directly suffer from dataset bias, mitigating latent bias becomes a key factor in improving sample quality and proportion. This paper proposes time-dependent importance reweighting to mitigate the bias for the diffusion models. We demonstrate tha… ▽ More

    Submitted 2 March, 2024; originally announced March 2024.

    Comments: International Conference on Learning Representations (ICLR 2024)

  27. arXiv:2402.17517  [pdf, other

    cs.LG

    Label-Noise Robust Diffusion Models

    Authors: Byeonghu Na, Yeongmin Kim, HeeSun Bae, Jung Hyun Lee, Se Jung Kwon, Wanmo Kang, Il-Chul Moon

    Abstract: Conditional diffusion models have shown remarkable performance in various generative tasks, but training them requires large-scale datasets that often contain noise in conditional inputs, a.k.a. noisy labels. This noise leads to condition mismatch and quality degradation of generated data. This paper proposes Transition-aware weighted Denoising Score Matching (TDSM) for training conditional diffus… ▽ More

    Submitted 27 February, 2024; originally announced February 2024.

    Comments: Accepted at ICLR 2024

  28. arXiv:2402.09668  [pdf, other

    cs.LG cs.AI cs.CL

    How to Train Data-Efficient LLMs

    Authors: Noveen Sachdeva, Benjamin Coleman, Wang-Cheng Kang, Jianmo Ni, Lichan Hong, Ed H. Chi, James Caverlee, Julian McAuley, Derek Zhiyuan Cheng

    Abstract: The training of large language models (LLMs) is expensive. In this paper, we study data-efficient approaches for pre-training LLMs, i.e., techniques that aim to optimize the Pareto frontier of model quality and training resource/data consumption. We seek to understand the tradeoffs associated with data selection routines based on (i) expensive-to-compute data-quality estimates, and (ii) maximizati… ▽ More

    Submitted 14 February, 2024; originally announced February 2024.

    Comments: Under review. 44 pages, 30 figures

  29. arXiv:2402.01293  [pdf, other

    cs.LG cs.CL

    Can MLLMs Perform Text-to-Image In-Context Learning?

    Authors: Yuchen Zeng, Wonjun Kang, Yicong Chen, Hyung Il Koo, Kangwook Lee

    Abstract: The evolution from Large Language Models (LLMs) to Multimodal Large Language Models (MLLMs) has spurred research into extending In-Context Learning (ICL) to its multimodal counterpart. Existing such studies have primarily concentrated on image-to-text ICL. However, the Text-to-Image ICL (T2I-ICL), with its unique characteristics and potential applications, remains underexplored. To address this ga… ▽ More

    Submitted 20 July, 2024; v1 submitted 2 February, 2024; originally announced February 2024.

    Comments: Accepted at COLM 2024

  30. Multi-Task Learning for Front-End Text Processing in TTS

    Authors: Wonjune Kang, Yun Wang, Shun Zhang, Arthur Hinsvark, Qing He

    Abstract: We propose a multi-task learning (MTL) model for jointly performing three tasks that are commonly solved in a text-to-speech (TTS) front-end: text normalization (TN), part-of-speech (POS) tagging, and homograph disambiguation (HD). Our framework utilizes a tree-like structure with a trunk that learns shared representations, followed by separate task-specific heads. We further incorporate a pre-tra… ▽ More

    Submitted 11 January, 2024; originally announced January 2024.

    Comments: ICASSP 2024

  31. arXiv:2312.16392  [pdf, other

    cs.CV cs.AI

    Adaptive Depth Networks with Skippable Sub-Paths

    Authors: Woochul Kang

    Abstract: Predictable adaptation of network depths can be an effective way to control inference latency and meet the resource condition of various devices. However, previous adaptive depth networks do not provide general principles and a formal explanation on why and which layers can be skipped, and, hence, their approaches are hard to be generalized and require long and complex training steps. In this pape… ▽ More

    Submitted 13 May, 2024; v1 submitted 26 December, 2023; originally announced December 2023.

    Comments: 15 pages

  32. arXiv:2312.10877  [pdf, other

    cs.CV

    Mimic: Speaking Style Disentanglement for Speech-Driven 3D Facial Animation

    Authors: Hui Fu, Zeqing Wang, Ke Gong, Keze Wang, Tianshui Chen, Haojie Li, Haifeng Zeng, Wenxiong Kang

    Abstract: Speech-driven 3D facial animation aims to synthesize vivid facial animations that accurately synchronize with speech and match the unique speaking style. However, existing works primarily focus on achieving precise lip synchronization while neglecting to model the subject-specific speaking style, often resulting in unrealistic facial animations. To the best of our knowledge, this work makes the fi… ▽ More

    Submitted 17 December, 2023; originally announced December 2023.

    Comments: 7 pages, 6 figures, accepted by AAAI-24

  33. arXiv:2312.06742  [pdf, other

    cs.CV cs.AI cs.CL cs.LG

    Honeybee: Locality-enhanced Projector for Multimodal LLM

    Authors: Junbum Cha, Wooyoung Kang, Jonghwan Mun, Byungseok Roh

    Abstract: In Multimodal Large Language Models (MLLMs), a visual projector plays a crucial role in bridging pre-trained vision encoders with LLMs, enabling profound visual understanding while harnessing the LLMs' robust capabilities. Despite the importance of the visual projector, it has been relatively less explored. In this study, we first identify two essential projector properties: (i) flexibility in man… ▽ More

    Submitted 31 March, 2024; v1 submitted 11 December, 2023; originally announced December 2023.

    Comments: CVPR 2024 camera-ready

  34. arXiv:2312.05737  [pdf, ps, other

    cs.IT

    Multiple Threshold Schemes Under The Weak Secure Condition

    Authors: Jiahong Wu, Nan Liu, Wei Kang

    Abstract: In this paper, we consider the case that sharing many secrets among a set of participants using the threshold schemes. All secrets are assumed to be statistically independent and the weak secure condition is focused on. Under such circumstances we investigate the infimum of the (average) information ratio and the (average) randomness ratio for any structure pair which consists of the number of the… ▽ More

    Submitted 9 December, 2023; originally announced December 2023.

  35. arXiv:2312.04922  [pdf, other

    cs.IT

    A Cyclic Placement Strategy for Multi-access Coded Caching

    Authors: Zeru Chen, Nan Liu, Wei Kang

    Abstract: We investigate the multi-access coded caching problem, which involves $N$ files, $K$ users, and $K$ caches in this paper. Each user can access $L$ adjacent caches in a cyclic manner. We present a coded placement scheme for the case of cache $M=\frac{K-1}{KL}$, when $\frac{K-1}{L}$ is an integer. The scheme is based on coded placement and involves cyclic placement in caches. In many parameter setti… ▽ More

    Submitted 8 December, 2023; originally announced December 2023.

  36. arXiv:2310.15747  [pdf, other

    cs.CV

    Large Language Models are Temporal and Causal Reasoners for Video Question Answering

    Authors: Dohwan Ko, Ji Soo Lee, Wooyoung Kang, Byungseok Roh, Hyunwoo J. Kim

    Abstract: Large Language Models (LLMs) have shown remarkable performances on a wide range of natural language understanding and generation tasks. We observe that the LLMs provide effective priors in exploiting $\textit{linguistic shortcuts}$ for temporal and causal reasoning in Video Question Answering (VideoQA). However, such priors often cause suboptimal results on VideoQA by leading the model to over-rel… ▽ More

    Submitted 6 November, 2023; v1 submitted 24 October, 2023; originally announced October 2023.

    Comments: Accepted paper at EMNLP 2023 Main

  37. arXiv:2310.14796  [pdf, other

    cs.SD cs.AI eess.AS

    A Novel Transfer Learning Method Utilizing Acoustic and Vibration Signals for Rotating Machinery Fault Diagnosis

    Authors: Zhongliang Chen, Zhuofei Huang, Wenxiong Kang

    Abstract: Fault diagnosis of rotating machinery plays a important role for the safety and stability of modern industrial systems. However, there is a distribution discrepancy between training data and data of real-world operation scenarios, which causing the decrease of performance of existing systems. This paper proposed a transfer learning based method utilizing acoustic and vibration signal to address th… ▽ More

    Submitted 20 October, 2023; originally announced October 2023.

  38. arXiv:2310.11230  [pdf, other

    eess.AS cs.LG cs.SD

    Zipformer: A faster and better encoder for automatic speech recognition

    Authors: Zengwei Yao, Liyong Guo, Xiaoyu Yang, Wei Kang, Fangjun Kuang, Yifan Yang, Zengrui Jin, Long Lin, Daniel Povey

    Abstract: The Conformer has become the most popular encoder model for automatic speech recognition (ASR). It adds convolution modules to a transformer to learn both local and global dependencies. In this work we describe a faster, more memory-efficient, and better-performing transformer, called Zipformer. Modeling changes include: 1) a U-Net-like encoder structure where middle stacks operate at lower frame… ▽ More

    Submitted 9 April, 2024; v1 submitted 17 October, 2023; originally announced October 2023.

    Comments: Published as a conference paper at ICLR 2024

  39. arXiv:2310.09983  [pdf, other

    cs.LG cs.AI cs.CL cs.IR

    Farzi Data: Autoregressive Data Distillation

    Authors: Noveen Sachdeva, Zexue He, Wang-Cheng Kang, Jianmo Ni, Derek Zhiyuan Cheng, Julian McAuley

    Abstract: We study data distillation for auto-regressive machine learning tasks, where the input and output have a strict left-to-right causal structure. More specifically, we propose Farzi, which summarizes an event sequence dataset into a small number of synthetic sequences -- Farzi Data -- which are optimized to maintain (if not improve) model performance compared to training on the full dataset. Under t… ▽ More

    Submitted 15 October, 2023; originally announced October 2023.

    Comments: Under review. 23 pages, 9 figures

  40. arXiv:2310.06765  [pdf, other

    cs.RO

    Efficient Graduated Non-Convexity for Pose Graph Optimization

    Authors: Wonseok Kang, Jaehyun Kim, Jiseong Chung, Seungwon Choi, Tae-wan Kim

    Abstract: We propose a novel approach to Graduated Non-Convexity (GNC) and demonstrate its efficacy through its application in robust pose graph optimization, a key component in SLAM backends. Traditional GNC methods often rely on heuristic methods for GNC schedule, updating control parameter μ for escalating the non-convexity. In contrast, our approach leverages the properties of convex functions and conve… ▽ More

    Submitted 10 October, 2023; originally announced October 2023.

    Comments: 6 pages, 6 figures

  41. arXiv:2309.08105  [pdf, other

    eess.AS cs.SD

    Libriheavy: a 50,000 hours ASR corpus with punctuation casing and context

    Authors: Wei Kang, Xiaoyu Yang, Zengwei Yao, Fangjun Kuang, Yifan Yang, Liyong Guo, Long Lin, Daniel Povey

    Abstract: In this paper, we introduce Libriheavy, a large-scale ASR corpus consisting of 50,000 hours of read English speech derived from LibriVox. To the best of our knowledge, Libriheavy is the largest freely-available corpus of speech with supervisions. Different from other open-sourced datasets that only provide normalized transcriptions, Libriheavy contains richer information such as punctuation, casin… ▽ More

    Submitted 14 January, 2024; v1 submitted 14 September, 2023; originally announced September 2023.

    Comments: Submitted to ICASSP 2024

  42. arXiv:2309.07414  [pdf, other

    eess.AS cs.CL cs.SD

    PromptASR for contextualized ASR with controllable style

    Authors: Xiaoyu Yang, Wei Kang, Zengwei Yao, Yifan Yang, Liyong Guo, Fangjun Kuang, Long Lin, Daniel Povey

    Abstract: Prompts are crucial to large language models as they provide context information such as topic or logical relationships. Inspired by this, we propose PromptASR, a framework that integrates prompts in end-to-end automatic speech recognition (E2E ASR) systems to achieve contextualized ASR with controllable style of transcriptions. Specifically, a dedicated text encoder encodes the text prompts and t… ▽ More

    Submitted 24 January, 2024; v1 submitted 13 September, 2023; originally announced September 2023.

    Comments: Proc. ICASSP 2024

  43. arXiv:2309.04036  [pdf, other

    cs.CR

    One-to-Multiple Clean-Label Image Camouflage (OmClic) based Backdoor Attack on Deep Learning

    Authors: Guohong Wang, Hua Ma, Yansong Gao, Alsharif Abuadbba, Zhi Zhang, Wei Kang, Said F. Al-Sarawib, Gongxuan Zhang, Derek Abbott

    Abstract: Image camouflage has been utilized to create clean-label poisoned images for implanting backdoor into a DL model. But there exists a crucial limitation that one attack/poisoned image can only fit a single input size of the DL model, which greatly increases its attack budget when attacking multiple commonly adopted input sizes of DL models. This work proposes to constructively craft an attack image… ▽ More

    Submitted 28 January, 2024; v1 submitted 7 September, 2023; originally announced September 2023.

  44. arXiv:2309.01961  [pdf, other

    cs.CV

    NICE: CVPR 2023 Challenge on Zero-shot Image Captioning

    Authors: Taehoon Kim, Pyunghwan Ahn, Sangyun Kim, Sihaeng Lee, Mark Marsden, Alessandra Sala, Seung Hwan Kim, Bohyung Han, Kyoung Mu Lee, Honglak Lee, Kyounghoon Bae, Xiangyu Wu, Yi Gao, Hailiang Zhang, Yang Yang, Weili Guo, Jianfeng Lu, Youngtaek Oh, Jae Won Cho, Dong-jin Kim, In So Kweon, Junmo Kim, Wooyoung Kang, Won Young Jhoo, Byungseok Roh , et al. (17 additional authors not shown)

    Abstract: In this report, we introduce NICE (New frontiers for zero-shot Image Captioning Evaluation) project and share the results and outcomes of 2023 challenge. This project is designed to challenge the computer vision community to develop robust image captioning models that advance the state-of-the-art both in terms of accuracy and fairness. Through the challenge, the image captioning models were tested… ▽ More

    Submitted 10 September, 2023; v1 submitted 5 September, 2023; originally announced September 2023.

    Comments: Tech report, project page https://nice.lgresearch.ai/

  45. arXiv:2308.11444  [pdf, other

    cs.RO

    Adaptive Graduated Non-Convexity for Pose Graph Optimization

    Authors: Seungwon Choi, Wonseok Kang, Jiseong Chung, Jaehyun Kim, Tae-wan Kim

    Abstract: We present a novel approach to robust pose graph optimization based on Graduated Non-Convexity (GNC). Unlike traditional GNC-based methods, the proposed approach employs an adaptive shape function using B-spline to optimize the shape of the robust kernel. This aims to reduce GNC iterations, boosting computational speed without compromising accuracy. When integrated with the open-source riSAM algor… ▽ More

    Submitted 23 September, 2023; v1 submitted 22 August, 2023; originally announced August 2023.

    Comments: 4 pages, 3 figures. Accepted for the workshop on Robotic Perception and Mapping(ROPEM): Frontier Vision & Learning Techniques, organized at the 2023 International Conference on Intelligent Robots and Systems (IROS)

  46. arXiv:2308.11018  [pdf

    cs.RO

    Computational Synthesis of Wearable Robot Mechanisms: Application to Hip-Joint Mechanisms

    Authors: Seok Won Kang, Jegyeong Ryu, Suh In Kim, Youngsoo Kim, Yoon Young Kim

    Abstract: Since wearable linkage mechanisms could control the moment transmission from actuator(s) to wearers, they can help ensure that even low-cost wearable systems provide advanced functionality tailored to users' needs. For example, if a hip mechanism transforms an input torque into a spatially-varying moment, a wearer can get effective assistance both in the sagittal and frontal planes during walking,… ▽ More

    Submitted 21 August, 2023; originally announced August 2023.

    Comments: 28 pages, 7 figures, Supplementary Materials

  47. arXiv:2308.07778  [pdf, other

    eess.IV cs.CV

    An Interpretable Machine Learning Model with Deep Learning-based Imaging Biomarkers for Diagnosis of Alzheimer's Disease

    Authors: Wenjie Kang, Bo Li, Janne M. Papma, Lize C. Jiskoot, Peter Paul De Deyn, Geert Jan Biessels, Jurgen A. H. R. Claassen, Huub A. M. Middelkoop, Wiesje M. van der Flier, Inez H. G. B. Ramakers, Stefan Klein, Esther E. Bron

    Abstract: Machine learning methods have shown large potential for the automatic early diagnosis of Alzheimer's Disease (AD). However, some machine learning methods based on imaging data have poor interpretability because it is usually unclear how they make their decisions. Explainable Boosting Machines (EBMs) are interpretable machine learning models based on the statistical framework of generalized additiv… ▽ More

    Submitted 15 August, 2023; originally announced August 2023.

    Comments: 11 pages, 5 figures

  48. arXiv:2307.07178  [pdf, other

    math.NA cs.LG eess.SY

    A Surrogate Data Assimilation Model for the Estimation of Dynamical System in a Limited Area

    Authors: Wei Kang, Liang Xu, Hong Zhou

    Abstract: We propose a novel learning-based surrogate data assimilation (DA) model for efficient state estimation in a limited area. Our model employs a feedforward neural network for online computation, eliminating the need for integrating high-dimensional limited-area models. This approach offers significant computational advantages over traditional DA algorithms. Furthermore, our method avoids the requir… ▽ More

    Submitted 14 July, 2023; originally announced July 2023.

  49. arXiv:2307.04427  [pdf, other

    astro-ph.HE astro-ph.GA cs.LG

    Observation of high-energy neutrinos from the Galactic plane

    Authors: R. Abbasi, M. Ackermann, J. Adams, J. A. Aguilar, M. Ahlers, M. Ahrens, J. M. Alameddine, A. A. Alves Jr., N. M. Amin, K. Andeen, T. Anderson, G. Anton, C. Argüelles, Y. Ashida, S. Athanasiadou, S. Axani, X. Bai, A. Balagopal V., S. W. Barwick, V. Basu, S. Baur, R. Bay, J. J. Beatty, K. -H. Becker, J. Becker Tjus , et al. (364 additional authors not shown)

    Abstract: The origin of high-energy cosmic rays, atomic nuclei that continuously impact Earth's atmosphere, has been a mystery for over a century. Due to deflection in interstellar magnetic fields, cosmic rays from the Milky Way arrive at Earth from random directions. However, near their sources and during propagation, cosmic rays interact with matter and produce high-energy neutrinos. We search for neutrin… ▽ More

    Submitted 10 July, 2023; originally announced July 2023.

    Comments: Submitted on May 12th, 2022; Accepted on May 4th, 2023

    Journal ref: Science 380, 6652, 1338-1343 (2023)

  50. arXiv:2306.17567  [pdf, other

    cs.CV

    Counting Guidance for High Fidelity Text-to-Image Synthesis

    Authors: Wonjun Kang, Kevin Galim, Hyung Il Koo

    Abstract: Recently, the quality and performance of text-to-image generation significantly advanced due to the impressive results of diffusion models. However, text-to-image diffusion models still fail to generate high fidelity content with respect to the input prompt. One problem where text-to-diffusion models struggle is generating the exact number of objects specified in the text prompt. E.g. given a prom… ▽ More

    Submitted 30 June, 2023; originally announced June 2023.

    Comments: 9 pages, 5 figures