Skip to main content

Showing 1–50 of 559 results for author: Peng, Y

Searching in archive cs. Search in all archives.
.
  1. arXiv:2408.08407  [pdf, other

    physics.optics cs.ET

    Photonic KAN: a Kolmogorov-Arnold network inspired efficient photonic neuromorphic architecture

    Authors: Yiwei Peng, Sean Hooten, Xinling Yu, Thomas Van Vaerenbergh, Yuan Yuan, Xian Xiao, Bassem Tossoun, Stanley Cheung, Marco Fiorentino, Raymond Beausoleil

    Abstract: Kolmogorov-Arnold Networks (KAN) models were recently proposed and claimed to provide improved parameter scaling and interpretability compared to conventional multilayer perceptron (MLP) models. Inspired by the KAN architecture, we propose the Photonic KAN -- an integrated all-optical neuromorphic platform leveraging highly parametric optical nonlinear transfer functions along KAN edges. In this w… ▽ More

    Submitted 15 August, 2024; originally announced August 2024.

    Comments: 11 pages, 7 figures, 1 table

  2. arXiv:2408.06303  [pdf, other

    cs.CL cs.CV

    Long-Form Answers to Visual Questions from Blind and Low Vision People

    Authors: Mina Huh, Fangyuan Xu, Yi-Hao Peng, Chongyan Chen, Hansika Murugu, Danna Gurari, Eunsol Choi, Amy Pavel

    Abstract: Vision language models can now generate long-form answers to questions about images - long-form visual question answers (LFVQA). We contribute VizWiz-LF, a dataset of long-form answers to visual questions posed by blind and low vision (BLV) users. VizWiz-LF contains 4.2k long-form answers to 600 visual questions, collected from human expert describers and six VQA models. We develop and annotate fu… ▽ More

    Submitted 12 August, 2024; originally announced August 2024.

    Comments: COLM 2024

  3. arXiv:2408.05543  [pdf, other

    cs.CV

    PixelFade: Privacy-preserving Person Re-identification with Noise-guided Progressive Replacement

    Authors: Delong Zhang, Yi-Xing Peng, Xiao-Ming Wu, Ancong Wu, Wei-Shi Zheng

    Abstract: Online person re-identification services face privacy breaches from potential data leakage and recovery attacks, exposing cloud-stored images to malicious attackers and triggering public concern. The privacy protection of pedestrian images is crucial. Previous privacy-preserving person re-identification methods are unable to resist recovery attacks and compromise accuracy. In this paper, we propos… ▽ More

    Submitted 10 August, 2024; originally announced August 2024.

    Comments: accepted by ACMMM24

  4. arXiv:2408.03616  [pdf, other

    eess.IV cs.CV

    Distillation Learning Guided by Image Reconstruction for One-Shot Medical Image Segmentation

    Authors: Feng Zhou, Yanjie Zhou, Longjie Wang, Yun Peng, David E. Carlson, Liyun Tu

    Abstract: Traditional one-shot medical image segmentation (MIS) methods use registration networks to propagate labels from a reference atlas or rely on comprehensive sampling strategies to generate synthetic labeled data for training. However, these methods often struggle with registration errors and low-quality synthetic images, leading to poor performance and generalization. To overcome this, we introduce… ▽ More

    Submitted 7 August, 2024; originally announced August 2024.

  5. arXiv:2408.03505  [pdf, other

    cs.CL cs.AI cs.DC

    Optimus: Accelerating Large-Scale Multi-Modal LLM Training by Bubble Exploitation

    Authors: Weiqi Feng, Yangrui Chen, Shaoyu Wang, Yanghua Peng, Haibin Lin, Minlan Yu

    Abstract: Multimodal large language models (MLLMs) have extended the success of large language models (LLMs) to multiple data types, such as image, text and audio, achieving significant performance in various domains, including multimodal translation, visual question answering and content generation. Nonetheless, existing systems are inefficient to train MLLMs due to substantial GPU bubbles caused by the he… ▽ More

    Submitted 6 August, 2024; originally announced August 2024.

  6. arXiv:2408.02484  [pdf, other

    cs.CV

    Exploring Conditional Multi-Modal Prompts for Zero-shot HOI Detection

    Authors: Ting Lei, Shaofeng Yin, Yuxin Peng, Yang Liu

    Abstract: Zero-shot Human-Object Interaction (HOI) detection has emerged as a frontier topic due to its capability to detect HOIs beyond a predefined set of categories. This task entails not only identifying the interactiveness of human-object pairs and localizing them but also recognizing both seen and unseen interaction categories. In this paper, we introduce a novel framework for zero-shot HOI detection… ▽ More

    Submitted 5 August, 2024; originally announced August 2024.

  7. arXiv:2408.01430  [pdf, other

    cs.CV cs.AI

    SUSTechGAN: Image Generation for Object Recognition in Adverse Conditions of Autonomous Driving

    Authors: Gongjin Lan, Yang Peng, Qi Hao, Chengzhong Xu

    Abstract: Autonomous driving significantly benefits from data-driven deep neural networks. However, the data in autonomous driving typically fits the long-tailed distribution, in which the critical driving data in adverse conditions is hard to collect. Although generative adversarial networks (GANs) have been applied to augment data for autonomous driving, generating driving images in adverse conditions is… ▽ More

    Submitted 18 July, 2024; originally announced August 2024.

    Comments: 10 pages, 9 figures

  8. Deep progressive reinforcement learning-based flexible resource scheduling framework for IRS and UAV-assisted MEC system

    Authors: Li Dong, Feibo Jiang, Minjie Wang, Yubo Peng, Xiaolong Li

    Abstract: The intelligent reflection surface (IRS) and unmanned aerial vehicle (UAV)-assisted mobile edge computing (MEC) system is widely used in temporary and emergency scenarios. Our goal is to minimize the energy consumption of the MEC system by jointly optimizing UAV locations, IRS phase shift, task offloading, and resource allocation with a variable number of UAVs. To this end, we propose a Flexible R… ▽ More

    Submitted 2 August, 2024; originally announced August 2024.

    Comments: 13 pages, 10 figures

    Journal ref: IEEE Transactions on Neural Networks and Learning Systems,2024

  9. arXiv:2408.00588  [pdf, other

    cs.CL cs.AI

    Closing the gap between open-source and commercial large language models for medical evidence summarization

    Authors: Gongbo Zhang, Qiao Jin, Yiliang Zhou, Song Wang, Betina R. Idnay, Yiming Luo, Elizabeth Park, Jordan G. Nestor, Matthew E. Spotnitz, Ali Soroush, Thomas Campion, Zhiyong Lu, Chunhua Weng, Yifan Peng

    Abstract: Large language models (LLMs) hold great promise in summarizing medical evidence. Most recent studies focus on the application of proprietary LLMs. Using proprietary LLMs introduces multiple risk factors, including a lack of transparency and vendor dependency. While open-source LLMs allow better transparency and customization, their performance falls short compared to proprietary ones. In this stud… ▽ More

    Submitted 25 July, 2024; originally announced August 2024.

  10. arXiv:2407.21416  [pdf, other

    cs.CV cs.RO

    VIPeR: Visual Incremental Place Recognition with Adaptive Mining and Lifelong Learning

    Authors: Yuhang Ming, Minyang Xu, Xingrui Yang, Weicai Ye, Weihan Wang, Yong Peng, Weichen Dai, Wanzeng Kong

    Abstract: Visual place recognition (VPR) is an essential component of many autonomous and augmented/virtual reality systems. It enables the systems to robustly localize themselves in large-scale environments. Existing VPR methods demonstrate attractive performance at the cost of heavy pre-training and limited generalizability. When deployed in unseen environments, these methods exhibit significant performan… ▽ More

    Submitted 31 July, 2024; originally announced July 2024.

    Comments: 8 pages, 4 figures

  11. arXiv:2407.20143  [pdf, other

    cs.AI

    ByteCheckpoint: A Unified Checkpointing System for LLM Development

    Authors: Borui Wan, Mingji Han, Yiyao Sheng, Zhichao Lai, Mofan Zhang, Junda Zhang, Yanghua Peng, Haibin Lin, Xin Liu, Chuan Wu

    Abstract: The development of real-world Large Language Models (LLMs) necessitates checkpointing of training states in persistent storage to mitigate potential software and hardware failures, as well as to facilitate checkpoint transferring within the training pipeline and across various tasks. Due to the immense size of LLMs, saving and loading checkpoints often incur intolerable minute-level stalls, signif… ▽ More

    Submitted 29 July, 2024; originally announced July 2024.

  12. arXiv:2407.19728  [pdf, other

    cs.HC cs.CY

    PersonalityScanner: Exploring the Validity of Personality Assessment Based on Multimodal Signals in Virtual Reality

    Authors: Xintong Zhang, Di Lu, Huiqi Hu, Nan Jiang, Xianhao Yu, Jinan Xu, Yujia Peng, Qing Li, Wenjuan Han

    Abstract: Human cognition significantly influences expressed behavior and is intrinsically tied to authentic personality traits. Personality assessment plays a pivotal role in various fields, including psychology, education, social media, etc. However, traditional self-report questionnaires can only provide data based on what individuals are willing and able to disclose, thereby lacking objective. Moreover,… ▽ More

    Submitted 29 July, 2024; originally announced July 2024.

    Comments: Accepted to COGSCI 2024

  13. arXiv:2407.17126  [pdf

    cs.CL cs.AI

    SDoH-GPT: Using Large Language Models to Extract Social Determinants of Health (SDoH)

    Authors: Bernardo Consoli, Xizhi Wu, Song Wang, Xinyu Zhao, Yanshan Wang, Justin Rousseau, Tom Hartvigsen, Li Shen, Huanmei Wu, Yifan Peng, Qi Long, Tianlong Chen, Ying Ding

    Abstract: Extracting social determinants of health (SDoH) from unstructured medical notes depends heavily on labor-intensive annotations, which are typically task-specific, hampering reusability and limiting sharing. In this study we introduced SDoH-GPT, a simple and effective few-shot Large Language Model (LLM) method leveraging contrastive examples and concise instructions to extract SDoH without relying… ▽ More

    Submitted 24 July, 2024; originally announced July 2024.

  14. arXiv:2407.16639  [pdf, other

    cs.SD eess.AS

    Distortion Recovery: A Two-Stage Method for Guitar Effect Removal

    Authors: Ying-Shuo Lee, Yueh-Po Peng, Jui-Te Wu, Ming Cheng, Li Su, Yi-Hsuan Yang

    Abstract: Removing audio effects from electric guitar recordings makes it easier for post-production and sound editing. An audio distortion recovery model not only improves the clarity of the guitar sounds but also opens up new opportunities for creative adjustments in mixing and mastering. While progress have been made in creating such models, previous efforts have largely focused on synthetic distortions… ▽ More

    Submitted 23 July, 2024; originally announced July 2024.

    Comments: DAFx 2024

  15. arXiv:2407.12117  [pdf, other

    cs.LG cs.DC

    Efficiently Training 7B LLM with 1 Million Sequence Length on 8 GPUs

    Authors: Pinxue Zhao, Hailin Zhang, Fangcheng Fu, Xiaonan Nie, Qibin Liu, Fang Yang, Yuanbo Peng, Dian Jiao, Shuaipeng Li, Jinbao Xue, Yangyu Tao, Bin Cui

    Abstract: Nowadays, Large Language Models (LLMs) have been trained using extended context lengths to foster more creative applications. However, long context training poses great challenges considering the constraint of GPU memory. It not only leads to substantial activation memory consumption during training, but also incurs considerable memory fragmentation. To facilitate long context training, existing f… ▽ More

    Submitted 16 July, 2024; originally announced July 2024.

  16. arXiv:2407.09760  [pdf, other

    cs.CV cs.AI

    ICCV23 Visual-Dialog Emotion Explanation Challenge: SEU_309 Team Technical Report

    Authors: Yixiao Yuan, Yingzhe Peng

    Abstract: The Visual-Dialog Based Emotion Explanation Generation Challenge focuses on generating emotion explanations through visual-dialog interactions in art discussions. Our approach combines state-of-the-art multi-modal models, including Language Model (LM) and Large Vision Language Model (LVLM), to achieve superior performance. By leveraging these models, we outperform existing benchmarks, securing the… ▽ More

    Submitted 12 July, 2024; originally announced July 2024.

  17. arXiv:2407.09059  [pdf, other

    cs.CV

    Domain-adaptive Video Deblurring via Test-time Blurring

    Authors: Jin-Ting He, Fu-Jen Tsai, Jia-Hao Wu, Yan-Tsung Peng, Chung-Chi Tsai, Chia-Wen Lin, Yen-Yu Lin

    Abstract: Dynamic scene video deblurring aims to remove undesirable blurry artifacts captured during the exposure process. Although previous video deblurring methods have achieved impressive results, they suffer from significant performance drops due to the domain gap between training and testing videos, especially for those captured in real-world scenarios. To address this issue, we propose a domain adapta… ▽ More

    Submitted 12 July, 2024; originally announced July 2024.

    Comments: ECCV 2024

  18. arXiv:2407.07468  [pdf, other

    cs.CV

    Rethinking Few-shot Class-incremental Learning: Learning from Yourself

    Authors: Yu-Ming Tang, Yi-Xing Peng, Jingke Meng, Wei-Shi Zheng

    Abstract: Few-shot class-incremental learning (FSCIL) aims to learn sequential classes with limited samples in a few-shot fashion. Inherited from the classical class-incremental learning setting, the popular benchmark of FSCIL uses averaged accuracy (aAcc) and last-task averaged accuracy (lAcc) as the evaluation metrics. However, we reveal that such evaluation metrics may not provide adequate emphasis on th… ▽ More

    Submitted 10 July, 2024; originally announced July 2024.

    Comments: Accepted to ECCV 2024

  19. arXiv:2407.06590  [pdf, other

    cs.RO cs.AI

    Revolutionizing Battery Disassembly: The Design and Implementation of a Battery Disassembly Autonomous Mobile Manipulator Robot(BEAM-1)

    Authors: Yanlong Peng, Zhigang Wang, Yisheng Zhang, Shengmin Zhang, Nan Cai, Fan Wu, Ming Chen

    Abstract: The efficient disassembly of end-of-life electric vehicle batteries(EOL-EVBs) is crucial for green manufacturing and sustainable development. The current pre-programmed disassembly conducted by the Autonomous Mobile Manipulator Robot(AMMR) struggles to meet the disassembly requirements in dynamic environments, complex scenarios, and unstructured processes. In this paper, we propose a Battery Disas… ▽ More

    Submitted 9 July, 2024; originally announced July 2024.

  20. arXiv:2407.03718  [pdf, other

    cs.CL cs.AI cs.LG cs.SD eess.AS

    Multi-Convformer: Extending Conformer with Multiple Convolution Kernels

    Authors: Darshan Prabhu, Yifan Peng, Preethi Jyothi, Shinji Watanabe

    Abstract: Convolutions have become essential in state-of-the-art end-to-end Automatic Speech Recognition~(ASR) systems due to their efficient modelling of local context. Notably, its use in Conformers has led to superior performance compared to vanilla Transformer-based ASR systems. While components other than the convolution module in the Conformer have been reexamined, altering the convolution module itse… ▽ More

    Submitted 23 July, 2024; v1 submitted 4 July, 2024; originally announced July 2024.

    Comments: Accepted to INTERSPEECH 2024

  21. arXiv:2407.02327  [pdf, other

    cs.LG cs.DC

    QSync: Quantization-Minimized Synchronous Distributed Training Across Hybrid Devices

    Authors: Juntao Zhao, Borui Wan, Yanghua Peng, Haibin Lin, Yibo Zhu, Chuan Wu

    Abstract: A number of production deep learning clusters have attempted to explore inference hardware for DNN training, at the off-peak serving hours with many inference GPUs idling. Conducting DNN training with a combination of heterogeneous training and inference GPUs, known as hybrid device training, presents considerable challenges due to disparities in compute capability and significant differences in m… ▽ More

    Submitted 2 July, 2024; originally announced July 2024.

    Comments: IPDPS 24

  22. arXiv:2407.00837  [pdf, other

    cs.CL cs.AI cs.SD eess.AS

    Towards Robust Speech Representation Learning for Thousands of Languages

    Authors: William Chen, Wangyou Zhang, Yifan Peng, Xinjian Li, Jinchuan Tian, Jiatong Shi, Xuankai Chang, Soumi Maiti, Karen Livescu, Shinji Watanabe

    Abstract: Self-supervised learning (SSL) has helped extend speech technologies to more languages by reducing the need for labeled data. However, models are still far from supporting the world's 7000+ languages. We propose XEUS, a Cross-lingual Encoder for Universal Speech, trained on over 1 million hours of data across 4057 languages, extending the language coverage of SSL models 4-fold. We combine 1 millio… ▽ More

    Submitted 2 July, 2024; v1 submitted 30 June, 2024; originally announced July 2024.

    Comments: Updated affiliations; 20 pages

  23. arXiv:2406.19693  [pdf, other

    cs.RO cs.CV

    MMRo: Are Multimodal LLMs Eligible as the Brain for In-Home Robotics?

    Authors: Jinming Li, Yichen Zhu, Zhiyuan Xu, Jindong Gu, Minjie Zhu, Xin Liu, Ning Liu, Yaxin Peng, Feifei Feng, Jian Tang

    Abstract: It is fundamentally challenging for robots to serve as useful assistants in human environments because this requires addressing a spectrum of sub-problems across robotics, including perception, language understanding, reasoning, and planning. The recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated their exceptional abilities in solving complex mathematical problems, m… ▽ More

    Submitted 28 June, 2024; originally announced June 2024.

  24. arXiv:2406.16942  [pdf, other

    eess.IV cs.AI cs.CV

    Enhancing Diagnostic Reliability of Foundation Model with Uncertainty Estimation in OCT Images

    Authors: Yuanyuan Peng, Aidi Lin, Meng Wang, Tian Lin, Ke Zou, Yinglin Cheng, Tingkun Shi, Xulong Liao, Lixia Feng, Zhen Liang, Xinjian Chen, Huazhu Fu, Haoyu Chen

    Abstract: Inability to express the confidence level and detect unseen classes has limited the clinical implementation of artificial intelligence in the real-world. We developed a foundation model with uncertainty estimation (FMUE) to detect 11 retinal conditions on optical coherence tomography (OCT). In the internal test set, FMUE achieved a higher F1 score of 96.76% than two state-of-the-art algorithms, RE… ▽ More

    Submitted 17 June, 2024; originally announced June 2024.

    Comments: All codes are available at https://github.com/yuanyuanpeng0129/FMUE

  25. arXiv:2406.16855  [pdf, other

    cs.CV

    DreamBench++: A Human-Aligned Benchmark for Personalized Image Generation

    Authors: Yuang Peng, Yuxin Cui, Haomiao Tang, Zekun Qi, Runpei Dong, Jing Bai, Chunrui Han, Zheng Ge, Xiangyu Zhang, Shu-Tao Xia

    Abstract: Personalized image generation holds great promise in assisting humans in everyday work and life due to its impressive function in creatively generating personalized content. However, current evaluations either are automated but misalign with humans or require human evaluations that are time-consuming and expensive. In this work, we present DreamBench++, a human-aligned benchmark automated by advan… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

    Comments: Project page: https://dreambenchplus.github.io/

  26. arXiv:2406.16120  [pdf, other

    eess.AS cs.CL cs.SD

    Contextualized End-to-end Automatic Speech Recognition with Intermediate Biasing Loss

    Authors: Muhammad Shakeel, Yui Sudo, Yifan Peng, Shinji Watanabe

    Abstract: Contextualized end-to-end automatic speech recognition has been an active research area, with recent efforts focusing on the implicit learning of contextual phrases based on the final loss objective. However, these approaches ignore the useful contextual knowledge encoded in the intermediate layers. We hypothesize that employing explicit biasing loss as an auxiliary task in the encoder intermediat… ▽ More

    Submitted 23 June, 2024; originally announced June 2024.

    Comments: Accepted to INTERSPEECH 2024

  27. arXiv:2406.15848  [pdf, other

    cs.CV

    Quality-guided Skin Tone Enhancement for Portrait Photography

    Authors: Shiqi Gao, Huiyu Duan, Xinyue Li, Kang Fu, Yicong Peng, Qihang Xu, Yuanyuan Chang, Jia Wang, Xiongkuo Min, Guangtao Zhai

    Abstract: In recent years, learning-based color and tone enhancement methods for photos have become increasingly popular. However, most learning-based image enhancement methods just learn a mapping from one distribution to another based on one dataset, lacking the ability to adjust images continuously and controllably. It is important to enable the learning-based enhancement models to adjust an image contin… ▽ More

    Submitted 22 June, 2024; originally announced June 2024.

  28. arXiv:2406.13185  [pdf, other

    cs.CL

    Learnable In-Context Vector for Visual Question Answering

    Authors: Yingzhe Peng, Chenduo Hao, Xu Yang, Jiawei Peng, Xinting Hu, Xin Geng

    Abstract: As language models continue to scale, Large Language Models (LLMs) have exhibited emerging capabilities in In-Context Learning (ICL), enabling them to solve language tasks by prefixing a few in-context demonstrations (ICDs) as context. Inspired by these advancements, researchers have extended these techniques to develop Large Multimodal Models (LMMs) with ICL capabilities. However, applying ICL us… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

  29. arXiv:2406.12726  [pdf, other

    cs.SD cs.AI eess.AS

    ED-sKWS: Early-Decision Spiking Neural Networks for Rapid,and Energy-Efficient Keyword Spotting

    Authors: Zeyang Song, Qianhui Liu, Qu Yang, Yizhou Peng, Haizhou Li

    Abstract: Keyword Spotting (KWS) is essential in edge computing requiring rapid and energy-efficient responses. Spiking Neural Networks (SNNs) are well-suited for KWS for their efficiency and temporal capacity for speech. To further reduce the latency and energy consumption, this study introduces ED-sKWS, an SNN-based KWS model with an early-decision mechanism that can stop speech processing and output the… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: Accepted by INTERSPEECH2024

  30. arXiv:2406.11026  [pdf, other

    cs.CV cs.AI

    Boosting Medical Image Classification with Segmentation Foundation Model

    Authors: Pengfei Gu, Zihan Zhao, Hongxiao Wang, Yaopeng Peng, Yizhe Zhang, Nishchal Sapkota, Chaoli Wang, Danny Z. Chen

    Abstract: The Segment Anything Model (SAM) exhibits impressive capabilities in zero-shot segmentation for natural images. Recently, SAM has gained a great deal of attention for its applications in medical image segmentation. However, to our best knowledge, no studies have shown how to harness the power of SAM for medical image classification. To fill this gap and make SAM a true ``foundation model'' for med… ▽ More

    Submitted 16 June, 2024; originally announced June 2024.

  31. arXiv:2406.10303  [pdf, other

    cs.CL cs.AI

    A Survey on Large Language Models from General Purpose to Medical Applications: Datasets, Methodologies, and Evaluations

    Authors: Jinqiang Wang, Huansheng Ning, Yi Peng, Qikai Wei, Daniel Tesfai, Wenwei Mao, Tao Zhu, Runhe Huang

    Abstract: Large Language Models (LLMs) have demonstrated surprising performance across various natural language processing tasks. Recently, medical LLMs enhanced with domain-specific knowledge have exhibited excellent capabilities in medical consultation and diagnosis. These models can smoothly simulate doctor-patient dialogues and provide professional medical advice. Most medical LLMs are developed through… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: 20 pages,3 figures

  32. arXiv:2406.09317  [pdf, other

    eess.IV cs.CV

    Common and Rare Fundus Diseases Identification Using Vision-Language Foundation Model with Knowledge of Over 400 Diseases

    Authors: Meng Wang, Tian Lin, Aidi Lin, Kai Yu, Yuanyuan Peng, Lianyu Wang, Cheng Chen, Ke Zou, Huiyu Liang, Man Chen, Xue Yao, Meiqin Zhang, Binwei Huang, Chaoxin Zheng, Peixin Zhang, Wei Chen, Yilong Luo, Yifan Chen, Honghe Xia, Tingkun Shi, Qi Zhang, Jinming Guo, Xiaolin Chen, Jingcheng Wang, Yih Chung Tham , et al. (24 additional authors not shown)

    Abstract: Previous foundation models for retinal images were pre-trained with limited disease categories and knowledge base. Here we introduce RetiZero, a vision-language foundation model that leverages knowledge from over 400 fundus diseases. To RetiZero's pre-training, we compiled 341,896 fundus images paired with text descriptions, sourced from public datasets, ophthalmic literature, and online resources… ▽ More

    Submitted 30 June, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

  33. Less Cybersickness, Please: Demystifying and Detecting Stereoscopic Visual Inconsistencies in VR Apps

    Authors: Shuqing Li, Cuiyun Gao, Jianping Zhang, Yujia Zhang, Yepang Liu, Jiazhen Gu, Yun Peng, Michael R. Lyu

    Abstract: The quality of Virtual Reality (VR) apps is vital, particularly the rendering quality of the VR Graphical User Interface (GUI). Different from traditional 2D apps, VR apps create a 3D digital scene for users, by rendering two distinct 2D images for the user's left and right eyes, respectively. Stereoscopic visual inconsistency (denoted as "SVI") issues, however, undermine the rendering process of… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: This work has been accepted at the ACM International Conference on the Foundations of Software Engineering (FSE) 2024, Porto de Galinhas, Brazil. DOI: https://doi.org/10.1145/3660803

  34. arXiv:2406.09282  [pdf, other

    cs.CL cs.SD eess.AS

    On the Effects of Heterogeneous Data Sources on Speech-to-Text Foundation Models

    Authors: Jinchuan Tian, Yifan Peng, William Chen, Kwanghee Choi, Karen Livescu, Shinji Watanabe

    Abstract: The Open Whisper-style Speech Model (OWSM) series was introduced to achieve full transparency in building advanced speech-to-text (S2T) foundation models. To this end, OWSM models are trained on 25 public speech datasets, which are heterogeneous in multiple ways. In this study, we advance the OWSM series by introducing OWSM v3.2, which improves on prior models by investigating and addressing the i… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

  35. arXiv:2406.09264  [pdf, other

    cs.HC cs.AI cs.CL

    Towards Bidirectional Human-AI Alignment: A Systematic Review for Clarifications, Framework, and Future Directions

    Authors: Hua Shen, Tiffany Knearem, Reshmi Ghosh, Kenan Alkiek, Kundan Krishna, Yachuan Liu, Ziqiao Ma, Savvas Petridis, Yi-Hao Peng, Li Qiwei, Sushrita Rakshit, Chenglei Si, Yutong Xie, Jeffrey P. Bigham, Frank Bentley, Joyce Chai, Zachary Lipton, Qiaozhu Mei, Rada Mihalcea, Michael Terry, Diyi Yang, Meredith Ringel Morris, Paul Resnick, David Jurgens

    Abstract: Recent advancements in general-purpose AI have highlighted the importance of guiding AI systems towards the intended goals, ethical principles, and values of individuals and groups, a concept broadly recognized as alignment. However, the lack of clarified definitions and scopes of human-AI alignment poses a significant obstacle, hampering collaborative efforts across research domains to achieve th… ▽ More

    Submitted 10 August, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

    Comments: proposing "bidirectional human-AI alignment" framework after a systematic review of over 400 alignment papers

  36. arXiv:2406.09201  [pdf, other

    cs.CV

    Enhanced Object Detection: A Study on Vast Vocabulary Object Detection Track for V3Det Challenge 2024

    Authors: Peixi Wu, Bosong Chai, Xuan Nie, Longquan Yan, Zeyu Wang, Qifan Zhou, Boning Wang, Yansong Peng, Hebei Li

    Abstract: In this technical report, we present our findings from the research conducted on the Vast Vocabulary Visual Detection (V3Det) dataset for Supervised Vast Vocabulary Visual Detection task. How to deal with complex categories and detection boxes has become a difficulty in this track. The original supervised detector is not suitable for this task. We have designed a series of improvements, including… ▽ More

    Submitted 21 June, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

    Journal ref: Second Place in CVPR 2024 Vast Vocabulary Visual Detection Challenge

  37. arXiv:2406.08135  [pdf

    cs.RO

    Design, modeling, and characteristics of ringshaped robot actuated by functional fluid

    Authors: Zebing Mao, Xuehang Bai, Yanhong Peng, Yayi Shen

    Abstract: The controlled actuation of hydraulic and pneumatic actuators has unveiled fresh and thrilling opportunities for designing mobile robots with adaptable structures. Previously reported rolling robots, which were powered by fluidic systems, often relied on complex principles, cumbersome pump and valve systems, and intricate control strategies, limiting their applicability in other fields. In this in… ▽ More

    Submitted 12 June, 2024; originally announced June 2024.

  38. arXiv:2406.05452  [pdf, other

    eess.SP cs.IT

    Near-Field Channel Estimation for Extremely Large-Scale Terahertz Communications

    Authors: Songjie Yang, Yizhou Peng, Wanting Lyu, Ya Li, Hongjun He, Zhongpei Zhang, Chau Yuen

    Abstract: Future Terahertz communications exhibit significant potential in accommodating ultra-high-rate services. Employing extremely large-scale array antennas is a key approach to realize this potential, as they can harness substantial beamforming gains to overcome the severe path loss and leverage the electromagnetic advantages in the near field. This paper proposes novel estimation methods designed to… ▽ More

    Submitted 8 June, 2024; originally announced June 2024.

  39. arXiv:2406.02950  [pdf, other

    eess.AS cs.CL cs.SD

    4D ASR: Joint Beam Search Integrating CTC, Attention, Transducer, and Mask Predict Decoders

    Authors: Yui Sudo, Muhammad Shakeel, Yosuke Fukumoto, Brian Yan, Jiatong Shi, Yifan Peng, Shinji Watanabe

    Abstract: End-to-end automatic speech recognition (E2E-ASR) can be classified into several network architectures, such as connectionist temporal classification (CTC), recurrent neural network transducer (RNN-T), attention-based encoder-decoder, and mask-predict models. Each network architecture has advantages and disadvantages, leading practitioners to switch between these different models depending on appl… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

    Comments: submitted to IEEE/ACM Transactions on Audio Speech and Language Processing

  40. arXiv:2406.00625  [pdf, other

    cs.CV

    SAM-LAD: Segment Anything Model Meets Zero-Shot Logic Anomaly Detection

    Authors: Yun Peng, Xiao Lin, Nachuan Ma, Jiayuan Du, Chuangwei Liu, Chengju Liu, Qijun Chen

    Abstract: Visual anomaly detection is vital in real-world applications, such as industrial defect detection and medical diagnosis. However, most existing methods focus on local structural anomalies and fail to detect higher-level functional anomalies under logical conditions. Although recent studies have explored logical anomaly detection, they can only address simple anomalies like missing or addition and… ▽ More

    Submitted 5 June, 2024; v1 submitted 2 June, 2024; originally announced June 2024.

  41. arXiv:2405.16978  [pdf, other

    cs.LG cs.CR

    OSLO: One-Shot Label-Only Membership Inference Attacks

    Authors: Yuefeng Peng, Jaechul Roh, Subhransu Maji, Amir Houmansadr

    Abstract: We introduce One-Shot Label-Only (OSLO) membership inference attacks (MIAs), which accurately infer a given sample's membership in a target model's training set with high precision using just \emph{a single query}, where the target model only returns the predicted hard label. This is in contrast to state-of-the-art label-only attacks which require $\sim6000$ queries, yet get attack precisions lowe… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

  42. arXiv:2405.16368  [pdf, other

    cs.LG

    Qsco: A Quantum Scoring Module for Open-set Supervised Anomaly Detection

    Authors: Yifeng Peng, Xinyi Li, Zhiding Liang, Ying Wang

    Abstract: Open set anomaly detection (OSAD) is a crucial task that aims to identify abnormal patterns or behaviors in data sets, especially when the anomalies observed during training do not represent all possible classes of anomalies. The recent advances in quantum computing in handling complex data structures and improving machine learning models herald a paradigm shift in anomaly detection methodologies.… ▽ More

    Submitted 25 May, 2024; originally announced May 2024.

  43. "This really lets us see the entire world:" Designing a conversational telepresence robot for homebound older adults

    Authors: Yaxin Hu, Laura Stegner, Yasmine Kotturi, Caroline Zhang, Yi-Hao Peng, Faria Huq, Yuhang Zhao, Jeffrey P. Bigham, Bilge Mutlu

    Abstract: In this paper, we explore the design and use of conversational telepresence robots to help homebound older adults interact with the external world. An initial needfinding study (N=8) using video vignettes revealed older adults' experiential needs for robot-mediated remote experiences such as exploration, reminiscence and social participation. We then designed a prototype system to support these go… ▽ More

    Submitted 23 May, 2024; originally announced May 2024.

    Comments: In proceedings of ACM Designing Interactive Systems (DIS) 2024

    MSC Class: 68-06

  44. arXiv:2405.14334  [pdf, other

    cs.CV

    Hierarchical Salient Patch Identification for Interpretable Fundus Disease Localization

    Authors: Yitao Peng, Lianghua He, Die Hu

    Abstract: With the widespread application of deep learning technology in medical image analysis, how to effectively explain model decisions and improve diagnosis accuracy has become an urgent problem that needs to be solved. Attribution methods have become a key tool to help doctors better understand the diagnostic basis of models, and they are used to explain and localize diseases in medical images. Howeve… ▽ More

    Submitted 23 May, 2024; originally announced May 2024.

  45. arXiv:2405.13514  [pdf, other

    eess.AS cs.CL cs.SD

    Joint Optimization of Streaming and Non-Streaming Automatic Speech Recognition with Multi-Decoder and Knowledge Distillation

    Authors: Muhammad Shakeel, Yui Sudo, Yifan Peng, Shinji Watanabe

    Abstract: End-to-end (E2E) automatic speech recognition (ASR) can operate in two modes: streaming and non-streaming, each with its pros and cons. Streaming ASR processes the speech frames in real-time as it is being received, while non-streaming ASR waits for the entire speech utterance; thus, professionals may have to operate in either mode to satisfy their application. In this work, we present joint optim… ▽ More

    Submitted 22 May, 2024; originally announced May 2024.

    Comments: Accepted to IEEE ICASSP 2024 workshop Hands-free Speech Communication and Microphone Arrays (HSCMA 2024)

  46. arXiv:2405.13344  [pdf, other

    eess.AS cs.CL cs.SD

    Contextualized Automatic Speech Recognition with Dynamic Vocabulary

    Authors: Yui Sudo, Yosuke Fukumoto, Muhammad Shakeel, Yifan Peng, Shinji Watanabe

    Abstract: Deep biasing (DB) improves the performance of end-to-end automatic speech recognition (E2E-ASR) for rare words or contextual phrases using a bias list. However, most existing methods treat bias phrases as sequences of subwords in a predefined static vocabulary, which can result in ineffective learning of the dependencies between subwords. More advanced techniques address this problem by incorporat… ▽ More

    Submitted 22 May, 2024; originally announced May 2024.

  47. arXiv:2405.11841  [pdf, other

    cs.AI

    Evaluating and Modeling Social Intelligence: A Comparative Study of Human and AI Capabilities

    Authors: Junqi Wang, Chunhui Zhang, Jiapeng Li, Yuxi Ma, Lixing Niu, Jiaheng Han, Yujia Peng, Yixin Zhu, Lifeng Fan

    Abstract: Facing the current debate on whether Large Language Models (LLMs) attain near-human intelligence levels (Mitchell & Krakauer, 2023; Bubeck et al., 2023; Kosinski, 2023; Shiffrin & Mitchell, 2023; Ullman, 2023), the current study introduces a benchmark for evaluating social intelligence, one of the most distinctive aspects of human cognition. We developed a comprehensive theoretical framework for s… ▽ More

    Submitted 20 May, 2024; originally announced May 2024.

    Comments: Also published in Proceedings of the Annual Meeting of the Cognitive Science Society (CogSci), 2024

  48. arXiv:2405.10591  [pdf, other

    cs.CV

    GEOcc: Geometrically Enhanced 3D Occupancy Network with Implicit-Explicit Depth Fusion and Contextual Self-Supervision

    Authors: Xin Tan, Wenbin Wu, Zhiwei Zhang, Chaojie Fan, Yong Peng, Zhizhong Zhang, Yuan Xie, Lizhuang Ma

    Abstract: 3D occupancy perception holds a pivotal role in recent vision-centric autonomous driving systems by converting surround-view images into integrated geometric and semantic representations within dense 3D grids. Nevertheless, current models still encounter two main challenges: modeling depth accurately in the 2D-3D view transformation stage, and overcoming the lack of generalizability issues due to… ▽ More

    Submitted 17 May, 2024; originally announced May 2024.

  49. arXiv:2405.10561  [pdf, other

    eess.IV cs.CV

    Infrared Image Super-Resolution via Lightweight Information Split Network

    Authors: Shijie Liu, Kang Yan, Feiwei Qin, Changmiao Wang, Ruiquan Ge, Kai Zhang, Jie Huang, Yong Peng, Jin Cao

    Abstract: Single image super-resolution (SR) is an established pixel-level vision task aimed at reconstructing a high-resolution image from its degraded low-resolution counterpart. Despite the notable advancements achieved by leveraging deep neural networks for SR, most existing deep learning architectures feature an extensive number of layers, leading to high computational complexity and substantial memory… ▽ More

    Submitted 27 May, 2024; v1 submitted 17 May, 2024; originally announced May 2024.

  50. arXiv:2405.09970  [pdf, ps, other

    math.LO cs.LO

    On the Cut Elimination of Weak Intuitionistic Tense Logic

    Authors: Yiheng Wang, Yu Peng, Zhe Lin

    Abstract: In this paper, we use a new method to prove cut-elimination of weak intuitionistic tense logic. This method focuses on splitting the contraction rule and cut rules. Further general theories and applications of this method shall be developed in the future.

    Submitted 27 May, 2024; v1 submitted 16 May, 2024; originally announced May 2024.