Skip to main content

Showing 1–50 of 1,434 results for author: Wu, H

Searching in archive cs. Search in all archives.
.
  1. arXiv:2408.11001  [pdf, other

    cs.CV

    MegaFusion: Extend Diffusion Models towards Higher-resolution Image Generation without Further Tuning

    Authors: Haoning Wu, Shaocheng Shen, Qiang Hu, Xiaoyun Zhang, Ya Zhang, Yanfeng Wang

    Abstract: Diffusion models have emerged as frontrunners in text-to-image generation for their impressive capabilities. Nonetheless, their fixed image resolution during training often leads to challenges in high-resolution image generation, such as semantic inaccuracies and object replication. This paper introduces MegaFusion, a novel approach that extends existing diffusion-based text-to-image generation mo… ▽ More

    Submitted 20 August, 2024; originally announced August 2024.

    Comments: Technical Report. Project Page: https://haoningwu3639.github.io/MegaFusion/

  2. arXiv:2408.10198  [pdf, other

    cs.CV cs.GR

    MeshFormer: High-Quality Mesh Generation with 3D-Guided Reconstruction Model

    Authors: Minghua Liu, Chong Zeng, Xinyue Wei, Ruoxi Shi, Linghao Chen, Chao Xu, Mengqi Zhang, Zhaoning Wang, Xiaoshuai Zhang, Isabella Liu, Hongzhi Wu, Hao Su

    Abstract: Open-world 3D reconstruction models have recently garnered significant attention. However, without sufficient 3D inductive bias, existing methods typically entail expensive training costs and struggle to extract high-quality 3D meshes. In this work, we introduce MeshFormer, a sparse-view reconstruction model that explicitly leverages 3D native structure, input guidance, and training supervision. S… ▽ More

    Submitted 19 August, 2024; originally announced August 2024.

    Comments: 20 pages, 9 figures

  3. arXiv:2408.09853  [pdf, other

    cs.CL cs.AI

    Self-Directed Turing Test for Large Language Models

    Authors: Weiqi Wu, Hongqiu Wu, Hai Zhao

    Abstract: The Turing test examines whether AIs can exhibit human-like behaviour in natural language conversations. Traditional Turing tests adopt a rigid dialogue format where each participant sends only one message each time and require continuous human involvement to direct the entire interaction with the test subject. This fails to reflect a natural conversational style and hinders the evaluation of Larg… ▽ More

    Submitted 19 August, 2024; originally announced August 2024.

  4. arXiv:2408.09655  [pdf, other

    cs.LG stat.ML

    Contextual Bandits for Unbounded Context Distributions

    Authors: Puning Zhao, Jiafei Wu, Zhe Liu, Huiwen Wu

    Abstract: Nonparametric contextual bandit is an important model of sequential decision making problems. Under $α$-Tsybakov margin condition, existing research has established a regret bound of $\tilde{O}\left(T^{1-\frac{α+1}{d+2}}\right)$ for bounded supports. However, the optimal regret with unbounded contexts has not been analyzed. The challenge of solving contextual bandit problems with unbounded support… ▽ More

    Submitted 18 August, 2024; originally announced August 2024.

  5. arXiv:2408.09439  [pdf, other

    cs.IR cs.AI

    Towards Boosting LLMs-driven Relevance Modeling with Progressive Retrieved Behavior-augmented Prompting

    Authors: Zeyuan Chen, Haiyan Wu, Kaixin Wu, Wei Chen, Mingjie Zhong, Jia Xu, Zhongyi Liu, Wei Zhang

    Abstract: Relevance modeling is a critical component for enhancing user experience in search engines, with the primary objective of identifying items that align with users' queries. Traditional models only rely on the semantic congruence between queries and items to ascertain relevance. However, this approach represents merely one aspect of the relevance judgement, and is insufficient in isolation. Even pow… ▽ More

    Submitted 18 August, 2024; originally announced August 2024.

  6. arXiv:2408.09386  [pdf, other

    cs.AI cs.CL cs.HC

    Game Development as Human-LLM Interaction

    Authors: Jiale Hong, Hongqiu Wu, Hai Zhao

    Abstract: Game development is a highly specialized task that relies on a complex game engine powered by complex programming languages, preventing many gaming enthusiasts from handling it. This paper introduces the Interaction-driven Game Engine (IGE) powered by LLM, which allows everyone to develop a custom game using natural language through Human-LLM interaction. To enable an LLM to function as an IGE, we… ▽ More

    Submitted 18 August, 2024; originally announced August 2024.

  7. arXiv:2408.08981  [pdf, other

    cs.IR cs.CL

    From Lazy to Prolific: Tackling Missing Labels in Open Vocabulary Extreme Classification by Positive-Unlabeled Sequence Learning

    Authors: Haoran Ranran Zhang, Bensu Uçar, Soumik Dey, Hansi Wu, Binbin Li, Rui Zhang

    Abstract: Open-vocabulary Extreme Multi-label Classification (OXMC) extends traditional XMC by allowing prediction beyond an extremely large, predefined label set (typically $10^3$ to $10^{12}$ labels), addressing the dynamic nature of real-world labeling tasks. However, self-selection bias in data annotation leads to significant missing labels in both training and test data, particularly for less popular i… ▽ More

    Submitted 16 August, 2024; originally announced August 2024.

  8. arXiv:2408.08495  [pdf, other

    cs.CV

    Achieving Complex Image Edits via Function Aggregation with Diffusion Models

    Authors: Mohammadreza Samadi, Fred X. Han, Mohammad Salameh, Hao Wu, Fengyu Sun, Chunhua Zhou, Di Niu

    Abstract: Diffusion models have demonstrated strong performance in generative tasks, making them ideal candidates for image editing. Recent studies highlight their ability to apply desired edits effectively by following textual instructions, yet two key challenges persist. First, these models struggle to apply multiple edits simultaneously, resulting in computational inefficiencies due to their reliance on… ▽ More

    Submitted 15 August, 2024; originally announced August 2024.

  9. arXiv:2408.08092  [pdf, other

    cs.CV cs.AI

    OC3D: Weakly Supervised Outdoor 3D Object Detection with Only Coarse Click Annotation

    Authors: Qiming Xia, Hongwei Lin, Wei Ye, Hai Wu, Yadan Luo, Shijia Zhao, Xin Li, Chenglu Wen

    Abstract: LiDAR-based outdoor 3D object detection has received widespread attention. However, training 3D detectors from the LiDAR point cloud typically relies on expensive bounding box annotations. This paper presents OC3D, an innovative weakly supervised method requiring only coarse clicks on the bird's eye view of the 3D point cloud. A key challenge here is the absence of complete geometric descriptions… ▽ More

    Submitted 15 August, 2024; v1 submitted 15 August, 2024; originally announced August 2024.

  10. arXiv:2408.05842  [pdf, other

    cs.AI cs.HC

    Scaling Virtual World with Delta-Engine

    Authors: Hongqiu Wu, Zekai Xu, Tianyang Xu, Jiale Hong, Weiqi Wu, Hai Zhao, Min Zhang, Zhezhi He

    Abstract: In this paper, we focus on \emph{virtual world}, a cyberspace where people can live in. An ideal virtual world shares great similarity with our real world. One of the crucial aspects is its evolving nature, reflected by the individuals' capacity to grow and thereby influence the objective world. Such dynamics is unpredictable and beyond the reach of existing systems. For this, we propose a special… ▽ More

    Submitted 11 August, 2024; originally announced August 2024.

  11. arXiv:2408.05834  [pdf, other

    stat.ML cs.AI cs.LG q-bio.NC

    Divide-and-Conquer Predictive Coding: a structured Bayesian inference algorithm

    Authors: Eli Sennesh, Hao Wu, Tommaso Salvatori

    Abstract: Unexpected stimuli induce "error" or "surprise" signals in the brain. The theory of predictive coding promises to explain these observations in terms of Bayesian inference by suggesting that the cortex implements variational inference in a probabilistic graphical model. However, when applied to machine learning tasks, this family of algorithms has yet to perform on par with other variational appro… ▽ More

    Submitted 11 August, 2024; originally announced August 2024.

    Comments: 22 pages, 5 figures, submitted to Neural Information Processing Systems (NeurIPS) 2024

  12. arXiv:2408.04838  [pdf

    cs.LG cs.IR

    Dual-Channel Latent Factor Analysis Enhanced Graph Contrastive Learning for Recommendation

    Authors: Junfeng Long, Hao Wu

    Abstract: Graph Neural Networks (GNNs) are powerful learning methods for recommender systems owing to their robustness in handling complicated user-item interactions. Recently, the integration of contrastive learning with GNNs has demonstrated remarkable performance in recommender systems to handle the issue of highly sparse user-item interaction data. Yet, some available graph contrastive learning (GCL) te… ▽ More

    Submitted 8 August, 2024; originally announced August 2024.

  13. arXiv:2408.04600  [pdf, other

    cs.CV

    Improving Network Interpretability via Explanation Consistency Evaluation

    Authors: Hefeng Wu, Hao Jiang, Keze Wang, Ziyi Tang, Xianghuan He, Liang Lin

    Abstract: While deep neural networks have achieved remarkable performance, they tend to lack transparency in prediction. The pursuit of greater interpretability in neural networks often results in a degradation of their original performance. Some works strive to improve both interpretability and performance, but they primarily depend on meticulously imposed conditions. In this paper, we propose a simple yet… ▽ More

    Submitted 8 August, 2024; originally announced August 2024.

    Comments: To appear in IEEE Transactions on Multimedia

  14. arXiv:2408.03692  [pdf, other

    cs.MA

    Asynchronous Credit Assignment Framework for Multi-Agent Reinforcement Learning

    Authors: Yongheng Liang, Hejun Wu, Haitao Wang, Hao Cai

    Abstract: Credit assignment is a core problem that distinguishes agents' marginal contributions for optimizing cooperative strategies in multi-agent reinforcement learning (MARL). Current credit assignment methods usually assume synchronous decision-making among agents. However, a prerequisite for many realistic cooperative tasks is asynchronous decision-making by agents, without waiting for others to avoid… ▽ More

    Submitted 7 August, 2024; originally announced August 2024.

  15. arXiv:2408.03677  [pdf, other

    cs.CV

    L4DR: LiDAR-4DRadar Fusion for Weather-Robust 3D Object Detection

    Authors: Xun Huang, Ziyu Xu, Hai Wu, Jinlong Wang, Qiming Xia, Yan Xia, Jonathan Li, Kyle Gao, Chenglu Wen, Cheng Wang

    Abstract: LiDAR-based vision systems are integral for 3D object detection, which is crucial for autonomous navigation. However, they suffer from performance degradation in adverse weather conditions due to the quality deterioration of LiDAR point clouds. Fusing LiDAR with the weather-robust 4D radar sensor is expected to solve this problem. However, the fusion of LiDAR and 4D radar is challenging because th… ▽ More

    Submitted 9 August, 2024; v1 submitted 7 August, 2024; originally announced August 2024.

  16. arXiv:2408.03675  [pdf, other

    cs.CL

    NACL: A General and Effective KV Cache Eviction Framework for LLMs at Inference Time

    Authors: Yilong Chen, Guoxia Wang, Junyuan Shang, Shiyao Cui, Zhenyu Zhang, Tingwen Liu, Shuohuan Wang, Yu Sun, Dianhai Yu, Hua Wu

    Abstract: Large Language Models (LLMs) have ignited an innovative surge of AI applications, marking a new era of exciting possibilities equipped with extended context windows. However, hosting these models is cost-prohibitive mainly due to the extensive memory consumption of KV Cache involving long-context modeling. Despite several works proposing to evict unnecessary tokens from the KV Cache, most of them… ▽ More

    Submitted 7 August, 2024; v1 submitted 7 August, 2024; originally announced August 2024.

    Comments: Accepted by ACL 2024 (main conference, long paper)

  17. arXiv:2408.02704  [pdf

    cs.LG cs.AI

    Spatial-temporal Graph Convolutional Networks with Diversified Transformation for Dynamic Graph Representation Learning

    Authors: Ling Wang, Yixiang Huang, Hao Wu

    Abstract: Dynamic graphs (DG) are often used to describe evolving interactions between nodes in real-world applications. Temporal patterns are a natural feature of DGs and are also key to representation learning. However, existing dynamic GCN models are mostly composed of static GCNs and sequence modules, which results in the separation of spatiotemporal information and cannot effectively capture complex te… ▽ More

    Submitted 5 August, 2024; originally announced August 2024.

    Comments: 8 papges, 1 figure

  18. arXiv:2408.02128  [pdf, other

    cs.CL

    Table Transformers for Imputing Textual Attributes

    Authors: Ting-Ruen Wei, Yuan Wang, Yoshitaka Inoue, Hsin-Tai Wu, Yi Fang

    Abstract: Missing data in tabular dataset is a common issue as the performance of downstream tasks usually depends on the completeness of the training dataset. Previous missing data imputation methods focus on numeric and categorical columns, but we propose a novel end-to-end approach called Table Transformers for Imputing Textual Attributes (TTITA) based on the transformer to impute unstructured textual co… ▽ More

    Submitted 4 August, 2024; originally announced August 2024.

  19. arXiv:2408.01566  [pdf, other

    cs.CV

    Full-range Head Pose Geometric Data Augmentations

    Authors: Huei-Chung Hu, Xuyang Wu, Haowei Liu, Ting-Ruen Wei, Hsin-Tai Wu

    Abstract: Many head pose estimation (HPE) methods promise the ability to create full-range datasets, theoretically allowing the estimation of the rotation and positioning of the head from various angles. However, these methods are only accurate within a range of head angles; exceeding this specific range led to significant inaccuracies. This is dominantly explained by unclear specificity of the coordinate s… ▽ More

    Submitted 2 August, 2024; originally announced August 2024.

    Comments: arXiv admin note: text overlap with arXiv:2403.18104

  20. arXiv:2408.01562  [pdf

    cs.CY

    Welfare, sustainability, and equity evaluation of the New York City Interborough Express using spatially heterogeneous mode choice models

    Authors: Hai Yang, Hongying Wu, Lauren Whang, Xiyuan Ren, Joseph Y. J. Chow

    Abstract: The Metropolitan Transit Authority (MTA) proposed building a new light rail route called the Interborough Express (IBX) to provide a direct, fast transit linkage between Queens and Brooklyn. An open-access synthetic citywide trip agenda dataset and a block-group-level mode choice model are used to assess the potential impact IBX could bring to New York City (NYC). IBX could save 28.1 minutes to po… ▽ More

    Submitted 2 August, 2024; originally announced August 2024.

  21. RainMamba: Enhanced Locality Learning with State Space Models for Video Deraining

    Authors: Hongtao Wu, Yijun Yang, Huihui Xu, Weiming Wang, Jinni Zhou, Lei Zhu

    Abstract: The outdoor vision systems are frequently contaminated by rain streaks and raindrops, which significantly degenerate the performance of visual tasks and multimedia applications. The nature of videos exhibits redundant temporal cues for rain removal with higher stability. Traditional video deraining methods heavily rely on optical flow estimation and kernel-based manners, which have a limited recep… ▽ More

    Submitted 31 July, 2024; originally announced July 2024.

    Comments: ACM Multimedia 2024

  22. arXiv:2407.20710  [pdf, other

    cs.DC

    On-the-fly Communication-and-Computing to Enable Representation Learning for Distributed Point Clouds

    Authors: Xu Chen, Hai Wu, Kaibin Huang

    Abstract: The advent of sixth-generation (6G) mobile networks introduces two groundbreaking capabilities: sensing and artificial intelligence (AI). Sensing leverages multi-modal sensors to capture real-time environmental data, while AI brings powerful models to the network edge, enabling intelligent Internet-of-Things (IoT) applications. These features converge in the Integrated Sensing and Edge AI (ISEA) p… ▽ More

    Submitted 30 July, 2024; originally announced July 2024.

    Comments: This is an ongoing work under revision

  23. arXiv:2407.19863  [pdf, other

    cs.DC

    Before and After Blockchain: Development and Principles of Distributed Fault-Tolerant Consensus

    Authors: Huanyu Wu, Chentao Yue, Yixuan Fan, Yonghui Li, Lei Zhang

    Abstract: The concept of distributed consensus gained widespread attention following the publication of "Byzantine Generals Problem" by Leslie Lamport in the 1980s. This research topic has been active and extensively studied over the last four decades, particularly since the advent of blockchain technology in 2009. Blockchain technology employs Proof-of-X (PoX) or Byzantine-fault-tolerant (BFT) systems, whe… ▽ More

    Submitted 3 August, 2024; v1 submitted 29 July, 2024; originally announced July 2024.

  24. arXiv:2407.18849  [pdf

    cs.SI cs.CY

    MNTD: An Efficient Dynamic Community Detector Based on Nonnegative Tensor Decomposition

    Authors: Hao Fang, Qu Wang, Qicong Hu, Hao Wu

    Abstract: Dynamic community detection is crucial for elucidating the temporal evolution of social structures, information dissemination, and interactive behaviors within complex networks. Nonnegative matrix factorization provides an efficient framework for identifying communities in static networks but fall short in depicting temporal variations in community affiliations. To solve this problem, this paper p… ▽ More

    Submitted 26 July, 2024; originally announced July 2024.

    Comments: 10 pages, 5 figures,This paper will be published on 2024 IEEE International Conference on Systems, Man, and Cybernetics(SMC)

  25. arXiv:2407.17379  [pdf, other

    cs.CV cs.CL

    MMRA: A Benchmark for Evaluating Multi-Granularity and Multi-Image Relational Association Capabilities in Large Visual Language Models

    Authors: Siwei Wu, Kang Zhu, Yu Bai, Yiming Liang, Yizhi Li, Haoning Wu, J. H. Liu, Ruibo Liu, Xingwei Qu, Xuxin Cheng, Ge Zhang, Wenhao Huang, Chenghua Lin

    Abstract: Given the remarkable success that large visual language models (LVLMs) have achieved in image perception tasks, the endeavor to make LVLMs perceive the world like humans is drawing increasing attention. Current multi-modal benchmarks primarily focus on facts or specific topic-related knowledge contained within individual images. However, they often overlook the associative relations between multip… ▽ More

    Submitted 5 August, 2024; v1 submitted 24 July, 2024; originally announced July 2024.

    Comments: VLMs, Multi-Image Association

  26. arXiv:2407.17126  [pdf

    cs.CL cs.AI

    SDoH-GPT: Using Large Language Models to Extract Social Determinants of Health (SDoH)

    Authors: Bernardo Consoli, Xizhi Wu, Song Wang, Xinyu Zhao, Yanshan Wang, Justin Rousseau, Tom Hartvigsen, Li Shen, Huanmei Wu, Yifan Peng, Qi Long, Tianlong Chen, Ying Ding

    Abstract: Extracting social determinants of health (SDoH) from unstructured medical notes depends heavily on labor-intensive annotations, which are typically task-specific, hampering reusability and limiting sharing. In this study we introduced SDoH-GPT, a simple and effective few-shot Large Language Model (LLM) method leveraging contrastive examples and concise instructions to extract SDoH without relying… ▽ More

    Submitted 24 July, 2024; originally announced July 2024.

  27. arXiv:2407.17035  [pdf, other

    cs.CV

    Q-Ground: Image Quality Grounding with Large Multi-modality Models

    Authors: Chaofeng Chen, Sensen Yang, Haoning Wu, Liang Liao, Zicheng Zhang, Annan Wang, Wenxiu Sun, Qiong Yan, Weisi Lin

    Abstract: Recent advances of large multi-modality models (LMM) have greatly improved the ability of image quality assessment (IQA) method to evaluate and explain the quality of visual content. However, these advancements are mostly focused on overall quality assessment, and the detailed examination of local quality, which is crucial for comprehensive visual understanding, is still largely unexplored. In thi… ▽ More

    Submitted 24 July, 2024; originally announced July 2024.

    Comments: ACM Multimedia 2024 (Oral)

  28. arXiv:2407.15754  [pdf, other

    cs.CV cs.CL cs.LG

    LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding

    Authors: Haoning Wu, Dongxu Li, Bei Chen, Junnan Li

    Abstract: Large multimodal models (LMMs) are processing increasingly longer and richer inputs. Albeit the progress, few public benchmark is available to measure such development. To mitigate this gap, we introduce LongVideoBench, a question-answering benchmark that features video-language interleaved inputs up to an hour long. Our benchmark includes 3,763 varying-length web-collected videos with their subti… ▽ More

    Submitted 22 July, 2024; originally announced July 2024.

    Comments: 29 pages

  29. arXiv:2407.15458  [pdf, other

    eess.AS cs.SD

    EMO-Codec: An In-Depth Look at Emotion Preservation capacity of Legacy and Neural Codec Models With Subjective and Objective Evaluations

    Authors: Wenze Ren, Yi-Cheng Lin, Huang-Cheng Chou, Haibin Wu, Yi-Chiao Wu, Chi-Chun Lee, Hung-yi Lee, Yu Tsao

    Abstract: The neural codec model reduces speech data transmission delay and serves as the foundational tokenizer for speech language models (speech LMs). Preserving emotional information in codecs is crucial for effective communication and context understanding. However, there is a lack of studies on emotion loss in existing codecs. This paper evaluates neural and legacy codecs using subjective and objectiv… ▽ More

    Submitted 30 July, 2024; v1 submitted 22 July, 2024; originally announced July 2024.

  30. arXiv:2407.15353  [pdf, other

    cs.CL cs.AR

    Customized Retrieval Augmented Generation and Benchmarking for EDA Tool Documentation QA

    Authors: Yuan Pu, Zhuolun He, Tairu Qiu, Haoyuan Wu, Bei Yu

    Abstract: Retrieval augmented generation (RAG) enhances the accuracy and reliability of generative AI models by sourcing factual information from external databases, which is extensively employed in document-grounded question-answering (QA) tasks. Off-the-shelf RAG flows are well pretrained on general-purpose documents, yet they encounter significant challenges when being applied to knowledge-intensive vert… ▽ More

    Submitted 26 July, 2024; v1 submitted 21 July, 2024; originally announced July 2024.

    Comments: Accepted by ICCAD 2024

  31. arXiv:2407.14904  [pdf, other

    eess.IV cs.AI cs.CL cs.CV

    Large-vocabulary forensic pathological analyses via prototypical cross-modal contrastive learning

    Authors: Chen Shen, Chunfeng Lian, Wanqing Zhang, Fan Wang, Jianhua Zhang, Shuanliang Fan, Xin Wei, Gongji Wang, Kehan Li, Hongshu Mu, Hao Wu, Xinggong Liang, Jianhua Ma, Zhenyuan Wang

    Abstract: Forensic pathology is critical in determining the cause and manner of death through post-mortem examinations, both macroscopic and microscopic. The field, however, grapples with issues such as outcome variability, laborious processes, and a scarcity of trained professionals. This paper presents SongCi, an innovative visual-language model (VLM) designed specifically for forensic pathology. SongCi u… ▽ More

    Submitted 20 July, 2024; originally announced July 2024.

    Comments: 28 pages, 6 figures, under review

  32. arXiv:2407.13806  [pdf, other

    cs.LG cs.AI

    Revisiting Attention for Multivariate Time Series Forecasting

    Authors: Haixiang Wu

    Abstract: Current Transformer methods for Multivariate Time-Series Forecasting (MTSF) are all based on the conventional attention mechanism. They involve sequence embedding and performing a linear projection of Q, K, and V, and then computing attention within this latent space. We have never delved into the attention mechanism to explore whether such a mapping space is optimal for MTSF. To investigate this… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

  33. arXiv:2407.13278  [pdf, other

    cs.LG

    Deep Time Series Models: A Comprehensive Survey and Benchmark

    Authors: Yuxuan Wang, Haixu Wu, Jiaxiang Dong, Yong Liu, Mingsheng Long, Jianmin Wang

    Abstract: Time series, characterized by a sequence of data points arranged in a discrete-time order, are ubiquitous in real-world applications. Different from other modalities, time series present unique challenges due to their complex and dynamic nature, including the entanglement of nonlinear patterns and time-variant trends. Analyzing time series data is of great significance in real-world scenarios and… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

    Comments: \

  34. arXiv:2407.13221  [pdf, other

    cs.CV

    Multimodal Label Relevance Ranking via Reinforcement Learning

    Authors: Taian Guo, Taolin Zhang, Haoqian Wu, Hanjun Li, Ruizhi Qiao, Xing Sun

    Abstract: Conventional multi-label recognition methods often focus on label confidence, frequently overlooking the pivotal role of partial order relations consistent with human preference. To resolve these issues, we introduce a novel method for multimodal label relevance ranking, named Label Relevance Ranking with Proximal Policy Optimization (LR\textsuperscript{2}PPO), which effectively discerns partial o… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

    Comments: Accepted to ECCV2024

  35. arXiv:2407.13193  [pdf, other

    cs.CL

    Retrieval-Augmented Generation for Natural Language Processing: A Survey

    Authors: Shangyu Wu, Ying Xiong, Yufei Cui, Haolun Wu, Can Chen, Ye Yuan, Lianming Huang, Xue Liu, Tei-Wei Kuo, Nan Guan, Chun Jason Xue

    Abstract: Large language models (LLMs) have demonstrated great success in various fields, benefiting from their huge amount of parameters that store knowledge. However, LLMs still suffer from several key issues, such as hallucination problems, knowledge update issues, and lacking domain-specific expertise. The appearance of retrieval-augmented generation (RAG), which leverages an external knowledge database… ▽ More

    Submitted 18 July, 2024; v1 submitted 18 July, 2024; originally announced July 2024.

  36. arXiv:2407.12229  [pdf, other

    eess.AS cs.AI eess.SP

    Laugh Now Cry Later: Controlling Time-Varying Emotional States of Flow-Matching-Based Zero-Shot Text-to-Speech

    Authors: Haibin Wu, Xiaofei Wang, Sefik Emre Eskimez, Manthan Thakker, Daniel Tompkins, Chung-Hsien Tsai, Canrun Li, Zhen Xiao, Sheng Zhao, Jinyu Li, Naoyuki Kanda

    Abstract: People change their tones of voice, often accompanied by nonverbal vocalizations (NVs) such as laughter and cries, to convey rich emotions. However, most text-to-speech (TTS) systems lack the capability to generate speech with rich emotions, including NVs. This paper introduces EmoCtrl-TTS, an emotion-controllable zero-shot TTS that can generate highly emotional speech with NVs for any speaker. Em… ▽ More

    Submitted 16 July, 2024; originally announced July 2024.

    Comments: See https://aka.ms/emoctrl-tts for demo samples

  37. arXiv:2407.11882  [pdf, other

    cs.CR

    Enhancing Covert Communication in Relay Systems Using Multi-Antenna Technique

    Authors: He Zhu, Huihui Wu, Wei Su, Xiaohong Jiang

    Abstract: This paper exploits the multi-antenna technique to enhance the covert communication performance in a relay system, where a source S conducts covert communication with a destination D via a relay R, subjecting to the detections of transmissions in the two hops from a single-antenna warden W. To demonstrate the performance gain from adopting the multi-antenna technique, we first consider the scenari… ▽ More

    Submitted 16 July, 2024; originally announced July 2024.

  38. arXiv:2407.10956  [pdf, other

    cs.AI cs.CL

    Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows?

    Authors: Ruisheng Cao, Fangyu Lei, Haoyuan Wu, Jixuan Chen, Yeqiao Fu, Hongcheng Gao, Xinzhuang Xiong, Hanchong Zhang, Yuchen Mao, Wenjing Hu, Tianbao Xie, Hongshen Xu, Danyang Zhang, Sida Wang, Ruoxi Sun, Pengcheng Yin, Caiming Xiong, Ansong Ni, Qian Liu, Victor Zhong, Lu Chen, Kai Yu, Tao Yu

    Abstract: Data science and engineering workflows often span multiple stages, from warehousing to orchestration, using tools like BigQuery, dbt, and Airbyte. As vision language models (VLMs) advance in multimodal understanding and code generation, VLM-based agents could potentially automate these workflows by generating SQL queries, Python code, and GUI operations. This automation can improve the productivit… ▽ More

    Submitted 15 July, 2024; originally announced July 2024.

    Comments: 34 pages, 14 figures, 10 tables

  39. arXiv:2407.10810  [pdf, other

    cs.CV cs.AI cs.AR cs.LG

    FabGPT: An Efficient Large Multimodal Model for Complex Wafer Defect Knowledge Queries

    Authors: Yuqi Jiang, Xudong Lu, Qian Jin, Qi Sun, Hanming Wu, Cheng Zhuo

    Abstract: Intelligence is key to advancing integrated circuit (IC) fabrication. Recent breakthroughs in Large Multimodal Models (LMMs) have unlocked unparalleled abilities in understanding images and text, fostering intelligent fabrication. Leveraging the power of LMMs, we introduce FabGPT, a customized IC fabrication large multimodal model for wafer defect knowledge query. FabGPT manifests expertise in con… ▽ More

    Submitted 15 July, 2024; originally announced July 2024.

  40. arXiv:2407.10179  [pdf, other

    cs.CV

    CLIP-Guided Networks for Transferable Targeted Attacks

    Authors: Hao Fang, Jiawei Kong, Bin Chen, Tao Dai, Hao Wu, Shu-Tao Xia

    Abstract: Transferable targeted adversarial attacks aim to mislead models into outputting adversary-specified predictions in black-box scenarios. Recent studies have introduced \textit{single-target} generative attacks that train a generator for each target class to generate highly transferable perturbations, resulting in substantial computational overhead when handling multiple classes. \textit{Multi-targe… ▽ More

    Submitted 22 July, 2024; v1 submitted 14 July, 2024; originally announced July 2024.

    Comments: ECCV 2024

  41. arXiv:2407.09873  [pdf, other

    cs.IT cs.AI

    Resource Management for Low-latency Cooperative Fine-tuning of Foundation Models at the Network Edge

    Authors: Hai Wu, Xu Chen, Kaibin Huang

    Abstract: The emergence of large-scale foundation models (FoMo's) that can perform human-like intelligence motivates their deployment at the network edge for devices to access state-of-the-art artificial intelligence. For better user experiences, the pre-trained FoMo's need to be adapted to specialized downstream tasks through fine-tuning techniques. To transcend a single device's memory and computation lim… ▽ More

    Submitted 13 July, 2024; originally announced July 2024.

    Comments: This work has been submitted to the IEEE for possible publication

  42. arXiv:2407.08958  [pdf, other

    cs.SE

    Towards Practical and Useful Automated Program Repair for Debugging

    Authors: Qi Xin, Haojun Wu, Steven P. Reiss, Jifeng Xuan

    Abstract: Current automated program repair (APR) techniques are far from being practical and useful enough to be considered for realistic debugging. They rely on unrealistic assumptions including the requirement of a comprehensive suite of test cases as the correctness criterion and frequent program re-execution for patch validation; they are not fast; and their ability of repairing the commonly arising com… ▽ More

    Submitted 11 July, 2024; originally announced July 2024.

  43. arXiv:2407.08561  [pdf, other

    cs.CV

    MapLocNet: Coarse-to-Fine Feature Registration for Visual Re-Localization in Navigation Maps

    Authors: Hang Wu, Zhenghao Zhang, Siyuan Lin, Xiangru Mu, Qiang Zhao, Ming Yang, Tong Qin

    Abstract: Robust localization is the cornerstone of autonomous driving, especially in challenging urban environments where GPS signals suffer from multipath errors. Traditional localization approaches rely on high-definition (HD) maps, which consist of precisely annotated landmarks. However, building HD map is expensive and challenging to scale up. Given these limitations, leveraging navigation maps has eme… ▽ More

    Submitted 11 July, 2024; originally announced July 2024.

    Comments: IROS 2024 (Oral)

  44. arXiv:2407.08526  [pdf, other

    cs.CV

    BLOS-BEV: Navigation Map Enhanced Lane Segmentation Network, Beyond Line of Sight

    Authors: Hang Wu, Zhenghao Zhang, Siyuan Lin, Tong Qin, Jin Pan, Qiang Zhao, Chunjing Xu, Ming Yang

    Abstract: Bird's-eye-view (BEV) representation is crucial for the perception function in autonomous driving tasks. It is difficult to balance the accuracy, efficiency and range of BEV representation. The existing works are restricted to a limited perception range within 50 meters. Extending the BEV representation range can greatly benefit downstream tasks such as topology reasoning, scene understanding, and… ▽ More

    Submitted 11 July, 2024; originally announced July 2024.

    Comments: IEEE IV 2024

  45. arXiv:2407.07710  [pdf, ps, other

    cs.CR cs.IT math.NT

    On the differential and Walsh spectra of $x^{2q+1}$ over $\mathbb{F}_{q^2}$

    Authors: Sihem Mesnager, Huawei Wu

    Abstract: Let $q$ be an odd prime power and let $\mathbb{F}_{q^2}$ be the finite field with $q^2$ elements. In this paper, we determine the differential spectrum of the power function $F(x)=x^{2q+1}$ over $\mathbb{F}_{q^2}$. When the characteristic of $\mathbb{F}_{q^2}$ is $3$, we also determine the value distribution of the Walsh spectrum of $F$, showing that it is $4$-valued, and use the obtained result t… ▽ More

    Submitted 8 July, 2024; originally announced July 2024.

  46. arXiv:2407.07504  [pdf, other

    cs.CV

    Pan-cancer Histopathology WSI Pre-training with Position-aware Masked Autoencoder

    Authors: Kun Wu, Zhiguo Jiang, Kunming Tang, Jun Shi, Fengying Xie, Wei Wang, Haibo Wu, Yushan Zheng

    Abstract: Large-scale pre-training models have promoted the development of histopathology image analysis. However, existing self-supervised methods for histopathology images focus on learning patch features, while there is still a lack of available pre-training models for WSI-level feature learning. In this paper, we propose a novel self-supervised learning framework for pan-cancer WSI-level representation… ▽ More

    Submitted 15 July, 2024; v1 submitted 10 July, 2024; originally announced July 2024.

  47. arXiv:2407.07088  [pdf, other

    cs.AI cs.LO eess.SY

    Safe and Reliable Training of Learning-Based Aerospace Controllers

    Authors: Udayan Mandal, Guy Amir, Haoze Wu, Ieva Daukantas, Fletcher Lee Newell, Umberto Ravaioli, Baoluo Meng, Michael Durling, Kerianne Hobbs, Milan Ganai, Tobey Shim, Guy Katz, Clark Barrett

    Abstract: In recent years, deep reinforcement learning (DRL) approaches have generated highly successful controllers for a myriad of complex domains. However, the opaque nature of these models limits their applicability in aerospace systems and safety-critical domains, in which a single mistake can have dire consequences. In this paper, we present novel advancements in both the training and verification of… ▽ More

    Submitted 9 July, 2024; originally announced July 2024.

    Comments: 10 pages, 3 figures

  48. arXiv:2407.06546  [pdf, other

    cs.CV cs.RO

    Exploring the Causality of End-to-End Autonomous Driving

    Authors: Jiankun Li, Hao Li, Jiangjiang Liu, Zhikang Zou, Xiaoqing Ye, Fan Wang, Jizhou Huang, Hua Wu, Haifeng Wang

    Abstract: Deep learning-based models are widely deployed in autonomous driving areas, especially the increasingly noticed end-to-end solutions. However, the black-box property of these models raises concerns about their trustworthiness and safety for autonomous driving, and how to debug the causality has become a pressing concern. Despite some existing research on the explainability of autonomous driving, t… ▽ More

    Submitted 19 July, 2024; v1 submitted 9 July, 2024; originally announced July 2024.

  49. arXiv:2407.05679  [pdf, other

    cs.CV cs.AI

    BEVWorld: A Multimodal World Model for Autonomous Driving via Unified BEV Latent Space

    Authors: Yumeng Zhang, Shi Gong, Kaixin Xiong, Xiaoqing Ye, Xiao Tan, Fan Wang, Jizhou Huang, Hua Wu, Haifeng Wang

    Abstract: World models are receiving increasing attention in autonomous driving for their ability to predict potential future scenarios. In this paper, we present BEVWorld, a novel approach that tokenizes multimodal sensor inputs into a unified and compact Bird's Eye View (BEV) latent space for environment modeling. The world model consists of two parts: the multi-modal tokenizer and the latent BEV sequence… ▽ More

    Submitted 18 July, 2024; v1 submitted 8 July, 2024; originally announced July 2024.

    Comments: 10 pages

  50. arXiv:2407.05413  [pdf, other

    cs.AI cs.CL cs.LG

    SBoRA: Low-Rank Adaptation with Regional Weight Updates

    Authors: Lai-Man Po, Yuyang Liu, Haoxuan Wu, Tianqi Zhang, Wing-Yin Yu, Zeyu Jiang, Kun Li

    Abstract: This paper introduces Standard Basis LoRA (SBoRA), a novel parameter-efficient fine-tuning approach for Large Language Models that builds upon the pioneering works of Low-Rank Adaptation (LoRA) and Orthogonal Adaptation. SBoRA further reduces the computational and memory requirements of LoRA while enhancing learning performance. By leveraging orthogonal standard basis vectors to initialize one of… ▽ More

    Submitted 10 July, 2024; v1 submitted 7 July, 2024; originally announced July 2024.

    Comments: 15 pages, 2 figures