-
EIA: Environmental Injection Attack on Generalist Web Agents for Privacy Leakage
Authors:
Zeyi Liao,
Lingbo Mo,
Chejian Xu,
Mintong Kang,
Jiawei Zhang,
Chaowei Xiao,
Yuan Tian,
Bo Li,
Huan Sun
Abstract:
Generalist web agents have evolved rapidly and demonstrated remarkable potential. However, there are unprecedented safety risks associated with these them, which are nearly unexplored so far. In this work, we aim to narrow this gap by conducting the first study on the privacy risks of generalist web agents in adversarial environments. First, we present a threat model that discusses the adversarial…
▽ More
Generalist web agents have evolved rapidly and demonstrated remarkable potential. However, there are unprecedented safety risks associated with these them, which are nearly unexplored so far. In this work, we aim to narrow this gap by conducting the first study on the privacy risks of generalist web agents in adversarial environments. First, we present a threat model that discusses the adversarial targets, constraints, and attack scenarios. Particularly, we consider two types of adversarial targets: stealing users' specific personally identifiable information (PII) or stealing the entire user request. To achieve these objectives, we propose a novel attack method, termed Environmental Injection Attack (EIA). This attack injects malicious content designed to adapt well to different environments where the agents operate, causing them to perform unintended actions. This work instantiates EIA specifically for the privacy scenario. It inserts malicious web elements alongside persuasive instructions that mislead web agents into leaking private information, and can further leverage CSS and JavaScript features to remain stealthy. We collect 177 actions steps that involve diverse PII categories on realistic websites from the Mind2Web dataset, and conduct extensive experiments using one of the most capable generalist web agent frameworks to date, SeeAct. The results demonstrate that EIA achieves up to 70% ASR in stealing users' specific PII. Stealing full user requests is more challenging, but a relaxed version of EIA can still achieve 16% ASR. Despite these concerning results, it is important to note that the attack can still be detectable through careful human inspection, highlighting a trade-off between high autonomy and security. This leads to our detailed discussion on the efficacy of EIA under different levels of human supervision as well as implications on defenses for generalist web agents.
△ Less
Submitted 17 September, 2024;
originally announced September 2024.
-
DrawingSpinUp: 3D Animation from Single Character Drawings
Authors:
Jie Zhou,
Chufeng Xiao,
Miu-Ling Lam,
Hongbo Fu
Abstract:
Animating various character drawings is an engaging visual content creation task. Given a single character drawing, existing animation methods are limited to flat 2D motions and thus lack 3D effects. An alternative solution is to reconstruct a 3D model from a character drawing as a proxy and then retarget 3D motion data onto it. However, the existing image-to-3D methods could not work well for ama…
▽ More
Animating various character drawings is an engaging visual content creation task. Given a single character drawing, existing animation methods are limited to flat 2D motions and thus lack 3D effects. An alternative solution is to reconstruct a 3D model from a character drawing as a proxy and then retarget 3D motion data onto it. However, the existing image-to-3D methods could not work well for amateur character drawings in terms of appearance and geometry. We observe the contour lines, commonly existing in character drawings, would introduce significant ambiguity in texture synthesis due to their view-dependence. Additionally, thin regions represented by single-line contours are difficult to reconstruct (e.g., slim limbs of a stick figure) due to their delicate structures. To address these issues, we propose a novel system, DrawingSpinUp, to produce plausible 3D animations and breathe life into character drawings, allowing them to freely spin up, leap, and even perform a hip-hop dance. For appearance improvement, we adopt a removal-then-restoration strategy to first remove the view-dependent contour lines and then render them back after retargeting the reconstructed character. For geometry refinement, we develop a skeleton-based thinning deformation algorithm to refine the slim structures represented by the single-line contours. The experimental evaluations and a perceptual user study show that our proposed method outperforms the existing 2D and 3D animation methods and generates high-quality 3D animations from a single character drawing. Please refer to our project page (https://lordliang.github.io/DrawingSpinUp) for the code and generated animations.
△ Less
Submitted 13 September, 2024;
originally announced September 2024.
-
RT-DETRv3: Real-time End-to-End Object Detection with Hierarchical Dense Positive Supervision
Authors:
Shuo Wang,
Chunlong Xia,
Feng Lv,
Yifeng Shi
Abstract:
RT-DETR is the first real-time end-to-end transformer-based object detector. Its efficiency comes from the framework design and the Hungarian matching. However, compared to dense supervision detectors like the YOLO series, the Hungarian matching provides much sparser supervision, leading to insufficient model training and difficult to achieve optimal results. To address these issues, we proposed a…
▽ More
RT-DETR is the first real-time end-to-end transformer-based object detector. Its efficiency comes from the framework design and the Hungarian matching. However, compared to dense supervision detectors like the YOLO series, the Hungarian matching provides much sparser supervision, leading to insufficient model training and difficult to achieve optimal results. To address these issues, we proposed a hierarchical dense positive supervision method based on RT-DETR, named RT-DETRv3. Firstly, we introduce a CNN-based auxiliary branch that provides dense supervision that collaborates with the original decoder to enhance the encoder feature representation. Secondly, to address insufficient decoder training, we propose a novel learning strategy involving self-attention perturbation. This strategy diversifies label assignment for positive samples across multiple query groups, thereby enriching positive supervisions. Additionally, we introduce a shared-weight decoder branch for dense positive supervision to ensure more high-quality queries matching each ground truth. Notably, all aforementioned modules are training-only. We conduct extensive experiments to demonstrate the effectiveness of our approach on COCO val2017. RT-DETRv3 significantly outperforms existing real-time detectors, including the RT-DETR series and the YOLO series. For example, RT-DETRv3-R18 achieves 48.1% AP (+1.6%/+1.4%) compared to RT-DETR-R18/RT-DETRv2-R18 while maintaining the same latency. Meanwhile, it requires only half of epochs to attain a comparable performance. Furthermore, RT-DETRv3-R101 can attain an impressive 54.6% AP outperforming YOLOv10-X. Code will be released soon.
△ Less
Submitted 12 September, 2024;
originally announced September 2024.
-
Legal Fact Prediction: Task Definition and Dataset Construction
Authors:
Junkai Liu,
Yujie Tong,
Hui Huang,
Shuyuan Zheng,
Muyun Yang,
Peicheng Wu,
Makoto Onizuka,
Chuan Xiao
Abstract:
Legal facts refer to the facts that can be proven by acknowledged evidence in a trial. They form the basis for the determination of court judgments. This paper introduces a novel NLP task: legal fact prediction, which aims to predict the legal fact based on a list of evidence. The predicted facts can instruct the parties and their lawyers involved in a trial to strengthen their submissions and opt…
▽ More
Legal facts refer to the facts that can be proven by acknowledged evidence in a trial. They form the basis for the determination of court judgments. This paper introduces a novel NLP task: legal fact prediction, which aims to predict the legal fact based on a list of evidence. The predicted facts can instruct the parties and their lawyers involved in a trial to strengthen their submissions and optimize their strategies during the trial. Moreover, since real legal facts are difficult to obtain before the final judgment, the predicted facts also serve as an important basis for legal judgment prediction. We construct a benchmark dataset consisting of evidence lists and ground-truth legal facts for real civil loan cases, LFPLoan. Our experiments on this dataset show that this task is non-trivial and requires further considerable research efforts.
△ Less
Submitted 11 September, 2024;
originally announced September 2024.
-
Enabling Shared-Control for A Riding Ballbot System
Authors:
Yu Chen,
Mahshid Mansouri,
Chenzhang Xiao,
Ze Wang,
Elizabeth T. Hsiao-Wecksler,
William R. Norris
Abstract:
This study introduces a shared-control approach for collision avoidance in a self-balancing riding ballbot, called PURE, marked by its dynamic stability, omnidirectional movement, and hands-free interface. Integrated with a sensor array and a novel Passive Artificial Potential Field (PAPF) method, PURE provides intuitive navigation with deceleration assistance and haptic/audio feedback, effectivel…
▽ More
This study introduces a shared-control approach for collision avoidance in a self-balancing riding ballbot, called PURE, marked by its dynamic stability, omnidirectional movement, and hands-free interface. Integrated with a sensor array and a novel Passive Artificial Potential Field (PAPF) method, PURE provides intuitive navigation with deceleration assistance and haptic/audio feedback, effectively mitigating collision risks. This approach addresses the limitations of traditional APF methods, such as control oscillations and unnecessary speed reduction in challenging scenarios. A human-robot interaction experiment, with 20 manual wheelchair users and able-bodied individuals, was conducted to evaluate the performance of indoor navigation and obstacle avoidance with the proposed shared-control algorithm. Results indicated that shared-control significantly reduced collisions and cognitive load without affecting travel speed, offering intuitive and safe operation. These findings highlight the shared-control system's suitability for enhancing collision avoidance in self-balancing mobility devices, a relatively unexplored area in assistive mobility research.
△ Less
Submitted 11 September, 2024;
originally announced September 2024.
-
Equivariant Filter for Tightly Coupled LiDAR-Inertial Odometry
Authors:
Anbo Tao,
Yarong Luo,
Chunxi Xia,
Chi Guo,
Xingxing Li
Abstract:
Pose estimation is a crucial problem in simultaneous localization and mapping (SLAM). However, developing a robust and consistent state estimator remains a significant challenge, as the traditional extended Kalman filter (EKF) struggles to handle the model nonlinearity, especially for inertial measurement unit (IMU) and light detection and ranging (LiDAR). To provide a consistent and efficient sol…
▽ More
Pose estimation is a crucial problem in simultaneous localization and mapping (SLAM). However, developing a robust and consistent state estimator remains a significant challenge, as the traditional extended Kalman filter (EKF) struggles to handle the model nonlinearity, especially for inertial measurement unit (IMU) and light detection and ranging (LiDAR). To provide a consistent and efficient solution of pose estimation, we propose Eq-LIO, a robust state estimator for tightly coupled LIO systems based on an equivariant filter (EqF). Compared with the invariant Kalman filter based on the $\SE_2(3)$ group structure, the EqF uses the symmetry of the semi-direct product group to couple the system state including IMU bias, navigation state and LiDAR extrinsic calibration state, thereby suppressing linearization error and improving the behavior of the estimator in the event of unexpected state changes. The proposed Eq-LIO owns natural consistency and higher robustness, which is theoretically proven with mathematical derivation and experimentally verified through a series of tests on both public and private datasets.
△ Less
Submitted 10 September, 2024;
originally announced September 2024.
-
From MOOC to MAIC: Reshaping Online Teaching and Learning through LLM-driven Agents
Authors:
Jifan Yu,
Zheyuan Zhang,
Daniel Zhang-li,
Shangqing Tu,
Zhanxin Hao,
Rui Miao Li,
Haoxuan Li,
Yuanchun Wang,
Hanming Li,
Linlu Gong,
Jie Cao,
Jiayin Lin,
Jinchang Zhou,
Fei Qin,
Haohua Wang,
Jianxiao Jiang,
Lijun Deng,
Yisi Zhan,
Chaojun Xiao,
Xusheng Dai,
Xuan Yan,
Nianyi Lin,
Nan Zhang,
Ruixin Ni,
Yang Dang
, et al. (8 additional authors not shown)
Abstract:
Since the first instances of online education, where courses were uploaded to accessible and shared online platforms, this form of scaling the dissemination of human knowledge to reach a broader audience has sparked extensive discussion and widespread adoption. Recognizing that personalized learning still holds significant potential for improvement, new AI technologies have been continuously integ…
▽ More
Since the first instances of online education, where courses were uploaded to accessible and shared online platforms, this form of scaling the dissemination of human knowledge to reach a broader audience has sparked extensive discussion and widespread adoption. Recognizing that personalized learning still holds significant potential for improvement, new AI technologies have been continuously integrated into this learning format, resulting in a variety of educational AI applications such as educational recommendation and intelligent tutoring. The emergence of intelligence in large language models (LLMs) has allowed for these educational enhancements to be built upon a unified foundational model, enabling deeper integration. In this context, we propose MAIC (Massive AI-empowered Course), a new form of online education that leverages LLM-driven multi-agent systems to construct an AI-augmented classroom, balancing scalability with adaptivity. Beyond exploring the conceptual framework and technical innovations, we conduct preliminary experiments at Tsinghua University, one of China's leading universities. Drawing from over 100,000 learning records of more than 500 students, we obtain a series of valuable observations and initial analyses. This project will continue to evolve, ultimately aiming to establish a comprehensive open platform that supports and unifies research, technology, and applications in exploring the possibilities of online education in the era of large model AI. We envision this platform as a collaborative hub, bringing together educators, researchers, and innovators to collectively explore the future of AI-driven online education.
△ Less
Submitted 5 September, 2024;
originally announced September 2024.
-
Configurable Foundation Models: Building LLMs from a Modular Perspective
Authors:
Chaojun Xiao,
Zhengyan Zhang,
Chenyang Song,
Dazhi Jiang,
Feng Yao,
Xu Han,
Xiaozhi Wang,
Shuo Wang,
Yufei Huang,
Guanyu Lin,
Yingfa Chen,
Weilin Zhao,
Yuge Tu,
Zexuan Zhong,
Ao Zhang,
Chenglei Si,
Khai Hao Moo,
Chenyang Zhao,
Huimin Chen,
Yankai Lin,
Zhiyuan Liu,
Jingbo Shang,
Maosong Sun
Abstract:
Advancements in LLMs have recently unveiled challenges tied to computational efficiency and continual scalability due to their requirements of huge parameters, making the applications and evolution of these models on devices with limited computation resources and scenarios requiring various abilities increasingly cumbersome. Inspired by modularity within the human brain, there is a growing tendenc…
▽ More
Advancements in LLMs have recently unveiled challenges tied to computational efficiency and continual scalability due to their requirements of huge parameters, making the applications and evolution of these models on devices with limited computation resources and scenarios requiring various abilities increasingly cumbersome. Inspired by modularity within the human brain, there is a growing tendency to decompose LLMs into numerous functional modules, allowing for inference with part of modules and dynamic assembly of modules to tackle complex tasks, such as mixture-of-experts. To highlight the inherent efficiency and composability of the modular approach, we coin the term brick to represent each functional module, designating the modularized structure as configurable foundation models. In this paper, we offer a comprehensive overview and investigation of the construction, utilization, and limitation of configurable foundation models. We first formalize modules into emergent bricks - functional neuron partitions that emerge during the pre-training phase, and customized bricks - bricks constructed via additional post-training to improve the capabilities and knowledge of LLMs. Based on diverse functional bricks, we further present four brick-oriented operations: retrieval and routing, merging, updating, and growing. These operations allow for dynamic configuration of LLMs based on instructions to handle complex tasks. To verify our perspective, we conduct an empirical analysis on widely-used LLMs. We find that the FFN layers follow modular patterns with functional specialization of neurons and functional neuron partitions. Finally, we highlight several open issues and directions for future research. Overall, this paper aims to offer a fresh modular perspective on existing LLM research and inspire the future creation of more efficient and scalable foundational models.
△ Less
Submitted 4 September, 2024;
originally announced September 2024.
-
GGS: Generalizable Gaussian Splatting for Lane Switching in Autonomous Driving
Authors:
Huasong Han,
Kaixuan Zhou,
Xiaoxiao Long,
Yusen Wang,
Chunxia Xiao
Abstract:
We propose GGS, a Generalizable Gaussian Splatting method for Autonomous Driving which can achieve realistic rendering under large viewpoint changes. Previous generalizable 3D gaussian splatting methods are limited to rendering novel views that are very close to the original pair of images, which cannot handle large differences in viewpoint. Especially in autonomous driving scenarios, images are t…
▽ More
We propose GGS, a Generalizable Gaussian Splatting method for Autonomous Driving which can achieve realistic rendering under large viewpoint changes. Previous generalizable 3D gaussian splatting methods are limited to rendering novel views that are very close to the original pair of images, which cannot handle large differences in viewpoint. Especially in autonomous driving scenarios, images are typically collected from a single lane. The limited training perspective makes rendering images of a different lane very challenging. To further improve the rendering capability of GGS under large viewpoint changes, we introduces a novel virtual lane generation module into GSS method to enables high-quality lane switching even without a multi-lane dataset. Besides, we design a diffusion loss to supervise the generation of virtual lane image to further address the problem of lack of data in the virtual lanes. Finally, we also propose a depth refinement module to optimize depth estimation in the GSS model. Extensive validation of our method, compared to existing approaches, demonstrates state-of-the-art performance.
△ Less
Submitted 3 September, 2024;
originally announced September 2024.
-
Self-Supervised Multi-Scale Network for Blind Image Deblurring via Alternating Optimization
Authors:
Lening Guo,
Jing Yu,
Ning Zhang,
Chuangbai Xiao
Abstract:
Blind image deblurring is a challenging low-level vision task that involves estimating the unblurred image when the blur kernel is unknown. In this paper, we present a self-supervised multi-scale blind image deblurring method to jointly estimate the latent image and the blur kernel via alternating optimization. In the image estimation step, we construct a multi-scale generator network with multipl…
▽ More
Blind image deblurring is a challenging low-level vision task that involves estimating the unblurred image when the blur kernel is unknown. In this paper, we present a self-supervised multi-scale blind image deblurring method to jointly estimate the latent image and the blur kernel via alternating optimization. In the image estimation step, we construct a multi-scale generator network with multiple inputs and multiple outputs to collaboratively estimate latent images at various scales, supervised by an image pyramid constructed from only the blurred image. This generator places architectural constraints on the network and avoids the need for mathematical expression of image priors. In the blur kernel estimation step, the blur kernel at each scale is independently estimated with a direct solution to a quadratic regularized least-squares model for its flexible adaptation to the proposed multi-scale generator for image estimation. Thanks to the collaborative estimation across multiple scales, our method avoids the computationally intensive coarse-to-fine propagation and additional image deblurring processes used in traditional mathematical optimization-based methods. Quantitative and qualitative experimental results on synthetic and realistic datasets demonstrate the superior performance of our method, especially for handling large and real-world blurs.
△ Less
Submitted 2 September, 2024;
originally announced September 2024.
-
OG-Mapping: Octree-based Structured 3D Gaussians for Online Dense Mapping
Authors:
Meng Wang,
Junyi Wang,
Changqun Xia,
Chen Wang,
Yue Qi
Abstract:
3D Gaussian splatting (3DGS) has recently demonstrated promising advancements in RGB-D online dense mapping. Nevertheless, existing methods excessively rely on per-pixel depth cues to perform map densification, which leads to significant redundancy and increased sensitivity to depth noise. Additionally, explicitly storing 3D Gaussian parameters of room-scale scene poses a significant storage chall…
▽ More
3D Gaussian splatting (3DGS) has recently demonstrated promising advancements in RGB-D online dense mapping. Nevertheless, existing methods excessively rely on per-pixel depth cues to perform map densification, which leads to significant redundancy and increased sensitivity to depth noise. Additionally, explicitly storing 3D Gaussian parameters of room-scale scene poses a significant storage challenge. In this paper, we introduce OG-Mapping, which leverages the robust scene structural representation capability of sparse octrees, combined with structured 3D Gaussian representations, to achieve efficient and robust online dense mapping. Moreover, OG-Mapping employs an anchor-based progressive map refinement strategy to recover the scene structures at multiple levels of detail. Instead of maintaining a small number of active keyframes with a fixed keyframe window as previous approaches do, a dynamic keyframe window is employed to allow OG-Mapping to better tackle false local minima and forgetting issues. Experimental results demonstrate that OG-Mapping delivers more robust and superior realism mapping results than existing Gaussian-based RGB-D online mapping methods with a compact model, and no additional post-processing is required.
△ Less
Submitted 30 August, 2024;
originally announced August 2024.
-
xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations
Authors:
Can Qin,
Congying Xia,
Krithika Ramakrishnan,
Michael Ryoo,
Lifu Tu,
Yihao Feng,
Manli Shu,
Honglu Zhou,
Anas Awadalla,
Jun Wang,
Senthil Purushwalkam,
Le Xue,
Yingbo Zhou,
Huan Wang,
Silvio Savarese,
Juan Carlos Niebles,
Zeyuan Chen,
Ran Xu,
Caiming Xiong
Abstract:
We present xGen-VideoSyn-1, a text-to-video (T2V) generation model capable of producing realistic scenes from textual descriptions. Building on recent advancements, such as OpenAI's Sora, we explore the latent diffusion model (LDM) architecture and introduce a video variational autoencoder (VidVAE). VidVAE compresses video data both spatially and temporally, significantly reducing the length of vi…
▽ More
We present xGen-VideoSyn-1, a text-to-video (T2V) generation model capable of producing realistic scenes from textual descriptions. Building on recent advancements, such as OpenAI's Sora, we explore the latent diffusion model (LDM) architecture and introduce a video variational autoencoder (VidVAE). VidVAE compresses video data both spatially and temporally, significantly reducing the length of visual tokens and the computational demands associated with generating long-sequence videos. To further address the computational costs, we propose a divide-and-merge strategy that maintains temporal consistency across video segments. Our Diffusion Transformer (DiT) model incorporates spatial and temporal self-attention layers, enabling robust generalization across different timeframes and aspect ratios. We have devised a data processing pipeline from the very beginning and collected over 13M high-quality video-text pairs. The pipeline includes multiple steps such as clipping, text detection, motion estimation, aesthetics scoring, and dense captioning based on our in-house video-LLM model. Training the VidVAE and DiT models required approximately 40 and 642 H100 days, respectively. Our model supports over 14-second 720p video generation in an end-to-end way and demonstrates competitive performance against state-of-the-art T2V models.
△ Less
Submitted 31 August, 2024; v1 submitted 22 August, 2024;
originally announced August 2024.
-
ViIK: Flow-based Vision Inverse Kinematics Solver with Fusing Collision Checking
Authors:
Qinglong Meng,
Chongkun Xia,
Xueqian Wang
Abstract:
Inverse Kinematics (IK) is to find the robot's configurations that satisfy the target pose of the end effector. In motion planning, diverse configurations were required in case a feasible trajectory was not found. Meanwhile, collision checking (CC), e.g. Oriented bounding box (OBB), Discrete Oriented Polytope (DOP), and Quickhull \cite{quickhull}, needs to be done for each configuration provided b…
▽ More
Inverse Kinematics (IK) is to find the robot's configurations that satisfy the target pose of the end effector. In motion planning, diverse configurations were required in case a feasible trajectory was not found. Meanwhile, collision checking (CC), e.g. Oriented bounding box (OBB), Discrete Oriented Polytope (DOP), and Quickhull \cite{quickhull}, needs to be done for each configuration provided by the IK solver to ensure every goal configuration for motion planning is available. This means the classical IK solver and CC algorithm should be executed repeatedly for every configuration. Thus, the preparation time is long when the required number of goal configurations is large, e.g. motion planning in cluster environments. Moreover, structured maps, which might be difficult to obtain, were required by classical collision-checking algorithms. To sidestep such two issues, we propose a flow-based vision method that can output diverse available configurations by fusing inverse kinematics and collision checking, named Vision Inverse Kinematics solver (ViIK). Moreover, ViIK uses RGB images as the perception of environments. ViIK can output 1000 configurations within 40 ms, and the accuracy is about 3 millimeters and 1.5 degrees. The higher accuracy can be obtained by being refined by the classical IK solver within a few iterations. The self-collision rates can be lower than 2%. The collision-with-env rates can be lower than 10% in most scenes. The code is available at: https://github.com/AdamQLMeng/ViIK.
△ Less
Submitted 28 August, 2024; v1 submitted 20 August, 2024;
originally announced August 2024.
-
MambaTrack: A Simple Baseline for Multiple Object Tracking with State Space Model
Authors:
Changcheng Xiao,
Qiong Cao,
Zhigang Luo,
Long Lan
Abstract:
Tracking by detection has been the prevailing paradigm in the field of Multi-object Tracking (MOT). These methods typically rely on the Kalman Filter to estimate the future locations of objects, assuming linear object motion. However, they fall short when tracking objects exhibiting nonlinear and diverse motion in scenarios like dancing and sports. In addition, there has been limited focus on util…
▽ More
Tracking by detection has been the prevailing paradigm in the field of Multi-object Tracking (MOT). These methods typically rely on the Kalman Filter to estimate the future locations of objects, assuming linear object motion. However, they fall short when tracking objects exhibiting nonlinear and diverse motion in scenarios like dancing and sports. In addition, there has been limited focus on utilizing learning-based motion predictors in MOT. To address these challenges, we resort to exploring data-driven motion prediction methods. Inspired by the great expectation of state space models (SSMs), such as Mamba, in long-term sequence modeling with near-linear complexity, we introduce a Mamba-based motion model named Mamba moTion Predictor (MTP). MTP is designed to model the complex motion patterns of objects like dancers and athletes. Specifically, MTP takes the spatial-temporal location dynamics of objects as input, captures the motion pattern using a bi-Mamba encoding layer, and predicts the next motion. In real-world scenarios, objects may be missed due to occlusion or motion blur, leading to premature termination of their trajectories. To tackle this challenge, we further expand the application of MTP. We employ it in an autoregressive way to compensate for missing observations by utilizing its own predictions as inputs, thereby contributing to more consistent trajectories. Our proposed tracker, MambaTrack, demonstrates advanced performance on benchmarks such as Dancetrack and SportsMOT, which are characterized by complex motion and severe occlusion.
△ Less
Submitted 17 August, 2024;
originally announced August 2024.
-
LoRA$^2$ : Multi-Scale Low-Rank Approximations for Fine-Tuning Large Language Models
Authors:
Jia-Chen Zhang,
Yu-Jie Xiong,
He-Xi Qiu,
Dong-Hai Zhu,
Chun-Ming Xia
Abstract:
Fine-tuning large language models (LLMs) with high parameter efficiency for downstream tasks has become a new paradigm. Low-Rank Adaptation (LoRA) significantly reduces the number of trainable parameters for fine-tuning. Although it has demonstrated commendable performance, updating parameters within a single scale may not be the optimal choice for complex downstream tasks.In this paper, we extend…
▽ More
Fine-tuning large language models (LLMs) with high parameter efficiency for downstream tasks has become a new paradigm. Low-Rank Adaptation (LoRA) significantly reduces the number of trainable parameters for fine-tuning. Although it has demonstrated commendable performance, updating parameters within a single scale may not be the optimal choice for complex downstream tasks.In this paper, we extend the LoRA to multiple scales, dubbed as LoRA$^2$. We first combine orthogonal projection theory to train a set of LoRAs in two mutually orthogonal planes. Then, we improve the importance score algorithm, which reduce parameter sensitivity score calculations by approximately 98.5\%. By pruning singular values with lower importance scores, thereby enhancing adaptability to various downstream tasks. Extensive experiments are conducted on two widely used pre-trained models to validate the effectiveness of LoRA$^2$. Results show that it significantly reduces the number of trainable parameters to just 0.72\% compared to full fine-tuning, while still delivering highly impressive performance. Even when the parameters are further reduced to 0.17M, it still achieves comparable results to the baseline with 8 times more parameters. Our code is available here: https://anonymous.4open.science/r/LoRA-2-5B4C
△ Less
Submitted 13 August, 2024;
originally announced August 2024.
-
Diffusion Model-based Contrastive Learning for Human Activity Recognition
Authors:
Chunjing Xiao,
Yanhui Han,
Wei Yang,
Yane Hou,
Fangzhan Shi,
Kevin Chetty
Abstract:
WiFi Channel State Information (CSI)-based activity recognition has sparked numerous studies due to its widespread availability and privacy protection. However, when applied in practical applications, general CSI-based recognition models may face challenges related to the limited generalization capability, since individuals with different behavior habits will cause various fluctuations in CSI data…
▽ More
WiFi Channel State Information (CSI)-based activity recognition has sparked numerous studies due to its widespread availability and privacy protection. However, when applied in practical applications, general CSI-based recognition models may face challenges related to the limited generalization capability, since individuals with different behavior habits will cause various fluctuations in CSI data and it is difficult to gather enough training data to cover all kinds of motion habits. To tackle this problem, we design a diffusion model-based Contrastive Learning framework for human Activity Recognition (CLAR) using WiFi CSI. On the basis of the contrastive learning framework, we primarily introduce two components for CLAR to enhance CSI-based activity recognition. To generate diverse augmented data and complement limited training data, we propose a diffusion model-based time series-specific augmentation model. In contrast to typical diffusion models that directly apply conditions to the generative process, potentially resulting in distorted CSI data, our tailored model dissects these condition into the high-frequency and low-frequency components, and then applies these conditions to the generative process with varying weights. This can alleviate data distortion and yield high-quality augmented data. To efficiently capture the difference of the sample importance, we present an adaptive weight algorithm. Different from typical contrastive learning methods which equally consider all the training samples, this algorithm adaptively adjusts the weights of positive sample pairs for learning better data representations. The experiments suggest that CLAR achieves significant gains compared to state-of-the-art methods.
△ Less
Submitted 10 August, 2024;
originally announced August 2024.
-
Automatic String Data Validation with Pattern Discovery
Authors:
Xinwei Lin,
Jing Zhao,
Peng Di,
Chuan Xiao,
Rui Mao,
Yan Ji,
Makoto Onizuka,
Zishuo Ding,
Weiyi Shang,
Jianbin Qin
Abstract:
In enterprise data pipelines, data insertions occur periodically and may impact downstream services if data quality issues are not addressed. Typically, such problems can be investigated and fixed by on-call engineers, but locating the cause of such problems and fixing errors are often time-consuming. Therefore, automatic data validation is a better solution to defend the system and downstream ser…
▽ More
In enterprise data pipelines, data insertions occur periodically and may impact downstream services if data quality issues are not addressed. Typically, such problems can be investigated and fixed by on-call engineers, but locating the cause of such problems and fixing errors are often time-consuming. Therefore, automatic data validation is a better solution to defend the system and downstream services by enabling early detection of errors and providing detailed error messages for quick resolution. This paper proposes a self-validate data management system with automatic pattern discovery techniques to verify the correctness of semi-structural string data in enterprise data pipelines. Our solution extracts patterns from historical data and detects erroneous incoming data in a top-down fashion. High-level information of historical data is analyzed to discover the format skeleton of correct values. Fine-grained semantic patterns are then extracted to strike a balance between generalization and specification of the discovered pattern, thus covering as many correct values as possible while avoiding over-fitting. To tackle cold start and rapid data growth, we propose an incremental update strategy and example generalization strategy. Experiments on large-scale industrial and public datasets demonstrate the effectiveness and efficiency of our method compared to alternative solutions. Furthermore, a case study on an industrial platform (Ant Group Inc.) with thousands of applications shows that our system captures meaningful data patterns in daily operations and helps engineers quickly identify errors.
△ Less
Submitted 6 August, 2024;
originally announced August 2024.
-
A Multi-class Ride-hailing Service Subsidy System Utilizing Deep Causal Networks
Authors:
Zhe Yu,
Chi Xia,
Shaosheng Cao,
Lin Zhou
Abstract:
In the ride-hailing industry, subsidies are predominantly employed to incentivize consumers to place more orders, thereby fostering market growth. Causal inference techniques are employed to estimate the consumer elasticity with different subsidy levels. However, the presence of confounding effects poses challenges in achieving an unbiased estimate of the uplift effect. We introduce a consumer sub…
▽ More
In the ride-hailing industry, subsidies are predominantly employed to incentivize consumers to place more orders, thereby fostering market growth. Causal inference techniques are employed to estimate the consumer elasticity with different subsidy levels. However, the presence of confounding effects poses challenges in achieving an unbiased estimate of the uplift effect. We introduce a consumer subsidizing system to capture relationships between subsidy propensity and the treatment effect, which proves effective while maintaining a lightweight online environment.
△ Less
Submitted 4 August, 2024;
originally announced August 2024.
-
IDNet: A Novel Dataset for Identity Document Analysis and Fraud Detection
Authors:
Hong Guan,
Yancheng Wang,
Lulu Xie,
Soham Nag,
Rajeev Goel,
Niranjan Erappa Narayana Swamy,
Yingzhen Yang,
Chaowei Xiao,
Jonathan Prisby,
Ross Maciejewski,
Jia Zou
Abstract:
Effective fraud detection and analysis of government-issued identity documents, such as passports, driver's licenses, and identity cards, are essential in thwarting identity theft and bolstering security on online platforms. The training of accurate fraud detection and analysis tools depends on the availability of extensive identity document datasets. However, current publicly available benchmark…
▽ More
Effective fraud detection and analysis of government-issued identity documents, such as passports, driver's licenses, and identity cards, are essential in thwarting identity theft and bolstering security on online platforms. The training of accurate fraud detection and analysis tools depends on the availability of extensive identity document datasets. However, current publicly available benchmark datasets for identity document analysis, including MIDV-500, MIDV-2020, and FMIDV, fall short in several respects: they offer a limited number of samples, cover insufficient varieties of fraud patterns, and seldom include alterations in critical personal identifying fields like portrait images, limiting their utility in training models capable of detecting realistic frauds while preserving privacy.
In response to these shortcomings, our research introduces a new benchmark dataset, IDNet, designed to advance privacy-preserving fraud detection efforts. The IDNet dataset comprises 837,060 images of synthetically generated identity documents, totaling approximately 490 gigabytes, categorized into 20 types from $10$ U.S. states and 10 European countries. We evaluate the utility and present use cases of the dataset, illustrating how it can aid in training privacy-preserving fraud detection methods, facilitating the generation of camera and video capturing of identity documents, and testing schema unification and other identity document management functionalities.
△ Less
Submitted 3 September, 2024; v1 submitted 3 August, 2024;
originally announced August 2024.
-
PGNeXt: High-Resolution Salient Object Detection via Pyramid Grafting Network
Authors:
Changqun Xia,
Chenxi Xie,
Zhentao He,
Tianshu Yu,
Jia Li
Abstract:
We present an advanced study on more challenging high-resolution salient object detection (HRSOD) from both dataset and network framework perspectives. To compensate for the lack of HRSOD dataset, we thoughtfully collect a large-scale high resolution salient object detection dataset, called UHRSD, containing 5,920 images from real-world complex scenarios at 4K-8K resolutions. All the images are fi…
▽ More
We present an advanced study on more challenging high-resolution salient object detection (HRSOD) from both dataset and network framework perspectives. To compensate for the lack of HRSOD dataset, we thoughtfully collect a large-scale high resolution salient object detection dataset, called UHRSD, containing 5,920 images from real-world complex scenarios at 4K-8K resolutions. All the images are finely annotated in pixel-level, far exceeding previous low-resolution SOD datasets. Aiming at overcoming the contradiction between the sampling depth and the receptive field size in the past methods, we propose a novel one-stage framework for HR-SOD task using pyramid grafting mechanism. In general, transformer-based and CNN-based backbones are adopted to extract features from different resolution images independently and then these features are grafted from transformer branch to CNN branch. An attention-based Cross-Model Grafting Module (CMGM) is proposed to enable CNN branch to combine broken detailed information more holistically, guided by different source feature during decoding process. Moreover, we design an Attention Guided Loss (AGL) to explicitly supervise the attention matrix generated by CMGM to help the network better interact with the attention from different branches. Comprehensive experiments on UHRSD and widely-used SOD datasets demonstrate that our method can simultaneously locate salient object and preserve rich details, outperforming state-of-the-art methods. To verify the generalization ability of the proposed framework, we apply it to the camouflaged object detection (COD) task. Notably, our method performs superior to most state-of-the-art COD methods without bells and whistles.
△ Less
Submitted 2 August, 2024;
originally announced August 2024.
-
Can Editing LLMs Inject Harm?
Authors:
Canyu Chen,
Baixiang Huang,
Zekun Li,
Zhaorun Chen,
Shiyang Lai,
Xiongxiao Xu,
Jia-Chen Gu,
Jindong Gu,
Huaxiu Yao,
Chaowei Xiao,
Xifeng Yan,
William Yang Wang,
Philip Torr,
Dawn Song,
Kai Shu
Abstract:
Knowledge editing has been increasingly adopted to correct the false or outdated knowledge in Large Language Models (LLMs). Meanwhile, one critical but under-explored question is: can knowledge editing be used to inject harm into LLMs? In this paper, we propose to reformulate knowledge editing as a new type of safety threat for LLMs, namely Editing Attack, and conduct a systematic investigation wi…
▽ More
Knowledge editing has been increasingly adopted to correct the false or outdated knowledge in Large Language Models (LLMs). Meanwhile, one critical but under-explored question is: can knowledge editing be used to inject harm into LLMs? In this paper, we propose to reformulate knowledge editing as a new type of safety threat for LLMs, namely Editing Attack, and conduct a systematic investigation with a newly constructed dataset EditAttack. Specifically, we focus on two typical safety risks of Editing Attack including Misinformation Injection and Bias Injection. For the risk of misinformation injection, we first categorize it into commonsense misinformation injection and long-tail misinformation injection. Then, we find that editing attacks can inject both types of misinformation into LLMs, and the effectiveness is particularly high for commonsense misinformation injection. For the risk of bias injection, we discover that not only can biased sentences be injected into LLMs with high effectiveness, but also one single biased sentence injection can cause a bias increase in general outputs of LLMs, which are even highly irrelevant to the injected sentence, indicating a catastrophic impact on the overall fairness of LLMs. Then, we further illustrate the high stealthiness of editing attacks, measured by their impact on the general knowledge and reasoning capacities of LLMs, and show the hardness of defending editing attacks with empirical evidence. Our discoveries demonstrate the emerging misuse risks of knowledge editing techniques on compromising the safety alignment of LLMs and the feasibility of disseminating misinformation or bias with LLMs as new channels.
△ Less
Submitted 16 August, 2024; v1 submitted 29 July, 2024;
originally announced July 2024.
-
Motif-Consistent Counterfactuals with Adversarial Refinement for Graph-Level Anomaly Detection
Authors:
Chunjing Xiao,
Shikang Pang,
Wenxin Tai,
Yanlong Huang,
Goce Trajcevski,
Fan Zhou
Abstract:
Graph-level anomaly detection is significant in diverse domains. To improve detection performance, counterfactual graphs have been exploited to benefit the generalization capacity by learning causal relations. Most existing studies directly introduce perturbations (e.g., flipping edges) to generate counterfactual graphs, which are prone to alter the semantics of generated examples and make them of…
▽ More
Graph-level anomaly detection is significant in diverse domains. To improve detection performance, counterfactual graphs have been exploited to benefit the generalization capacity by learning causal relations. Most existing studies directly introduce perturbations (e.g., flipping edges) to generate counterfactual graphs, which are prone to alter the semantics of generated examples and make them off the data manifold, resulting in sub-optimal performance. To address these issues, we propose a novel approach, Motif-consistent Counterfactuals with Adversarial Refinement (MotifCAR), for graph-level anomaly detection. The model combines the motif of one graph, the core subgraph containing the identification (category) information, and the contextual subgraph (non-motif) of another graph to produce a raw counterfactual graph. However, the produced raw graph might be distorted and cannot satisfy the important counterfactual properties: Realism, Validity, Proximity and Sparsity. Towards that, we present a Generative Adversarial Network (GAN)-based graph optimizer to refine the raw counterfactual graphs. It adopts the discriminator to guide the generator to generate graphs close to realistic data, i.e., meet the property Realism. Further, we design the motif consistency to force the motif of the generated graphs to be consistent with the realistic graphs, meeting the property Validity. Also, we devise the contextual loss and connection loss to control the contextual subgraph and the newly added links to meet the properties Proximity and Sparsity. As a result, the model can generate high-quality counterfactual graphs. Experiments demonstrate the superiority of MotifCAR.
△ Less
Submitted 18 July, 2024;
originally announced July 2024.
-
Translate-and-Revise: Boosting Large Language Models for Constrained Translation
Authors:
Pengcheng Huang,
Yongyu Mu,
Yuzhang Wu,
Bei Li,
Chunyang Xiao,
Tong Xiao,
Jingbo Zhu
Abstract:
Imposing constraints on machine translation systems presents a challenging issue because these systems are not trained to make use of constraints in generating adequate, fluent translations. In this paper, we leverage the capabilities of large language models (LLMs) for constrained translation, given that LLMs can easily adapt to this task by taking translation instructions and constraints as prom…
▽ More
Imposing constraints on machine translation systems presents a challenging issue because these systems are not trained to make use of constraints in generating adequate, fluent translations. In this paper, we leverage the capabilities of large language models (LLMs) for constrained translation, given that LLMs can easily adapt to this task by taking translation instructions and constraints as prompts. However, LLMs cannot always guarantee the adequacy of translation, and, in some cases, ignore the given constraints. This is in part because LLMs might be overly confident in their predictions, overriding the influence of the constraints. To overcome this overiding behaviour, we propose to add a revision process that encourages LLMs to correct the outputs by prompting them about the constraints that have not yet been met. We evaluate our approach on four constrained translation tasks, encompassing both lexical and structural constraints in multiple constraint domains. Experiments show 15\% improvement in constraint-based translation accuracy over standard LLMs and the approach also significantly outperforms neural machine translation (NMT) state-of-the-art methods.
△ Less
Submitted 18 July, 2024;
originally announced July 2024.
-
AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases
Authors:
Zhaorun Chen,
Zhen Xiang,
Chaowei Xiao,
Dawn Song,
Bo Li
Abstract:
LLM agents have demonstrated remarkable performance across various applications, primarily due to their advanced capabilities in reasoning, utilizing external knowledge and tools, calling APIs, and executing actions to interact with environments. Current agents typically utilize a memory module or a retrieval-augmented generation (RAG) mechanism, retrieving past knowledge and instances with simila…
▽ More
LLM agents have demonstrated remarkable performance across various applications, primarily due to their advanced capabilities in reasoning, utilizing external knowledge and tools, calling APIs, and executing actions to interact with environments. Current agents typically utilize a memory module or a retrieval-augmented generation (RAG) mechanism, retrieving past knowledge and instances with similar embeddings from knowledge bases to inform task planning and execution. However, the reliance on unverified knowledge bases raises significant concerns about their safety and trustworthiness. To uncover such vulnerabilities, we propose a novel red teaming approach AgentPoison, the first backdoor attack targeting generic and RAG-based LLM agents by poisoning their long-term memory or RAG knowledge base. In particular, we form the trigger generation process as a constrained optimization to optimize backdoor triggers by mapping the triggered instances to a unique embedding space, so as to ensure that whenever a user instruction contains the optimized backdoor trigger, the malicious demonstrations are retrieved from the poisoned memory or knowledge base with high probability. In the meantime, benign instructions without the trigger will still maintain normal performance. Unlike conventional backdoor attacks, AgentPoison requires no additional model training or fine-tuning, and the optimized backdoor trigger exhibits superior transferability, in-context coherence, and stealthiness. Extensive experiments demonstrate AgentPoison's effectiveness in attacking three types of real-world LLM agents: RAG-based autonomous driving agent, knowledge-intensive QA agent, and healthcare EHRAgent. On each agent, AgentPoison achieves an average attack success rate higher than 80% with minimal impact on benign performance (less than 1%) with a poison rate less than 0.1%.
△ Less
Submitted 17 July, 2024;
originally announced July 2024.
-
LLMBox: A Comprehensive Library for Large Language Models
Authors:
Tianyi Tang,
Yiwen Hu,
Bingqian Li,
Wenyang Luo,
Zijing Qin,
Haoxiang Sun,
Jiapeng Wang,
Shiyi Xu,
Xiaoxue Cheng,
Geyang Guo,
Han Peng,
Bowen Zheng,
Yiru Tang,
Yingqian Min,
Yushuo Chen,
Jie Chen,
Yuanqian Zhao,
Luran Ding,
Yuhao Wang,
Zican Dong,
Chunxuan Xia,
Junyi Li,
Kun Zhou,
Wayne Xin Zhao,
Ji-Rong Wen
Abstract:
To facilitate the research on large language models (LLMs), this paper presents a comprehensive and unified library, LLMBox, to ease the development, use, and evaluation of LLMs. This library is featured with three main merits: (1) a unified data interface that supports the flexible implementation of various training strategies, (2) a comprehensive evaluation that covers extensive tasks, datasets,…
▽ More
To facilitate the research on large language models (LLMs), this paper presents a comprehensive and unified library, LLMBox, to ease the development, use, and evaluation of LLMs. This library is featured with three main merits: (1) a unified data interface that supports the flexible implementation of various training strategies, (2) a comprehensive evaluation that covers extensive tasks, datasets, and models, and (3) more practical consideration, especially on user-friendliness and efficiency. With our library, users can easily reproduce existing methods, train new models, and conduct comprehensive performance comparisons. To rigorously test LLMBox, we conduct extensive experiments in a diverse coverage of evaluation settings, and experimental results demonstrate the effectiveness and efficiency of our library in supporting various implementations related to LLMs. The detailed introduction and usage guidance can be found at https://github.com/RUCAIBox/LLMBox.
△ Less
Submitted 7 July, 2024;
originally announced July 2024.
-
BFLN: A Blockchain-based Federated Learning Model for Non-IID Data
Authors:
Yang Li,
Chunhe Xia,
Dongchi Huang,
Xiaojian Li,
Tianbo Wang
Abstract:
As the application of federated learning becomes increasingly widespread, the issue of imbalanced training data distribution has emerged as a significant challenge. Federated learning utilizes local data stored on different training clients for model training, rather than centralizing data on a server, thereby greatly enhancing the privacy and security of training data. However, the distribution o…
▽ More
As the application of federated learning becomes increasingly widespread, the issue of imbalanced training data distribution has emerged as a significant challenge. Federated learning utilizes local data stored on different training clients for model training, rather than centralizing data on a server, thereby greatly enhancing the privacy and security of training data. However, the distribution of training data across different clients may be imbalanced, with different categories of data potentially residing on different clients. This presents a challenge to traditional federated learning, which assumes data distribution is independent and identically distributed (IID). This paper proposes a Blockchain-based Federated Learning Model for Non-IID Data (BFLN), which combines federated learning with blockchain technology. By introducing a new aggregation method and incentive algorithm, BFLN enhances the model performance of federated learning on non-IID data. Experiments on public datasets demonstrate that, compared to other state-of-the-art models, BFLN improves training accuracy and provides a sustainable incentive mechanism for personalized federated learning.
△ Less
Submitted 10 July, 2024; v1 submitted 7 July, 2024;
originally announced July 2024.
-
Hindsight Preference Learning for Offline Preference-based Reinforcement Learning
Authors:
Chen-Xiao Gao,
Shengjun Fang,
Chenjun Xiao,
Yang Yu,
Zongzhang Zhang
Abstract:
Offline preference-based reinforcement learning (RL), which focuses on optimizing policies using human preferences between pairs of trajectory segments selected from an offline dataset, has emerged as a practical avenue for RL applications. Existing works rely on extracting step-wise reward signals from trajectory-wise preference annotations, assuming that preferences correlate with the cumulative…
▽ More
Offline preference-based reinforcement learning (RL), which focuses on optimizing policies using human preferences between pairs of trajectory segments selected from an offline dataset, has emerged as a practical avenue for RL applications. Existing works rely on extracting step-wise reward signals from trajectory-wise preference annotations, assuming that preferences correlate with the cumulative Markovian rewards. However, such methods fail to capture the holistic perspective of data annotation: Humans often assess the desirability of a sequence of actions by considering the overall outcome rather than the immediate rewards. To address this challenge, we propose to model human preferences using rewards conditioned on future outcomes of the trajectory segments, i.e. the hindsight information. For downstream RL optimization, the reward of each step is calculated by marginalizing over possible future outcomes, the distribution of which is approximated by a variational auto-encoder trained using the offline dataset. Our proposed method, Hindsight Preference Learning (HPL), can facilitate credit assignment by taking full advantage of vast trajectory data available in massive unlabeled datasets. Comprehensive empirical studies demonstrate the benefits of HPL in delivering robust and advantageous rewards across various domains. Our code is publicly released at https://github.com/typoverflow/WiseRL.
△ Less
Submitted 5 July, 2024;
originally announced July 2024.
-
Counterfactual Data Augmentation with Denoising Diffusion for Graph Anomaly Detection
Authors:
Chunjing Xiao,
Shikang Pang,
Xovee Xu,
Xuan Li,
Goce Trajcevski,
Fan Zhou
Abstract:
A critical aspect of Graph Neural Networks (GNNs) is to enhance the node representations by aggregating node neighborhood information. However, when detecting anomalies, the representations of abnormal nodes are prone to be averaged by normal neighbors, making the learned anomaly representations less distinguishable. To tackle this issue, we propose CAGAD -- an unsupervised Counterfactual data Aug…
▽ More
A critical aspect of Graph Neural Networks (GNNs) is to enhance the node representations by aggregating node neighborhood information. However, when detecting anomalies, the representations of abnormal nodes are prone to be averaged by normal neighbors, making the learned anomaly representations less distinguishable. To tackle this issue, we propose CAGAD -- an unsupervised Counterfactual data Augmentation method for Graph Anomaly Detection -- which introduces a graph pointer neural network as the heterophilic node detector to identify potential anomalies whose neighborhoods are normal-node-dominant. For each identified potential anomaly, we design a graph-specific diffusion model to translate a part of its neighbors, which are probably normal, into anomalous ones. At last, we involve these translated neighbors in GNN neighborhood aggregation to produce counterfactual representations of anomalies. Through aggregating the translated anomalous neighbors, counterfactual representations become more distinguishable and further advocate detection performance. The experimental results on four datasets demonstrate that CAGAD significantly outperforms strong baselines, with an average improvement of 2.35% on F1, 2.53% on AUC-ROC, and 2.79% on AUC-PR.
△ Less
Submitted 2 July, 2024;
originally announced July 2024.
-
Agentless: Demystifying LLM-based Software Engineering Agents
Authors:
Chunqiu Steven Xia,
Yinlin Deng,
Soren Dunn,
Lingming Zhang
Abstract:
Recent advancements in large language models (LLMs) have significantly advanced the automation of software development tasks, including code synthesis, program repair, and test generation. More recently, researchers and industry practitioners have developed various autonomous LLM agents to perform end-to-end software development tasks. These agents are equipped with the ability to use tools, run c…
▽ More
Recent advancements in large language models (LLMs) have significantly advanced the automation of software development tasks, including code synthesis, program repair, and test generation. More recently, researchers and industry practitioners have developed various autonomous LLM agents to perform end-to-end software development tasks. These agents are equipped with the ability to use tools, run commands, observe feedback from the environment, and plan for future actions. However, the complexity of these agent-based approaches, together with the limited abilities of current LLMs, raises the following question: Do we really have to employ complex autonomous software agents? To attempt to answer this question, we build Agentless -- an agentless approach to automatically solve software development problems. Compared to the verbose and complex setup of agent-based approaches, Agentless employs a simplistic two-phase process of localization followed by repair, without letting the LLM decide future actions or operate with complex tools. Our results on the popular SWE-bench Lite benchmark show that surprisingly the simplistic Agentless is able to achieve both the highest performance (27.33%) and lowest cost (\$0.34) compared with all existing open-source software agents! Furthermore, we manually classified the problems in SWE-bench Lite and found problems with exact ground truth patch or insufficient/misleading issue descriptions. As such, we construct SWE-bench Lite-S by excluding such problematic issues to perform more rigorous evaluation and comparison. Our work highlights the current overlooked potential of a simple, interpretable technique in autonomous software development. We hope Agentless will help reset the baseline, starting point, and horizon for autonomous software agents, and inspire future work along this crucial direction.
△ Less
Submitted 1 July, 2024;
originally announced July 2024.
-
TrialBench: Multi-Modal Artificial Intelligence-Ready Clinical Trial Datasets
Authors:
Jintai Chen,
Yaojun Hu,
Yue Wang,
Yingzhou Lu,
Xu Cao,
Miao Lin,
Hongxia Xu,
Jian Wu,
Cao Xiao,
Jimeng Sun,
Lucas Glass,
Kexin Huang,
Marinka Zitnik,
Tianfan Fu
Abstract:
Clinical trials are pivotal for developing new medical treatments, yet they typically pose some risks such as patient mortality, adverse events, and enrollment failure that waste immense efforts spanning over a decade. Applying artificial intelligence (AI) to forecast or simulate key events in clinical trials holds great potential for providing insights to guide trial designs. However, complex dat…
▽ More
Clinical trials are pivotal for developing new medical treatments, yet they typically pose some risks such as patient mortality, adverse events, and enrollment failure that waste immense efforts spanning over a decade. Applying artificial intelligence (AI) to forecast or simulate key events in clinical trials holds great potential for providing insights to guide trial designs. However, complex data collection and question definition requiring medical expertise and a deep understanding of trial designs have hindered the involvement of AI thus far. This paper tackles these challenges by presenting a comprehensive suite of meticulously curated AIready datasets covering multi-modal data (e.g., drug molecule, disease code, text, categorical/numerical features) and 8 crucial prediction challenges in clinical trial design, encompassing prediction of trial duration, patient dropout rate, serious adverse event, mortality rate, trial approval outcome, trial failure reason, drug dose finding, design of eligibility criteria. Furthermore, we provide basic validation methods for each task to ensure the datasets' usability and reliability. We anticipate that the availability of such open-access datasets will catalyze the development of advanced AI approaches for clinical trial design, ultimately advancing clinical trial research and accelerating medical solution development. The curated dataset, metrics, and basic models are publicly available at https://github.com/ML2Health/ML2ClinicalTrials/tree/main/AI4Trial.
△ Less
Submitted 3 September, 2024; v1 submitted 30 June, 2024;
originally announced July 2024.
-
Consistency Purification: Effective and Efficient Diffusion Purification towards Certified Robustness
Authors:
Yiquan Li,
Zhongzhu Chen,
Kun Jin,
Jiongxiao Wang,
Bo Li,
Chaowei Xiao
Abstract:
Diffusion Purification, purifying noised images with diffusion models, has been widely used for enhancing certified robustness via randomized smoothing. However, existing frameworks often grapple with the balance between efficiency and effectiveness. While the Denoising Diffusion Probabilistic Model (DDPM) offers an efficient single-step purification, it falls short in ensuring purified images res…
▽ More
Diffusion Purification, purifying noised images with diffusion models, has been widely used for enhancing certified robustness via randomized smoothing. However, existing frameworks often grapple with the balance between efficiency and effectiveness. While the Denoising Diffusion Probabilistic Model (DDPM) offers an efficient single-step purification, it falls short in ensuring purified images reside on the data manifold. Conversely, the Stochastic Diffusion Model effectively places purified images on the data manifold but demands solving cumbersome stochastic differential equations, while its derivative, the Probability Flow Ordinary Differential Equation (PF-ODE), though solving simpler ordinary differential equations, still requires multiple computational steps. In this work, we demonstrated that an ideal purification pipeline should generate the purified images on the data manifold that are as much semantically aligned to the original images for effectiveness in one step for efficiency. Therefore, we introduced Consistency Purification, an efficiency-effectiveness Pareto superior purifier compared to the previous work. Consistency Purification employs the consistency model, a one-step generative model distilled from PF-ODE, thus can generate on-manifold purified images with a single network evaluation. However, the consistency model is designed not for purification thus it does not inherently ensure semantic alignment between purified and original images. To resolve this issue, we further refine it through Consistency Fine-tuning with LPIPS loss, which enables more aligned semantic meaning while keeping the purified images on data manifold. Our comprehensive experiments demonstrate that our Consistency Purification framework achieves state-of the-art certified robustness and efficiency compared to baseline methods.
△ Less
Submitted 30 June, 2024;
originally announced July 2024.
-
BioMNER: A Dataset for Biomedical Method Entity Recognition
Authors:
Chen Tang,
Bohao Yang,
Kun Zhao,
Bo Lv,
Chenghao Xiao,
Frank Guerin,
Chenghua Lin
Abstract:
Named entity recognition (NER) stands as a fundamental and pivotal task within the realm of Natural Language Processing. Particularly within the domain of Biomedical Method NER, this task presents notable challenges, stemming from the continual influx of domain-specific terminologies in scholarly literature. Current research in Biomedical Method (BioMethod) NER suffers from a scarcity of resources…
▽ More
Named entity recognition (NER) stands as a fundamental and pivotal task within the realm of Natural Language Processing. Particularly within the domain of Biomedical Method NER, this task presents notable challenges, stemming from the continual influx of domain-specific terminologies in scholarly literature. Current research in Biomedical Method (BioMethod) NER suffers from a scarcity of resources, primarily attributed to the intricate nature of methodological concepts, which necessitate a profound understanding for precise delineation. In this study, we propose a novel dataset for biomedical method entity recognition, employing an automated BioMethod entity recognition and information retrieval system to assist human annotation. Furthermore, we comprehensively explore a range of conventional and contemporary open-domain NER methodologies, including the utilization of cutting-edge large-scale language models (LLMs) customised to our dataset. Our empirical findings reveal that the large parameter counts of language models surprisingly inhibit the effective assimilation of entity extraction patterns pertaining to biomedical methods. Remarkably, the approach, leveraging the modestly sized ALBERT model (only 11MB), in conjunction with conditional random fields (CRF), achieves state-of-the-art (SOTA) performance.
△ Less
Submitted 28 June, 2024;
originally announced June 2024.
-
UniGen: A Unified Framework for Textual Dataset Generation Using Large Language Models
Authors:
Siyuan Wu,
Yue Huang,
Chujie Gao,
Dongping Chen,
Qihui Zhang,
Yao Wan,
Tianyi Zhou,
Xiangliang Zhang,
Jianfeng Gao,
Chaowei Xiao,
Lichao Sun
Abstract:
Large Language Models (LLMs) such as GPT-4 and Llama3 have significantly impacted various fields by enabling high-quality synthetic data generation and reducing dependence on expensive human-generated datasets. Despite this, challenges remain in the areas of generalization, controllability, diversity, and truthfulness within the existing generative frameworks. To address these challenges, this pap…
▽ More
Large Language Models (LLMs) such as GPT-4 and Llama3 have significantly impacted various fields by enabling high-quality synthetic data generation and reducing dependence on expensive human-generated datasets. Despite this, challenges remain in the areas of generalization, controllability, diversity, and truthfulness within the existing generative frameworks. To address these challenges, this paper presents UniGen, a comprehensive LLM-powered framework designed to produce diverse, accurate, and highly controllable datasets. UniGen is adaptable, supporting all types of text datasets and enhancing the generative process through innovative mechanisms. To augment data diversity, UniGen incorporates an attribute-guided generation module and a group checking feature. For accuracy, it employs a code-based mathematical assessment for label verification alongside a retrieval-augmented generation technique for factual validation. The framework also allows for user-specified constraints, enabling customization of the data generation process to suit particular requirements. Extensive experiments demonstrate the superior quality of data generated by UniGen, and each module within UniGen plays a critical role in this enhancement. Additionally, UniGen is applied in two practical scenarios: benchmarking LLMs and data augmentation. The results indicate that UniGen effectively supports dynamic and evolving benchmarking, and that data augmentation improves LLM capabilities in various domains, including agent-oriented abilities and reasoning skills.
△ Less
Submitted 22 August, 2024; v1 submitted 27 June, 2024;
originally announced June 2024.
-
CompassDB: Pioneering High-Performance Key-Value Store with Perfect Hash
Authors:
Jin Jiang,
Dongsheng He,
Yu Hu,
Dong Liu,
Chenfan Xiao,
Hongxiao Bi,
Yusong Zhang,
Chaoqu Jiang,
Zhijun Fu
Abstract:
Modern mainstream persistent key-value storage engines utilize Log-Structured Merge tree (LSM-tree) based designs, optimizing read/write performance by leveraging sequential disk I/O. However, the advent of SSDs, with their significant improvements in bandwidth and IOPS, shifts the bottleneck from I/O to CPU. The high compaction cost and large read/write amplification associated with LSM trees hav…
▽ More
Modern mainstream persistent key-value storage engines utilize Log-Structured Merge tree (LSM-tree) based designs, optimizing read/write performance by leveraging sequential disk I/O. However, the advent of SSDs, with their significant improvements in bandwidth and IOPS, shifts the bottleneck from I/O to CPU. The high compaction cost and large read/write amplification associated with LSM trees have become critical bottlenecks. In this paper, we introduce CompassDB, which utilizes a Two-tier Perfect Hash Table (TPH) design to significantly decrease read/write amplification and compaction costs. CompassDB utilizes a perfect hash algorithm for its in-memory index, resulting in an average index cost of about 6 bytes per key-value pair. This compact index reduces the lookup time complexity from $O(log N)$ to $O(1)$ and decreases the overall cost. Consequently, it allows for the storage of more key-value pairs for reads or provides additional memory for the memtable for writes. This results in substantial improvements in both throughput and latency. Our evaluation using the YCSB benchmark tool shows that CompassDB increases throughput by 2.5x to 4x compared to RocksDB, and by 5x to 17x compared to PebblesDB across six typical workloads. Additionally, CompassDB significantly reduces average and 99th percentile read/write latency, achieving a 50% to 85% reduction in comparison to RocksDB.
△ Less
Submitted 26 June, 2024;
originally announced June 2024.
-
Crafting Customisable Characters with LLMs: Introducing SimsChat, a Persona-Driven Role-Playing Agent Framework
Authors:
Bohao Yang,
Dong Liu,
Chen Tang,
Chenghao Xiao,
Kun Zhao,
Chao Li,
Lin Yuan,
Guang Yang,
Lanxiao Huang,
Chenghua Lin
Abstract:
Large Language Models (LLMs) demonstrate a remarkable ability to comprehend human instructions and generate high-quality text. This capability allows LLMs to function as agents that can emulate human beings at a more sophisticated level, beyond the mere replication of basic human behaviours. However, there is a lack of exploring into leveraging LLMs to craft characters from diverse aspects. In thi…
▽ More
Large Language Models (LLMs) demonstrate a remarkable ability to comprehend human instructions and generate high-quality text. This capability allows LLMs to function as agents that can emulate human beings at a more sophisticated level, beyond the mere replication of basic human behaviours. However, there is a lack of exploring into leveraging LLMs to craft characters from diverse aspects. In this work, we introduce the Customisable Conversation Agent Framework, which leverages LLMs to simulate real-world characters that can be freely customised according to various user preferences. This adaptable framework is beneficial for the design of customisable characters and role-playing agents aligned with human preferences. We propose the SimsConv dataset, which encompasses 68 different customised characters, 1,360 multi-turn role-playing dialogues, and a total of 13,971 interaction dialogues. The characters are created from several real-world elements, such as career, aspiration, trait, and skill. Building upon these foundations, we present SimsChat, a freely customisable role-playing agent. It incorporates diverse real-world scenes and topic-specific character interaction dialogues, thereby simulating characters' life experiences in various scenarios and topic-specific interactions with specific emotions. Experimental results indicate that our proposed framework achieves desirable performance and provides a valuable guideline for the construction of more accurate human simulacra in the future. Our data and code are publicly available at https://github.com/Bernard-Yang/SimsChat.
△ Less
Submitted 16 August, 2024; v1 submitted 25 June, 2024;
originally announced June 2024.
-
X-ray Made Simple: Radiology Report Generation and Evaluation with Layman's Terms
Authors:
Kun Zhao,
Chenghao Xiao,
Chen Tang,
Bohao Yang,
Kai Ye,
Noura Al Moubayed,
Liang Zhan,
Chenghua Lin
Abstract:
Radiology Report Generation (RRG) has achieved significant progress with the advancements of multimodal generative models. However, the evaluation in the domain suffers from a lack of fair and robust metrics. We reveal that, high performance on RRG with existing lexical-based metrics (e.g. BLEU) might be more of a mirage - a model can get a high BLEU only by learning the template of reports. This…
▽ More
Radiology Report Generation (RRG) has achieved significant progress with the advancements of multimodal generative models. However, the evaluation in the domain suffers from a lack of fair and robust metrics. We reveal that, high performance on RRG with existing lexical-based metrics (e.g. BLEU) might be more of a mirage - a model can get a high BLEU only by learning the template of reports. This has become an urgent problem for RRG due to the highly patternized nature of these reports. In this work, we un-intuitively approach this problem by proposing the Layman's RRG framework, a layman's terms-based dataset, evaluation and training framework that systematically improves RRG with day-to-day language. We first contribute the translated Layman's terms dataset. Building upon the dataset, we then propose a semantics-based evaluation method, which is proved to mitigate the inflated numbers of BLEU and provides fairer evaluation. Last, we show that training on the layman's terms dataset encourages models to focus on the semantics of the reports, as opposed to overfitting to learning the report templates. We reveal a promising scaling law between the number of training examples and semantics gain provided by our dataset, compared to the inverse pattern brought by the original formats. Our code is available at \url{https://github.com/hegehongcha/LaymanRRG}.
△ Less
Submitted 30 June, 2024; v1 submitted 25 June, 2024;
originally announced June 2024.
-
LLMs Assist NLP Researchers: Critique Paper (Meta-)Reviewing
Authors:
Jiangshu Du,
Yibo Wang,
Wenting Zhao,
Zhongfen Deng,
Shuaiqi Liu,
Renze Lou,
Henry Peng Zou,
Pranav Narayanan Venkit,
Nan Zhang,
Mukund Srinath,
Haoran Ranran Zhang,
Vipul Gupta,
Yinghui Li,
Tao Li,
Fei Wang,
Qin Liu,
Tianlin Liu,
Pengzhi Gao,
Congying Xia,
Chen Xing,
Jiayang Cheng,
Zhaowei Wang,
Ying Su,
Raj Sanjay Shah,
Ruohao Guo
, et al. (15 additional authors not shown)
Abstract:
This work is motivated by two key trends. On one hand, large language models (LLMs) have shown remarkable versatility in various generative tasks such as writing, drawing, and question answering, significantly reducing the time required for many routine tasks. On the other hand, researchers, whose work is not only time-consuming but also highly expertise-demanding, face increasing challenges as th…
▽ More
This work is motivated by two key trends. On one hand, large language models (LLMs) have shown remarkable versatility in various generative tasks such as writing, drawing, and question answering, significantly reducing the time required for many routine tasks. On the other hand, researchers, whose work is not only time-consuming but also highly expertise-demanding, face increasing challenges as they have to spend more time reading, writing, and reviewing papers. This raises the question: how can LLMs potentially assist researchers in alleviating their heavy workload?
This study focuses on the topic of LLMs assist NLP Researchers, particularly examining the effectiveness of LLM in assisting paper (meta-)reviewing and its recognizability. To address this, we constructed the ReviewCritique dataset, which includes two types of information: (i) NLP papers (initial submissions rather than camera-ready) with both human-written and LLM-generated reviews, and (ii) each review comes with "deficiency" labels and corresponding explanations for individual segments, annotated by experts. Using ReviewCritique, this study explores two threads of research questions: (i) "LLMs as Reviewers", how do reviews generated by LLMs compare with those written by humans in terms of quality and distinguishability? (ii) "LLMs as Metareviewers", how effectively can LLMs identify potential issues, such as Deficient or unprofessional review segments, within individual paper reviews? To our knowledge, this is the first work to provide such a comprehensive analysis.
△ Less
Submitted 25 June, 2024; v1 submitted 23 June, 2024;
originally announced June 2024.
-
Diffusion Spectral Representation for Reinforcement Learning
Authors:
Dmitry Shribak,
Chen-Xiao Gao,
Yitong Li,
Chenjun Xiao,
Bo Dai
Abstract:
Diffusion-based models have achieved notable empirical successes in reinforcement learning (RL) due to their expressiveness in modeling complex distributions. Despite existing methods being promising, the key challenge of extending existing methods for broader real-world applications lies in the computational cost at inference time, i.e., sampling from a diffusion model is considerably slow as it…
▽ More
Diffusion-based models have achieved notable empirical successes in reinforcement learning (RL) due to their expressiveness in modeling complex distributions. Despite existing methods being promising, the key challenge of extending existing methods for broader real-world applications lies in the computational cost at inference time, i.e., sampling from a diffusion model is considerably slow as it often requires tens to hundreds of iterations to generate even one sample. To circumvent this issue, we propose to leverage the flexibility of diffusion models for RL from a representation learning perspective. In particular, by exploiting the connection between diffusion model and energy-based model, we develop Diffusion Spectral Representation (Diff-SR), a coherent algorithm framework that enables extracting sufficient representations for value functions in Markov decision processes (MDP) and partially observable Markov decision processes (POMDP). We further demonstrate how Diff-SR facilitates efficient policy optimization and practical algorithms while explicitly bypassing the difficulty and inference cost of sampling from the diffusion model. Finally, we provide comprehensive empirical studies to verify the benefits of Diff-SR in delivering robust and advantageous performance across various benchmarks with both fully and partially observable settings.
△ Less
Submitted 23 June, 2024;
originally announced June 2024.
-
Visible-Thermal Tiny Object Detection: A Benchmark Dataset and Baselines
Authors:
Xinyi Ying,
Chao Xiao,
Ruojing Li,
Xu He,
Boyang Li,
Zhaoxu Li,
Yingqian Wang,
Mingyuan Hu,
Qingyu Xu,
Zaiping Lin,
Miao Li,
Shilin Zhou,
Wei An,
Weidong Sheng,
Li Liu
Abstract:
Small object detection (SOD) has been a longstanding yet challenging task for decades, with numerous datasets and algorithms being developed. However, they mainly focus on either visible or thermal modality, while visible-thermal (RGBT) bimodality is rarely explored. Although some RGBT datasets have been developed recently, the insufficient quantity, limited category, misaligned images and large t…
▽ More
Small object detection (SOD) has been a longstanding yet challenging task for decades, with numerous datasets and algorithms being developed. However, they mainly focus on either visible or thermal modality, while visible-thermal (RGBT) bimodality is rarely explored. Although some RGBT datasets have been developed recently, the insufficient quantity, limited category, misaligned images and large target size cannot provide an impartial benchmark to evaluate multi-category visible-thermal small object detection (RGBT SOD) algorithms. In this paper, we build the first large-scale benchmark with high diversity for RGBT SOD (namely RGBT-Tiny), including 115 paired sequences, 93K frames and 1.2M manual annotations. RGBT-Tiny contains abundant targets (7 categories) and high-diversity scenes (8 types that cover different illumination and density variations). Note that, over 81% of targets are smaller than 16x16, and we provide paired bounding box annotations with tracking ID to offer an extremely challenging benchmark with wide-range applications, such as RGBT fusion, detection and tracking. In addition, we propose a scale adaptive fitness (SAFit) measure that exhibits high robustness on both small and large targets. The proposed SAFit can provide reasonable performance evaluation and promote detection performance. Based on the proposed RGBT-Tiny dataset and SAFit measure, extensive evaluations have been conducted, including 23 recent state-of-the-art algorithms that cover four different types (i.e., visible generic detection, visible SOD, thermal SOD and RGBT object detection). Project is available at https://github.com/XinyiYing24/RGBT-Tiny.
△ Less
Submitted 20 June, 2024;
originally announced June 2024.
-
Synthesizing Multimodal Electronic Health Records via Predictive Diffusion Models
Authors:
Yuan Zhong,
Xiaochen Wang,
Jiaqi Wang,
Xiaokun Zhang,
Yaqing Wang,
Mengdi Huai,
Cao Xiao,
Fenglong Ma
Abstract:
Synthesizing electronic health records (EHR) data has become a preferred strategy to address data scarcity, improve data quality, and model fairness in healthcare. However, existing approaches for EHR data generation predominantly rely on state-of-the-art generative techniques like generative adversarial networks, variational autoencoders, and language models. These methods typically replicate inp…
▽ More
Synthesizing electronic health records (EHR) data has become a preferred strategy to address data scarcity, improve data quality, and model fairness in healthcare. However, existing approaches for EHR data generation predominantly rely on state-of-the-art generative techniques like generative adversarial networks, variational autoencoders, and language models. These methods typically replicate input visits, resulting in inadequate modeling of temporal dependencies between visits and overlooking the generation of time information, a crucial element in EHR data. Moreover, their ability to learn visit representations is limited due to simple linear mapping functions, thus compromising generation quality. To address these limitations, we propose a novel EHR data generation model called EHRPD. It is a diffusion-based model designed to predict the next visit based on the current one while also incorporating time interval estimation. To enhance generation quality and diversity, we introduce a novel time-aware visit embedding module and a pioneering predictive denoising diffusion probabilistic model (PDDPM). Additionally, we devise a predictive U-Net (PU-Net) to optimize P-DDPM.We conduct experiments on two public datasets and evaluate EHRPD from fidelity, privacy, and utility perspectives. The experimental results demonstrate the efficacy and utility of the proposed EHRPD in addressing the aforementioned limitations and advancing EHR data generation.
△ Less
Submitted 19 June, 2024;
originally announced June 2024.
-
MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding
Authors:
Fei Wang,
Xingyu Fu,
James Y. Huang,
Zekun Li,
Qin Liu,
Xiaogeng Liu,
Mingyu Derek Ma,
Nan Xu,
Wenxuan Zhou,
Kai Zhang,
Tianyi Lorena Yan,
Wenjie Jacky Mo,
Hsiang-Hui Liu,
Pan Lu,
Chunyuan Li,
Chaowei Xiao,
Kai-Wei Chang,
Dan Roth,
Sheng Zhang,
Hoifung Poon,
Muhao Chen
Abstract:
We introduce MuirBench, a comprehensive benchmark that focuses on robust multi-image understanding capabilities of multimodal LLMs. MuirBench consists of 12 diverse multi-image tasks (e.g., scene understanding, ordering) that involve 10 categories of multi-image relations (e.g., multiview, temporal relations). Comprising 11,264 images and 2,600 multiple-choice questions, MuirBench is created in a…
▽ More
We introduce MuirBench, a comprehensive benchmark that focuses on robust multi-image understanding capabilities of multimodal LLMs. MuirBench consists of 12 diverse multi-image tasks (e.g., scene understanding, ordering) that involve 10 categories of multi-image relations (e.g., multiview, temporal relations). Comprising 11,264 images and 2,600 multiple-choice questions, MuirBench is created in a pairwise manner, where each standard instance is paired with an unanswerable variant that has minimal semantic differences, in order for a reliable assessment. Evaluated upon 20 recent multi-modal LLMs, our results reveal that even the best-performing models like GPT-4o and Gemini Pro find it challenging to solve MuirBench, achieving 68.0% and 49.3% in accuracy. Open-source multimodal LLMs trained on single images can hardly generalize to multi-image questions, hovering below 33.3% in accuracy. These results highlight the importance of MuirBench in encouraging the community to develop multimodal LLMs that can look beyond a single image, suggesting potential pathways for future improvements.
△ Less
Submitted 1 July, 2024; v1 submitted 13 June, 2024;
originally announced June 2024.
-
Certifiably Byzantine-Robust Federated Conformal Prediction
Authors:
Mintong Kang,
Zhen Lin,
Jimeng Sun,
Cao Xiao,
Bo Li
Abstract:
Conformal prediction has shown impressive capacity in constructing statistically rigorous prediction sets for machine learning models with exchangeable data samples. The siloed datasets, coupled with the escalating privacy concerns related to local data sharing, have inspired recent innovations extending conformal prediction into federated environments with distributed data samples. However, this…
▽ More
Conformal prediction has shown impressive capacity in constructing statistically rigorous prediction sets for machine learning models with exchangeable data samples. The siloed datasets, coupled with the escalating privacy concerns related to local data sharing, have inspired recent innovations extending conformal prediction into federated environments with distributed data samples. However, this framework for distributed uncertainty quantification is susceptible to Byzantine failures. A minor subset of malicious clients can significantly compromise the practicality of coverage guarantees. To address this vulnerability, we introduce a novel framework Rob-FCP, which executes robust federated conformal prediction, effectively countering malicious clients capable of reporting arbitrary statistics with the conformal calibration process. We theoretically provide the conformal coverage bound of Rob-FCP in the Byzantine setting and show that the coverage of Rob-FCP is asymptotically close to the desired coverage level. We also propose a malicious client number estimator to tackle a more challenging setting where the number of malicious clients is unknown to the defender and theoretically shows its effectiveness. We empirically demonstrate the robustness of Rob-FCP against diverse proportions of malicious clients under a variety of Byzantine attacks on five standard benchmark and real-world healthcare datasets.
△ Less
Submitted 4 June, 2024;
originally announced June 2024.
-
Target Networks and Over-parameterization Stabilize Off-policy Bootstrapping with Function Approximation
Authors:
Fengdi Che,
Chenjun Xiao,
Jincheng Mei,
Bo Dai,
Ramki Gummadi,
Oscar A Ramirez,
Christopher K Harris,
A. Rupam Mahmood,
Dale Schuurmans
Abstract:
We prove that the combination of a target network and over-parameterized linear function approximation establishes a weaker convergence condition for bootstrapped value estimation in certain cases, even with off-policy data. Our condition is naturally satisfied for expected updates over the entire state-action space or learning with a batch of complete trajectories from episodic Markov decision pr…
▽ More
We prove that the combination of a target network and over-parameterized linear function approximation establishes a weaker convergence condition for bootstrapped value estimation in certain cases, even with off-policy data. Our condition is naturally satisfied for expected updates over the entire state-action space or learning with a batch of complete trajectories from episodic Markov decision processes. Notably, using only a target network or an over-parameterized model does not provide such a convergence guarantee. Additionally, we extend our results to learning with truncated trajectories, showing that convergence is achievable for all tasks with minor modifications, akin to value truncation for the final states in trajectories. Our primary result focuses on temporal difference estimation for prediction, providing high-probability value estimation error bounds and empirical analysis on Baird's counterexample and a Four-room task. Furthermore, we explore the control setting, demonstrating that similar convergence conditions apply to Q-learning.
△ Less
Submitted 31 May, 2024;
originally announced May 2024.
-
AI Risk Management Should Incorporate Both Safety and Security
Authors:
Xiangyu Qi,
Yangsibo Huang,
Yi Zeng,
Edoardo Debenedetti,
Jonas Geiping,
Luxi He,
Kaixuan Huang,
Udari Madhushani,
Vikash Sehwag,
Weijia Shi,
Boyi Wei,
Tinghao Xie,
Danqi Chen,
Pin-Yu Chen,
Jeffrey Ding,
Ruoxi Jia,
Jiaqi Ma,
Arvind Narayanan,
Weijie J Su,
Mengdi Wang,
Chaowei Xiao,
Bo Li,
Dawn Song,
Peter Henderson,
Prateek Mittal
Abstract:
The exposure of security vulnerabilities in safety-aligned language models, e.g., susceptibility to adversarial attacks, has shed light on the intricate interplay between AI safety and AI security. Although the two disciplines now come together under the overarching goal of AI risk management, they have historically evolved separately, giving rise to differing perspectives. Therefore, in this pape…
▽ More
The exposure of security vulnerabilities in safety-aligned language models, e.g., susceptibility to adversarial attacks, has shed light on the intricate interplay between AI safety and AI security. Although the two disciplines now come together under the overarching goal of AI risk management, they have historically evolved separately, giving rise to differing perspectives. Therefore, in this paper, we advocate that stakeholders in AI risk management should be aware of the nuances, synergies, and interplay between safety and security, and unambiguously take into account the perspectives of both disciplines in order to devise mostly effective and holistic risk mitigation approaches. Unfortunately, this vision is often obfuscated, as the definitions of the basic concepts of "safety" and "security" themselves are often inconsistent and lack consensus across communities. With AI risk management being increasingly cross-disciplinary, this issue is particularly salient. In light of this conceptual challenge, we introduce a unified reference framework to clarify the differences and interplay between AI safety and AI security, aiming to facilitate a shared understanding and effective collaboration across communities.
△ Less
Submitted 29 May, 2024;
originally announced May 2024.
-
The Power of Next-Frame Prediction for Learning Physical Laws
Authors:
Thomas Winterbottom,
G. Thomas Hudson,
Daniel Kluvanec,
Dean Slack,
Jamie Sterling,
Junjie Shentu,
Chenghao Xiao,
Zheming Zhou,
Noura Al Moubayed
Abstract:
Next-frame prediction is a useful and powerful method for modelling and understanding the dynamics of video data. Inspired by the empirical success of causal language modelling and next-token prediction in language modelling, we explore the extent to which next-frame prediction serves as a strong foundational learning strategy (analogous to language modelling) for inducing an understanding of the…
▽ More
Next-frame prediction is a useful and powerful method for modelling and understanding the dynamics of video data. Inspired by the empirical success of causal language modelling and next-token prediction in language modelling, we explore the extent to which next-frame prediction serves as a strong foundational learning strategy (analogous to language modelling) for inducing an understanding of the visual world. In order to quantify the specific visual understanding induced by next-frame prediction, we introduce six diagnostic simulation video datasets derived from fundamental physical laws created by varying physical constants such as gravity and mass. We demonstrate that our models trained only on next-frame prediction are capable of predicting the value of these physical constants (e.g. gravity) without having been trained directly to learn these constants via a regression task. We find that the generative training phase alone induces a model state that can predict physical constants significantly better than that of a random model, improving the loss by a factor of between 1.28 to 6.24. We conclude that next-frame prediction shows great promise as a general learning strategy to induce understanding of the many `laws' that govern the visual domain without the need for explicit labelling.
△ Less
Submitted 21 May, 2024;
originally announced May 2024.
-
KG-FIT: Knowledge Graph Fine-Tuning Upon Open-World Knowledge
Authors:
Pengcheng Jiang,
Lang Cao,
Cao Xiao,
Parminder Bhatia,
Jimeng Sun,
Jiawei Han
Abstract:
Knowledge Graph Embedding (KGE) techniques are crucial in learning compact representations of entities and relations within a knowledge graph, facilitating efficient reasoning and knowledge discovery. While existing methods typically focus either on training KGE models solely based on graph structure or fine-tuning pre-trained language models with classification data in KG, KG-FIT leverages LLM-gu…
▽ More
Knowledge Graph Embedding (KGE) techniques are crucial in learning compact representations of entities and relations within a knowledge graph, facilitating efficient reasoning and knowledge discovery. While existing methods typically focus either on training KGE models solely based on graph structure or fine-tuning pre-trained language models with classification data in KG, KG-FIT leverages LLM-guided refinement to construct a semantically coherent hierarchical structure of entity clusters. By incorporating this hierarchical knowledge along with textual information during the fine-tuning process, KG-FIT effectively captures both global semantics from the LLM and local semantics from the KG. Extensive experiments on the benchmark datasets FB15K-237, YAGO3-10, and PrimeKG demonstrate the superiority of KG-FIT over state-of-the-art pre-trained language model-based methods, achieving improvements of 14.4%, 13.5%, and 11.9% in the Hits@10 metric for the link prediction task, respectively. Furthermore, KG-FIT yields substantial performance gains of 12.6%, 6.7%, and 17.7% compared to the structure-based base models upon which it is built. These results highlight the effectiveness of KG-FIT in incorporating open-world knowledge from LLMs to significantly enhance the expressiveness and informativeness of KG embeddings.
△ Less
Submitted 4 June, 2024; v1 submitted 25 May, 2024;
originally announced May 2024.
-
Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement
Authors:
Xiyao Wang,
Jiuhai Chen,
Zhaoyang Wang,
Yuhang Zhou,
Yiyang Zhou,
Huaxiu Yao,
Tianyi Zhou,
Tom Goldstein,
Parminder Bhatia,
Furong Huang,
Cao Xiao
Abstract:
Large vision-language models (LVLMs) have achieved impressive results in various visual question-answering and reasoning tasks through vision instruction tuning on specific datasets. However, there is still significant room for improvement in the alignment between visual and language modalities. Previous methods to enhance this alignment typically require external models or data, heavily depending…
▽ More
Large vision-language models (LVLMs) have achieved impressive results in various visual question-answering and reasoning tasks through vision instruction tuning on specific datasets. However, there is still significant room for improvement in the alignment between visual and language modalities. Previous methods to enhance this alignment typically require external models or data, heavily depending on their capabilities and quality, which inevitably sets an upper bound on performance. In this paper, we propose SIMA, a framework that enhances visual and language modality alignment through self-improvement, eliminating the needs for external models or data. SIMA leverages prompts from existing vision instruction tuning datasets to self-generate responses and employs an in-context self-critic mechanism to select response pairs for preference tuning. The key innovation is the introduction of three vision metrics during the in-context self-critic process, which can guide the LVLM in selecting responses that enhance image comprehension. Through experiments across 14 hallucination and comprehensive benchmarks, we demonstrate that SIMA not only improves model performance across all benchmarks but also achieves superior modality alignment, outperforming previous approaches.
△ Less
Submitted 7 June, 2024; v1 submitted 24 May, 2024;
originally announced May 2024.
-
Asynchronous MIMO-OFDM Massive Unsourced Random Access with Codeword Collisions
Authors:
Tianya Li,
Yongpeng Wu,
Junyuan Gao,
Wenjun Zhang,
Xiang-Gen Xia,
Derrick Wing Kwan Ng,
Chengshan Xiao
Abstract:
This paper investigates asynchronous MIMO massive unsourced random access in an orthogonal frequency division multiplexing (OFDM) system over frequency-selective fading channels, with the presence of both timing and carrier frequency offsets (TO and CFO) and non-negligible codeword collisions. The proposed coding framework segregates the data into two components, namely, preamble and coding parts,…
▽ More
This paper investigates asynchronous MIMO massive unsourced random access in an orthogonal frequency division multiplexing (OFDM) system over frequency-selective fading channels, with the presence of both timing and carrier frequency offsets (TO and CFO) and non-negligible codeword collisions. The proposed coding framework segregates the data into two components, namely, preamble and coding parts, with the former being tree-coded and the latter LDPC-coded. By leveraging the dual sparsity of the equivalent channel across both codeword and delay domains (CD and DD), we develop a message passing-based sparse Bayesian learning algorithm, combined with belief propagation and mean field, to iteratively estimate DD channel responses, TO, and delay profiles. Furthermore, we establish a novel graph-based algorithm to iteratively separate the superimposed channels and compensate for the phase rotations. Additionally, the proposed algorithm is applied to the flat fading scenario to estimate both TO and CFO, where the channel and offset estimation is enhanced by leveraging the geometric characteristics of the signal constellation. Simulations reveal that the proposed algorithm achieves superior performance and substantial complexity reduction in both channel and offset estimation compared to the codebook enlarging-based counterparts, and enhanced data recovery performances compared to state-of-the-art URA schemes.
△ Less
Submitted 20 May, 2024;
originally announced May 2024.
-
Safeguarding Vision-Language Models Against Patched Visual Prompt Injectors
Authors:
Jiachen Sun,
Changsheng Wang,
Jiongxiao Wang,
Yiwei Zhang,
Chaowei Xiao
Abstract:
Large language models have become increasingly prominent, also signaling a shift towards multimodality as the next frontier in artificial intelligence, where their embeddings are harnessed as prompts to generate textual content. Vision-language models (VLMs) stand at the forefront of this advancement, offering innovative ways to combine visual and textual data for enhanced understanding and intera…
▽ More
Large language models have become increasingly prominent, also signaling a shift towards multimodality as the next frontier in artificial intelligence, where their embeddings are harnessed as prompts to generate textual content. Vision-language models (VLMs) stand at the forefront of this advancement, offering innovative ways to combine visual and textual data for enhanced understanding and interaction. However, this integration also enlarges the attack surface. Patch-based adversarial attack is considered the most realistic threat model in physical vision applications, as demonstrated in many existing literature. In this paper, we propose to address patched visual prompt injection, where adversaries exploit adversarial patches to generate target content in VLMs. Our investigation reveals that patched adversarial prompts exhibit sensitivity to pixel-wise randomization, a trait that remains robust even against adaptive attacks designed to counteract such defenses. Leveraging this insight, we introduce SmoothVLM, a defense mechanism rooted in smoothing techniques, specifically tailored to protect VLMs from the threat of patched visual prompt injectors. Our framework significantly lowers the attack success rate to a range between 0% and 5.0% on two leading VLMs, while achieving around 67.3% to 95.0% context recovery of the benign images, demonstrating a balance between security and usability.
△ Less
Submitted 24 August, 2024; v1 submitted 17 May, 2024;
originally announced May 2024.
-
SonifyAR: Context-Aware Sound Generation in Augmented Reality
Authors:
Xia Su,
Jon E. Froehlich,
Eunyee Koh,
Chang Xiao
Abstract:
Sound plays a crucial role in enhancing user experience and immersiveness in Augmented Reality (AR). However, current platforms lack support for AR sound authoring due to limited interaction types, challenges in collecting and specifying context information, and difficulty in acquiring matching sound assets. We present SonifyAR, an LLM-based AR sound authoring system that generates context-aware s…
▽ More
Sound plays a crucial role in enhancing user experience and immersiveness in Augmented Reality (AR). However, current platforms lack support for AR sound authoring due to limited interaction types, challenges in collecting and specifying context information, and difficulty in acquiring matching sound assets. We present SonifyAR, an LLM-based AR sound authoring system that generates context-aware sound effects for AR experiences. SonifyAR expands the current design space of AR sound and implements a Programming by Demonstration (PbD) pipeline to automatically collect contextual information of AR events, including virtual content semantics and real world context. This context information is then processed by a large language model to acquire sound effects with Recommendation, Retrieval, Generation, and Transfer methods. To evaluate the usability and performance of our system, we conducted a user study with eight participants and created five example applications, including an AR-based science experiment, an improving case for AR headset safety, and an assisting example for low vision AR users.
△ Less
Submitted 11 August, 2024; v1 submitted 11 May, 2024;
originally announced May 2024.