Showing 1–50 of 2,343 results for author: Zhao, Y

Search v0.5.6 released 2020-02-24

arXiv:2408.11048 [pdf, other]

cs.RO cs.AI cs.LG

RP1M: A Large-Scale Motion Dataset for Piano Playing with Bi-Manual Dexterous Robot Hands

Authors: Yi Zhao, Le Chen, Jan Schneider, Quankai Gao, Juho Kannala, Bernhard Schölkopf, Joni Pajarinen, Dieter Büchler

Abstract: It has been a long-standing research goal to endow robot hands with human-level dexterity. Bi-manual robot piano playing constitutes a task that combines challenges from dynamic tasks, such as generating fast while precise motions, with slower but contact-rich manipulation problems. Although reinforcement learning based approaches have shown promising results in single-task performance, these meth… ▽ More It has been a long-standing research goal to endow robot hands with human-level dexterity. Bi-manual robot piano playing constitutes a task that combines challenges from dynamic tasks, such as generating fast while precise motions, with slower but contact-rich manipulation problems. Although reinforcement learning based approaches have shown promising results in single-task performance, these methods struggle in a multi-song setting. Our work aims to close this gap and, thereby, enable imitation learning approaches for robot piano playing at scale. To this end, we introduce the Robot Piano 1 Million (RP1M) dataset, containing bi-manual robot piano playing motion data of more than one million trajectories. We formulate finger placements as an optimal transport problem, thus, enabling automatic annotation of vast amounts of unlabeled songs. Benchmarking existing imitation learning approaches shows that such approaches reach state-of-the-art robot piano playing performance by leveraging RP1M. △ Less

Submitted 20 August, 2024; originally announced August 2024.

Comments: Project Website: https://rp1m.github.io/
arXiv:2408.11030 [pdf, other]

cs.CV

OpenScan: A Benchmark for Generalized Open-Vocabulary 3D Scene Understanding

Authors: Youjun Zhao, Jiaying Lin, Shuquan Ye, Qianshi Pang, Rynson W. H. Lau

Abstract: Open-vocabulary 3D scene understanding (OV-3D) aims to localize and classify novel objects beyond the closed object classes. However, existing approaches and benchmarks primarily focus on the open vocabulary problem within the context of object classes, which is insufficient to provide a holistic evaluation to what extent a model understands the 3D scene. In this paper, we introduce a more challen… ▽ More Open-vocabulary 3D scene understanding (OV-3D) aims to localize and classify novel objects beyond the closed object classes. However, existing approaches and benchmarks primarily focus on the open vocabulary problem within the context of object classes, which is insufficient to provide a holistic evaluation to what extent a model understands the 3D scene. In this paper, we introduce a more challenging task called Generalized Open-Vocabulary 3D Scene Understanding (GOV-3D) to explore the open vocabulary problem beyond object classes. It encompasses an open and diverse set of generalized knowledge, expressed as linguistic queries of fine-grained and object-specific attributes. To this end, we contribute a new benchmark named OpenScan, which consists of 3D object attributes across eight representative linguistic aspects, including affordance, property, material, and more. We further evaluate state-of-the-art OV-3D methods on our OpenScan benchmark, and discover that these methods struggle to comprehend the abstract vocabularies of the GOV-3D task, a challenge that cannot be addressed by simply scaling up object classes during training. We highlight the limitations of existing methodologies and explore a promising direction to overcome the identified shortcomings. Data and code are available at https://github.com/YoujunZhao/OpenScan △ Less

Submitted 20 August, 2024; originally announced August 2024.
arXiv:2408.10935 [pdf, other]

cs.CV

Large Point-to-Gaussian Model for Image-to-3D Generation

Authors: Longfei Lu, Huachen Gao, Tao Dai, Yaohua Zha, Zhi Hou, Junta Wu, Shu-Tao Xia

Abstract: Recently, image-to-3D approaches have significantly advanced the generation quality and speed of 3D assets based on large reconstruction models, particularly 3D Gaussian reconstruction models. Existing large 3D Gaussian models directly map 2D image to 3D Gaussian parameters, while regressing 2D image to 3D Gaussian representations is challenging without 3D priors. In this paper, we propose a large… ▽ More Recently, image-to-3D approaches have significantly advanced the generation quality and speed of 3D assets based on large reconstruction models, particularly 3D Gaussian reconstruction models. Existing large 3D Gaussian models directly map 2D image to 3D Gaussian parameters, while regressing 2D image to 3D Gaussian representations is challenging without 3D priors. In this paper, we propose a large Point-to-Gaussian model, that inputs the initial point cloud produced from large 3D diffusion model conditional on 2D image to generate the Gaussian parameters, for image-to-3D generation. The point cloud provides initial 3D geometry prior for Gaussian generation, thus significantly facilitating image-to-3D Generation. Moreover, we present the \textbf{A}ttention mechanism, \textbf{P}rojection mechanism, and \textbf{P}oint feature extractor, dubbed as \textbf{APP} block, for fusing the image features with point cloud features. The qualitative and quantitative experiments extensively demonstrate the effectiveness of the proposed approach on GSO and Objaverse datasets, and show the proposed method achieves state-of-the-art performance. △ Less

Submitted 20 August, 2024; originally announced August 2024.

Comments: 10 pages, 9 figures, ACM MM 2024
arXiv:2408.10918 [pdf, other]

cs.CL

CHECKWHY: Causal Fact Verification via Argument Structure

Authors: Jiasheng Si, Yibo Zhao, Yingjie Zhu, Haiyang Zhu, Wenpeng Lu, Deyu Zhou

Abstract: With the growing complexity of fact verification tasks, the concern with "thoughtful" reasoning capabilities is increasing. However, recent fact verification benchmarks mainly focus on checking a narrow scope of semantic factoids within claims and lack an explicit logical reasoning process. In this paper, we introduce CheckWhy, a challenging dataset tailored to a novel causal fact verification tas… ▽ More With the growing complexity of fact verification tasks, the concern with "thoughtful" reasoning capabilities is increasing. However, recent fact verification benchmarks mainly focus on checking a narrow scope of semantic factoids within claims and lack an explicit logical reasoning process. In this paper, we introduce CheckWhy, a challenging dataset tailored to a novel causal fact verification task: checking the truthfulness of the causal relation within claims through rigorous reasoning steps. CheckWhy consists of over 19K "why" claim-evidence-argument structure triplets with supports, refutes, and not enough info labels. Each argument structure is composed of connected evidence, representing the reasoning process that begins with foundational evidence and progresses toward claim establishment. Through extensive experiments on state-of-the-art models, we validate the importance of incorporating the argument structure for causal fact verification. Moreover, the automated and human evaluation of argument structure generation reveals the difficulty in producing satisfying argument structure by fine-tuned models or Chain-of-Thought prompted LLMs, leaving considerable room for future improvements. △ Less

Submitted 20 August, 2024; originally announced August 2024.

Comments: Accepted by ACL2024; Awarded as Outstanding Paper Award and Area Chair Award
arXiv:2408.10912 [pdf, other]

cs.IT

An Achievable Rate-Distortion Region of Joint Identification and Sensing for Multiple Access Channels

Authors: Yaning Zhao, Wafa Labidi, Holger Boche, Eduard Jorswieck, Christian Deppe

Abstract: In contrast to Shannon transmission codes, the size of identification (ID) codes for discrete memoryless channels (DMCs) experiences doubly exponential growth with the block length when randomized encoding is used. Additional enhancements within the ID paradigm can be realized through supplementary resources such as quantum entanglement, common randomness (CR), and feedback. Joint transmission and… ▽ More In contrast to Shannon transmission codes, the size of identification (ID) codes for discrete memoryless channels (DMCs) experiences doubly exponential growth with the block length when randomized encoding is used. Additional enhancements within the ID paradigm can be realized through supplementary resources such as quantum entanglement, common randomness (CR), and feedback. Joint transmission and sensing demonstrate significant benefits over separation-based methods. Inspired by the significant impact of feedback on the ID capacity, our work delves into the realm of joint ID and sensing (JIDAS) for state-dependent multiple access channels (SD-MACs) with noiseless strictly casual feedback. Here, the senders aim to convey ID messages to the receiver while simultaneously sensing the channel states. We establish a lower bound on the capacity-distortion region of the SD-MACs. An example shows that JIDAS outperforms the separation-based approach. △ Less

Submitted 20 August, 2024; originally announced August 2024.
arXiv:2408.10718 [pdf, other]

cs.SE cs.CL

CodeJudge-Eval: Can Large Language Models be Good Judges in Code Understanding?

Authors: Yuwei Zhao, Ziyang Luo, Yuchen Tian, Hongzhan Lin, Weixiang Yan, Annan Li, Jing Ma

Abstract: Recent advancements in large language models (LLMs) have showcased impressive code generation capabilities, primarily evaluated through language-to-code benchmarks. However, these benchmarks may not fully capture a model's code understanding abilities. We introduce CodeJudge-Eval (CJ-Eval), a novel benchmark designed to assess LLMs' code understanding abilities from the perspective of code judging… ▽ More Recent advancements in large language models (LLMs) have showcased impressive code generation capabilities, primarily evaluated through language-to-code benchmarks. However, these benchmarks may not fully capture a model's code understanding abilities. We introduce CodeJudge-Eval (CJ-Eval), a novel benchmark designed to assess LLMs' code understanding abilities from the perspective of code judging rather than code generation. CJ-Eval challenges models to determine the correctness of provided code solutions, encompassing various error types and compilation issues. By leveraging a diverse set of problems and a fine-grained judging system, CJ-Eval addresses the limitations of traditional benchmarks, including the potential memorization of solutions. Evaluation of 12 well-known LLMs on CJ-Eval reveals that even state-of-the-art models struggle, highlighting the benchmark's ability to probe deeper into models' code understanding abilities. Our benchmark will be available at \url{https://github.com/CodeLLM-Research/CodeJudge-Eval}. △ Less

Submitted 20 August, 2024; originally announced August 2024.

Comments: Work in progress
arXiv:2408.10566 [pdf, other]

cs.LG cs.AI

SparseGrow: Addressing Growth-Induced Forgetting in Task-Agnostic Continual Learning

Authors: Yuqing Zhao, Divya Saxena, Jiannong Cao, Xiaoyun Liu, Changlin Song

Abstract: In continual learning (CL), model growth enhances adaptability over new data, improving knowledge retention for more tasks. However, improper model growth can lead to severe degradation of previously learned knowledge, an issue we name as growth-induced forgetting (GIFt), especially in task-agnostic CL using entire grown model for inference. Existing works, despite adopting model growth and random… ▽ More In continual learning (CL), model growth enhances adaptability over new data, improving knowledge retention for more tasks. However, improper model growth can lead to severe degradation of previously learned knowledge, an issue we name as growth-induced forgetting (GIFt), especially in task-agnostic CL using entire grown model for inference. Existing works, despite adopting model growth and random initialization for better adaptability, often fail to recognize the presence of GIFt caused by improper model growth. This oversight limits comprehensive control of forgetting and hinders full utilization of model growth. We are the first in CL to identify this issue and conduct an in-depth study on root cause of GIFt, where layer expansion stands out among model growth strategies, widening layers without affecting model functionality. Yet, direct adoption of layer expansion presents challenges. It lacks data-driven control and initialization of expanded parameters to balance adaptability and knowledge retention. This paper presents a novel SparseGrow approach to overcome the issue of GIFt while enhancing adaptability over new data. SparseGrow employs data-driven sparse layer expansion to control efficient parameter usage during growth, reducing GIFt from excessive growth and functionality changes. It also combines sparse growth with on-data initialization at training late-stage to create partially 0-valued expansions that fit learned distribution, enhancing retention and adaptability. To further minimize forgetting, freezing is applied by calculating the sparse mask, allowing data-driven preservation of important parameters. Through experiments across datasets with various settings, cases and task numbers, we demonstrate the necessity of layer expansion and showcase the effectiveness of SparseGrow in overcoming GIFt, highlighting its adaptability and knowledge retention for incremental tasks. △ Less

Submitted 20 August, 2024; originally announced August 2024.

Comments: This paper has been submitted to the AAAI conference. If accepted, the final version will be updated to reflect the conference proceedings
arXiv:2408.10172 [pdf, other]

cs.DS

Eulerian Graph Sparsification by Effective Resistance Decomposition

Authors: Arun Jambulapati, Sushant Sachdeva, Aaron Sidford, Kevin Tian, Yibin Zhao

Abstract: We provide an algorithm that, given an $n$-vertex $m$-edge Eulerian graph with polynomially bounded weights, computes an $\breve{O}(n\log^{2} n \cdot \varepsilon^{-2})$-edge $\varepsilon$-approximate Eulerian sparsifier with high probability in $\breve{O}(m\log^3 n)$ time (where $\breve{O}(\cdot)$ hides $\text{polyloglog}(n)$ factors). Due to a reduction from [Peng-Song, STOC '22], this yields an… ▽ More We provide an algorithm that, given an $n$-vertex $m$-edge Eulerian graph with polynomially bounded weights, computes an $\breve{O}(n\log^{2} n \cdot \varepsilon^{-2})$-edge $\varepsilon$-approximate Eulerian sparsifier with high probability in $\breve{O}(m\log^3 n)$ time (where $\breve{O}(\cdot)$ hides $\text{polyloglog}(n)$ factors). Due to a reduction from [Peng-Song, STOC '22], this yields an $\breve{O}(m\log^3 n + n\log^6 n)$-time algorithm for solving $n$-vertex $m$-edge Eulerian Laplacian systems with polynomially-bounded weights with high probability, improving upon the previous state-of-the-art runtime of $Ω(m\log^8 n + n\log^{23} n)$. We also give a polynomial-time algorithm that computes $O(\min(n\log n \cdot \varepsilon^{-2} + n\log^{5/3} n \cdot \varepsilon^{-4/3}, n\log^{3/2} n \cdot \varepsilon^{-2}))$-edge sparsifiers, improving the best such sparsity bound of $O(n\log^2 n \cdot \varepsilon^{-2} + n\log^{8/3} n \cdot \varepsilon^{-4/3})$ [Sachdeva-Thudi-Zhao, ICALP '24]. Finally, we show that our techniques extend to yield the first $O(m\cdot\text{polylog}(n))$ time algorithm for computing $O(n\varepsilon^{-1}\cdot\text{polylog}(n))$-edge graphical spectral sketches, as well as a natural Eulerian generalization we introduce. In contrast to prior Eulerian graph sparsification algorithms which used either short cycle or expander decompositions, our algorithms use a simple efficient effective resistance decomposition scheme we introduce. Our algorithms apply a natural sampling scheme and electrical routing (to achieve degree balance) to such decompositions. Our analysis leverages new asymmetric variance bounds specialized to Eulerian Laplacians and tools from discrepancy theory. △ Less

Submitted 19 August, 2024; originally announced August 2024.
arXiv:2408.09647 [pdf, other]

cs.CV

C2P-CLIP: Injecting Category Common Prompt in CLIP to Enhance Generalization in Deepfake Detection

Authors: Chuangchuang Tan, Renshuai Tao, Huan Liu, Guanghua Gu, Baoyuan Wu, Yao Zhao, Yunchao Wei

Abstract: This work focuses on AIGC detection to develop universal detectors capable of identifying various types of forgery images. Recent studies have found large pre-trained models, such as CLIP, are effective for generalizable deepfake detection along with linear classifiers. However, two critical issues remain unresolved: 1) understanding why CLIP features are effective on deepfake detection through a… ▽ More This work focuses on AIGC detection to develop universal detectors capable of identifying various types of forgery images. Recent studies have found large pre-trained models, such as CLIP, are effective for generalizable deepfake detection along with linear classifiers. However, two critical issues remain unresolved: 1) understanding why CLIP features are effective on deepfake detection through a linear classifier; and 2) exploring the detection potential of CLIP. In this study, we delve into the underlying mechanisms of CLIP's detection capabilities by decoding its detection features into text and performing word frequency analysis. Our finding indicates that CLIP detects deepfakes by recognizing similar concepts (Fig. \ref{fig:fig1} a). Building on this insight, we introduce Category Common Prompt CLIP, called C2P-CLIP, which integrates the category common prompt into the text encoder to inject category-related concepts into the image encoder, thereby enhancing detection performance (Fig. \ref{fig:fig1} b). Our method achieves a 12.41\% improvement in detection accuracy compared to the original CLIP, without introducing additional parameters during testing. Comprehensive experiments conducted on two widely-used datasets, encompassing 20 generation models, validate the efficacy of the proposed method, demonstrating state-of-the-art performance. The code is available at \url{https://github.com/chuangchuangtan/C2P-CLIP-DeepfakeDetection} △ Less

Submitted 18 August, 2024; originally announced August 2024.

Comments: 10 pages, 5 figures
arXiv:2408.09533 [pdf, other]

cs.CV

AnomalyFactory: Regard Anomaly Generation as Unsupervised Anomaly Localization

Authors: Ying Zhao

Abstract: Recent advances in anomaly generation approaches alleviate the effect of data insufficiency on task of anomaly localization. While effective, most of them learn multiple large generative models on different datasets and cumbersome anomaly prediction models for different classes. To address the limitations, we propose a novel scalable framework, named AnomalyFactory, that unifies unsupervised anoma… ▽ More Recent advances in anomaly generation approaches alleviate the effect of data insufficiency on task of anomaly localization. While effective, most of them learn multiple large generative models on different datasets and cumbersome anomaly prediction models for different classes. To address the limitations, we propose a novel scalable framework, named AnomalyFactory, that unifies unsupervised anomaly generation and localization with same network architecture. It starts with a BootGenerator that combines structure of a target edge map and appearance of a reference color image with the guidance of a learned heatmap. Then, it proceeds with a FlareGenerator that receives supervision signals from the BootGenerator and reforms the heatmap to indicate anomaly locations in the generated image. Finally, it easily transforms the same network architecture to a BlazeDetector that localizes anomaly pixels with the learned heatmap by converting the anomaly images generated by the FlareGenerator to normal images. By manipulating the target edge maps and combining them with various reference images, AnomalyFactory generates authentic and diversity samples cross domains. Comprehensive experiments carried on 5 datasets, including MVTecAD, VisA, MVTecLOCO, MADSim and RealIAD, demonstrate that our approach is superior to competitors in generation capability and scalability. △ Less

Submitted 18 August, 2024; originally announced August 2024.

Comments: Accepted to the 2nd workshop on Vision-based InduStrial InspectiON (VISION) at ECCV 2024
arXiv:2408.09462 [pdf, other]

cs.MM

SpeechEE: A Novel Benchmark for Speech Event Extraction

Authors: Bin Wang, Meishan Zhang, Hao Fei, Yu Zhao, Bobo Li, Shengqiong Wu, Wei Ji, Min Zhang

Abstract: Event extraction (EE) is a critical direction in the field of information extraction, laying an important foundation for the construction of structured knowledge bases. EE from text has received ample research and attention for years, yet there can be numerous real-world applications that require direct information acquisition from speech signals, online meeting minutes, interview summaries, press… ▽ More Event extraction (EE) is a critical direction in the field of information extraction, laying an important foundation for the construction of structured knowledge bases. EE from text has received ample research and attention for years, yet there can be numerous real-world applications that require direct information acquisition from speech signals, online meeting minutes, interview summaries, press releases, etc. While EE from speech has remained under-explored, this paper fills the gap by pioneering a SpeechEE, defined as detecting the event predicates and arguments from a given audio speech. To benchmark the SpeechEE task, we first construct a large-scale high-quality dataset. Based on textual EE datasets under the sentence, document, and dialogue scenarios, we convert texts into speeches through both manual real-person narration and automatic synthesis, empowering the data with diverse scenarios, languages, domains, ambiences, and speaker styles. Further, to effectively address the key challenges in the task, we tailor an E2E SpeechEE system based on the encoder-decoder architecture, where a novel Shrinking Unit module and a retrieval-aided decoding mechanism are devised. Extensive experimental results on all SpeechEE subsets demonstrate the efficacy of the proposed model, offering a strong baseline for the task. At last, being the first work on this topic, we shed light on key directions for future research. Our codes and the benchmark datasets are open at https://SpeechEE.github.io/ △ Less

Submitted 18 August, 2024; originally announced August 2024.
arXiv:2408.09119 [pdf, ps, other]

cs.IT

Identification via Gaussian Multiple Access Channels in the Presence of Feedback

Authors: Yaning Zhao, Wafa Labidi, Holger Boche, Eduard Jorswieck, Christian Deppe

Abstract: We investigate message identification over a K-sender Gaussian multiple access channel (K-GMAC). Unlike conventional Shannon transmission codes, the size of randomized identification (ID) codes experiences a doubly exponential growth in the code length. Improvements in the ID approach can be attained through additional resources such as quantum entanglement, common randomness (CR), and feedback. I… ▽ More We investigate message identification over a K-sender Gaussian multiple access channel (K-GMAC). Unlike conventional Shannon transmission codes, the size of randomized identification (ID) codes experiences a doubly exponential growth in the code length. Improvements in the ID approach can be attained through additional resources such as quantum entanglement, common randomness (CR), and feedback. It has been demonstrated that an infinite capacity can be attained for a single-user Gaussian channel with noiseless feedback, irrespective of the chosen rate scaling. We establish the capacity region of both the K-sender Gaussian multiple access channel (K-GMAC) and the K-sender state-dependent Gaussian multiple access channel (K-SD-GMAC) when strictly causal noiseless feedback is available. △ Less

Submitted 17 August, 2024; originally announced August 2024.
arXiv:2408.08515 [pdf, other]

cs.SE

Selecting Initial Seeds for Better JVM Fuzzing

Authors: Tianchang Gao, Junjie Chen, Dong Wang, Yile Guo, Yingquan Zhao, Zan Wang

Abstract: Literature in traditional program fuzzing has confirmed that effectiveness is largely impacted by redundancy among initial seeds, thereby proposing a series of seed selection methods. JVM fuzzing, compared to traditional ones, presents unique characteristics, including large-scale and intricate code, and programs with both syntactic and semantic features. However, it remains unclear whether the ex… ▽ More Literature in traditional program fuzzing has confirmed that effectiveness is largely impacted by redundancy among initial seeds, thereby proposing a series of seed selection methods. JVM fuzzing, compared to traditional ones, presents unique characteristics, including large-scale and intricate code, and programs with both syntactic and semantic features. However, it remains unclear whether the existing seed selection methods are suitable for JVM fuzzing and whether utilizing program features can enhance effectiveness. To address this, we devise a total of 10 initial seed selection methods, comprising coverage-based, prefuzz-based, and program-feature-based methods. We then conduct an empirical study on three JVM implementations to extensively evaluate the performance of the seed selection methods within two SOTA fuzzing techniques (JavaTailor and VECT). Specifically, we examine performance from three aspects: (i) effectiveness and efficiency using widely studied initial seeds, (ii) effectiveness using the programs in the wild, and (iii) the ability to detect new bugs. Evaluation results first show that the program-feature-based method that utilizes the control flow graph not only has a significantly lower time overhead (i.e., 30s), but also outperforms other methods, achieving 142% to 269% improvement compared to the full set of initial seeds. Second, results reveal that the initial seed selection greatly improves the quality of wild programs and exhibits complementary effectiveness by detecting new behaviors. Third, results demonstrate that given the same testing period, initial seed selection improves the JVM fuzzing techniques by detecting more unknown bugs. Particularly, 21 out of the 25 detected bugs have been confirmed or fixed by developers. This work takes the first look at initial seed selection in JVM fuzzing, confirming its importance in fuzzing effectiveness and efficiency. △ Less

Submitted 16 August, 2024; originally announced August 2024.
arXiv:2408.08252 [pdf, other]

cs.LG cs.AI q-bio.GN stat.ML

Derivative-Free Guidance in Continuous and Discrete Diffusion Models with Soft Value-Based Decoding

Authors: Xiner Li, Yulai Zhao, Chenyu Wang, Gabriele Scalia, Gokcen Eraslan, Surag Nair, Tommaso Biancalani, Aviv Regev, Sergey Levine, Masatoshi Uehara

Abstract: Diffusion models excel at capturing the natural design spaces of images, molecules, DNA, RNA, and protein sequences. However, rather than merely generating designs that are natural, we often aim to optimize downstream reward functions while preserving the naturalness of these design spaces. Existing methods for achieving this goal often require ``differentiable'' proxy models (\textit{e.g.}, class… ▽ More Diffusion models excel at capturing the natural design spaces of images, molecules, DNA, RNA, and protein sequences. However, rather than merely generating designs that are natural, we often aim to optimize downstream reward functions while preserving the naturalness of these design spaces. Existing methods for achieving this goal often require ``differentiable'' proxy models (\textit{e.g.}, classifier guidance or DPS) or involve computationally expensive fine-tuning of diffusion models (\textit{e.g.}, classifier-free guidance, RL-based fine-tuning). In our work, we propose a new method to address these challenges. Our algorithm is an iterative sampling method that integrates soft value functions, which looks ahead to how intermediate noisy states lead to high rewards in the future, into the standard inference procedure of pre-trained diffusion models. Notably, our approach avoids fine-tuning generative models and eliminates the need to construct differentiable models. This enables us to (1) directly utilize non-differentiable features/reward feedback, commonly used in many scientific domains, and (2) apply our method to recent discrete diffusion models in a principled way. Finally, we demonstrate the effectiveness of our algorithm across several domains, including image generation, molecule generation, and DNA/RNA sequence generation. The code is available at \href{https://github.com/masa-ue/SVDD}{https://github.com/masa-ue/SVDD}. △ Less

Submitted 15 August, 2024; originally announced August 2024.

Comments: The code is available at https://github.com/masa-ue/SVDD
arXiv:2408.07605 [pdf, other]

cs.CV

Panacea+: Panoramic and Controllable Video Generation for Autonomous Driving

Authors: Yuqing Wen, Yucheng Zhao, Yingfei Liu, Binyuan Huang, Fan Jia, Yanhui Wang, Chi Zhang, Tiancai Wang, Xiaoyan Sun, Xiangyu Zhang

Abstract: The field of autonomous driving increasingly demands high-quality annotated video training data. In this paper, we propose Panacea+, a powerful and universally applicable framework for generating video data in driving scenes. Built upon the foundation of our previous work, Panacea, Panacea+ adopts a multi-view appearance noise prior mechanism and a super-resolution module for enhanced consistency… ▽ More The field of autonomous driving increasingly demands high-quality annotated video training data. In this paper, we propose Panacea+, a powerful and universally applicable framework for generating video data in driving scenes. Built upon the foundation of our previous work, Panacea, Panacea+ adopts a multi-view appearance noise prior mechanism and a super-resolution module for enhanced consistency and increased resolution. Extensive experiments show that the generated video samples from Panacea+ greatly benefit a wide range of tasks on different datasets, including 3D object tracking, 3D object detection, and lane detection tasks on the nuScenes and Argoverse 2 dataset. These results strongly prove Panacea+ to be a valuable data generation framework for autonomous driving. △ Less

Submitted 14 August, 2024; originally announced August 2024.

Comments: Project page: https://panacea-ad.github.io/. arXiv admin note: text overlap with arXiv:2311.16813
arXiv:2408.06942 [pdf, other]

cs.HC

doi 10.1145/3663548.3688514

Speech-based Mark for Data Sonification

Authors: Yichun Zhao, Jingyi Lu, Miguel A Nacenta

Abstract: Sonification serves as a powerful tool for data accessibility, especially for people with vision loss. Among various modalities, speech is a familiar means of communication similar to the role of text in visualization. However, speech-based sonification is underexplored. We introduce SpeechTone, a novel speech-based mark for data sonification and extension to the existing Erie declarative grammar… ▽ More Sonification serves as a powerful tool for data accessibility, especially for people with vision loss. Among various modalities, speech is a familiar means of communication similar to the role of text in visualization. However, speech-based sonification is underexplored. We introduce SpeechTone, a novel speech-based mark for data sonification and extension to the existing Erie declarative grammar for sonification. It encodes data into speech attributes such as pitch, speed, voice and speech content. We demonstrate the efficacy of SpeechTone through three examples. △ Less

Submitted 13 August, 2024; originally announced August 2024.

Comments: Accepted in ASSETS '24, October 27-30, 2024, St. John's, NL, Canada

Journal ref: The 26th International ACM SIGACCESS Conference on Computers and Accessibility (ASSETS '24), October 27-30, 2024, St. John's, NL, Canada
arXiv:2408.06834 [pdf, other]

cs.CV

GLGait: A Global-Local Temporal Receptive Field Network for Gait Recognition in the Wild

Authors: Guozhen Peng, Yunhong Wang, Yuwei Zhao, Shaoxiong Zhang, Annan Li

Abstract: Gait recognition has attracted increasing attention from academia and industry as a human recognition technology from a distance in non-intrusive ways without requiring cooperation. Although advanced methods have achieved impressive success in lab scenarios, most of them perform poorly in the wild. Recently, some Convolution Neural Networks (ConvNets) based methods have been proposed to address th… ▽ More Gait recognition has attracted increasing attention from academia and industry as a human recognition technology from a distance in non-intrusive ways without requiring cooperation. Although advanced methods have achieved impressive success in lab scenarios, most of them perform poorly in the wild. Recently, some Convolution Neural Networks (ConvNets) based methods have been proposed to address the issue of gait recognition in the wild. However, the temporal receptive field obtained by convolution operations is limited for long gait sequences. If directly replacing convolution blocks with visual transformer blocks, the model may not enhance a local temporal receptive field, which is important for covering a complete gait cycle. To address this issue, we design a Global-Local Temporal Receptive Field Network (GLGait). GLGait employs a Global-Local Temporal Module (GLTM) to establish a global-local temporal receptive field, which mainly consists of a Pseudo Global Temporal Self-Attention (PGTA) and a temporal convolution operation. Specifically, PGTA is used to obtain a pseudo global temporal receptive field with less memory and computation complexity compared with a multi-head self-attention (MHSA). The temporal convolution operation is used to enhance the local temporal receptive field. Besides, it can also aggregate pseudo global temporal receptive field to a true holistic temporal receptive field. Furthermore, we also propose a Center-Augmented Triplet Loss (CTL) in GLGait to reduce the intra-class distance and expand the positive samples in the training stage. Extensive experiments show that our method obtains state-of-the-art results on in-the-wild datasets, $i.e.$, Gait3D and GREW. The code is available at https://github.com/bgdpgz/GLGait. △ Less

Submitted 13 August, 2024; originally announced August 2024.

Comments: Accepted by ACM MM2024
arXiv:2408.06567 [pdf, other]

cs.CL cs.AI

AquilaMoE: Efficient Training for MoE Models with Scale-Up and Scale-Out Strategies

Authors: Bo-Wen Zhang, Liangdong Wang, Ye Yuan, Jijie Li, Shuhao Gu, Mengdi Zhao, Xinya Wu, Guang Liu, Chengwei Wu, Hanyu Zhao, Li Du, Yiming Ju, Quanyue Ma, Yulong Ao, Yingli Zhao, Songhe Zhu, Zhou Cao, Dong Liang, Yonghua Lin, Ming Zhang, Shunfei Wang, Yanxin Zhou, Min Ye, Xuekai Chen, Xinyang Yu , et al. (2 additional authors not shown)

Abstract: In recent years, with the rapid application of large language models across various fields, the scale of these models has gradually increased, and the resources required for their pre-training have grown exponentially. Training an LLM from scratch will cost a lot of computation resources while scaling up from a smaller model is a more efficient approach and has thus attracted significant attention… ▽ More In recent years, with the rapid application of large language models across various fields, the scale of these models has gradually increased, and the resources required for their pre-training have grown exponentially. Training an LLM from scratch will cost a lot of computation resources while scaling up from a smaller model is a more efficient approach and has thus attracted significant attention. In this paper, we present AquilaMoE, a cutting-edge bilingual 8*16B Mixture of Experts (MoE) language model that has 8 experts with 16 billion parameters each and is developed using an innovative training methodology called EfficientScale. This approach optimizes performance while minimizing data requirements through a two-stage process. The first stage, termed Scale-Up, initializes the larger model with weights from a pre-trained smaller model, enabling substantial knowledge transfer and continuous pretraining with significantly less data. The second stage, Scale-Out, uses a pre-trained dense model to initialize the MoE experts, further enhancing knowledge transfer and performance. Extensive validation experiments on 1.8B and 7B models compared various initialization schemes, achieving models that maintain and reduce loss during continuous pretraining. Utilizing the optimal scheme, we successfully trained a 16B model and subsequently the 8*16B AquilaMoE model, demonstrating significant improvements in performance and training efficiency. △ Less

Submitted 12 August, 2024; originally announced August 2024.
arXiv:2408.06550 [pdf, other]

cs.HC

Stretch or Vibrate? Rendering Spatial Information of Static and Moving Objects in VR via Haptic Feedback for Blind People

Authors: Jiasheng Li, Zining Zhang, Zeyu Yan, Yuhang Zhao, Huaishu Peng

Abstract: Perceiving spatial information of a virtual object (e.g., direction, distance) is critical yet challenging for blind users seeking an immersive virtual reality experience. To facilitate VR accessibility for blind users, in this paper, we investigate the effectiveness of two types of haptic cues--vibrotactile and skin-stretch cues--in conveying the spatial information of a virtual object when appli… ▽ More Perceiving spatial information of a virtual object (e.g., direction, distance) is critical yet challenging for blind users seeking an immersive virtual reality experience. To facilitate VR accessibility for blind users, in this paper, we investigate the effectiveness of two types of haptic cues--vibrotactile and skin-stretch cues--in conveying the spatial information of a virtual object when applied to the dorsal side of a blind user's hand. We conducted a user study with 10 blind users to investigate how they perceive static and moving objects in VR with a custom-made haptic apparatus. Our results reveal that blind users can more accurately understand an object's location and movement when receiving skin-stretch cues, as opposed to vibrotactile cues. We discuss the pros and cons of both types of haptic cues and conclude with design recommendations for future haptic solutions for VR accessibility. △ Less

Submitted 12 August, 2024; originally announced August 2024.
arXiv:2408.06019 [pdf, other]

cs.CV

HeadGAP: Few-shot 3D Head Avatar via Generalizable Gaussian Priors

Authors: Xiaozheng Zheng, Chao Wen, Zhaohu Li, Weiyi Zhang, Zhuo Su, Xu Chang, Yang Zhao, Zheng Lv, Xiaoyuan Zhang, Yongjie Zhang, Guidong Wang, Lan Xu

Abstract: In this paper, we present a novel 3D head avatar creation approach capable of generalizing from few-shot in-the-wild data with high-fidelity and animatable robustness. Given the underconstrained nature of this problem, incorporating prior knowledge is essential. Therefore, we propose a framework comprising prior learning and avatar creation phases. The prior learning phase leverages 3D head priors… ▽ More In this paper, we present a novel 3D head avatar creation approach capable of generalizing from few-shot in-the-wild data with high-fidelity and animatable robustness. Given the underconstrained nature of this problem, incorporating prior knowledge is essential. Therefore, we propose a framework comprising prior learning and avatar creation phases. The prior learning phase leverages 3D head priors derived from a large-scale multi-view dynamic dataset, and the avatar creation phase applies these priors for few-shot personalization. Our approach effectively captures these priors by utilizing a Gaussian Splatting-based auto-decoder network with part-based dynamic modeling. Our method employs identity-shared encoding with personalized latent codes for individual identities to learn the attributes of Gaussian primitives. During the avatar creation phase, we achieve fast head avatar personalization by leveraging inversion and fine-tuning strategies. Extensive experiments demonstrate that our model effectively exploits head priors and successfully generalizes them to few-shot personalization, achieving photo-realistic rendering quality, multi-view consistency, and stable animation. △ Less

Submitted 12 August, 2024; originally announced August 2024.

Comments: Project page: https://headgap.github.io/
arXiv:2408.05758 [pdf, other]

eess.AS cs.AI cs.CL cs.SD

VQ-CTAP: Cross-Modal Fine-Grained Sequence Representation Learning for Speech Processing

Authors: Chunyu Qiang, Wang Geng, Yi Zhao, Ruibo Fu, Tao Wang, Cheng Gong, Tianrui Wang, Qiuyu Liu, Jiangyan Yi, Zhengqi Wen, Chen Zhang, Hao Che, Longbiao Wang, Jianwu Dang, Jianhua Tao

Abstract: Deep learning has brought significant improvements to the field of cross-modal representation learning. For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired, emphasizing the semantic content of the text modality while de-emphasizing the paralinguistic information of the spe… ▽ More Deep learning has brought significant improvements to the field of cross-modal representation learning. For tasks such as text-to-speech (TTS), voice conversion (VC), and automatic speech recognition (ASR), a cross-modal fine-grained (frame-level) sequence representation is desired, emphasizing the semantic content of the text modality while de-emphasizing the paralinguistic information of the speech modality. We propose a method called "Vector Quantized Contrastive Token-Acoustic Pre-training (VQ-CTAP)", which uses the cross-modal aligned sequence transcoder to bring text and speech into a joint multimodal space, learning how to connect text and speech at the frame level. The proposed VQ-CTAP is a paradigm for cross-modal sequence representation learning, offering a promising solution for fine-grained generation and recognition tasks in speech processing. The VQ-CTAP can be directly applied to VC and ASR tasks without fine-tuning or additional structures. We propose a sequence-aware semantic connector, which connects multiple frozen pre-trained modules for the TTS task, exhibiting a plug-and-play capability. We design a stepping optimization strategy to ensure effective model convergence by gradually injecting and adjusting the influence of various loss components. Furthermore, we propose a semantic-transfer-wise paralinguistic consistency loss to enhance representational capabilities, allowing the model to better generalize to unseen data and capture the nuances of paralinguistic information. In addition, VQ-CTAP achieves high-compression speech coding at a rate of 25Hz from 24kHz input waveforms, which is a 960-fold reduction in the sampling rate. The audio demo is available at https://qiangchunyu.github.io/VQCTAP/ △ Less

Submitted 11 August, 2024; originally announced August 2024.
arXiv:2408.05702 [pdf, other]

cs.LG eess.SY nlin.CD

Predicting Chaotic System Behavior using Machine Learning Techniques

Authors: Huaiyuan Rao, Yichen Zhao, Qiang Lai

Abstract: Recently, machine learning techniques, particularly deep learning, have demonstrated superior performance over traditional time series forecasting methods across various applications, including both single-variable and multi-variable predictions. This study aims to investigate the capability of i) Next Generation Reservoir Computing (NG-RC) ii) Reservoir Computing (RC) iii) Long short-term Memory… ▽ More Recently, machine learning techniques, particularly deep learning, have demonstrated superior performance over traditional time series forecasting methods across various applications, including both single-variable and multi-variable predictions. This study aims to investigate the capability of i) Next Generation Reservoir Computing (NG-RC) ii) Reservoir Computing (RC) iii) Long short-term Memory (LSTM) for predicting chaotic system behavior, and to compare their performance in terms of accuracy, efficiency, and robustness. These methods are applied to predict time series obtained from four representative chaotic systems including Lorenz, Rössler, Chen, Qi systems. In conclusion, we found that NG-RC is more computationally efficient and offers greater potential for predicting chaotic system behavior. △ Less

Submitted 11 August, 2024; originally announced August 2024.

Comments: 8 pages, 15 figures
arXiv:2408.05686 [pdf, other]

cs.LG cs.MA

The Bandit Whisperer: Communication Learning for Restless Bandits

Authors: Yunfan Zhao, Tonghan Wang, Dheeraj Nagaraj, Aparna Taneja, Milind Tambe

Abstract: Applying Reinforcement Learning (RL) to Restless Multi-Arm Bandits (RMABs) offers a promising avenue for addressing allocation problems with resource constraints and temporal dynamics. However, classic RMAB models largely overlook the challenges of (systematic) data errors - a common occurrence in real-world scenarios due to factors like varying data collection protocols and intentional noise for… ▽ More Applying Reinforcement Learning (RL) to Restless Multi-Arm Bandits (RMABs) offers a promising avenue for addressing allocation problems with resource constraints and temporal dynamics. However, classic RMAB models largely overlook the challenges of (systematic) data errors - a common occurrence in real-world scenarios due to factors like varying data collection protocols and intentional noise for differential privacy. We demonstrate that conventional RL algorithms used to train RMABs can struggle to perform well in such settings. To solve this problem, we propose the first communication learning approach in RMABs, where we study which arms, when involved in communication, are most effective in mitigating the influence of such systematic data errors. In our setup, the arms receive Q-function parameters from similar arms as messages to guide behavioral policies, steering Q-function updates. We learn communication strategies by considering the joint utility of messages across all pairs of arms and using a Q-network architecture that decomposes the joint utility. Both theoretical and empirical evidence validate the effectiveness of our method in significantly improving RMAB performance across diverse problems. △ Less

Submitted 10 August, 2024; originally announced August 2024.
arXiv:2408.05517 [pdf, other]

cs.CL

SWIFT:A Scalable lightWeight Infrastructure for Fine-Tuning

Authors: Yuze Zhao, Jintao Huang, Jinghan Hu, Xingjun Wang, Yunlin Mao, Daoze Zhang, Zeyinzi Jiang, Zhikai Wu, Baole Ai, Ang Wang, Wenmeng Zhou, Yingda Chen

Abstract: Recent development in Large Language Models (LLMs) and Multi-modal Large Language Models (MLLMs) have leverage Attention-based Transformer architectures and achieved superior performance and generalization capabilities. They have since covered extensive areas of traditional learning tasks. For instance, text-based tasks such as text-classification and sequence-labeling, as well as multi-modal task… ▽ More Recent development in Large Language Models (LLMs) and Multi-modal Large Language Models (MLLMs) have leverage Attention-based Transformer architectures and achieved superior performance and generalization capabilities. They have since covered extensive areas of traditional learning tasks. For instance, text-based tasks such as text-classification and sequence-labeling, as well as multi-modal tasks like Visual Question Answering (VQA) and Optical Character Recognition (OCR), which were previously addressed using different models, can now be tackled based on one foundation model. Consequently, the training and lightweight fine-tuning of LLMs and MLLMs, especially those based on Transformer architecture, has become particularly important. In recognition of these overwhelming needs, we develop SWIFT, a customizable one-stop infrastructure for large models. With support of over $300+$ LLMs and $50+$ MLLMs, SWIFT stands as the open-source framework that provide the most comprehensive support for fine-tuning large models. In particular, it is the first training framework that provides systematic support for MLLMs. In addition to the core functionalities of fine-tuning, SWIFT also integrates post-training processes such as inference, evaluation, and model quantization, to facilitate fast adoptions of large models in various application scenarios. With a systematic integration of various training techniques, SWIFT offers helpful utilities such as benchmark comparisons among different training techniques for large models. For fine-tuning models specialized in agent framework, we show that notable improvements on the ToolBench leader-board can be achieved by training with customized dataset on SWIFT, with an increase of 5.2%-21.8% in the Act.EM metric over various baseline models, a reduction in hallucination by 1.6%-14.1%, and an average performance improvement of 8%-17%. △ Less

Submitted 18 August, 2024; v1 submitted 10 August, 2024; originally announced August 2024.
arXiv:2408.05435 [pdf, other]

quant-ph cs.LG

SuperEncoder: Towards Universal Neural Approximate Quantum State Preparation

Authors: Yilun Zhao, Bingmeng Wang, Wenle Jiang, Xiwei Pan, Bing Li, Yinhe Han, Ying Wang

Abstract: Numerous quantum algorithms operate under the assumption that classical data has already been converted into quantum states, a process termed Quantum State Preparation (QSP). However, achieving precise QSP requires a circuit depth that scales exponentially with the number of qubits, making it a substantial obstacle in harnessing quantum advantage. Recent research suggests using a Parameterized Qua… ▽ More Numerous quantum algorithms operate under the assumption that classical data has already been converted into quantum states, a process termed Quantum State Preparation (QSP). However, achieving precise QSP requires a circuit depth that scales exponentially with the number of qubits, making it a substantial obstacle in harnessing quantum advantage. Recent research suggests using a Parameterized Quantum Circuit (PQC) to approximate a target state, offering a more scalable solution with reduced circuit depth compared to precise QSP. Despite this, the need for iterative updates of circuit parameters results in a lengthy runtime, limiting its practical application. In this work, we demonstrate that it is possible to leverage a pre-trained neural network to directly generate the QSP circuit for arbitrary quantum state, thereby eliminating the significant overhead of online iterations. Our study makes a steady step towards a universal neural designer for approximate QSP. △ Less

Submitted 10 August, 2024; originally announced August 2024.
arXiv:2408.05307 [pdf]

cs.CE cs.LG

Audio-visual cross-modality knowledge transfer for machine learning-based in-situ monitoring in laser additive manufacturing

Authors: Jiarui Xie, Mutahar Safdar, Lequn Chen, Seung Ki Moon, Yaoyao Fiona Zhao

Abstract: Various machine learning (ML)-based in-situ monitoring systems have been developed to detect laser additive manufacturing (LAM) process anomalies and defects. Multimodal fusion can improve in-situ monitoring performance by acquiring and integrating data from multiple modalities, including visual and audio data. However, multimodal fusion employs multiple sensors of different types, which leads to… ▽ More Various machine learning (ML)-based in-situ monitoring systems have been developed to detect laser additive manufacturing (LAM) process anomalies and defects. Multimodal fusion can improve in-situ monitoring performance by acquiring and integrating data from multiple modalities, including visual and audio data. However, multimodal fusion employs multiple sensors of different types, which leads to higher hardware, computational, and operational costs. This paper proposes a cross-modality knowledge transfer (CMKT) methodology that transfers knowledge from a source to a target modality for LAM in-situ monitoring. CMKT enhances the usefulness of the features extracted from the target modality during the training phase and removes the sensors of the source modality during the prediction phase. This paper proposes three CMKT methods: semantic alignment, fully supervised mapping, and semi-supervised mapping. Semantic alignment establishes a shared encoded space between modalities to facilitate knowledge transfer. It utilizes a semantic alignment loss to align the distributions of the same classes (e.g., visual defective and audio defective classes) and a separation loss to separate the distributions of different classes (e.g., visual defective and audio defect-free classes). The two mapping methods transfer knowledge by deriving the features of one modality from the other modality using fully supervised and semi-supervised learning. The proposed CMKT methods were implemented and compared with multimodal audio-visual fusion in an LAM in-situ anomaly detection case study. The semantic alignment method achieves a 98.4% accuracy while removing the audio modality during the prediction phase, which is comparable to the accuracy of multimodal fusion (98.2%). △ Less

Submitted 9 August, 2024; originally announced August 2024.

Comments: 36 pages, 12 figures, 6 tables
arXiv:2408.05117 [pdf, other]

eess.IV cs.AI cs.CV

Beyond the Eye: A Relational Model for Early Dementia Detection Using Retinal OCTA Images

Authors: Shouyue Liu, Jinkui Hao, Yonghuai Liu, Huazhu Fu, Xinyu Guo, Shuting Zhang, Yitian Zhao

Abstract: Early detection of dementia, such as Alzheimer's disease (AD) or mild cognitive impairment (MCI), is essential to enable timely intervention and potential treatment. Accurate detection of AD/MCI is challenging due to the high complexity, cost, and often invasive nature of current diagnostic techniques, which limit their suitability for large-scale population screening. Given the shared embryologic… ▽ More Early detection of dementia, such as Alzheimer's disease (AD) or mild cognitive impairment (MCI), is essential to enable timely intervention and potential treatment. Accurate detection of AD/MCI is challenging due to the high complexity, cost, and often invasive nature of current diagnostic techniques, which limit their suitability for large-scale population screening. Given the shared embryological origins and physiological characteristics of the retina and brain, retinal imaging is emerging as a potentially rapid and cost-effective alternative for the identification of individuals with or at high risk of AD. In this paper, we present a novel PolarNet+ that uses retinal optical coherence tomography angiography (OCTA) to discriminate early-onset AD (EOAD) and MCI subjects from controls. Our method first maps OCTA images from Cartesian coordinates to polar coordinates, allowing approximate sub-region calculation to implement the clinician-friendly early treatment of diabetic retinopathy study (ETDRS) grid analysis. We then introduce a multi-view module to serialize and analyze the images along three dimensions for comprehensive, clinically useful information extraction. Finally, we abstract the sequence embedding into a graph, transforming the detection task into a general graph classification problem. A regional relationship module is applied after the multi-view module to excavate the relationship between the sub-regions. Such regional relationship analyses validate known eye-brain links and reveal new discriminative patterns. △ Less

Submitted 9 August, 2024; originally announced August 2024.
arXiv:2408.04863 [pdf, other]

cs.SE

Coding-PTMs: How to Find Optimal Code Pre-trained Models for Code Embedding in Vulnerability Detection?

Authors: Yu Zhao, Lina Gong, Zhiqiu Huang, Yongwei Wang, Mingqiang Wei, Fei Wu

Abstract: Vulnerability detection is garnering increasing attention in software engineering, since code vulnerabilities possibly pose significant security. Recently, reusing various code pre-trained models has become common for code embedding without providing reasonable justifications in vulnerability detection. The premise for casually utilizing pre-trained models (PTMs) is that the code embeddings genera… ▽ More Vulnerability detection is garnering increasing attention in software engineering, since code vulnerabilities possibly pose significant security. Recently, reusing various code pre-trained models has become common for code embedding without providing reasonable justifications in vulnerability detection. The premise for casually utilizing pre-trained models (PTMs) is that the code embeddings generated by different PTMs would generate a similar impact on the performance. Is that TRUE? To answer this important question, we systematically investigate the effects of code embedding generated by ten different code PTMs on the performance of vulnerability detection, and get the answer, i.e., that is NOT true. We observe that code embedding generated by various code PTMs can indeed influence the performance and selecting an embedding technique based on parameter scales and embedding dimension is not reliable. Our findings highlight the necessity of quantifying and evaluating the characteristics of code embedding generated by various code PTMs to understand the effects. To achieve this goal, we analyze the numerical representation and data distribution of code embedding generated by different PTMs to evaluate differences and characteristics. Based on these insights, we propose Coding-PTMs, a recommendation framework to assist engineers in selecting optimal code PTMs for their specific vulnerability detection tasks. Specifically, we define thirteen code embedding metrics across three dimensions (i.e., statistics, norm, and distribution) for constructing a specialized code PTM recommendation dataset. We then employ a Random Forest classifier to train a recommendation model and identify the optimal code PTMs from the candidate model zoo. △ Less

Submitted 9 August, 2024; originally announced August 2024.

Comments: Accepted by ASE 2024
arXiv:2408.03284 [pdf, other]

cs.CV cs.GR cs.MM

ReSyncer: Rewiring Style-based Generator for Unified Audio-Visually Synced Facial Performer

Authors: Jiazhi Guan, Zhiliang Xu, Hang Zhou, Kaisiyuan Wang, Shengyi He, Zhanwang Zhang, Borong Liang, Haocheng Feng, Errui Ding, Jingtuo Liu, Jingdong Wang, Youjian Zhao, Ziwei Liu

Abstract: Lip-syncing videos with given audio is the foundation for various applications including the creation of virtual presenters or performers. While recent studies explore high-fidelity lip-sync with different techniques, their task-orientated models either require long-term videos for clip-specific training or retain visible artifacts. In this paper, we propose a unified and effective framework ReSyn… ▽ More Lip-syncing videos with given audio is the foundation for various applications including the creation of virtual presenters or performers. While recent studies explore high-fidelity lip-sync with different techniques, their task-orientated models either require long-term videos for clip-specific training or retain visible artifacts. In this paper, we propose a unified and effective framework ReSyncer, that synchronizes generalized audio-visual facial information. The key design is revisiting and rewiring the Style-based generator to efficiently adopt 3D facial dynamics predicted by a principled style-injected Transformer. By simply re-configuring the information insertion mechanisms within the noise and style space, our framework fuses motion and appearance with unified training. Extensive experiments demonstrate that ReSyncer not only produces high-fidelity lip-synced videos according to audio, but also supports multiple appealing properties that are suitable for creating virtual presenters and performers, including fast personalized fine-tuning, video-driven lip-syncing, the transfer of speaking styles, and even face swapping. Resources can be found at https://guanjz20.github.io/projects/ReSyncer. △ Less

Submitted 6 August, 2024; originally announced August 2024.

Comments: Accepted to European Conference on Computer Vision (ECCV), 2024. Project page: https://guanjz20.github.io/projects/ReSyncer
arXiv:2408.02993 [pdf, other]

cs.CV

DreamLCM: Towards High-Quality Text-to-3D Generation via Latent Consistency Model

Authors: Yiming Zhong, Xiaolin Zhang, Yao Zhao, Yunchao Wei

Abstract: Recently, the text-to-3D task has developed rapidly due to the appearance of the SDS method. However, the SDS method always generates 3D objects with poor quality due to the over-smooth issue. This issue is attributed to two factors: 1) the DDPM single-step inference produces poor guidance gradients; 2) the randomness from the input noises and timesteps averages the details of the 3D contents. In… ▽ More Recently, the text-to-3D task has developed rapidly due to the appearance of the SDS method. However, the SDS method always generates 3D objects with poor quality due to the over-smooth issue. This issue is attributed to two factors: 1) the DDPM single-step inference produces poor guidance gradients; 2) the randomness from the input noises and timesteps averages the details of the 3D contents. In this paper, to address the issue, we propose DreamLCM which incorporates the Latent Consistency Model (LCM). DreamLCM leverages the powerful image generation capabilities inherent in LCM, enabling generating consistent and high-quality guidance, i.e., predicted noises or images. Powered by the improved guidance, the proposed method can provide accurate and detailed gradients to optimize the target 3D models. In addition, we propose two strategies to enhance the generation quality further. Firstly, we propose a guidance calibration strategy, utilizing Euler Solver to calibrate the guidance distribution to accelerate 3D models to converge. Secondly, we propose a dual timestep strategy, increasing the consistency of guidance and optimizing 3D models from geometry to appearance in DreamLCM. Experiments show that DreamLCM achieves state-of-the-art results in both generation quality and training efficiency. The code is available at https://github.com/1YimingZhong/DreamLCM. △ Less

Submitted 9 August, 2024; v1 submitted 6 August, 2024; originally announced August 2024.

Comments: 15 pages, 9 figures, ACM MM 2024
arXiv:2408.02705 [pdf, other]

cs.LG cs.AI

PSNE: Efficient Spectral Sparsification Algorithms for Scaling Network Embedding

Authors: Longlong Lin, Yunfeng Yu, Zihao Wang, Zeli Wang, Yuying Zhao, Jin Zhao, Tao Jia

Abstract: Network embedding has numerous practical applications and has received extensive attention in graph learning, which aims at mapping vertices into a low-dimensional and continuous dense vector space by preserving the underlying structural properties of the graph. Many network embedding methods have been proposed, among which factorization of the Personalized PageRank (PPR for short) matrix has been… ▽ More Network embedding has numerous practical applications and has received extensive attention in graph learning, which aims at mapping vertices into a low-dimensional and continuous dense vector space by preserving the underlying structural properties of the graph. Many network embedding methods have been proposed, among which factorization of the Personalized PageRank (PPR for short) matrix has been empirically and theoretically well supported recently. However, several fundamental issues cannot be addressed. (1) Existing methods invoke a seminal Local Push subroutine to approximate \textit{a single} row or column of the PPR matrix. Thus, they have to execute $n$ ($n$ is the number of nodes) Local Push subroutines to obtain a provable PPR matrix, resulting in prohibitively high computational costs for large $n$. (2) The PPR matrix has limited power in capturing the structural similarity between vertices, leading to performance degradation. To overcome these dilemmas, we propose PSNE, an efficient spectral s\textbf{P}arsification method for \textbf{S}caling \textbf{N}etwork \textbf{E}mbedding, which can fast obtain the embedding vectors that retain strong structural similarities. Specifically, PSNE first designs a matrix polynomial sparser to accelerate the calculation of the PPR matrix, which has a theoretical guarantee in terms of the Frobenius norm. Subsequently, PSNE proposes a simple but effective multiple-perspective strategy to enhance further the representation power of the obtained approximate PPR matrix. Finally, PSNE applies a randomized singular value decomposition algorithm on the sparse and multiple-perspective PPR matrix to get the target embedding vectors. Experimental evaluation of real-world and synthetic datasets shows that our solutions are indeed more efficient, effective, and scalable compared with ten competitors. △ Less

Submitted 5 August, 2024; originally announced August 2024.
arXiv:2408.02598 [pdf, other]

cs.LG cs.CY

AI-Driven Strategies for Reducing Student Withdrawal -- A Study of EMU Student Stopout

Authors: Yan Zhao, Amy Otteson

Abstract: Not everyone who enrolls in college will leave with a certificate or degree, but the number of people who drop out or take a break is much higher than experts previously believed. In December 2013, there were 29 million people with some college education but no degree. That number jumped to 36 million by December of 2018, according to a new report from the National Student Clearinghouse Research C… ▽ More Not everyone who enrolls in college will leave with a certificate or degree, but the number of people who drop out or take a break is much higher than experts previously believed. In December 2013, there were 29 million people with some college education but no degree. That number jumped to 36 million by December of 2018, according to a new report from the National Student Clearinghouse Research Center[1]. It is imperative to understand the underlying factors contributing to student withdrawal and to assist decision-makers to identify effective strategies to prevent it. By analyzing the characteristics and educational pathways of the stopout student population, our aim is to provide actionable insights that can benefit institutions facing similar challenges. Eastern Michigan University (EMU) faces significant challenges in student retention, with approximately 55% of its undergraduate students not completing their degrees within six years. As an institution committed to student success, EMU conducted a comprehensive study of student withdrawals to understand the influencing factors. And the paper revealed a high correlation between certain factors and withdrawals, even in the early stages of university attendance. Based on these findings, we developed a predictive model that employs artificial intelligence techniques to assess the potential risk that students abandon their studies. These models enable universities to implement early intervention strategies, support at-risk students, and improve overall higher education success. △ Less

Submitted 5 August, 2024; originally announced August 2024.

Comments: 6 pages, 5 figures
arXiv:2408.02446 [pdf, other]

cs.IT

Second 6G life Workshop on Post Shannon Theory

Authors: Yaning Zhao, Christian Deppe

Abstract: The one-day workshop, held prior to the "ZIF Workshop on Information Theory and Related Fields", provided an excellent opportunity for in-depth discussions on several topics within the field of post-Shannon theory. The agenda covered deterministic and randomized identification, focusing on various methods and algorithms for identifying data or signals deterministically and through randomized proce… ▽ More The one-day workshop, held prior to the "ZIF Workshop on Information Theory and Related Fields", provided an excellent opportunity for in-depth discussions on several topics within the field of post-Shannon theory. The agenda covered deterministic and randomized identification, focusing on various methods and algorithms for identifying data or signals deterministically and through randomized processes. It explored the theoretical foundations and practical applications of these techniques. The session on resources for increasing identification capacity examined the different resources and strategies that can be utilized to boost the capacity for identifying information. This included discussions on both hardware and software solutions, as well as innovative approaches to resource allocation and optimization. Participants delved into common randomness generation, essential for various cryptographic protocols and communication systems. The session highlighted recent advancements and practical implementations of common randomness in secure communications. The workshop concluded with a detailed look at the development and practical deployment of identification codes. Experts shared insights on code construction techniques, implementation challenges, and real-world applications in various communication systems. We extend our thanks to the esteemed speakers for their valuable contributions: Caspar von Lengerke, Wafa Labidi, Ilya Vorobyev, Johannes Rosenberger, Jonathan Huffmann, and Pau Colomer. Their presentations and insights significantly enriched the workshop. Additionally, we are grateful to all the participants whose active engagement, constructive comments, and stimulating discussions made the event a success. Your involvement was crucial in fostering a collaborative and intellectually vibrant environment. △ Less

Submitted 5 August, 2024; originally announced August 2024.
arXiv:2408.01906 [pdf, ps, other]

cs.IT

Binary $[n,(n\pm1)/2]$ cyclic codes with good minimum distances from sequences

Authors: Xianhong Xie, Yaxin Zhao, Zhonghua Sun, Xiaobo Zhou

Abstract: Recently, binary cyclic codes with parameters $[n,(n\pm1)/2,\geq \sqrt{n}]$ have been a hot topic since their minimum distances have a square-root bound. In this paper, we construct four classes of binary cyclic codes $\mathcal{C}_{\mathcal{S},0}$, $\mathcal{C}_{\mathcal{S},1}$ and $\mathcal{C}_{\mathcal{D},0}$, $\mathcal{C}_{\mathcal{D},1}$ by using two families of sequences, and obtain some code… ▽ More Recently, binary cyclic codes with parameters $[n,(n\pm1)/2,\geq \sqrt{n}]$ have been a hot topic since their minimum distances have a square-root bound. In this paper, we construct four classes of binary cyclic codes $\mathcal{C}_{\mathcal{S},0}$, $\mathcal{C}_{\mathcal{S},1}$ and $\mathcal{C}_{\mathcal{D},0}$, $\mathcal{C}_{\mathcal{D},1}$ by using two families of sequences, and obtain some codes with parameters $[n,(n\pm1)/2,\geq \sqrt{n}]$. For $m\equiv2\pmod4$, the code $\mathcal{C}_{\mathcal{S},0}$ has parameters $[2^m-1,2^{m-1},\geq2^{\frac{m}{2}}+2]$, and the code $\mathcal{C}_{\mathcal{D},0}$ has parameters $[2^m-1,2^{m-1},\geq2^{\frac{m}{2}}+2]$ if $h=1$ and $[2^m-1,2^{m-1},\geq2^{\frac{m}{2}}]$ if $h=2$. △ Less

Submitted 3 August, 2024; originally announced August 2024.
arXiv:2408.01840 [pdf, other]

cs.CV

E$^3$NeRF: Efficient Event-Enhanced Neural Radiance Fields from Blurry Images

Authors: Yunshan Qi, Jia Li, Yifan Zhao, Yu Zhang, Lin Zhu

Abstract: Neural Radiance Fields (NeRF) achieve impressive rendering performance by learning volumetric 3D representation from several images of different views. However, it is difficult to reconstruct a sharp NeRF from blurry input as it often occurs in the wild. To solve this problem, we propose a novel Efficient Event-Enhanced NeRF (E$^3$NeRF) by utilizing the combination of RGB images and event streams.… ▽ More Neural Radiance Fields (NeRF) achieve impressive rendering performance by learning volumetric 3D representation from several images of different views. However, it is difficult to reconstruct a sharp NeRF from blurry input as it often occurs in the wild. To solve this problem, we propose a novel Efficient Event-Enhanced NeRF (E$^3$NeRF) by utilizing the combination of RGB images and event streams. To effectively introduce event streams into the neural volumetric representation learning process, we propose an event-enhanced blur rendering loss and an event rendering loss, which guide the network via modeling the real blur process and event generation process, respectively. Specifically, we leverage spatial-temporal information from the event stream to evenly distribute learning attention over temporal blur while simultaneously focusing on blurry texture through the spatial attention. Moreover, a camera pose estimation framework for real-world data is built with the guidance of the events to generalize the method to practical applications. Compared to previous image-based or event-based NeRF, our framework makes more profound use of the internal relationship between events and images. Extensive experiments on both synthetic data and real-world data demonstrate that E$^3$NeRF can effectively learn a sharp NeRF from blurry images, especially in non-uniform motion and low-light scenes. △ Less

Submitted 3 August, 2024; originally announced August 2024.
arXiv:2408.01687 [pdf, other]

cs.SE

Voices from the Frontier: A Comprehensive Analysis of the OpenAI Developer Forum

Authors: Xinyi Hou, Yanjie Zhao, Haoyu Wang

Abstract: OpenAI's advanced large language models (LLMs) have revolutionized natural language processing and enabled developers to create innovative applications. As adoption grows, understanding the experiences and challenges of developers working with these technologies is crucial. This paper presents a comprehensive analysis of the OpenAI Developer Forum, focusing on (1) popularity trends and user engage… ▽ More OpenAI's advanced large language models (LLMs) have revolutionized natural language processing and enabled developers to create innovative applications. As adoption grows, understanding the experiences and challenges of developers working with these technologies is crucial. This paper presents a comprehensive analysis of the OpenAI Developer Forum, focusing on (1) popularity trends and user engagement patterns, and (2) a taxonomy of challenges and concerns faced by developers. We first employ a quantitative analysis of the metadata from 29,576 forum topics, investigating temporal trends in topic creation, the popularity of topics across different categories, and user contributions at various trust levels. We then qualitatively analyze content from 9,301 recently active topics on developer concerns. From a sample of 886 topics, we construct a taxonomy of concerns in the OpenAI Developer Forum. Our findings uncover critical concerns raised by developers in creating AI-powered applications and offer targeted recommendations to address them. This work not only advances AI-assisted software engineering but also empowers developer communities to shape the responsible evolution and integration of AI technology in society. △ Less

Submitted 3 August, 2024; originally announced August 2024.
arXiv:2408.01544 [pdf, other]

cs.LG stat.AP

Momentum Capture and Prediction System Based on Wimbledon Open2023 Tournament Data

Authors: Chang Liu, Tongyuan Yang, Yan Zhao

Abstract: There is a hidden energy in tennis, which cannot be seen or touched. It is the force that controls the flow of the game and is present in all types of matches. This mysterious force is Momentum. This study introduces an evaluation model that synergizes the Entropy Weight Method (EWM) and Gray Relation Analysis (GRA) to quantify momentum's impact on match outcomes. Empirical validation was conducte… ▽ More There is a hidden energy in tennis, which cannot be seen or touched. It is the force that controls the flow of the game and is present in all types of matches. This mysterious force is Momentum. This study introduces an evaluation model that synergizes the Entropy Weight Method (EWM) and Gray Relation Analysis (GRA) to quantify momentum's impact on match outcomes. Empirical validation was conducted through Mann-Whitney U and Kolmogorov-Smirnov tests, which yielded p values of 0.0043 and 0.00128,respectively. These results underscore the non-random association between momentum shifts and match outcomes, highlighting the critical role of momentum in tennis. Otherwise, our investigation foucus is the creation of a predictive model that combines the advanced machine learning algorithm XGBoost with the SHAP framework. This model enables precise predictions of match swings with exceptional accuracy (0.999013 for multiple matches and 0.992738 for finals). The model's ability to identify the influence of specific factors on match dynamics,such as bilateral distance run during points, demonstrates its prowess.The model's generalizability was thoroughly evaluated using datasets from the four Grand Slam tournaments. The results demonstrate its remarkable adaptability to different match scenarios,despite minor variations in predictive accuracy. It offers strategic insights that can help players effectively respond to opponents' shifts in momentum,enhancing their competitive edge. △ Less

Submitted 2 August, 2024; originally announced August 2024.
arXiv:2408.00744 [pdf, other]

cs.CV

Collaborative Vision-Text Representation Optimizing for Open-Vocabulary Segmentation

Authors: Siyu Jiao, Hongguang Zhu, Jiannan Huang, Yao Zhao, Yunchao Wei, Humphrey Shi

Abstract: Pre-trained vision-language models, e.g. CLIP, have been increasingly used to address the challenging Open-Vocabulary Segmentation (OVS) task, benefiting from their well-aligned vision-text embedding space. Typical solutions involve either freezing CLIP during training to unilaterally maintain its zero-shot capability, or fine-tuning CLIP vision encoder to achieve perceptual sensitivity to local r… ▽ More Pre-trained vision-language models, e.g. CLIP, have been increasingly used to address the challenging Open-Vocabulary Segmentation (OVS) task, benefiting from their well-aligned vision-text embedding space. Typical solutions involve either freezing CLIP during training to unilaterally maintain its zero-shot capability, or fine-tuning CLIP vision encoder to achieve perceptual sensitivity to local regions. However, few of them incorporate vision-text collaborative optimization. Based on this, we propose the Content-Dependent Transfer to adaptively enhance each text embedding by interacting with the input image, which presents a parameter-efficient way to optimize the text representation. Besides, we additionally introduce a Representation Compensation strategy, reviewing the original CLIP-V representation as compensation to maintain the zero-shot capability of CLIP. In this way, the vision and text representation of CLIP are optimized collaboratively, enhancing the alignment of the vision-text feature space. To the best of our knowledge, we are the first to establish the collaborative vision-text optimizing mechanism within the OVS field. Extensive experiments demonstrate our method achieves superior performance on popular OVS benchmarks. In open-vocabulary semantic segmentation, our method outperforms the previous state-of-the-art approaches by +0.5, +2.3, +3.4, +0.4 and +1.1 mIoU, respectively on A-847, A-150, PC-459, PC-59 and PAS-20. Furthermore, in a panoptic setting on ADE20K, we achieve the performance of 27.1 PQ, 73.5 SQ, and 32.9 RQ. Code will be available at https://github.com/jiaosiyu1999/MAFT-Plus.git . △ Less

Submitted 1 August, 2024; originally announced August 2024.

Comments: ECCV 2024
arXiv:2408.00629 [pdf, other]

cs.CV eess.IV

Empowering Snapshot Compressive Imaging: Spatial-Spectral State Space Model with Across-Scanning and Local Enhancement

Authors: Wenzhe Tian, Haijin Zeng, Yin-Ping Zhao, Yongyong Chen, Zhen Wang, Xuelong Li

Abstract: Snapshot Compressive Imaging (SCI) relies on decoding algorithms such as CNN or Transformer to reconstruct the hyperspectral image (HSI) from its compressed measurement. Although existing CNN and Transformer-based methods have proven effective, CNNs are limited by their inadequate modeling of long-range dependencies, while Transformer ones face high computational costs due to quadratic complexity.… ▽ More Snapshot Compressive Imaging (SCI) relies on decoding algorithms such as CNN or Transformer to reconstruct the hyperspectral image (HSI) from its compressed measurement. Although existing CNN and Transformer-based methods have proven effective, CNNs are limited by their inadequate modeling of long-range dependencies, while Transformer ones face high computational costs due to quadratic complexity. Recent Mamba models have demonstrated superior performance over CNN and Transformer-based architectures in some visual tasks, but these models have not fully utilized the local similarities in both spatial and spectral dimensions. Moreover, the long-sequence modeling capability of SSM may offer an advantage in processing the numerous spectral bands for HSI reconstruction, which has not yet been explored. In this paper, we introduce a State Space Model with Across-Scanning and Local Enhancement, named ASLE-SSM, that employs a Spatial-Spectral SSM for global-local balanced context encoding and cross-channel interaction promoting. Specifically, we introduce local scanning in the spatial dimension to balance the global and local receptive fields, and then propose our across-scanning method based on spatial-spectral local cubes to leverage local similarities between adjacent spectral bands and pixels to guide the reconstruction process. These two scanning mechanisms extract the HSI's local features while balancing the global perspective without any additional costs. Experimental results illustrate ASLE-SSM's superiority over existing state-of-the-art methods, with an inference speed 2.4 times faster than Transformer-based MST and saving 0.12 (M) of parameters, achieving the lowest computational cost and parameter count. △ Less

Submitted 1 August, 2024; originally announced August 2024.

Comments: 12 pages,6 figures
arXiv:2408.00496 [pdf, other]

cs.CV

SegStitch: Multidimensional Transformer for Robust and Efficient Medical Imaging Segmentation

Authors: Shengbo Tan, Zeyu Zhang, Ying Cai, Daji Ergu, Lin Wu, Binbin Hu, Pengzhang Yu, Yang Zhao

Abstract: Medical imaging segmentation plays a significant role in the automatic recognition and analysis of lesions. State-of-the-art methods, particularly those utilizing transformers, have been prominently adopted in 3D semantic segmentation due to their superior performance in scalability and generalizability. However, plain vision transformers encounter challenges due to their neglect of local features… ▽ More Medical imaging segmentation plays a significant role in the automatic recognition and analysis of lesions. State-of-the-art methods, particularly those utilizing transformers, have been prominently adopted in 3D semantic segmentation due to their superior performance in scalability and generalizability. However, plain vision transformers encounter challenges due to their neglect of local features and their high computational complexity. To address these challenges, we introduce three key contributions: Firstly, we proposed SegStitch, an innovative architecture that integrates transformers with denoising ODE blocks. Instead of taking whole 3D volumes as inputs, we adapt axial patches and customize patch-wise queries to ensure semantic consistency. Additionally, we conducted extensive experiments on the BTCV and ACDC datasets, achieving improvements up to 11.48% and 6.71% respectively in mDSC, compared to state-of-the-art methods. Lastly, our proposed method demonstrates outstanding efficiency, reducing the number of parameters by 36.7% and the number of FLOPS by 10.7% compared to UNETR. This advancement holds promising potential for adapting our method to real-world clinical practice. The code will be available at https://github.com/goblin327/SegStitch △ Less

Submitted 1 August, 2024; originally announced August 2024.
arXiv:2408.00083 [pdf, other]

cs.CV

Localized Gaussian Splatting Editing with Contextual Awareness

Authors: Hanyuan Xiao, Yingshu Chen, Huajian Huang, Haolin Xiong, Jing Yang, Pratusha Prasad, Yajie Zhao

Abstract: Recent text-guided generation of individual 3D object has achieved great success using diffusion priors. However, these methods are not suitable for object insertion and replacement tasks as they do not consider the background, leading to illumination mismatches within the environment. To bridge the gap, we introduce an illumination-aware 3D scene editing pipeline for 3D Gaussian Splatting (3DGS)… ▽ More Recent text-guided generation of individual 3D object has achieved great success using diffusion priors. However, these methods are not suitable for object insertion and replacement tasks as they do not consider the background, leading to illumination mismatches within the environment. To bridge the gap, we introduce an illumination-aware 3D scene editing pipeline for 3D Gaussian Splatting (3DGS) representation. Our key observation is that inpainting by the state-of-the-art conditional 2D diffusion model is consistent with background in lighting. To leverage the prior knowledge from the well-trained diffusion models for 3D object generation, our approach employs a coarse-to-fine objection optimization pipeline with inpainted views. In the first coarse step, we achieve image-to-3D lifting given an ideal inpainted view. The process employs 3D-aware diffusion prior from a view-conditioned diffusion model, which preserves illumination present in the conditioning image. To acquire an ideal inpainted image, we introduce an Anchor View Proposal (AVP) algorithm to find a single view that best represents the scene illumination in target region. In the second Texture Enhancement step, we introduce a novel Depth-guided Inpainting Score Distillation Sampling (DI-SDS), which enhances geometry and texture details with the inpainting diffusion prior, beyond the scope of the 3D-aware diffusion prior knowledge in the first coarse step. DI-SDS not only provides fine-grained texture enhancement, but also urges optimization to respect scene lighting. Our approach efficiently achieves local editing with global illumination consistency without explicitly modeling light transport. We demonstrate robustness of our method by evaluating editing in real scenes containing explicit highlight and shadows, and compare against the state-of-the-art text-to-3D editing methods. △ Less

Submitted 31 July, 2024; originally announced August 2024.
arXiv:2407.21293 [pdf, other]

cs.CV cs.AI

SimpleLLM4AD: An End-to-End Vision-Language Model with Graph Visual Question Answering for Autonomous Driving

Authors: Peiru Zheng, Yun Zhao, Zhan Gong, Hong Zhu, Shaohua Wu

Abstract: Many fields could benefit from the rapid development of the large language models (LLMs). The end-to-end autonomous driving (e2eAD) is one of the typically fields facing new opportunities as the LLMs have supported more and more modalities. Here, by utilizing vision-language model (VLM), we proposed an e2eAD method called SimpleLLM4AD. In our method, the e2eAD task are divided into four stages, wh… ▽ More Many fields could benefit from the rapid development of the large language models (LLMs). The end-to-end autonomous driving (e2eAD) is one of the typically fields facing new opportunities as the LLMs have supported more and more modalities. Here, by utilizing vision-language model (VLM), we proposed an e2eAD method called SimpleLLM4AD. In our method, the e2eAD task are divided into four stages, which are perception, prediction, planning, and behavior. Each stage consists of several visual question answering (VQA) pairs and VQA pairs interconnect with each other constructing a graph called Graph VQA (GVQA). By reasoning each VQA pair in the GVQA through VLM stage by stage, our method could achieve e2e driving with language. In our method, vision transformers (ViT) models are employed to process nuScenes visual data, while VLM are utilized to interpret and reason about the information extracted from the visual inputs. In the perception stage, the system identifies and classifies objects from the driving environment. The prediction stage involves forecasting the potential movements of these objects. The planning stage utilizes the gathered information to develop a driving strategy, ensuring the safety and efficiency of the autonomous vehicle. Finally, the behavior stage translates the planned actions into executable commands for the vehicle. Our experiments demonstrate that SimpleLLM4AD achieves competitive performance in complex driving scenarios. △ Less

Submitted 30 July, 2024; originally announced July 2024.

Comments: 16 pages, 3 figures
arXiv:2407.21243 [pdf, other]

cs.LG cs.AI

Informed Correctors for Discrete Diffusion Models

Authors: Yixiu Zhao, Jiaxin Shi, Lester Mackey, Scott Linderman

Abstract: Discrete diffusion modeling is a promising framework for modeling and generating data in discrete spaces. To sample from these models, different strategies present trade-offs between computation and sample quality. A predominant sampling strategy is predictor-corrector $τ$-leaping, which simulates the continuous time generative process with discretized predictor steps and counteracts the accumulat… ▽ More Discrete diffusion modeling is a promising framework for modeling and generating data in discrete spaces. To sample from these models, different strategies present trade-offs between computation and sample quality. A predominant sampling strategy is predictor-corrector $τ$-leaping, which simulates the continuous time generative process with discretized predictor steps and counteracts the accumulation of discretization error via corrector steps. However, for absorbing state diffusion, an important class of discrete diffusion models, the standard forward-backward corrector can be ineffective in fixing such errors, resulting in subpar sample quality. To remedy this problem, we propose a family of informed correctors that more reliably counteracts discretization error by leveraging information learned by the model. For further efficiency gains, we also propose $k$-Gillespie's, a sampling algorithm that better utilizes each model evaluation, while still enjoying the speed and flexibility of $τ$-leaping. Across several real and synthetic datasets, we show that $k$-Gillespie's with informed correctors reliably produces higher quality samples at lower computational cost. △ Less

Submitted 30 July, 2024; originally announced July 2024.
arXiv:2407.21075 [pdf, other]

cs.AI cs.CL cs.LG

Apple Intelligence Foundation Language Models

Authors: Tom Gunter, Zirui Wang, Chong Wang, Ruoming Pang, Andy Narayanan, Aonan Zhang, Bowen Zhang, Chen Chen, Chung-Cheng Chiu, David Qiu, Deepak Gopinath, Dian Ang Yap, Dong Yin, Feng Nan, Floris Weers, Guoli Yin, Haoshuo Huang, Jianyu Wang, Jiarui Lu, John Peebles, Ke Ye, Mark Lee, Nan Du, Qibin Chen, Quentin Keunebroek , et al. (130 additional authors not shown)

Abstract: We present foundation language models developed to power Apple Intelligence features, including a ~3 billion parameter model designed to run efficiently on devices and a large server-based language model designed for Private Cloud Compute. These models are designed to perform a wide range of tasks efficiently, accurately, and responsibly. This report describes the model architecture, the data used… ▽ More We present foundation language models developed to power Apple Intelligence features, including a ~3 billion parameter model designed to run efficiently on devices and a large server-based language model designed for Private Cloud Compute. These models are designed to perform a wide range of tasks efficiently, accurately, and responsibly. This report describes the model architecture, the data used to train the model, the training process, how the models are optimized for inference, and the evaluation results. We highlight our focus on Responsible AI and how the principles are applied throughout the model development. △ Less

Submitted 29 July, 2024; originally announced July 2024.
arXiv:2407.20908 [pdf, other]

cs.CV

Dynamic Scene Understanding through Object-Centric Voxelization and Neural Rendering

Authors: Yanpeng Zhao, Yiwei Hao, Siyu Gao, Yunbo Wang, Xiaokang Yang

Abstract: Learning object-centric representations from unsupervised videos is challenging. Unlike most previous approaches that focus on decomposing 2D images, we present a 3D generative model named DynaVol-S for dynamic scenes that enables object-centric learning within a differentiable volume rendering framework. The key idea is to perform object-centric voxelization to capture the 3D nature of the scene,… ▽ More Learning object-centric representations from unsupervised videos is challenging. Unlike most previous approaches that focus on decomposing 2D images, we present a 3D generative model named DynaVol-S for dynamic scenes that enables object-centric learning within a differentiable volume rendering framework. The key idea is to perform object-centric voxelization to capture the 3D nature of the scene, which infers per-object occupancy probabilities at individual spatial locations. These voxel features evolve through a canonical-space deformation function and are optimized in an inverse rendering pipeline with a compositional NeRF. Additionally, our approach integrates 2D semantic features to create 3D semantic grids, representing the scene through multiple disentangled voxel grids. DynaVol-S significantly outperforms existing models in both novel view synthesis and unsupervised decomposition tasks for dynamic scenes. By jointly considering geometric structures and semantic features, it effectively addresses challenging real-world scenarios involving complex object interactions. Furthermore, once trained, the explicitly meaningful voxel features enable additional capabilities that 2D scene decomposition methods cannot achieve, such as novel scene generation through editing geometric shapes or manipulating the motion trajectories of objects. △ Less

Submitted 30 July, 2024; originally announced July 2024.
arXiv:2407.20600 [pdf, other]

cs.CV

Knowledge Fused Recognition: Fusing Hierarchical Knowledge for Image Recognition through Quantitative Relativity Modeling and Deep Metric Learning

Authors: Yunfeng Zhao, Huiyu Zhou, Fei Wu, Xifeng Wu

Abstract: Image recognition is an essential baseline for deep metric learning. Hierarchical knowledge about image classes depicts inter-class similarities or dissimilarities. Effective fusion of hierarchical knowledge about image classes to enhance image recognition remains a challenging topic to advance. In this paper, we propose a novel deep metric learning based method to effectively fuse hierarchical pr… ▽ More Image recognition is an essential baseline for deep metric learning. Hierarchical knowledge about image classes depicts inter-class similarities or dissimilarities. Effective fusion of hierarchical knowledge about image classes to enhance image recognition remains a challenging topic to advance. In this paper, we propose a novel deep metric learning based method to effectively fuse hierarchical prior knowledge about image classes and enhance image recognition performances in an end-to-end supervised regression manner. Existing deep metric learning incorporated image classification mainly exploits qualitative relativity between image classes, i.e., whether sampled images are from the same class. A new triplet loss function term that exploits quantitative relativity and aligns distances in model latent space with those in knowledge space is also proposed and incorporated in the proposed dual-modality fusion method. Experimental results indicate that the proposed method enhanced image recognition performances and outperformed baseline and existing methods on CIFAR-10, CIFAR-100, Mini-ImageNet, and ImageNet-1K datasets. △ Less

Submitted 30 July, 2024; originally announced July 2024.
arXiv:2407.19941 [pdf, other]

cs.LG

Boosting Graph Foundation Model from Structural Perspective

Authors: Yao Cheng, Yige Zhao, Jianxiang Yu, Xiang Li

Abstract: Graph foundation models have recently attracted significant attention due to its strong generalizability. Although existing methods resort to language models to learn unified semantic representations across domains, they disregard the unique structural characteristics of graphs from different domains. To address the problem, in this paper, we boost graph foundation model from structural perspectiv… ▽ More Graph foundation models have recently attracted significant attention due to its strong generalizability. Although existing methods resort to language models to learn unified semantic representations across domains, they disregard the unique structural characteristics of graphs from different domains. To address the problem, in this paper, we boost graph foundation model from structural perspective and propose BooG. The model constructs virtual super nodes to unify structural characteristics of graph data from different domains. Specifically, the super nodes fuse the information of anchor nodes and class labels, where each anchor node captures the information of a node or a graph instance to be classified. Instead of using the raw graph structure, we connect super nodes to all nodes within their neighborhood by virtual edges. This new structure allows for effective information aggregation while unifying cross-domain structural characteristics. Additionally, we propose a novel pre-training objective based on contrastive learning, which learns more expressive representations for graph data and generalizes effectively to different domains and downstream tasks. Experimental results on various datasets and tasks demonstrate the superior performance of BooG. We provide our code and data here: https://anonymous.4open.science/r/BooG-EE42/. △ Less

Submitted 29 July, 2024; originally announced July 2024.
arXiv:2407.19775 [pdf, other]

cs.AI cs.CL cs.CR cs.DC

Model Agnostic Hybrid Sharding For Heterogeneous Distributed Inference

Authors: Claudio Angione, Yue Zhao, Harry Yang, Ahmad Farhan, Fielding Johnston, James Buban, Patrick Colangelo

Abstract: The rapid growth of large-scale AI models, particularly large language models has brought significant challenges in data privacy, computational resources, and accessibility. Traditional centralized architectures often struggle to meet required data security and scalability needs which hinders the democratization of AI systems. Nesa introduces a model-agnostic sharding framework designed for decent… ▽ More The rapid growth of large-scale AI models, particularly large language models has brought significant challenges in data privacy, computational resources, and accessibility. Traditional centralized architectures often struggle to meet required data security and scalability needs which hinders the democratization of AI systems. Nesa introduces a model-agnostic sharding framework designed for decentralized AI inference. Our framework uses blockchain-based sequential deep neural network sharding to distribute computational tasks across a diverse network of nodes based on a personalised heuristic and routing mechanism. This enables efficient distributed training and inference for recent large-scale models even on consumer-grade hardware. We use compression techniques like dynamic blockwise quantization and mixed matrix decomposition to reduce data transfer and memory needs. We also integrate robust security measures, including hardware-based trusted execution environments to ensure data integrity and confidentiality. Evaluating our system across various natural language processing and vision tasks shows that these compression strategies do not compromise model accuracy. Our results highlight the potential to democratize access to cutting-edge AI technologies by enabling secure and efficient inference on a decentralized network. △ Less

Submitted 29 July, 2024; originally announced July 2024.
arXiv:2407.19711 [pdf, other]

cs.SE

TVDiag: A Task-oriented and View-invariant Failure Diagnosis Framework with Multimodal Data

Authors: Shuaiyu Xie, Jian Wang, Hanbin He, Zhihao Wang, Yuqi Zhao, Neng Zhang, Bing Li

Abstract: Microservice-based systems often suffer from reliability issues due to their intricate interactions and expanding scale. With the rapid growth of observability techniques, various methods have been proposed to achieve failure diagnosis, including root cause localization and failure type identification, by leveraging diverse monitoring data such as logs, metrics, or traces. However, traditional fai… ▽ More Microservice-based systems often suffer from reliability issues due to their intricate interactions and expanding scale. With the rapid growth of observability techniques, various methods have been proposed to achieve failure diagnosis, including root cause localization and failure type identification, by leveraging diverse monitoring data such as logs, metrics, or traces. However, traditional failure diagnosis methods that use single-modal data can hardly cover all failure scenarios due to the restricted information. Several failure diagnosis methods have been recently proposed to integrate multimodal data based on deep learning. These methods, however, tend to combine modalities indiscriminately and treat them equally in failure diagnosis, ignoring the relationship between specific modalities and different diagnostic tasks. This oversight hinders the effective utilization of the unique advantages offered by each modality. To address the limitation, we propose \textit{TVDiag}, a multimodal failure diagnosis framework for locating culprit microservice instances and identifying their failure types (e.g., Net-packets Corruption) in microservice-based systems. \textit{TVDiag} employs task-oriented learning to enhance the potential advantages of each modality and establishes cross-modal associations based on contrastive learning to extract view-invariant failure information. Furthermore, we develop a graph-level data augmentation strategy that randomly inactivates the observability of some normal microservice instances during training to mitigate the shortage of training data. Experimental results show that \textit{TVDiag} outperforms state-of-the-art methods in multimodal failure diagnosis, achieving at least a 55.94\% higher $HR@1$ accuracy and over a 4.08\% increase in F1-score across two datasets. △ Less

Submitted 29 July, 2024; originally announced July 2024.

Comments: 30 pages
arXiv:2407.19672 [pdf, other]

cs.CL

SeaLLMs 3: Open Foundation and Chat Multilingual Large Language Models for Southeast Asian Languages

Authors: Wenxuan Zhang, Hou Pong Chan, Yiran Zhao, Mahani Aljunied, Jianyu Wang, Chaoqun Liu, Yue Deng, Zhiqiang Hu, Weiwen Xu, Yew Ken Chia, Xin Li, Lidong Bing

Abstract: Large Language Models (LLMs) have shown remarkable abilities across various tasks, yet their development has predominantly centered on high-resource languages like English and Chinese, leaving low-resource languages underserved. To address this disparity, we present SeaLLMs 3, the latest iteration of the SeaLLMs model family, tailored for Southeast Asian languages. This region, characterized by it… ▽ More Large Language Models (LLMs) have shown remarkable abilities across various tasks, yet their development has predominantly centered on high-resource languages like English and Chinese, leaving low-resource languages underserved. To address this disparity, we present SeaLLMs 3, the latest iteration of the SeaLLMs model family, tailored for Southeast Asian languages. This region, characterized by its rich linguistic diversity, has lacked adequate language technology support. SeaLLMs 3 aims to bridge this gap by covering a comprehensive range of languages spoken in this region, including English, Chinese, Indonesian, Vietnamese, Thai, Tagalog, Malay, Burmese, Khmer, Lao, Tamil, and Javanese. Leveraging efficient language enhancement techniques and a specially constructed instruction tuning dataset, SeaLLMs 3 significantly reduces training costs while maintaining high performance and versatility. Our model excels in tasks such as world knowledge, mathematical reasoning, translation, and instruction following, achieving state-of-the-art performance among similarly sized models. Additionally, we prioritized safety and reliability by addressing both general and culture-specific considerations and incorporated mechanisms to reduce hallucinations. This work underscores the importance of inclusive AI, showing that advanced LLM capabilities can benefit underserved linguistic and cultural communities. △ Less

Submitted 28 July, 2024; originally announced July 2024.

Search v0.5.6 released 2020-02-24