Search | arXiv e-print repository

Low-Complexity SVM Signal Recovery in Bandwidth-Limited 100Gb/s PAM4 PON Upstream

Authors: Liyan Wu, Yanlu Huang, Kai Jin, Shangya Han, Kun Xu, Yanni Ou

Abstract: We proposed a low-complexity SVM-based signal recovery algorithm and evaluated it in 100G-PON with 25G-class devices. For the first time, it experimentally achieved 24 dB power budget @ FEC threshold 1E-3 over 40 km SMF, improving receiver sensitivity over 2 dB compared to FFE&DFE. We proposed a low-complexity SVM-based signal recovery algorithm and evaluated it in 100G-PON with 25G-class devices. For the first time, it experimentally achieved 24 dB power budget @ FEC threshold 1E-3 over 40 km SMF, improving receiver sensitivity over 2 dB compared to FFE&DFE. △ Less

Submitted 4 July, 2024; originally announced July 2024.

arXiv:2407.03627 [pdf, other]

DSLR: Document Refinement with Sentence-Level Re-ranking and Reconstruction to Enhance Retrieval-Augmented Generation

Authors: Taeho Hwang, Soyeong Jeong, Sukmin Cho, SeungYoon Han, Jong C. Park

Abstract: Recent advancements in Large Language Models (LLMs) have significantly improved their performance across various Natural Language Processing (NLP) tasks. However, LLMs still struggle with generating non-factual responses due to limitations in their parametric memory. Retrieval-Augmented Generation (RAG) systems address this issue by incorporating external knowledge with a retrieval module. Despite… ▽ More Recent advancements in Large Language Models (LLMs) have significantly improved their performance across various Natural Language Processing (NLP) tasks. However, LLMs still struggle with generating non-factual responses due to limitations in their parametric memory. Retrieval-Augmented Generation (RAG) systems address this issue by incorporating external knowledge with a retrieval module. Despite their successes, however, current RAG systems face challenges with retrieval failures and the limited ability of LLMs to filter out irrelevant information. Therefore, in this work, we propose DSLR (Document Refinement with Sentence-Level Re-ranking and Reconstruction), an unsupervised framework that decomposes retrieved documents into sentences, filters out irrelevant sentences, and reconstructs them again into coherent passages. We experimentally validate DSLR on multiple open-domain QA datasets and the results demonstrate that DSLR significantly enhances the RAG performance over conventional fixed-size passage. Furthermore, our DSLR enhances performance in specific, yet realistic scenarios without the need for additional training, providing an effective and efficient solution for refining retrieved documents in RAG systems. △ Less

Submitted 8 September, 2024; v1 submitted 4 July, 2024; originally announced July 2024.

Comments: 20 pages

Journal ref: KnowledgeNLP@ACL 2024

arXiv:2407.02848 [pdf, other]

Efficiency bounds for bipartite information-driven thermodynamic systems

Authors: Shihao Xia, Shuanglong Han, Ousi Pan, Yuzhuo Pan, Jincan Chen, Shanhe Su

Abstract: This study introduces a novel approach to derive a lower bound for the entropy production rate of a subsystem by utilizing the Cauchy-Schwarz inequality. It extends to establishing comprehensive upper and lower bounds for the efficiency of two subsystems. These bounds are applicable to a wide range of Markovian stochastic processes, which enhances the accuracy in depicting the range of energy conv… ▽ More This study introduces a novel approach to derive a lower bound for the entropy production rate of a subsystem by utilizing the Cauchy-Schwarz inequality. It extends to establishing comprehensive upper and lower bounds for the efficiency of two subsystems. These bounds are applicable to a wide range of Markovian stochastic processes, which enhances the accuracy in depicting the range of energy conversion efficiency between subsystems. Empirical validation is conducted using a two-quantum-dot system model, which serves to confirm the effectiveness of our inequality in refining the boundaries of efficiency. △ Less

Submitted 3 July, 2024; originally announced July 2024.

Comments: 8 pages, 2 figures

arXiv:2407.02211 [pdf, other]

PromptIntern: Saving Inference Costs by Internalizing Recurrent Prompt during Large Language Model Fine-tuning

Authors: Jiaru Zou, Mengyu Zhou, Tao Li, Shi Han, Dongmei Zhang

Abstract: Large language models (LLMs) have played a fundamental role in various natural language processing tasks with powerful prompt techniques. However, in real-world applications, there are often similar prompt components for repeated queries, which causes significant computational burdens during inference. Existing prompt compression and direct fine-tuning methods aim to tackle these challenges, yet t… ▽ More Large language models (LLMs) have played a fundamental role in various natural language processing tasks with powerful prompt techniques. However, in real-world applications, there are often similar prompt components for repeated queries, which causes significant computational burdens during inference. Existing prompt compression and direct fine-tuning methods aim to tackle these challenges, yet they frequently struggle to strike an optimal balance between cost-efficiency and performance effectiveness, especially in complex tasks such as NL2Code. In this paper, we propose a novel method namely PromptIntern to internalize the prompt knowledge into model parameters via progressive fine-tuning. Our method enables LLMs to emulate the human learning process for a new task, where detailed templates and examples in a prompt are gradually internalized and phased out progressively as the model grows accustomed to the task. Extensive experiments demonstrate that our method reduces inference tokens over 90%, speedups inference by 4.2 times, and saves 88.3% monetary cost. △ Less

Submitted 2 July, 2024; originally announced July 2024.

arXiv:2407.01093 [pdf, other]

IBSEN: Director-Actor Agent Collaboration for Controllable and Interactive Drama Script Generation

Authors: Senyu Han, Lu Chen, Li-Min Lin, Zhengshan Xu, Kai Yu

Abstract: Large language models have demonstrated their capabilities in storyline creation and human-like character role-playing. Current language model agents mainly focus on reasonable behaviors from the level of individuals, and their behaviors might be hard to constraint on the level of the whole storyline. In this paper we introduce IBSEN, a director-actor coordinate agent framework that generates dram… ▽ More Large language models have demonstrated their capabilities in storyline creation and human-like character role-playing. Current language model agents mainly focus on reasonable behaviors from the level of individuals, and their behaviors might be hard to constraint on the level of the whole storyline. In this paper we introduce IBSEN, a director-actor coordinate agent framework that generates drama scripts and makes the plot played by agents more controllable. The director agent writes plot outlines that the user desires to see, instructs the actor agents to role-play their characters, and reschedules the plot when human players participate in the scenario to ensure the plot is progressing towards the objective. To evaluate the framework, we create a novel drama plot that involves several actor agents and check the interactions between them under the instruction of the director agent. Evaluation results show that our framework could generate complete, diverse drama scripts from only a rough outline of plot objectives, meanwhile maintaining the characteristics of characters in the drama. Our codes and prompts are available at https://github.com/OpenDFM/ibsen. △ Less

Submitted 1 July, 2024; originally announced July 2024.

Comments: Accepted by ACL 2024 Main

arXiv:2407.01076 [pdf]

Orbital origin of magnetic moment enhancement induced by charge density wave in kagome FeGe

Authors: Shulun Han, Linyang Li, Chi Sin Tang, Qi Wang, Lingfeng Zhang, Caozheng Diao, Mingwen Zhao, Shuo Sun, Lijun Tian, Mark B. H. Breese, Chuanbing Cai, Milorad V. Milosevic, Yanpeng Qi, Andrew T. S. Wee, Xinmao Yin

Abstract: Interactions among various electronic states such as CDW, magnetism, and superconductivity are of high significance in strongly correlated systems. While significant progress has been made in understanding the relationship between CDW and superconductivity, the interplay between CDW and magnetic order remains largely elusive. Kagome lattices, which intertwine nontrivial topology, charge order, and… ▽ More Interactions among various electronic states such as CDW, magnetism, and superconductivity are of high significance in strongly correlated systems. While significant progress has been made in understanding the relationship between CDW and superconductivity, the interplay between CDW and magnetic order remains largely elusive. Kagome lattices, which intertwine nontrivial topology, charge order, and magnetism, offer an ideal platform for such studies. The kagome magnet FeGe, hosting the unique coupling between CDW and magnetism, has recently garnered considerable attention in that respect. Here we reveal the significant role of the orbital coupling effect during the CDW phase transition, highlighting the orbital origin of the magnetic moment enhancement in FeGe. Our X ray absorption experiments and first principles calculations illuminate the temperature dependent behavior of Fe3d_Ge4p orbital hybridization and corroborate its pivotal impact on the magnetic properties of FeGe. These findings introduce an orbital dimension to the correlation between charge and magnetic degrees of freedom, advancing our understanding of the intriguing quantum phases resulting from this interplay. △ Less

Submitted 1 July, 2024; originally announced July 2024.

arXiv:2407.00896 [pdf, other]

Channel Modeling Aided Dataset Generation for AI-Enabled CSI Feedback: Advances, Challenges, and Solutions

Authors: Yupeng Li, Gang Li, Zirui Wen, Shuangfeng Han, Shijian Gao, Guangyi Liu, Jiangzhou Wang

Abstract: The AI-enabled autoencoder has demonstrated great potential in channel state information (CSI) feedback in frequency division duplex (FDD) multiple input multiple output (MIMO) systems. However, this method completely changes the existing feedback strategies, making it impractical to deploy in recent years. To address this issue, this paper proposes a channel modeling aided data augmentation metho… ▽ More The AI-enabled autoencoder has demonstrated great potential in channel state information (CSI) feedback in frequency division duplex (FDD) multiple input multiple output (MIMO) systems. However, this method completely changes the existing feedback strategies, making it impractical to deploy in recent years. To address this issue, this paper proposes a channel modeling aided data augmentation method based on a limited number of field channel data. Specifically, the user equipment (UE) extracts the primary stochastic parameters of the field channel data and transmits them to the base station (BS). The BS then updates the typical TR 38.901 model parameters with the extracted parameters. In this way, the updated channel model is used to generate the dataset. This strategy comprehensively considers the dataset collection, model generalization, model monitoring, and so on. Simulations verify that our proposed strategy can significantly improve performance compared to the benchmarks. △ Less

Submitted 30 June, 2024; originally announced July 2024.

arXiv:2406.19856 [pdf]

LUT-boosted CDR and Equalization for Burst-mode 50/100 Gbit/s Bandwidth-limited Flexible PON

Authors: Yanlu Huang, Liyan Wu, Shangya Han, Kai Jin, Kun Xu, Yanni Ou

Abstract: We proposed and experimentally demonstrated a look-up table boosted fast CDR and equalization scheme for the burst-mode 50/100 Gbps bandwidth-limited flexible PON, requiring no preamble for convergence and achieved the same bit error rate performance as in the case of long preambles. We proposed and experimentally demonstrated a look-up table boosted fast CDR and equalization scheme for the burst-mode 50/100 Gbps bandwidth-limited flexible PON, requiring no preamble for convergence and achieved the same bit error rate performance as in the case of long preambles. △ Less

Submitted 28 June, 2024; originally announced June 2024.

arXiv:2406.19135 [pdf, other]

DEX-TTS: Diffusion-based EXpressive Text-to-Speech with Style Modeling on Time Variability

Authors: Hyun Joon Park, Jin Sob Kim, Wooseok Shin, Sung Won Han

Abstract: Expressive Text-to-Speech (TTS) using reference speech has been studied extensively to synthesize natural speech, but there are limitations to obtaining well-represented styles and improving model generalization ability. In this study, we present Diffusion-based EXpressive TTS (DEX-TTS), an acoustic model designed for reference-based speech synthesis with enhanced style representations. Based on a… ▽ More Expressive Text-to-Speech (TTS) using reference speech has been studied extensively to synthesize natural speech, but there are limitations to obtaining well-represented styles and improving model generalization ability. In this study, we present Diffusion-based EXpressive TTS (DEX-TTS), an acoustic model designed for reference-based speech synthesis with enhanced style representations. Based on a general diffusion TTS framework, DEX-TTS includes encoders and adapters to handle styles extracted from reference speech. Key innovations contain the differentiation of styles into time-invariant and time-variant categories for effective style extraction, as well as the design of encoders and adapters with high generalization ability. In addition, we introduce overlapping patchify and convolution-frequency patch embedding strategies to improve DiT-based diffusion networks for TTS. DEX-TTS yields outstanding performance in terms of objective and subjective evaluation in English multi-speaker and emotional multi-speaker datasets, without relying on pre-training strategies. Lastly, the comparison results for the general TTS on a single-speaker dataset verify the effectiveness of our enhanced diffusion backbone. Demos are available here. △ Less

Submitted 27 June, 2024; originally announced June 2024.

Comments: Preprint

arXiv:2406.18925 [pdf, other]

Selective Vision is the Challenge for Visual Reasoning: A Benchmark for Visual Argument Understanding

Authors: Jiwan Chung, Sungjae Lee, Minseo Kim, Seungju Han, Ashkan Yousefpour, Jack Hessel, Youngjae Yu

Abstract: Visual arguments, often used in advertising or social causes, rely on images to persuade viewers to do or believe something. Understanding these arguments requires selective vision: only specific visual stimuli within an image are relevant to the argument, and relevance can only be understood within the context of a broader argumentative structure. While visual arguments are readily appreciated by… ▽ More Visual arguments, often used in advertising or social causes, rely on images to persuade viewers to do or believe something. Understanding these arguments requires selective vision: only specific visual stimuli within an image are relevant to the argument, and relevance can only be understood within the context of a broader argumentative structure. While visual arguments are readily appreciated by human audiences, we ask: are today's AI capable of similar understanding? We collect and release VisArgs, an annotated corpus designed to make explicit the (usually implicit) structures underlying visual arguments. VisArgs includes 1,611 images accompanied by three types of textual annotations: 5,112 visual premises (with region annotations), 5,574 commonsense premises, and reasoning trees connecting them to a broader argument. We propose three tasks over VisArgs to probe machine capacity for visual argument understanding: localization of premises, identification of premises, and deduction of conclusions. Experiments demonstrate that 1) machines cannot fully identify the relevant visual cues. The top-performing model, GPT-4-O, achieved an accuracy of only 78.5%, whereas humans reached 98.0%. All models showed a performance drop, with an average decrease in accuracy of 19.5%, when the comparison set was changed from objects outside the image to irrelevant objects within the image. Furthermore, 2) this limitation is the greatest factor impacting their performance in understanding visual arguments. Most models improved the most when given relevant visual premises as additional inputs, compared to other inputs, for deducing the conclusion of the visual argument. △ Less

Submitted 27 June, 2024; originally announced June 2024.

Comments: 12 pages, 5 figures

arXiv:2406.18819 [pdf, other]

MultiObjMatch: Matching with Optimal Tradeoffs between Multiple Objectives in R

Authors: Shichao Han, Samuel D. Pimentel

Abstract: In an observational study, matching aims to create many small sets of similar treated and control units from initial samples that may differ substantially in order to permit more credible causal inferences. The problem of constructing matched sets may be formulated as an optimization problem, but it can be challenging to specify a single objective function that adequately captures all the design c… ▽ More In an observational study, matching aims to create many small sets of similar treated and control units from initial samples that may differ substantially in order to permit more credible causal inferences. The problem of constructing matched sets may be formulated as an optimization problem, but it can be challenging to specify a single objective function that adequately captures all the design considerations at work. One solution, proposed by \citet{pimentel2019optimal} is to explore a family of matched designs that are Pareto optimal for multiple objective functions. We present an R package, \href{https://github.com/ShichaoHan/MultiObjMatch}{\texttt{MultiObjMatch}}, that implements this multi-objective matching strategy using a network flow algorithm for several common design goals: marginal balance on important covariates, size of the matched sample, and average within-pair multivariate distances. We demonstrate the package's flexibility in exploring user-defined tradeoffs of interest via two case studies, a reanalysis of the canonical National Supported Work dataset and a novel analysis of a clinical dataset to estimate the impact of diabetic kidney disease on hospitalization costs. △ Less

Submitted 26 June, 2024; originally announced June 2024.

arXiv:2406.18510 [pdf, other]

WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models

Authors: Liwei Jiang, Kavel Rao, Seungju Han, Allyson Ettinger, Faeze Brahman, Sachin Kumar, Niloofar Mireshghallah, Ximing Lu, Maarten Sap, Yejin Choi, Nouha Dziri

Abstract: We introduce WildTeaming, an automatic LLM safety red-teaming framework that mines in-the-wild user-chatbot interactions to discover 5.7K unique clusters of novel jailbreak tactics, and then composes multiple tactics for systematic exploration of novel jailbreaks. Compared to prior work that performed red-teaming via recruited human workers, gradient-based optimization, or iterative revision with… ▽ More We introduce WildTeaming, an automatic LLM safety red-teaming framework that mines in-the-wild user-chatbot interactions to discover 5.7K unique clusters of novel jailbreak tactics, and then composes multiple tactics for systematic exploration of novel jailbreaks. Compared to prior work that performed red-teaming via recruited human workers, gradient-based optimization, or iterative revision with LLMs, our work investigates jailbreaks from chatbot users who were not specifically instructed to break the system. WildTeaming reveals previously unidentified vulnerabilities of frontier LLMs, resulting in up to 4.6x more diverse and successful adversarial attacks compared to state-of-the-art jailbreak methods. While many datasets exist for jailbreak evaluation, very few open-source datasets exist for jailbreak training, as safety training data has been closed even when model weights are open. With WildTeaming we create WildJailbreak, a large-scale open-source synthetic safety dataset with 262K vanilla (direct request) and adversarial (complex jailbreak) prompt-response pairs. To mitigate exaggerated safety behaviors, WildJailbreak provides two contrastive types of queries: 1) harmful queries (vanilla & adversarial) and 2) benign queries that resemble harmful queries in form but contain no harm. As WildJailbreak considerably upgrades the quality and scale of existing safety resources, it uniquely enables us to examine the scaling effects of data and the interplay of data properties and model capabilities during safety training. Through extensive experiments, we identify the training properties that enable an ideal balance of safety behaviors: appropriate safeguarding without over-refusal, effective handling of vanilla and adversarial queries, and minimal, if any, decrease in general capabilities. All components of WildJailbeak contribute to achieving balanced safety behaviors of models. △ Less

Submitted 26 June, 2024; originally announced June 2024.

arXiv:2406.18495 [pdf, other]

WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs

Authors: Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, Nouha Dziri

Abstract: We introduce WildGuard -- an open, light-weight moderation tool for LLM safety that achieves three goals: (1) identifying malicious intent in user prompts, (2) detecting safety risks of model responses, and (3) determining model refusal rate. Together, WildGuard serves the increasing needs for automatic safety moderation and evaluation of LLM interactions, providing a one-stop tool with enhanced a… ▽ More We introduce WildGuard -- an open, light-weight moderation tool for LLM safety that achieves three goals: (1) identifying malicious intent in user prompts, (2) detecting safety risks of model responses, and (3) determining model refusal rate. Together, WildGuard serves the increasing needs for automatic safety moderation and evaluation of LLM interactions, providing a one-stop tool with enhanced accuracy and broad coverage across 13 risk categories. While existing open moderation tools such as Llama-Guard2 score reasonably well in classifying straightforward model interactions, they lag far behind a prompted GPT-4, especially in identifying adversarial jailbreaks and in evaluating models' refusals, a key measure for evaluating safety behaviors in model responses. To address these challenges, we construct WildGuardMix, a large-scale and carefully balanced multi-task safety moderation dataset with 92K labeled examples that cover vanilla (direct) prompts and adversarial jailbreaks, paired with various refusal and compliance responses. WildGuardMix is a combination of WildGuardTrain, the training data of WildGuard, and WildGuardTest, a high-quality human-annotated moderation test set with 5K labeled items covering broad risk scenarios. Through extensive evaluations on WildGuardTest and ten existing public benchmarks, we show that WildGuard establishes state-of-the-art performance in open-source safety moderation across all the three tasks compared to ten strong existing open-source moderation models (e.g., up to 26.4% improvement on refusal detection). Importantly, WildGuard matches and sometimes exceeds GPT-4 performance (e.g., up to 3.9% improvement on prompt harmfulness identification). WildGuard serves as a highly effective safety moderator in an LLM interface, reducing the success rate of jailbreak attacks from 79.8% to 2.4%. △ Less

Submitted 9 July, 2024; v1 submitted 26 June, 2024; originally announced June 2024.

Comments: First two authors contributed equally. Third and fourth authors contributed equally

arXiv:2406.15734 [pdf, other]

RankAdaptor: Hierarchical Dynamic Low-Rank Adaptation for Structural Pruned LLMs

Authors: Changhai Zhou, Shijie Han, Shiyang Zhang, Shichao Weng, Zekai Liu, Cheng Jin

Abstract: The efficient compression of large language models (LLMs) is becoming increasingly popular. However, recovering the accuracy of compressed LLMs is still a major challenge. Structural pruning with standard Low-Rank Adaptation (LoRA) is a common technique in current LLM compression. In structural pruning, the model architecture is modified unevenly, resulting in suboptimal performance in various dow… ▽ More The efficient compression of large language models (LLMs) is becoming increasingly popular. However, recovering the accuracy of compressed LLMs is still a major challenge. Structural pruning with standard Low-Rank Adaptation (LoRA) is a common technique in current LLM compression. In structural pruning, the model architecture is modified unevenly, resulting in suboptimal performance in various downstream tasks via standard LoRA with fixed rank. To address this problem, we introduce RankAdaptor, an efficient fine-tuning method with hierarchical dynamic rank scheduling for pruned LLMs. An end-to-end automatic optimization flow is developed that utilizes a lightweight performance model to determine the different ranks during fine-tuning. Comprehensive experiments on popular benchmarks show that RankAdaptor consistently outperforms standard LoRA with structural pruning over different pruning settings. Without increasing the trainable parameters, RankAdaptor further reduces the accuracy performance gap between the recovery of the pruned model and the original model compared to standard LoRA. △ Less

Submitted 22 June, 2024; originally announced June 2024.

arXiv:2406.14703 [pdf, other]

Do LLMs Have Distinct and Consistent Personality? TRAIT: Personality Testset designed for LLMs with Psychometrics

Authors: Seungbeen Lee, Seungwon Lim, Seungju Han, Giyeong Oh, Hyungjoo Chae, Jiwan Chung, Minju Kim, Beong-woo Kwak, Yeonsoo Lee, Dongha Lee, Jinyoung Yeo, Youngjae Yu

Abstract: The idea of personality in descriptive psychology, traditionally defined through observable behavior, has now been extended to Large Language Models (LLMs) to better understand their behavior. This raises a question: do LLMs exhibit distinct and consistent personality traits, similar to humans? Existing self-assessment personality tests, while applicable, lack the necessary validity and reliabilit… ▽ More The idea of personality in descriptive psychology, traditionally defined through observable behavior, has now been extended to Large Language Models (LLMs) to better understand their behavior. This raises a question: do LLMs exhibit distinct and consistent personality traits, similar to humans? Existing self-assessment personality tests, while applicable, lack the necessary validity and reliability for precise personality measurements. To address this, we introduce TRAIT, a new tool consisting of 8K multi-choice questions designed to assess the personality of LLMs with validity and reliability. TRAIT is built on the psychometrically validated human questionnaire, Big Five Inventory (BFI) and Short Dark Triad (SD-3), enhanced with the ATOMIC10X knowledge graph for testing personality in a variety of real scenarios. TRAIT overcomes the reliability and validity issues when measuring personality of LLM with self-assessment, showing the highest scores across three metrics: refusal rate, prompt sensitivity, and option order sensitivity. It reveals notable insights into personality of LLM: 1) LLMs exhibit distinct and consistent personality, which is highly influenced by their training data (i.e., data used for alignment tuning), and 2) current prompting techniques have limited effectiveness in eliciting certain traits, such as high psychopathy or low conscientiousness, suggesting the need for further research in this direction. △ Less

Submitted 20 June, 2024; originally announced June 2024.

Comments: Preprint; Under review

arXiv:2406.14459 [pdf, other]

Healing Powers of BERT: How Task-Specific Fine-Tuning Recovers Corrupted Language Models

Authors: Shijie Han, Zhenyu Zhang, Andrei Arsene Simion

Abstract: Language models like BERT excel at sentence classification tasks due to extensive pre-training on general data, but their robustness to parameter corruption is unexplored. To understand this better, we look at what happens if a language model is "broken", in the sense that some of its parameters are corrupted and then recovered by fine-tuning. Strategically corrupting BERT variants at different le… ▽ More Language models like BERT excel at sentence classification tasks due to extensive pre-training on general data, but their robustness to parameter corruption is unexplored. To understand this better, we look at what happens if a language model is "broken", in the sense that some of its parameters are corrupted and then recovered by fine-tuning. Strategically corrupting BERT variants at different levels, we find corrupted models struggle to fully recover their original performance, with higher corruption causing more severe degradation. Notably, bottom-layer corruption affecting fundamental linguistic features is more detrimental than top-layer corruption. Our insights contribute to understanding language model robustness and adaptability under adverse conditions, informing strategies for developing resilient NLP systems against parameter perturbations. △ Less

Submitted 20 June, 2024; originally announced June 2024.

arXiv:2406.14380 [pdf, other]

Estimating Treatment Effects under Recommender Interference: A Structured Neural Networks Approach

Authors: Ruohan Zhan, Shichao Han, Yuchen Hu, Zhenling Jiang

Abstract: Recommender systems are essential for content-sharing platforms by curating personalized content. To evaluate updates to recommender systems targeting content creators, platforms frequently rely on creator-side randomized experiments. The treatment effect measures the change in outcomes when a new algorithm is implemented compared to the status quo. We show that the standard difference-in-means es… ▽ More Recommender systems are essential for content-sharing platforms by curating personalized content. To evaluate updates to recommender systems targeting content creators, platforms frequently rely on creator-side randomized experiments. The treatment effect measures the change in outcomes when a new algorithm is implemented compared to the status quo. We show that the standard difference-in-means estimator can lead to biased estimates due to recommender interference that arises when treated and control creators compete for exposure. We propose a "recommender choice model" that describes which item gets exposed from a pool containing both treated and control items. By combining a structural choice model with neural networks, this framework directly models the interference pathway while accounting for rich viewer-content heterogeneity. We construct a debiased estimator of the treatment effect and prove it is $\sqrt n$-consistent and asymptotically normal with potentially correlated samples. We validate our estimator's empirical performance with a field experiment on Weixin short-video platform. In addition to the standard creator-side experiment, we conduct a costly double-sided randomization design to obtain a benchmark estimate free from interference bias. We show that the proposed estimator yields results comparable to the benchmark, whereas the standard difference-in-means estimator can exhibit significant bias and even produce reversed signs. △ Less

Submitted 5 July, 2024; v1 submitted 20 June, 2024; originally announced June 2024.

arXiv:2406.12904 [pdf, other]

Meent: Differentiable Electromagnetic Simulator for Machine Learning

Authors: Yongha Kim, Anthony W. Jung, Sanmun Kim, Kevin Octavian, Doyoung Heo, Chaejin Park, Jeongmin Shin, Sunghyun Nam, Chanhyung Park, Juho Park, Sangjun Han, Jinmyoung Lee, Seolho Kim, Min Seok Jang, Chan Y. Park

Abstract: Electromagnetic (EM) simulation plays a crucial role in analyzing and designing devices with sub-wavelength scale structures such as solar cells, semiconductor devices, image sensors, future displays and integrated photonic devices. Specifically, optics problems such as estimating semiconductor device structures and designing nanophotonic devices provide intriguing research topics with far-reachin… ▽ More Electromagnetic (EM) simulation plays a crucial role in analyzing and designing devices with sub-wavelength scale structures such as solar cells, semiconductor devices, image sensors, future displays and integrated photonic devices. Specifically, optics problems such as estimating semiconductor device structures and designing nanophotonic devices provide intriguing research topics with far-reaching real world impact. Traditional algorithms for such tasks require iteratively refining parameters through simulations, which often yield sub-optimal results due to the high computational cost of both the algorithms and EM simulations. Machine learning (ML) emerged as a promising candidate to mitigate these challenges, and optics research community has increasingly adopted ML algorithms to obtain results surpassing classical methods across various tasks. To foster a synergistic collaboration between the optics and ML communities, it is essential to have an EM simulation software that is user-friendly for both research communities. To this end, we present Meent, an EM simulation software that employs rigorous coupled-wave analysis (RCWA). Developed in Python and equipped with automatic differentiation (AD) capabilities, Meent serves as a versatile platform for integrating ML into optics research and vice versa. To demonstrate its utility as a research platform, we present three applications of Meent: 1) generating a dataset for training neural operator, 2) serving as an environment for the reinforcement learning of nanophotonic device optimization, and 3) providing a solution for inverse problems with gradient-based optimizers. These applications highlight Meent's potential to advance both EM simulation and ML methodologies. The code is available at https://github.com/kc-ml2/meent with the MIT license to promote the cross-polinations of ideas among academic researchers and industry practitioners. △ Less

Submitted 11 June, 2024; originally announced June 2024.

Comments: under review

arXiv:2406.12874 [pdf, other]

doi 10.1088/1748-0221/19/08/P08027

The Design, Implementation, and Performance of the LZ Calibration Systems

Authors: J. Aalbers, D. S. Akerib, A. K. Al Musalhi, F. Alder, C. S. Amarasinghe, A. Ames, T. J. Anderson, N. Angelides, H. M. Araújo, J. E. Armstrong, M. Arthurs, A. Baker, S. Balashov, J. Bang, E. E. Barillier, J. W. Bargemann, K. Beattie, T. Benson, A. Bhatti, A. Biekert, T. P. Biesiadzinski, H. J. Birch, E. Bishop, G. M. Blockinger, B. Boxer , et al. (179 additional authors not shown)

Abstract: LUX-ZEPLIN (LZ) is a tonne-scale experiment searching for direct dark matter interactions and other rare events. It is located at the Sanford Underground Research Facility (SURF) in Lead, South Dakota, USA. The core of the LZ detector is a dual-phase xenon time projection chamber (TPC), designed with the primary goal of detecting Weakly Interacting Massive Particles (WIMPs) via their induced low e… ▽ More LUX-ZEPLIN (LZ) is a tonne-scale experiment searching for direct dark matter interactions and other rare events. It is located at the Sanford Underground Research Facility (SURF) in Lead, South Dakota, USA. The core of the LZ detector is a dual-phase xenon time projection chamber (TPC), designed with the primary goal of detecting Weakly Interacting Massive Particles (WIMPs) via their induced low energy nuclear recoils. Surrounding the TPC, two veto detectors immersed in an ultra-pure water tank enable reducing background events to enhance the discovery potential. Intricate calibration systems are purposely designed to precisely understand the responses of these three detector volumes to various types of particle interactions and to demonstrate LZ's ability to discriminate between signals and backgrounds. In this paper, we present a comprehensive discussion of the key features, requirements, and performance of the LZ calibration systems, which play a crucial role in enabling LZ's WIMP-search and its broad science program. The thorough description of these calibration systems, with an emphasis on their novel aspects, is valuable for future calibration efforts in direct dark matter and other rare-event search experiments. △ Less

Submitted 5 September, 2024; v1 submitted 2 May, 2024; originally announced June 2024.

Journal ref: JINST 19 P08027 (2024)

arXiv:2406.11961 [pdf, other]

Elaborating Higgs to dimuon decay from gluon fusion by decorrelation and jet substructure

Authors: Subin Han, Hyung Do Kim

Abstract: Discovery of the Higgs boson decay to dimuon is anticipated soon based on the current evidence. Precise categorization of the events without affecting the invariant mass shape is crucial in the analysis. Decorrelation of the invariant mass and the output of discriminators (the score of discriminators) is essential for consistent and precise analysis. In this paper we use distance correlation as th… ▽ More Discovery of the Higgs boson decay to dimuon is anticipated soon based on the current evidence. Precise categorization of the events without affecting the invariant mass shape is crucial in the analysis. Decorrelation of the invariant mass and the output of discriminators (the score of discriminators) is essential for consistent and precise analysis. In this paper we use distance correlation as the additional loss function to achieve the decorrelation for discriminators and examine various analysis methods. The analyses with and without jet substructure variables are presented. Adding jet substructure variables considerably improves the significance of the Higgs to dimuon signal from gluon fusion. △ Less

Submitted 17 June, 2024; originally announced June 2024.

Comments: 25 pages, 7 figures, 7 tables

arXiv:2406.11260 [pdf, other]

Adversarial Style Augmentation via Large Language Model for Robust Fake News Detection

Authors: Sungwon Park, Sungwon Han, Meeyoung Cha

Abstract: The spread of fake news negatively impacts individuals and is regarded as a significant social challenge that needs to be addressed. A number of algorithmic and insightful features have been identified for detecting fake news. However, with the recent LLMs and their advanced generation capabilities, many of the detectable features (e.g., style-conversion attacks) can be altered, making it more cha… ▽ More The spread of fake news negatively impacts individuals and is regarded as a significant social challenge that needs to be addressed. A number of algorithmic and insightful features have been identified for detecting fake news. However, with the recent LLMs and their advanced generation capabilities, many of the detectable features (e.g., style-conversion attacks) can be altered, making it more challenging to distinguish from real news. This study proposes adversarial style augmentation, AdStyle, to train a fake news detector that remains robust against various style-conversion attacks. Our model's key mechanism is the careful use of LLMs to automatically generate a diverse yet coherent range of style-conversion attack prompts. This improves the generation of prompts that are particularly difficult for the detector to handle. Experiments show that our augmentation strategy improves robustness and detection performance when tested on fake news benchmark datasets. △ Less

Submitted 22 July, 2024; v1 submitted 17 June, 2024; originally announced June 2024.

Comments: 8 pages

arXiv:2406.10847 [pdf, other]

TorchOpera: A Compound AI System for LLM Safety

Authors: Shanshan Han, Yuhang Yao, Zijian Hu, Dimitris Stripelis, Zhaozhuo Xu, Chaoyang He

Abstract: We introduce TorchOpera, a compound AI system for enhancing the safety and quality of prompts and responses for Large Language Models. TorchOpera ensures that all user prompts are safe, contextually grounded, and effectively processed, while enhancing LLM responses to be relevant and high quality. TorchOpera utilizes the vector database for contextual grounding, rule-based wrappers for flexible mo… ▽ More We introduce TorchOpera, a compound AI system for enhancing the safety and quality of prompts and responses for Large Language Models. TorchOpera ensures that all user prompts are safe, contextually grounded, and effectively processed, while enhancing LLM responses to be relevant and high quality. TorchOpera utilizes the vector database for contextual grounding, rule-based wrappers for flexible modifications, and specialized mechanisms for detecting and adjusting unsafe or incorrect content. We also provide a view of the compound AI system to reduce the computational cost. Extensive experiments show that TorchOpera ensures the safety, reliability, and applicability of LLMs in real-world settings while maintaining the efficiency of LLM responses. △ Less

Submitted 16 June, 2024; originally announced June 2024.

arXiv:2406.10774 [pdf, other]

Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

Authors: Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, Song Han

Abstract: As the demand for long-context large language models (LLMs) increases, models with context windows of up to 128K or 1M tokens are becoming increasingly prevalent. However, long-context LLM inference is challenging since the inference speed decreases significantly as the sequence length grows. This slowdown is primarily caused by loading a large KV cache during self-attention. Previous works have s… ▽ More As the demand for long-context large language models (LLMs) increases, models with context windows of up to 128K or 1M tokens are becoming increasingly prevalent. However, long-context LLM inference is challenging since the inference speed decreases significantly as the sequence length grows. This slowdown is primarily caused by loading a large KV cache during self-attention. Previous works have shown that a small portion of critical tokens will dominate the attention outcomes. However, we observe the criticality of a token highly depends on the query. To this end, we propose Quest, a query-aware KV cache selection algorithm. Quest keeps track of the minimal and maximal Key values in KV cache pages and estimates the criticality of a given page using Query vectors. By only loading the Top-K critical KV cache pages for attention, Quest significantly speeds up self-attention without sacrificing accuracy. We show that Quest can achieve up to 2.23x self-attention speedup, which reduces inference latency by 7.03x while performing well on tasks with long dependencies with negligible accuracy loss. Code is available at http://github.com/mit-han-lab/Quest . △ Less

Submitted 26 August, 2024; v1 submitted 15 June, 2024; originally announced June 2024.

Comments: ICML 2024

arXiv:2406.10537 [pdf, other]

Scalable Differentiable Causal Discovery in the Presence of Latent Confounders with Skeleton Posterior (Extended Version)

Authors: Pingchuan Ma, Rui Ding, Qiang Fu, Jiaru Zhang, Shuai Wang, Shi Han, Dongmei Zhang

Abstract: Differentiable causal discovery has made significant advancements in the learning of directed acyclic graphs. However, its application to real-world datasets remains restricted due to the ubiquity of latent confounders and the requirement to learn maximal ancestral graphs (MAGs). To date, existing differentiable MAG learning algorithms have been limited to small datasets and failed to scale to lar… ▽ More Differentiable causal discovery has made significant advancements in the learning of directed acyclic graphs. However, its application to real-world datasets remains restricted due to the ubiquity of latent confounders and the requirement to learn maximal ancestral graphs (MAGs). To date, existing differentiable MAG learning algorithms have been limited to small datasets and failed to scale to larger ones (e.g., with more than 50 variables). The key insight in this paper is that the causal skeleton, which is the undirected version of the causal graph, has potential for improving accuracy and reducing the search space of the optimization procedure, thereby enhancing the performance of differentiable causal discovery. Therefore, we seek to address a two-fold challenge to harness the potential of the causal skeleton for differentiable causal discovery in the presence of latent confounders: (1) scalable and accurate estimation of skeleton and (2) universal integration of skeleton estimation with differentiable causal discovery. To this end, we propose SPOT (Skeleton Posterior-guided OpTimization), a two-phase framework that harnesses skeleton posterior for differentiable causal discovery in the presence of latent confounders. On the contrary to a ``point-estimation'', SPOT seeks to estimate the posterior distribution of skeletons given the dataset. It first formulates the posterior inference as an instance of amortized inference problem and concretizes it with a supervised causal learning (SCL)-enabled solution to estimate the skeleton posterior. To incorporate the skeleton posterior with differentiable causal discovery, SPOT then features a skeleton posterior-guided stochastic optimization procedure to guide the optimization of MAGs. [abridged due to length limit] △ Less

Submitted 15 June, 2024; originally announced June 2024.

arXiv:2406.09799 [pdf, other]

GeoSEE: Regional Socio-Economic Estimation With a Large Language Model

Authors: Sungwon Han, Donghyun Ahn, Seungeon Lee, Minhyuk Song, Sungwon Park, Sangyoon Park, Jihee Kim, Meeyoung Cha

Abstract: Moving beyond traditional surveys, combining heterogeneous data sources with AI-driven inference models brings new opportunities to measure socio-economic conditions, such as poverty and population, over expansive geographic areas. The current research presents GeoSEE, a method that can estimate various socio-economic indicators using a unified pipeline powered by a large language model (LLM). Pre… ▽ More Moving beyond traditional surveys, combining heterogeneous data sources with AI-driven inference models brings new opportunities to measure socio-economic conditions, such as poverty and population, over expansive geographic areas. The current research presents GeoSEE, a method that can estimate various socio-economic indicators using a unified pipeline powered by a large language model (LLM). Presented with a diverse set of information modules, including those pre-constructed from satellite imagery, GeoSEE selects which modules to use in estimation, for each indicator and country. This selection is guided by the LLM's prior socio-geographic knowledge, which functions similarly to the insights of a domain expert. The system then computes target indicators via in-context learning after aggregating results from selected modules in the format of natural language-based texts. Comprehensive evaluation across countries at various stages of development reveals that our method outperforms other predictive models in both unsupervised and low-shot contexts. This reliable performance under data-scarce setting in under-developed or developing countries, combined with its cost-effectiveness, underscores its potential to continuously support and monitor the progress of Sustainable Development Goals, such as poverty alleviation and equitable growth, on a global scale. △ Less

Submitted 14 June, 2024; originally announced June 2024.

arXiv:2406.09076 [pdf, other]

3M: Multi-modal Multi-task Multi-teacher Learning for Game Event Detection

Authors: Thye Shan Ng, Feiqi Cao, Soyeon Caren Han

Abstract: Esports has rapidly emerged as a global phenomenon with an ever-expanding audience via platforms, like YouTube. Due to the inherent complexity nature of the game, it is challenging for newcomers to comprehend what the event entails. The chaotic nature of online chat, the fast-paced speech of the game commentator, and the game-specific user interface further compound the difficulty for users in com… ▽ More Esports has rapidly emerged as a global phenomenon with an ever-expanding audience via platforms, like YouTube. Due to the inherent complexity nature of the game, it is challenging for newcomers to comprehend what the event entails. The chaotic nature of online chat, the fast-paced speech of the game commentator, and the game-specific user interface further compound the difficulty for users in comprehending the gameplay. To overcome these challenges, it is crucial to integrate the Multi-Modal (MM) information from the platform and understand the event. The paper introduces a new MM multi-teacher-based game event detection framework, with the ultimate goal of constructing a comprehensive framework that enhances the comprehension of the ongoing game situation. While conventional MM models typically prioritise aligning MM data through concurrent training towards a unified objective, our framework leverages multiple teachers trained independently on different tasks to accomplish the Game Event Detection. The experiment clearly shows the effectiveness of the proposed MM multi-teacher framework. △ Less

Submitted 13 June, 2024; originally announced June 2024.

arXiv:2406.08777

Finite Time Blowup of Integer- and Fractional-Order Time-Delayed Diffusion Equations

Authors: Christopher N. Angstmann, Stuart-James M. Burney, Daniel S. Han, Bruce I. Henry, Boris Z. Huang, Zhuang Xu

Abstract: In this work, exact solutions are derived for an integer- and fractional-order time-delayed diffusion equation with arbitrary initial conditions. The solutions are obtained using Fourier transform methods in conjunction with the known properties of delay functions. It is observed that the solutions do not exhibit infinite speed of propagation for smooth initial conditions that are bounded and posi… ▽ More In this work, exact solutions are derived for an integer- and fractional-order time-delayed diffusion equation with arbitrary initial conditions. The solutions are obtained using Fourier transform methods in conjunction with the known properties of delay functions. It is observed that the solutions do not exhibit infinite speed of propagation for smooth initial conditions that are bounded and positive. Sufficient conditions on the initial condition are also established such that the finite time blowup of the solutions can be explicitly calculated. Examples are provided that highlight the contrasting behaviours of these exact solutions with the known dynamics of solutions to the standard diffusion equation. △ Less

Submitted 3 August, 2024; v1 submitted 12 June, 2024; originally announced June 2024.

Comments: Errors were discovered in the analysis, significant revisions are being made to the manuscript

MSC Class: 35R25; 35C10; 34K06; 34K37; 33E20; 42A38

arXiv:2406.08301 [pdf, other]

Jet modification via $π^0$-hadron correlations in Au$+$Au collisions at $\sqrt{s_{_{NN}}}=200$ GeV

Authors: PHENIX Collaboration, N. J. Abdulameer, U. Acharya, A. Adare, S. Afanasiev, C. Aidala, N. N. Ajitanand, Y. Akiba, H. Al-Bataineh, J. Alexander, M. Alfred, K. Aoki, N. Apadula, L. Aphecetche, J. Asai, H. Asano, E. T. Atomssa, R. Averbeck, T. C. Awes, B. Azmoun, V. Babintsev, M. Bai, G. Baksay, L. Baksay, A. Baldisseri , et al. (510 additional authors not shown)

Abstract: High-momentum two-particle correlations are a useful tool for studying jet-quenching effects in the quark-gluon plasma. Angular correlations between neutral-pion triggers and charged hadrons with transverse momenta in the range 4--12~GeV/$c$ and 0.5--7~GeV/$c$, respectively, have been measured by the PHENIX experiment in 2014 for Au$+$Au collisions at $\sqrt{s_{_{NN}}}=200$~GeV. Suppression is obs… ▽ More High-momentum two-particle correlations are a useful tool for studying jet-quenching effects in the quark-gluon plasma. Angular correlations between neutral-pion triggers and charged hadrons with transverse momenta in the range 4--12~GeV/$c$ and 0.5--7~GeV/$c$, respectively, have been measured by the PHENIX experiment in 2014 for Au$+$Au collisions at $\sqrt{s_{_{NN}}}=200$~GeV. Suppression is observed in the yield of high-momentum jet fragments opposite the trigger particle, which indicates jet suppression stemming from in-medium partonic energy loss, while enhancement is observed for low-momentum particles. The ratio and differences between the yield in Au$+$Au collisions and $p$$+$$p$ collisions, $I_{AA}$ and $Δ_{AA}$, as a function of the trigger-hadron azimuthal separation, $Δφ$, are measured for the first time at the Relativistic Heavy Ion Collider. These results better quantify how the yield of low-$p_T$ associated hadrons is enhanced at wide angle, which is crucial for studying energy loss as well as medium-response effects. △ Less

Submitted 12 June, 2024; originally announced June 2024.

Comments: 534 authors from 83 institutions, 12 pages, 7 figures. v1 is version submitted to Physical Review C. HEPdata tables for the points plotted in figures for this and previous PHENIX publications are (or will be) publicly available at http://www.phenix.bnl.gov/papers.html

arXiv:2406.08020 [pdf, other]

Generalizable Disaster Damage Assessment via Change Detection with Vision Foundation Model

Authors: Kyeongjin Ahn, Sungwon Han, Sungwon Park, Jihee Kim, Sangyoon Park, Meeyoung Cha

Abstract: The increasing frequency and intensity of natural disasters demand more sophisticated approaches for rapid and precise damage assessment. To tackle this issue, researchers have developed various methods on disaster benchmark datasets from satellite imagery to aid in detecting disaster damage. However, the diverse nature of geographical landscapes and disasters makes it challenging to apply existin… ▽ More The increasing frequency and intensity of natural disasters demand more sophisticated approaches for rapid and precise damage assessment. To tackle this issue, researchers have developed various methods on disaster benchmark datasets from satellite imagery to aid in detecting disaster damage. However, the diverse nature of geographical landscapes and disasters makes it challenging to apply existing methods to regions unseen during training. We present DAVI (Disaster Assessment with VIsion foundation model), which overcomes domain disparities and detects structural damage (e.g., building) without requiring ground-truth labels of the target region. DAVI integrates task-specific knowledge from a model trained on source regions with an image segmentation foundation model to generate pseudo labels of possible damage in the target region. It then employs a two-stage refinement process, targeting both the pixel and overall image, to more accurately pinpoint changes in disaster-struck areas based on before-and-after images. Comprehensive evaluations demonstrate that DAVI achieves exceptional performance across diverse terrains (e.g., USA and Mexico) and disaster types (e.g., wildfires, hurricanes, and earthquakes). This confirms its robustness in assessing disaster impact without dependence on ground-truth labels. △ Less

Submitted 12 June, 2024; originally announced June 2024.

Comments: 9 pages, 4 figures, 2 tables

arXiv:2406.05649 [pdf, other]

GTR: Improving Large 3D Reconstruction Models through Geometry and Texture Refinement

Authors: Peiye Zhuang, Songfang Han, Chaoyang Wang, Aliaksandr Siarohin, Jiaxu Zou, Michael Vasilkovsky, Vladislav Shakhrai, Sergey Korolev, Sergey Tulyakov, Hsin-Ying Lee

Abstract: We propose a novel approach for 3D mesh reconstruction from multi-view images. Our method takes inspiration from large reconstruction models like LRM that use a transformer-based triplane generator and a Neural Radiance Field (NeRF) model trained on multi-view images. However, in our method, we introduce several important modifications that allow us to significantly enhance 3D reconstruction quali… ▽ More We propose a novel approach for 3D mesh reconstruction from multi-view images. Our method takes inspiration from large reconstruction models like LRM that use a transformer-based triplane generator and a Neural Radiance Field (NeRF) model trained on multi-view images. However, in our method, we introduce several important modifications that allow us to significantly enhance 3D reconstruction quality. First of all, we examine the original LRM architecture and find several shortcomings. Subsequently, we introduce respective modifications to the LRM architecture, which lead to improved multi-view image representation and more computationally efficient training. Second, in order to improve geometry reconstruction and enable supervision at full image resolution, we extract meshes from the NeRF field in a differentiable manner and fine-tune the NeRF model through mesh rendering. These modifications allow us to achieve state-of-the-art performance on both 2D and 3D evaluation metrics, such as a PSNR of 28.67 on Google Scanned Objects (GSO) dataset. Despite these superior results, our feed-forward model still struggles to reconstruct complex textures, such as text and portraits on assets. To address this, we introduce a lightweight per-instance texture refinement procedure. This procedure fine-tunes the triplane representation and the NeRF color estimation model on the mesh surface using the input multi-view images in just 4 seconds. This refinement improves the PSNR to 29.79 and achieves faithful reconstruction of complex textures, such as text. Additionally, our approach enables various downstream applications, including text- or image-to-3D generation. △ Less

Submitted 13 June, 2024; v1 submitted 9 June, 2024; originally announced June 2024.

Comments: 19 pages, 17 figures. Project page: https://snap-research.github.io/GTR/

arXiv:2406.05431 [pdf]

MaTableGPT: GPT-based Table Data Extractor from Materials Science Literature

Authors: Gyeong Hoon Yi, Jiwoo Choi, Hyeongyun Song, Olivia Miano, Jaewoong Choi, Kihoon Bang, Byungju Lee, Seok Su Sohn, David Buttler, Anna Hiszpanski, Sang Soo Han, Donghun Kim

Abstract: Efficiently extracting data from tables in the scientific literature is pivotal for building large-scale databases. However, the tables reported in materials science papers exist in highly diverse forms; thus, rule-based extractions are an ineffective approach. To overcome this challenge, we present MaTableGPT, which is a GPT-based table data extractor from the materials science literature. MaTabl… ▽ More Efficiently extracting data from tables in the scientific literature is pivotal for building large-scale databases. However, the tables reported in materials science papers exist in highly diverse forms; thus, rule-based extractions are an ineffective approach. To overcome this challenge, we present MaTableGPT, which is a GPT-based table data extractor from the materials science literature. MaTableGPT features key strategies of table data representation and table splitting for better GPT comprehension and filtering hallucinated information through follow-up questions. When applied to a vast volume of water splitting catalysis literature, MaTableGPT achieved an extraction accuracy (total F1 score) of up to 96.8%. Through comprehensive evaluations of the GPT usage cost, labeling cost, and extraction accuracy for the learning methods of zero-shot, few-shot and fine-tuning, we present a Pareto-front mapping where the few-shot learning method was found to be the most balanced solution owing to both its high extraction accuracy (total F1 score>95%) and low cost (GPT usage cost of 5.97 US dollars and labeling cost of 10 I/O paired examples). The statistical analyses conducted on the database generated by MaTableGPT revealed valuable insights into the distribution of the overpotential and elemental utilization across the reported catalysts in the water splitting literature. △ Less

Submitted 8 June, 2024; originally announced June 2024.

arXiv:2406.05086 [pdf, other]

Robust Reward Design for Markov Decision Processes

Authors: Shuo Wu, Haoxiang Ma, Jie Fu, Shuo Han

Abstract: The problem of reward design examines the interaction between a leader and a follower, where the leader aims to shape the follower's behavior to maximize the leader's payoff by modifying the follower's reward function. Current approaches to reward design rely on an accurate model of how the follower responds to reward modifications, which can be sensitive to modeling inaccuracies. To address this… ▽ More The problem of reward design examines the interaction between a leader and a follower, where the leader aims to shape the follower's behavior to maximize the leader's payoff by modifying the follower's reward function. Current approaches to reward design rely on an accurate model of how the follower responds to reward modifications, which can be sensitive to modeling inaccuracies. To address this issue of sensitivity, we present a solution that offers robustness against uncertainties in modeling the follower, including 1) how the follower breaks ties in the presence of nonunique best responses, 2) inexact knowledge of how the follower perceives reward modifications, and 3) bounded rationality of the follower. Our robust solution is guaranteed to exist under mild conditions and can be obtained numerically by solving a mixed-integer linear program. Numerical experiments on multiple test cases demonstrate that our solution improves robustness compared to the standard approach without incurring significant additional computing costs. △ Less

Submitted 7 June, 2024; originally announced June 2024.

Comments: 50 pages, 8 figures

arXiv:2406.05078 [pdf, other]

Enhancing LEO Mega-Constellations with Inter-Satellite Links: Vision and Challenges

Authors: Chenyu Wu, Shuai Han, Qian Chen, Yu Wang, Weixiao Meng, Abderrahim Benslimane

Abstract: Low Earth orbit (LEO) satellites have been envisioned as a significant component of the sixth generation (6G) network architecture for achieving ubiquitous coverage and seamless access. However, the implementation of LEO satellites is largely restricted by the deployment of ground stations. Inter-satellite links (ISLs) have been regarded as a promising technique to fully exploit the potentials of… ▽ More Low Earth orbit (LEO) satellites have been envisioned as a significant component of the sixth generation (6G) network architecture for achieving ubiquitous coverage and seamless access. However, the implementation of LEO satellites is largely restricted by the deployment of ground stations. Inter-satellite links (ISLs) have been regarded as a promising technique to fully exploit the potentials of LEO mega constellations by concatenating multiple satellites to constitute an autonomous space network. In this article, we present the merits of implementing ISLs in LEO mega constellations and the representative applications empowered/inspired by ISLs. Moreover, we outline several key technical challenges as well as potential solutions related to LEO satellite networks with ISLs, including performance analysis for system design, routing and load balancing, and resource allocation. Particularly, the potential of using ISLs in enhancing in-flight connectivity is showcased with a preliminary performance evaluation. Finally, some open issues are discussed to inspire future research. △ Less

Submitted 7 June, 2024; originally announced June 2024.

Comments: 7 pages, 4 figures

arXiv:2406.04639 [pdf, other]

Cooperative Meta-Learning with Gradient Augmentation

Authors: Jongyun Shin, Seunjin Han, Jangho Kim

Abstract: Model agnostic meta-learning (MAML) is one of the most widely used gradient-based meta-learning, consisting of two optimization loops: an inner loop and outer loop. MAML learns the new task from meta-initialization parameters with an inner update and finds the meta-initialization parameters in the outer loop. In general, the injection of noise into the gradient of the model for augmenting the grad… ▽ More Model agnostic meta-learning (MAML) is one of the most widely used gradient-based meta-learning, consisting of two optimization loops: an inner loop and outer loop. MAML learns the new task from meta-initialization parameters with an inner update and finds the meta-initialization parameters in the outer loop. In general, the injection of noise into the gradient of the model for augmenting the gradient is one of the widely used regularization methods. In this work, we propose a novel cooperative meta-learning framework dubbed CML which leverages gradient-level regularization with gradient augmentation. We inject learnable noise into the gradient of the model for the model generalization. The key idea of CML is introducing the co-learner which has no inner update but the outer loop update to augment gradients for finding better meta-initialization parameters. Since the co-learner does not update in the inner loop, it can be easily deleted after meta-training. Therefore, CML infers with only meta-learner without additional cost and performance degradation. We demonstrate that CML is easily applicable to gradient-based meta-learning methods and CML leads to increased performance in few-shot regression, few-shot image classification and few-shot node classification tasks. Our codes are at https://github.com/JJongyn/CML. △ Less

Submitted 7 June, 2024; originally announced June 2024.

Comments: Accepted to UAI 2024

arXiv:2406.03274 [pdf, other]

Enhancing CTC-based speech recognition with diverse modeling units

Authors: Shiyi Han, Zhihong Lei, Mingbin Xu, Xingyu Na, Zhen Huang

Abstract: In recent years, the evolution of end-to-end (E2E) automatic speech recognition (ASR) models has been remarkable, largely due to advances in deep learning architectures like transformer. On top of E2E systems, researchers have achieved substantial accuracy improvement by rescoring E2E model's N-best hypotheses with a phoneme-based model. This raises an interesting question about where the improvem… ▽ More In recent years, the evolution of end-to-end (E2E) automatic speech recognition (ASR) models has been remarkable, largely due to advances in deep learning architectures like transformer. On top of E2E systems, researchers have achieved substantial accuracy improvement by rescoring E2E model's N-best hypotheses with a phoneme-based model. This raises an interesting question about where the improvements come from other than the system combination effect. We examine the underlying mechanisms driving these gains and propose an efficient joint training approach, where E2E models are trained jointly with diverse modeling units. This methodology does not only align the strengths of both phoneme and grapheme-based models but also reveals that using these diverse modeling units in a synergistic way can significantly enhance model accuracy. Our findings offer new insights into the optimal integration of heterogeneous modeling units in the development of more robust and accurate ASR systems. △ Less

Submitted 11 June, 2024; v1 submitted 5 June, 2024; originally announced June 2024.

arXiv:2406.03019 [pdf, other]

Puzzle Pieces Picker: Deciphering Ancient Chinese Characters with Radical Reconstruction

Authors: Pengjie Wang, Kaile Zhang, Xinyu Wang, Shengwei Han, Yongge Liu, Lianwen Jin, Xiang Bai, Yuliang Liu

Abstract: Oracle Bone Inscriptions is one of the oldest existing forms of writing in the world. However, due to the great antiquity of the era, a large number of Oracle Bone Inscriptions (OBI) remain undeciphered, making it one of the global challenges in the field of paleography today. This paper introduces a novel approach, namely Puzzle Pieces Picker (P$^3$), to decipher these enigmatic characters throug… ▽ More Oracle Bone Inscriptions is one of the oldest existing forms of writing in the world. However, due to the great antiquity of the era, a large number of Oracle Bone Inscriptions (OBI) remain undeciphered, making it one of the global challenges in the field of paleography today. This paper introduces a novel approach, namely Puzzle Pieces Picker (P$^3$), to decipher these enigmatic characters through radical reconstruction. We deconstruct OBI into foundational strokes and radicals, then employ a Transformer model to reconstruct them into their modern (conterpart)\textcolor{blue}{counterparts}, offering a groundbreaking solution to ancient script analysis. To further this endeavor, a new Ancient Chinese Character Puzzles (ACCP) dataset was developed, comprising an extensive collection of character images from seven key historical stages, annotated with detailed radical sequences. The experiments have showcased considerable promising insights, underscoring the potential and effectiveness of our approach in deciphering the intricacies of ancient Chinese scripts. Through this novel dataset and methodology, we aim to bridge the gap between traditional philology and modern document analysis techniques, offering new insights into the rich history of Chinese linguistic heritage. △ Less

Submitted 5 June, 2024; originally announced June 2024.

Comments: ICDAR 2024

arXiv:2406.02441 [pdf, other]

Probing the Scalar WIMP-Pion Coupling with the first LUX-ZEPLIN data

Authors: J. Aalbers, D. S. Akerib, A. K. Al Musalhi, F. Alder, C. S. Amarasinghe, A. Ames, T. J. Anderson, N. Angelides, H. M. Araújo, J. E. Armstrong, M. Arthurs, A. Baker, S. Balashov, J. Bang, E. E. Barillier, J. W. Bargemann, K. Beattie, T. Benson, A. Bhatti, A. Biekert, T. P. Biesiadzinski, H. J. Birch, E. J. Bishop, G. M. Blockinger, B. Boxer , et al. (178 additional authors not shown)

Abstract: Weakly interacting massive particles (WIMPs) may interact with a virtual pion that is exchanged between nucleons. This interaction channel is important to consider in models where the spin-independent isoscalar channel is suppressed. Using data from the first science run of the LUX-ZEPLIN dark matter experiment, containing 60 live days of data in a 5.5~tonne fiducial mass of liquid xenon, we repor… ▽ More Weakly interacting massive particles (WIMPs) may interact with a virtual pion that is exchanged between nucleons. This interaction channel is important to consider in models where the spin-independent isoscalar channel is suppressed. Using data from the first science run of the LUX-ZEPLIN dark matter experiment, containing 60 live days of data in a 5.5~tonne fiducial mass of liquid xenon, we report the results on a search for WIMP-pion interactions. We observe no significant excess and set an upper limit of $1.5\times10^{-46}$~cm$^2$ at a 90\% confidence level for a WIMP mass of 33~GeV/c$^2$ for this interaction. △ Less

Submitted 4 June, 2024; originally announced June 2024.

arXiv:2406.01963 [pdf]

Diamond molecular balance: Revolutionizing high-resolution mass spectrometry from MDa to TDa at room temperature

Authors: Donggeun Lee, Seung-Woo Jeon, Chang-Hwan Yi, Yang-Hee Kim, Yeeun Choi, Sang-Hun Lee, Jinwoong Cha, Seung-Bo Shim, Junho Suh, Il-Young Kim, Dongyeon Daniel Kang, Hojoong Jung, Cherlhyun Jeong, Jae-pyoung Ahn, Hee Chul Park, Sang-Wook Han, Chulki Kim

Abstract: The significance of mass spectrometry lies in its unparalleled ability to accurately identify and quantify molecules in complex samples, providing invaluable insights into molecular structures and interactions. Here, we leverage diamond nanostructures as highly sensitive mass sensors by utilizing a self-excitation mechanism under an electron beam in a conventional scanning electron microscope (SEM… ▽ More The significance of mass spectrometry lies in its unparalleled ability to accurately identify and quantify molecules in complex samples, providing invaluable insights into molecular structures and interactions. Here, we leverage diamond nanostructures as highly sensitive mass sensors by utilizing a self-excitation mechanism under an electron beam in a conventional scanning electron microscope (SEM). The diamond molecular balance (DMB) exhibits an exceptional mass resolution of 0.36 MDa, based on its outstanding mechanical quality factor and frequency stability, along with an extensive dynamic range from MDa to TDa. This positions the DMB at the forefront of molecular balances operating at room temperature. Notably, the DMB demonstrates its ability to measure the mass of a single bacteriophage T4 by precisely locating the analyte on the device. These findings highlight the groundbreaking potential of the DMB as a revolutionary tool for mass spectrometry at room temperature. △ Less

Submitted 25 July, 2024; v1 submitted 4 June, 2024; originally announced June 2024.

Comments: 16 pages, 4 figures

arXiv:2406.01886 [pdf, other]

Monotone Equilibrium Design for Matching Markets with Signaling

Authors: Seungjin Han, Alex Sam, Youngki Shin

Abstract: We study monotone equilibrium design by a planner who chooses an interval of reactions that receivers take before senders and receivers move in matching markets with signaling. Given the convex efficiency frontier over sender surplus and receiver surplus generated by the interval delegation, the optimal reaction interval crucially depends on the ripple effect of its lower bound and on the trade-of… ▽ More We study monotone equilibrium design by a planner who chooses an interval of reactions that receivers take before senders and receivers move in matching markets with signaling. Given the convex efficiency frontier over sender surplus and receiver surplus generated by the interval delegation, the optimal reaction interval crucially depends on the ripple effect of its lower bound and on the trade-off between matching inefficiency and signaling cost savings in the top pooling region generated by its upper bound. Our analysis generates cohesive market design results that integrate the literature on minimum wage, firm size distribution, and relative risk aversion. △ Less

Submitted 23 July, 2024; v1 submitted 3 June, 2024; originally announced June 2024.

Comments: 54 pages, 14 figures

arXiv:2406.01681 [pdf, other]

Gross-Neveu-Yukawa theory of $\text{SO}(2N)\rightarrow \text{SO}(N) \times \text{SO}(N)$ spontaneous symmetry breaking

Authors: SangEun Han, Igor F. Herbut

Abstract: We construct and study the relativistic Gross-Neveu-Yukawa field theory for the $\text{SO}(2N)$ real symmetric second-rank tensor order parameter coupled to $N_f$ flavors of $4N$-component Majorana fermions in 2+1 dimensions. Such a tensor order parameter unifies all Lorentz-invariant mass-gap orders for $N$ two-component Dirac fermions in two dimensions except for the $\text{SO}(2N)$-singlet anom… ▽ More We construct and study the relativistic Gross-Neveu-Yukawa field theory for the $\text{SO}(2N)$ real symmetric second-rank tensor order parameter coupled to $N_f$ flavors of $4N$-component Majorana fermions in 2+1 dimensions. Such a tensor order parameter unifies all Lorentz-invariant mass-gap orders for $N$ two-component Dirac fermions in two dimensions except for the $\text{SO}(2N)$-singlet anomalous quantum Hall state. The value $N_f=1$ corresponds to the canonical Gross-Neveu model. Within the leading-order $ε$-expansion around the upper critical dimension of $3+1$ the field theory exhibits a critical fixed point in its renormalization group flow which describes spontaneous symmetry breaking to $\text{SO}(N)\times \text{SO}(N)$ for the number of flavors of Majorana fermions higher than a critical value $N_{f,c2}\approx 2N$. For $N_{f, c1}< N_f < N_{f,c2}$ , with $N_{f,c1} \approx 1.080 N$ the critical fixed point resides in the unstable region of the theory where the effective potential is unbounded from below, whereas for $N_f < N_{f,c1}$ there is no real critical fixed point, and the flow runs away. In either case, for $N_f < N_{f,c2}$ the transition should become fluctuation-induced first-order, and we discuss the dependence of its size on the parameters in the theory. One-loop critical exponents for the new universality class at $N_{f, c2}< N_f $ are computed and the flow diagram in various regimes is discussed. △ Less

Submitted 13 June, 2024; v1 submitted 3 June, 2024; originally announced June 2024.

Comments: 13 pages, 5 figures

arXiv:2406.01339 [pdf, other]

Recover as It is Designed to Be: Recovering from Compatibility Mobile App Crashes by Reusing User Flows

Authors: Donghwi Kim, Hyungjun Yoon, Chang Min Park, Sujin Han, Youngjin Kwon, Steven Y. Ko, Sung-Ju Lee

Abstract: Android OS is severely fragmented by API updates and device vendors' OS customization, creating a market condition where vastly different OS versions coexist. This gives rise to compatibility crash problems where Android apps crash on certain Android versions but not on others. Although well-known, this problem is extremely challenging for app developers to overcome due to the sheer number of Andr… ▽ More Android OS is severely fragmented by API updates and device vendors' OS customization, creating a market condition where vastly different OS versions coexist. This gives rise to compatibility crash problems where Android apps crash on certain Android versions but not on others. Although well-known, this problem is extremely challenging for app developers to overcome due to the sheer number of Android versions in the market that must be tested. We present RecoFlow, a framework for enabling app developers to automatically recover an app from a crash by programming user flows with our API and visual tools. RecoFlow tracks app feature usage with the user flows on user devices and recovers an app from a crash by replaying UI actions of the app feature disrupted by the crash. To prevent recurring compatibility crashes, RecoFlow executes a previously crashed app in compatibility mode that is enabled by our novel Android OS virtualization technique. Our evaluation with professional Android developers shows that our API and tools are easy to use and effective in recovering from compatibility crashes. △ Less

Submitted 3 June, 2024; originally announced June 2024.

arXiv:2406.00897 [pdf, other]

Exact Solutions of a Time-Delay Advection Equation and a Fractional Time-Delay Advection Equation

Authors: Christopher N. Angstmann, Stuart-James M. Burney, Daniel S. Han, Bruce I. Henry, Boris Z. Huang, Zhuang Xu

Abstract: Exact solutions are derived for a time-delay advection equation and a fractional-order time-delay advection equation with a time-delay in the spatial derivative. Solutions are obtained, for arbitrary separable initial conditions, by incorporating recently introduced delay functions in a separation of variables approach. Examples are provided showing oscillatory and translatory behaviours fundament… ▽ More Exact solutions are derived for a time-delay advection equation and a fractional-order time-delay advection equation with a time-delay in the spatial derivative. Solutions are obtained, for arbitrary separable initial conditions, by incorporating recently introduced delay functions in a separation of variables approach. Examples are provided showing oscillatory and translatory behaviours fundamentally different to standard propagating wave solutions. △ Less

Submitted 2 June, 2024; originally announced June 2024.

Comments: Letter

MSC Class: 35C10; 35F10; 34K06; 42A38; 33E20

arXiv:2406.00684 [pdf, other]

Deciphering Oracle Bone Language with Diffusion Models

Authors: Haisu Guan, Huanxin Yang, Xinyu Wang, Shengwei Han, Yongge Liu, Lianwen Jin, Xiang Bai, Yuliang Liu

Abstract: Originating from China's Shang Dynasty approximately 3,000 years ago, the Oracle Bone Script (OBS) is a cornerstone in the annals of linguistic history, predating many established writing systems. Despite the discovery of thousands of inscriptions, a vast expanse of OBS remains undeciphered, casting a veil of mystery over this ancient language. The emergence of modern AI technologies presents a no… ▽ More Originating from China's Shang Dynasty approximately 3,000 years ago, the Oracle Bone Script (OBS) is a cornerstone in the annals of linguistic history, predating many established writing systems. Despite the discovery of thousands of inscriptions, a vast expanse of OBS remains undeciphered, casting a veil of mystery over this ancient language. The emergence of modern AI technologies presents a novel frontier for OBS decipherment, challenging traditional NLP methods that rely heavily on large textual corpora, a luxury not afforded by historical languages. This paper introduces a novel approach by adopting image generation techniques, specifically through the development of Oracle Bone Script Decipher (OBSD). Utilizing a conditional diffusion-based strategy, OBSD generates vital clues for decipherment, charting a new course for AI-assisted analysis of ancient languages. To validate its efficacy, extensive experiments were conducted on an oracle bone script dataset, with quantitative results demonstrating the effectiveness of OBSD. Code and decipherment results will be made available at https://github.com/guanhaisu/OBSD. △ Less

Submitted 2 June, 2024; originally announced June 2024.

Comments: ACL2024 main conference long paper

arXiv:2405.20610 [pdf, other]

Revisiting and Maximizing Temporal Knowledge in Semi-supervised Semantic Segmentation

Authors: Wooseok Shin, Hyun Joon Park, Jin Sob Kim, Sung Won Han

Abstract: In semi-supervised semantic segmentation, the Mean Teacher- and co-training-based approaches are employed to mitigate confirmation bias and coupling problems. However, despite their high performance, these approaches frequently involve complex training pipelines and a substantial computational burden, limiting the scalability and compatibility of these methods. In this paper, we propose a PrevMatc… ▽ More In semi-supervised semantic segmentation, the Mean Teacher- and co-training-based approaches are employed to mitigate confirmation bias and coupling problems. However, despite their high performance, these approaches frequently involve complex training pipelines and a substantial computational burden, limiting the scalability and compatibility of these methods. In this paper, we propose a PrevMatch framework that effectively mitigates the aforementioned limitations by maximizing the utilization of the temporal knowledge obtained during the training process. The PrevMatch framework relies on two core strategies: (1) we reconsider the use of temporal knowledge and thus directly utilize previous models obtained during training to generate additional pseudo-label guidance, referred to as previous guidance. (2) we design a highly randomized ensemble strategy to maximize the effectiveness of the previous guidance. Experimental results on four benchmark semantic segmentation datasets confirm that the proposed method consistently outperforms existing methods across various evaluation protocols. In particular, with DeepLabV3+ and ResNet-101 network settings, PrevMatch outperforms the existing state-of-the-art method, Diverse Co-training, by +1.6 mIoU on Pascal VOC with only 92 annotated images, while achieving 2.4 times faster training. Furthermore, the results indicate that PrevMatch induces stable optimization, particularly in benefiting classes that exhibit poor performance. Code is available at https://github.com/wooseok-shin/PrevMatch △ Less

Submitted 30 May, 2024; originally announced May 2024.

Comments: 14 pages, 5 figures, submitted to IEEE TPAMI. This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible

arXiv:2405.19521 [pdf, other]

Crowdsourcing with Difficulty: A Bayesian Rating Model for Heterogeneous Items

Authors: Seong Woo Han, Ozan Adıgüzel, Bob Carpenter

Abstract: In applied statistics and machine learning, the "gold standards" used for training are often biased and almost always noisy. Dawid and Skene's justifiably popular crowdsourcing model adjusts for rater (coder, annotator) sensitivity and specificity, but fails to capture distributional properties of rating data gathered for training, which in turn biases training. In this study, we introduce a gener… ▽ More In applied statistics and machine learning, the "gold standards" used for training are often biased and almost always noisy. Dawid and Skene's justifiably popular crowdsourcing model adjusts for rater (coder, annotator) sensitivity and specificity, but fails to capture distributional properties of rating data gathered for training, which in turn biases training. In this study, we introduce a general purpose measurement-error model with which we can infer consensus categories by adding item-level effects for difficulty, discriminativeness, and guessability. We further show how to constrain the bimodal posterior of these models to avoid (or if necessary, allow) adversarial raters. We validate our model's goodness of fit with posterior predictive checks, the Bayesian analogue of $χ^2$ tests. Dawid and Skene's model is rejected by goodness of fit tests, whereas our new model, which adjusts for item heterogeneity, is not rejected. We illustrate our new model with two well-studied data sets, binary rating data for caries in dental X-rays and implication in natural language. △ Less

Submitted 29 May, 2024; originally announced May 2024.

arXiv:2405.19335 [pdf, other]

X-VILA: Cross-Modality Alignment for Large Language Model

Authors: Hanrong Ye, De-An Huang, Yao Lu, Zhiding Yu, Wei Ping, Andrew Tao, Jan Kautz, Song Han, Dan Xu, Pavlo Molchanov, Hongxu Yin

Abstract: We introduce X-VILA, an omni-modality model designed to extend the capabilities of large language models (LLMs) by incorporating image, video, and audio modalities. By aligning modality-specific encoders with LLM inputs and diffusion decoders with LLM outputs, X-VILA achieves cross-modality understanding, reasoning, and generation. To facilitate this cross-modality alignment, we curate an effectiv… ▽ More We introduce X-VILA, an omni-modality model designed to extend the capabilities of large language models (LLMs) by incorporating image, video, and audio modalities. By aligning modality-specific encoders with LLM inputs and diffusion decoders with LLM outputs, X-VILA achieves cross-modality understanding, reasoning, and generation. To facilitate this cross-modality alignment, we curate an effective interleaved any-to-any modality instruction-following dataset. Furthermore, we identify a significant problem with the current cross-modality alignment method, which results in visual information loss. To address the issue, we propose a visual alignment mechanism with a visual embedding highway module. We then introduce a resource-efficient recipe for training X-VILA, that exhibits proficiency in any-to-any modality conversation, surpassing previous approaches by large margins. X-VILA also showcases emergent properties across modalities even in the absence of similar training data. The project will be made open-source. △ Less

Submitted 29 May, 2024; originally announced May 2024.

Comments: Technical Report

arXiv:2405.18919 [pdf, other]

Exploiting Inter-Satellite Links for In-Flight Connectivity Scheme in Space-Air-Ground Integrated Networks

Authors: Qian Chen, Chenyu Wu, Shuai Han, Weixiao Meng, Tony Q. S. Quek

Abstract: Space-air-ground integrated networks (SAGIN) are pivotal for achieving uninterrupted in-flight connectivity (IFC). Most existing studies, however, merely treat satellites as transparent forwarding nodes, and overlook their caching capabilities in enhancing the IFC data rate. In this paper, we consider an IFC-oriented SAGIN, where the satellites collaboratively deliver the content to airborne passe… ▽ More Space-air-ground integrated networks (SAGIN) are pivotal for achieving uninterrupted in-flight connectivity (IFC). Most existing studies, however, merely treat satellites as transparent forwarding nodes, and overlook their caching capabilities in enhancing the IFC data rate. In this paper, we consider an IFC-oriented SAGIN, where the satellites collaboratively deliver the content to airborne passengers to facilitate airborne communication. Considering the cached files instantaneously accessible via satellites, this work pioneers the integration of multiple inter-satellite links (ISLs) into the IFC framework, thereby innovating the content delivery process. To minimize the average delay of content delivery, we formulate an optimization problem and propose an exact penalty-based method to derive the satellite association scheme. Our proposed framework has a low complexity and thus paves the way for high-speed Internet connectivity to aviation passengers. Finally, simulation results are presented to demonstrate the effectiveness of our proposed IFC framework for SAGIN. △ Less

Submitted 29 May, 2024; originally announced May 2024.

Comments: 6 pages, 4 figures

arXiv:2405.18698 [pdf, other]

Spectral-Risk Safe Reinforcement Learning with Convergence Guarantees

Authors: Dohyeong Kim, Taehyun Cho, Seungyub Han, Hojun Chung, Kyungjae Lee, Songhwai Oh

Abstract: The field of risk-constrained reinforcement learning (RCRL) has been developed to effectively reduce the likelihood of worst-case scenarios by explicitly handling risk-measure-based constraints. However, the nonlinearity of risk measures makes it challenging to achieve convergence and optimality. To overcome the difficulties posed by the nonlinearity, we propose a spectral risk measure-constrained… ▽ More The field of risk-constrained reinforcement learning (RCRL) has been developed to effectively reduce the likelihood of worst-case scenarios by explicitly handling risk-measure-based constraints. However, the nonlinearity of risk measures makes it challenging to achieve convergence and optimality. To overcome the difficulties posed by the nonlinearity, we propose a spectral risk measure-constrained RL algorithm, spectral-risk-constrained policy optimization (SRCPO), a bilevel optimization approach that utilizes the duality of spectral risk measures. In the bilevel optimization structure, the outer problem involves optimizing dual variables derived from the risk measures, while the inner problem involves finding an optimal policy given these dual variables. The proposed method, to the best of our knowledge, is the first to guarantee convergence to an optimum in the tabular setting. Furthermore, the proposed method has been evaluated on continuous control tasks and showed the best performance among other RCRL algorithms satisfying the constraints. △ Less

Submitted 28 May, 2024; originally announced May 2024.

Comments: 26 pages

arXiv:2405.16493 [pdf, other]

Flow Snapshot Neurons in Action: Deep Neural Networks Generalize to Biological Motion Perception

Authors: Shuangpeng Han, Ziyu Wang, Mengmi Zhang

Abstract: Biological motion perception (BMP) refers to humans' ability to perceive and recognize the actions of living beings solely from their motion patterns, sometimes as minimal as those depicted on point-light displays. While humans excel at these tasks without any prior training, current AI models struggle with poor generalization performance. To close this research gap, we propose the Motion Perceive… ▽ More Biological motion perception (BMP) refers to humans' ability to perceive and recognize the actions of living beings solely from their motion patterns, sometimes as minimal as those depicted on point-light displays. While humans excel at these tasks without any prior training, current AI models struggle with poor generalization performance. To close this research gap, we propose the Motion Perceiver (MP). MP solely relies on patch-level optical flows from video clips as inputs. During training, it learns prototypical flow snapshots through a competitive binding mechanism and integrates invariant motion representations to predict action labels for the given video. During inference, we evaluate the generalization ability of all AI models and humans on 62,656 video stimuli spanning 24 BMP conditions using point-light displays in neuroscience. Remarkably, MP outperforms all existing AI models with a maximum improvement of 29% in top-1 action recognition accuracy on these conditions. Moreover, we benchmark all AI models in point-light displays of two standard video datasets in computer vision. MP also demonstrates superior performance in these cases. More interestingly, via psychophysics experiments, we found that MP recognizes biological movements in a way that aligns with human behavioural data. All data and code will be made public. △ Less

Submitted 26 May, 2024; originally announced May 2024.

arXiv:2405.16234 [pdf, other]

Vision Language Models for Spreadsheet Understanding: Challenges and Opportunities

Authors: Shiyu Xia, Junyu Xiong, Haoyu Dong, Jianbo Zhao, Yuzhang Tian, Mengyu Zhou, Yeye He, Shi Han, Dongmei Zhang

Abstract: This paper explores capabilities of Vision Language Models on spreadsheet comprehension. We propose three self-supervised challenges with corresponding evaluation metrics to comprehensively evaluate VLMs on Optical Character Recognition (OCR), spatial perception, and visual format recognition. Additionally, we utilize the spreadsheet table detection task to assess the overall performance of VLMs b… ▽ More This paper explores capabilities of Vision Language Models on spreadsheet comprehension. We propose three self-supervised challenges with corresponding evaluation metrics to comprehensively evaluate VLMs on Optical Character Recognition (OCR), spatial perception, and visual format recognition. Additionally, we utilize the spreadsheet table detection task to assess the overall performance of VLMs by integrating these challenges. To probe VLMs more finely, we propose three spreadsheet-to-image settings: column width adjustment, style change, and address augmentation. We propose variants of prompts to address the above tasks in different settings. Notably, to leverage the strengths of VLMs in understanding text rather than two-dimensional positioning, we propose to decode cell values on the four boundaries of the table in spreadsheet boundary detection. Our findings reveal that VLMs demonstrate promising OCR capabilities but produce unsatisfactory results due to cell omission and misalignment, and they notably exhibit insufficient spatial and format recognition skills, motivating future work to enhance VLMs' spreadsheet data comprehension capabilities using our methods to generate extensive spreadsheet-image pairs in various settings. △ Less

Submitted 8 August, 2024; v1 submitted 25 May, 2024; originally announced May 2024.

Showing 51–100 of 1,806 results for author: Han, S