Search | arXiv e-print repository

Proxona: Leveraging LLM-Driven Personas to Enhance Creators' Understanding of Their Audience

Authors: Yoonseo Choi, Eun Jeong Kang, Seulgi Choi, Min Kyung Lee, Juho Kim

Abstract: Creators are nothing without their audience, and thereby understanding their audience is the cornerstone of their professional achievement. Yet many creators feel lost while comprehending audiences with existing tools, which offer insufficient insights for tailoring content to audience needs. To address the challenges creators face in understanding their audience, we present Proxona, a system for… ▽ More Creators are nothing without their audience, and thereby understanding their audience is the cornerstone of their professional achievement. Yet many creators feel lost while comprehending audiences with existing tools, which offer insufficient insights for tailoring content to audience needs. To address the challenges creators face in understanding their audience, we present Proxona, a system for defining and extracting representative audience personas from the comments. Creators converse with personas to gain insights into their preferences and engagement, solicit feedback, and implement evidence-based improvements to their content. Powered by large language models, Proxona analyzes audience comments, distilling the latent characteristics of audiences into tangible dimensions (classification categories) and values (category attributes). Proxona then clusters these into synthetic personas. Our technical evaluations demonstrated that our pipelines effectively generated relevant and distinct dimensions and values, enabling the deduction of audience-reflecting personas, while minimizing the likelihood of hallucinations in persona responses. Our user evaluation with 11 creators showed that Proxona supported creators to gain new insights about their audience, make informed decisions, and successfully complete content creation with high confidence. Proxona's data-driven audience personas empower creators to seamlessly integrate audience perspectives into their creative processes, fostering a collaborative approach to content creation. △ Less

Submitted 20 August, 2024; originally announced August 2024.

Comments: 32 pages (including 14 pages of Appendix)

arXiv:2408.10900 [pdf, other]

Towards Efficient Formal Verification of Spiking Neural Network

Authors: Baekryun Seong, Jieung Kim, Sang-Ki Ko

Abstract: Recently, AI research has primarily focused on large language models (LLMs), and increasing accuracy often involves scaling up and consuming more power. The power consumption of AI has become a significant societal issue; in this context, spiking neural networks (SNNs) offer a promising solution. SNNs operate event-driven, like the human brain, and compress information temporally. These characteri… ▽ More Recently, AI research has primarily focused on large language models (LLMs), and increasing accuracy often involves scaling up and consuming more power. The power consumption of AI has become a significant societal issue; in this context, spiking neural networks (SNNs) offer a promising solution. SNNs operate event-driven, like the human brain, and compress information temporally. These characteristics allow SNNs to significantly reduce power consumption compared to perceptron-based artificial neural networks (ANNs), highlighting them as a next-generation neural network technology. However, societal concerns regarding AI go beyond power consumption, with the reliability of AI models being a global issue. For instance, adversarial attacks on AI models are a well-studied problem in the context of traditional neural networks. Despite their importance, the stability and property verification of SNNs remains in the early stages of research. Most SNN verification methods are time-consuming and barely scalable, making practical applications challenging. In this paper, we introduce temporal encoding to achieve practical performance in verifying the adversarial robustness of SNNs. We conduct a theoretical analysis of this approach and demonstrate its success in verifying SNNs at previously unmanageable scales. Our contribution advances SNN verification to a practical level, facilitating the safer application of SNNs. △ Less

Submitted 20 August, 2024; originally announced August 2024.

arXiv:2408.10086 [pdf, other]

ARMADA: Attribute-Based Multimodal Data Augmentation

Authors: Xiaomeng Jin, Jeonghwan Kim, Yu Zhou, Kuan-Hao Huang, Te-Lin Wu, Nanyun Peng, Heng Ji

Abstract: In Multimodal Language Models (MLMs), the cost of manually annotating high-quality image-text pair data for fine-tuning and alignment is extremely high. While existing multimodal data augmentation frameworks propose ways to augment image-text pairs, they either suffer from semantic inconsistency between texts and images, or generate unrealistic images, causing knowledge gap with real world example… ▽ More In Multimodal Language Models (MLMs), the cost of manually annotating high-quality image-text pair data for fine-tuning and alignment is extremely high. While existing multimodal data augmentation frameworks propose ways to augment image-text pairs, they either suffer from semantic inconsistency between texts and images, or generate unrealistic images, causing knowledge gap with real world examples. To address these issues, we propose Attribute-based Multimodal Data Augmentation (ARMADA), a novel multimodal data augmentation method via knowledge-guided manipulation of visual attributes of the mentioned entities. Specifically, we extract entities and their visual attributes from the original text data, then search for alternative values for the visual attributes under the guidance of knowledge bases (KBs) and large language models (LLMs). We then utilize an image-editing model to edit the images with the extracted attributes. ARMADA is a novel multimodal data generation framework that: (i) extracts knowledge-grounded attributes from symbolic KBs for semantically consistent yet distinctive image-text pair generation, (ii) generates visually similar images of disparate categories using neighboring entities in the KB hierarchy, and (iii) uses the commonsense knowledge of LLMs to modulate auxiliary visual attributes such as backgrounds for more robust representation of original entities. Our empirical results over four downstream tasks demonstrate the efficacy of our framework to produce high-quality data and enhance the model performance. This also highlights the need to leverage external knowledge proxies for enhanced interpretability and real-world grounding. △ Less

Submitted 19 August, 2024; originally announced August 2024.

arXiv:2408.09734 [pdf, other]

Mutually-Aware Feature Learning for Few-Shot Object Counting

Authors: Yerim Jeon, Subeen Lee, Jihwan Kim, Jae-Pil Heo

Abstract: Few-shot object counting has garnered significant attention for its practicality as it aims to count target objects in a query image based on given exemplars without the need for additional training. However, there is a shortcoming in the prevailing extract-and-match approach: query and exemplar features lack interaction during feature extraction since they are extracted unaware of each other and… ▽ More Few-shot object counting has garnered significant attention for its practicality as it aims to count target objects in a query image based on given exemplars without the need for additional training. However, there is a shortcoming in the prevailing extract-and-match approach: query and exemplar features lack interaction during feature extraction since they are extracted unaware of each other and later correlated based on similarity. This can lead to insufficient target awareness of the extracted features, resulting in target confusion in precisely identifying the actual target when multiple class objects coexist. To address this limitation, we propose a novel framework, Mutually-Aware FEAture learning(MAFEA), which encodes query and exemplar features mutually aware of each other from the outset. By encouraging interaction between query and exemplar features throughout the entire pipeline, we can obtain target-aware features that are robust to a multi-category scenario. Furthermore, we introduce a background token to effectively associate the target region of query with exemplars and decouple its background region from them. Our extensive experiments demonstrate that our model reaches a new state-of-the-art performance on the two challenging benchmarks, FSCD-LVIS and FSC-147, with a remarkably reduced degree of the target confusion problem. △ Less

Submitted 19 August, 2024; originally announced August 2024.

Comments: Submitted to Pattern Recognition

arXiv:2408.09685 [pdf, ps, other]

Triorthogonal Codes and Self-dual Codes

Authors: Minjia Shi, Haodong Lu, Jon-Lark Kim, Patrick Sole

Abstract: Triorthogonal matrices were introduced in Quantum Information Theory in connection with distillation of magic states (Bravyi and Haah (2012)). We give an algorithm to construct binary triorthogonal matrices from binary self-dual codes. Further, we generalize to this setting the classical coding techniques of shortening and extending. We also give some simple propagation rules. Triorthogonal matrices were introduced in Quantum Information Theory in connection with distillation of magic states (Bravyi and Haah (2012)). We give an algorithm to construct binary triorthogonal matrices from binary self-dual codes. Further, we generalize to this setting the classical coding techniques of shortening and extending. We also give some simple propagation rules. △ Less

Submitted 18 August, 2024; originally announced August 2024.

Comments: 21 pages

MSC Class: 94B05

Journal ref: Quantum Inf Process 23, 280 (2024)

arXiv:2408.09354 [pdf, other]

Boundary-Recovering Network for Temporal Action Detection

Authors: Jihwan Kim, Jaehyun Choi, Yerim Jeon, Jae-Pil Heo

Abstract: Temporal action detection (TAD) is challenging, yet fundamental for real-world video applications. Large temporal scale variation of actions is one of the most primary difficulties in TAD. Naturally, multi-scale features have potential in localizing actions of diverse lengths as widely used in object detection. Nevertheless, unlike objects in images, actions have more ambiguity in their boundaries… ▽ More Temporal action detection (TAD) is challenging, yet fundamental for real-world video applications. Large temporal scale variation of actions is one of the most primary difficulties in TAD. Naturally, multi-scale features have potential in localizing actions of diverse lengths as widely used in object detection. Nevertheless, unlike objects in images, actions have more ambiguity in their boundaries. That is, small neighboring objects are not considered as a large one while short adjoining actions can be misunderstood as a long one. In the coarse-to-fine feature pyramid via pooling, these vague action boundaries can fade out, which we call 'vanishing boundary problem'. To this end, we propose Boundary-Recovering Network (BRN) to address the vanishing boundary problem. BRN constructs scale-time features by introducing a new axis called scale dimension by interpolating multi-scale features to the same temporal length. On top of scale-time features, scale-time blocks learn to exchange features across scale levels, which can effectively settle down the issue. Our extensive experiments demonstrate that our model outperforms the state-of-the-art on the two challenging benchmarks, ActivityNet-v1.3 and THUMOS14, with remarkably reduced degree of the vanishing boundary problem. △ Less

Submitted 18 August, 2024; originally announced August 2024.

Comments: Submitted to Pattern Recognition Journal

arXiv:2408.09064 [pdf, other]

MoRA: LoRA Guided Multi-Modal Disease Diagnosis with Missing Modality

Authors: Zhiyi Shi, Junsik Kim, Wanhua Li, Yicong Li, Hanspeter Pfister

Abstract: Multi-modal pre-trained models efficiently extract and fuse features from different modalities with low memory requirements for fine-tuning. Despite this efficiency, their application in disease diagnosis is under-explored. A significant challenge is the frequent occurrence of missing modalities, which impairs performance. Additionally, fine-tuning the entire pre-trained model demands substantial… ▽ More Multi-modal pre-trained models efficiently extract and fuse features from different modalities with low memory requirements for fine-tuning. Despite this efficiency, their application in disease diagnosis is under-explored. A significant challenge is the frequent occurrence of missing modalities, which impairs performance. Additionally, fine-tuning the entire pre-trained model demands substantial computational resources. To address these issues, we introduce Modality-aware Low-Rank Adaptation (MoRA), a computationally efficient method. MoRA projects each input to a low intrinsic dimension but uses different modality-aware up-projections for modality-specific adaptation in cases of missing modalities. Practically, MoRA integrates into the first block of the model, significantly improving performance when a modality is missing. It requires minimal computational resources, with less than 1.6% of the trainable parameters needed compared to training the entire model. Experimental results show that MoRA outperforms existing techniques in disease diagnosis, demonstrating superior performance, robustness, and training efficiency. △ Less

Submitted 16 August, 2024; originally announced August 2024.

Comments: Accepted by MICCAI 2024

arXiv:2408.08990 [pdf, other]

Adaptive Uncertainty Quantification for Generative AI

Authors: Jungeum Kim, Sean O'Hagan, Veronika Rockova

Abstract: This work is concerned with conformal prediction in contemporary applications (including generative AI) where a black-box model has been trained on data that are not accessible to the user. Mirroring split-conformal inference, we design a wrapper around a black-box algorithm which calibrates conformity scores. This calibration is local and proceeds in two stages by first adaptively partitioning th… ▽ More This work is concerned with conformal prediction in contemporary applications (including generative AI) where a black-box model has been trained on data that are not accessible to the user. Mirroring split-conformal inference, we design a wrapper around a black-box algorithm which calibrates conformity scores. This calibration is local and proceeds in two stages by first adaptively partitioning the predictor space into groups and then calibrating sectionally group by group. Adaptive partitioning (self-grouping) is achieved by fitting a robust regression tree to the conformity scores on the calibration set. This new tree variant is designed in such a way that adding a single new observation does not change the tree fit with overwhelmingly large probability. This add-one-in robustness property allows us to conclude a finite sample group-conditional coverage guarantee, a refinement of the marginal guarantee. In addition, unlike traditional split-conformal inference, adaptive splitting and within-group calibration yields adaptive bands which can stretch and shrink locally. We demonstrate benefits of local tightening on several simulated as well as real examples using non-parametric regression. Finally, we consider two contemporary classification applications for obtaining uncertainty quantification around GPT-4o predictions. We conformalize skin disease diagnoses based on self-reported symptoms as well as predicted states of U.S. legislators based on summaries of their ideology. We demonstrate substantial local tightening of the uncertainty sets while attaining similar marginal coverage. △ Less

Submitted 16 August, 2024; originally announced August 2024.

arXiv:2408.08631 [pdf, other]

Persona is a Double-edged Sword: Enhancing the Zero-shot Reasoning by Ensembling the Role-playing and Neutral Prompts

Authors: Junseok Kim, Nakyeong Yang, Kyomin Jung

Abstract: Recent studies demonstrate that prompting an appropriate role-playing persona to an LLM improves its reasoning capability. However, assigning a proper persona is difficult since an LLM's performance is extremely sensitive to assigned prompts; therefore, personas sometimes hinder LLMs and degrade their reasoning capabilities. In this paper, we propose a novel framework, Jekyll \& Hyde, which ensemb… ▽ More Recent studies demonstrate that prompting an appropriate role-playing persona to an LLM improves its reasoning capability. However, assigning a proper persona is difficult since an LLM's performance is extremely sensitive to assigned prompts; therefore, personas sometimes hinder LLMs and degrade their reasoning capabilities. In this paper, we propose a novel framework, Jekyll \& Hyde, which ensembles the results of role-playing and neutral prompts to eradicate performance degradation via unilateral use of role-playing prompted LLM and enhance the robustness of an LLM's reasoning ability. Specifically, Jekyll \& Hyde collects two potential solutions from both role-playing and neutral prompts and selects a better solution after cross-checking via an LLM evaluator. However, LLM-based evaluators tend to be affected by the order of those potential solutions within the prompt when selecting the proper solution; thus, we also propose a robust LLM evaluator to mitigate the position bias. The experimental analysis demonstrates that role-playing prompts distract LLMs and degrade their reasoning abilities in 4 out of 12 datasets, even when using GPT-4. In addition, we reveal that Jekyll \& Hyde improves reasoning capabilities by selecting better choices among the potential solutions on twelve widely-used reasoning datasets. We further show that our proposed LLM evaluator outperforms other baselines, proving the LLMs' position bias is successfully mitigated. △ Less

Submitted 16 August, 2024; originally announced August 2024.

Comments: 13 pages, 4 figures

arXiv:2408.08591 [pdf, other]

Zero-Shot Dual-Path Integration Framework for Open-Vocabulary 3D Instance Segmentation

Authors: Tri Ton, Ji Woo Hong, SooHwan Eom, Jun Yeop Shim, Junyeong Kim, Chang D. Yoo

Abstract: Open-vocabulary 3D instance segmentation transcends traditional closed-vocabulary methods by enabling the identification of both previously seen and unseen objects in real-world scenarios. It leverages a dual-modality approach, utilizing both 3D point clouds and 2D multi-view images to generate class-agnostic object mask proposals. Previous efforts predominantly focused on enhancing 3D mask propos… ▽ More Open-vocabulary 3D instance segmentation transcends traditional closed-vocabulary methods by enabling the identification of both previously seen and unseen objects in real-world scenarios. It leverages a dual-modality approach, utilizing both 3D point clouds and 2D multi-view images to generate class-agnostic object mask proposals. Previous efforts predominantly focused on enhancing 3D mask proposal models; consequently, the information that could come from 2D association to 3D was not fully exploited. This bias towards 3D data, while effective for familiar indoor objects, limits the system's adaptability to new and varied object types, where 2D models offer greater utility. Addressing this gap, we introduce Zero-Shot Dual-Path Integration Framework that equally values the contributions of both 3D and 2D modalities. Our framework comprises three components: 3D pathway, 2D pathway, and Dual-Path Integration. 3D pathway generates spatially accurate class-agnostic mask proposals of common indoor objects from 3D point cloud data using a pre-trained 3D model, while 2D pathway utilizes pre-trained open-vocabulary instance segmentation model to identify a diverse array of object proposals from multi-view RGB-D images. In Dual-Path Integration, our Conditional Integration process, which operates in two stages, filters and merges the proposals from both pathways adaptively. This process harmonizes output proposals to enhance segmentation capabilities. Our framework, utilizing pre-trained models in a zero-shot manner, is model-agnostic and demonstrates superior performance on both seen and unseen data, as evidenced by comprehensive evaluations on the ScanNet200 and qualitative results on ARKitScenes datasets. △ Less

Submitted 16 August, 2024; originally announced August 2024.

Comments: OpenSUN 3D: 2nd Workshop on Open-Vocabulary 3D Scene Understanding (CVPR 2024)

arXiv:2408.08430 [pdf, other]

Random Gradient Masking as a Defensive Measure to Deep Leakage in Federated Learning

Authors: Joon Kim, Sejin Park

Abstract: Federated Learning(FL), in theory, preserves privacy of individual clients' data while producing quality machine learning models. However, attacks such as Deep Leakage from Gradients(DLG) severely question the practicality of FL. In this paper, we empirically evaluate the efficacy of four defensive methods against DLG: Masking, Clipping, Pruning, and Noising. Masking, while only previously studied… ▽ More Federated Learning(FL), in theory, preserves privacy of individual clients' data while producing quality machine learning models. However, attacks such as Deep Leakage from Gradients(DLG) severely question the practicality of FL. In this paper, we empirically evaluate the efficacy of four defensive methods against DLG: Masking, Clipping, Pruning, and Noising. Masking, while only previously studied as a way to compress information during parameter transfer, shows surprisingly robust defensive utility when compared to the other three established methods. Our experimentation is two-fold. We first evaluate the minimum hyperparameter threshold for each method across MNIST, CIFAR-10, and lfw datasets. Then, we train FL clients with each method and their minimum threshold values to investigate the trade-off between DLG defense and training performance. Results reveal that Masking and Clipping show near to none degradation in performance while obfuscating enough information to effectively defend against DLG. △ Less

Submitted 15 August, 2024; originally announced August 2024.

Comments: 13 pages, 5 figures, to be submitted to Applied Intelligence

arXiv:2408.07900 [pdf, other]

Network analysis reveals news press landscape and asymmetric user polarization

Authors: Byunghwee Lee, Hyo-sun Ryu, Jae Kook Lee, Hawoong Jeong, Beom Jun Kim

Abstract: Unlike traditional media, online news platforms allow users to consume content that suits their tastes and to facilitate interactions with other people. However, as more personalized consumption of information and interaction with like-minded users increase, ideological bias can inadvertently increase and contribute to the formation of echo chambers, reinforcing the polarization of opinions. Altho… ▽ More Unlike traditional media, online news platforms allow users to consume content that suits their tastes and to facilitate interactions with other people. However, as more personalized consumption of information and interaction with like-minded users increase, ideological bias can inadvertently increase and contribute to the formation of echo chambers, reinforcing the polarization of opinions. Although the structural characteristics of polarization among different ideological groups in online spaces have been extensively studied, research into how these groups emotionally interact with each other has not been as thoroughly explored. From this perspective, we investigate both structural and affective polarization between news media user groups on Naver News, South Korea's largest online news portal, during the period of 2022 Korean presidential election. By utilizing the dataset comprising 333,014 articles and over 36 million user comments, we uncover two distinct groups of users characterized by opposing political leanings and reveal significant bias and polarization among them. Additionally, we reveal the existence of echo chambers within co-commenting networks and investigate the asymmetric affective interaction patterns between the two polarized groups. Classification task of news media articles based on the distinct comment response patterns support the notion that different political groups may employ distinct communication strategies. Our approach based on network analysis on large-scale comment dataset offers novel insights into characteristics of user polarization in the online news platforms and the nuanced interaction nature between user groups. △ Less

Submitted 14 August, 2024; originally announced August 2024.

Comments: 21 pages, 6 figures

arXiv:2408.07757 [pdf, other]

Inverse k-visibility for RSSI-based Indoor Geometric Mapping

Authors: Junseo Kim, Matthew Lisondra, Yeganeh Bahoo, Sajad Saeedi

Abstract: In recent years, the increased availability of WiFi in indoor environments has gained an interest in the robotics community to leverage WiFi signals for enhancing indoor SLAM (Simultaneous Localization and Mapping) systems. SLAM technology is widely used, especially for the navigation and control of autonomous robots. This paper discusses various works in developing WiFi-based localization and cha… ▽ More In recent years, the increased availability of WiFi in indoor environments has gained an interest in the robotics community to leverage WiFi signals for enhancing indoor SLAM (Simultaneous Localization and Mapping) systems. SLAM technology is widely used, especially for the navigation and control of autonomous robots. This paper discusses various works in developing WiFi-based localization and challenges in achieving high-accuracy geometric maps. This paper introduces the concept of inverse k-visibility developed from the k-visibility algorithm to identify the free space in an unknown environment for planning, navigation, and obstacle avoidance. Comprehensive experiments, including those utilizing single and multiple RSSI signals, were conducted in both simulated and real-world environments to demonstrate the robustness of the proposed algorithm. Additionally, a detailed analysis comparing the resulting maps with ground-truth Lidar-based maps is provided to highlight the algorithm's accuracy and reliability. △ Less

Submitted 14 August, 2024; originally announced August 2024.

Comments: This work has been submitted to the IEEE Sensors Journal for possible publication

arXiv:2408.07326 [pdf, other]

LPU: A Latency-Optimized and Highly Scalable Processor for Large Language Model Inference

Authors: Seungjae Moon, Jung-Hoon Kim, Junsoo Kim, Seongmin Hong, Junseo Cha, Minsu Kim, Sukbin Lim, Gyubin Choi, Dongjin Seo, Jongho Kim, Hunjong Lee, Hyunjun Park, Ryeowook Ko, Soongyu Choi, Jongse Park, Jinwon Lee, Joo-Young Kim

Abstract: The explosive arrival of OpenAI's ChatGPT has fueled the globalization of large language model (LLM), which consists of billions of pretrained parameters that embodies the aspects of syntax and semantics. HyperAccel introduces latency processing unit (LPU), a latency-optimized and highly scalable processor architecture for the acceleration of LLM inference. LPU perfectly balances the memory bandwi… ▽ More The explosive arrival of OpenAI's ChatGPT has fueled the globalization of large language model (LLM), which consists of billions of pretrained parameters that embodies the aspects of syntax and semantics. HyperAccel introduces latency processing unit (LPU), a latency-optimized and highly scalable processor architecture for the acceleration of LLM inference. LPU perfectly balances the memory bandwidth and compute logic with streamlined dataflow to maximize performance and efficiency. LPU is equipped with expandable synchronization link (ESL) that hides data synchronization latency between multiple LPUs. HyperDex complements LPU as an intuitive software framework to run LLM applications. LPU achieves 1.25 ms/token and 20.9 ms/token for 1.3B and 66B model, respectively, which is 2.09x and 1.37x faster than the GPU. LPU, synthesized using Samsung 4nm process, has total area of 0.824 mm2 and power consumption of 284.31 mW. LPU-based servers achieve 1.33x and 1.32x energy efficiency over NVIDIA H100 and L4 servers, respectively. △ Less

Submitted 14 August, 2024; originally announced August 2024.

arXiv:2408.07233 [pdf]

Pan-cancer gene set discovery via scRNA-seq for optimal deep learning based downstream tasks

Authors: Jong Hyun Kim, Jongseong Jang

Abstract: The application of machine learning to transcriptomics data has led to significant advances in cancer research. However, the high dimensionality and complexity of RNA sequencing (RNA-seq) data pose significant challenges in pan-cancer studies. This study hypothesizes that gene sets derived from single-cell RNA sequencing (scRNA-seq) data will outperform those selected using bulk RNA-seq in pan-can… ▽ More The application of machine learning to transcriptomics data has led to significant advances in cancer research. However, the high dimensionality and complexity of RNA sequencing (RNA-seq) data pose significant challenges in pan-cancer studies. This study hypothesizes that gene sets derived from single-cell RNA sequencing (scRNA-seq) data will outperform those selected using bulk RNA-seq in pan-cancer downstream tasks. We analyzed scRNA-seq data from 181 tumor biopsies across 13 cancer types. High-dimensional weighted gene co-expression network analysis (hdWGCNA) was performed to identify relevant gene sets, which were further refined using XGBoost for feature selection. These gene sets were applied to downstream tasks using TCGA pan-cancer RNA-seq data and compared to six reference gene sets and oncogenes from OncoKB evaluated with deep learning models, including multilayer perceptrons (MLPs) and graph neural networks (GNNs). The XGBoost-refined hdWGCNA gene set demonstrated higher performance in most tasks, including tumor mutation burden assessment, microsatellite instability classification, mutation prediction, cancer subtyping, and grading. In particular, genes such as DPM1, BAD, and FKBP4 emerged as important pan-cancer biomarkers, with DPM1 consistently significant across tasks. This study presents a robust approach for feature selection in cancer genomics by integrating scRNA-seq data and advanced analysis techniques, offering a promising avenue for improving predictive accuracy in cancer research. △ Less

Submitted 13 August, 2024; originally announced August 2024.

Comments: 16 pages, 3 figures, 1 tables, and 6 supplementary Table

arXiv:2408.06276 [pdf, other]

Review-driven Personalized Preference Reasoning with Large Language Models for Recommendation

Authors: Jieyong Kim, Hyunseo Kim, Hyunjin Cho, SeongKu Kang, Buru Chang, Jinyoung Yeo, Dongha Lee

Abstract: Recent advancements in Large Language Models (LLMs) have demonstrated exceptional performance across a wide range of tasks, generating significant interest in their application to recommendation systems. However, existing methods have not fully capitalized on the potential of LLMs, often constrained by limited input information or failing to fully utilize their advanced reasoning capabilities. To… ▽ More Recent advancements in Large Language Models (LLMs) have demonstrated exceptional performance across a wide range of tasks, generating significant interest in their application to recommendation systems. However, existing methods have not fully capitalized on the potential of LLMs, often constrained by limited input information or failing to fully utilize their advanced reasoning capabilities. To address these limitations, we introduce EXP3RT, a novel LLM-based recommender designed to leverage rich preference information contained in user and item reviews. EXP3RT is basically fine-tuned through distillation from a teacher LLM to perform three key tasks in order: EXP3RT first extracts and encapsulates essential subjective preferences from raw reviews, aggregates and summarizes them according to specific criteria to create user and item profiles. It then generates detailed step-by-step reasoning followed by predicted rating, i.e., reasoning-enhanced rating prediction, by considering both subjective and objective information from user/item profiles and item descriptions. This personalized preference reasoning from EXP3RT enhances rating prediction accuracy and also provides faithful and reasonable explanations for recommendation. Extensive experiments show that EXP3RT outperforms existing methods on both rating prediction and candidate item reranking for top-k recommendation, while significantly enhancing the explainability of recommendation systems. △ Less

Submitted 13 August, 2024; v1 submitted 12 August, 2024; originally announced August 2024.

arXiv:2408.06167 [pdf, other]

doi 10.1145/3627673.3680017

Blind-Match: Efficient Homomorphic Encryption-Based 1:N Matching for Privacy-Preserving Biometric Identification

Authors: Hyunmin Choi, Jiwon Kim, Chiyoung Song, Simon S. Woo, Hyoungshick Kim

Abstract: We present Blind-Match, a novel biometric identification system that leverages homomorphic encryption (HE) for efficient and privacy-preserving 1:N matching. Blind-Match introduces a HE-optimized cosine similarity computation method, where the key idea is to divide the feature vector into smaller parts for processing rather than computing the entire vector at once. By optimizing the number of thes… ▽ More We present Blind-Match, a novel biometric identification system that leverages homomorphic encryption (HE) for efficient and privacy-preserving 1:N matching. Blind-Match introduces a HE-optimized cosine similarity computation method, where the key idea is to divide the feature vector into smaller parts for processing rather than computing the entire vector at once. By optimizing the number of these parts, Blind-Match minimizes execution time while ensuring data privacy through HE. Blind-Match achieves superior performance compared to state-of-the-art methods across various biometric datasets. On the LFW face dataset, Blind-Match attains a 99.63% Rank-1 accuracy with a 128-dimensional feature vector, demonstrating its robustness in face recognition tasks. For fingerprint identification, Blind-Match achieves a remarkable 99.55% Rank-1 accuracy on the PolyU dataset, even with a compact 16-dimensional feature vector, significantly outperforming the state-of-the-art method, Blind-Touch, which achieves only 59.17%. Furthermore, Blind-Match showcases practical efficiency in large-scale biometric identification scenarios, such as Naver Cloud's FaceSign, by processing 6,144 biometric samples in 0.74 seconds using a 128-dimensional feature vector. △ Less

Submitted 12 August, 2024; originally announced August 2024.

Comments: Accepted to CIKM 2024 (Applied Research Track)

arXiv:2408.06010 [pdf, other]

DEEPTalk: Dynamic Emotion Embedding for Probabilistic Speech-Driven 3D Face Animation

Authors: Jisoo Kim, Jungbin Cho, Joonho Park, Soonmin Hwang, Da Eun Kim, Geon Kim, Youngjae Yu

Abstract: Speech-driven 3D facial animation has garnered lots of attention thanks to its broad range of applications. Despite recent advancements in achieving realistic lip motion, current methods fail to capture the nuanced emotional undertones conveyed through speech and produce monotonous facial motion. These limitations result in blunt and repetitive facial animations, reducing user engagement and hinde… ▽ More Speech-driven 3D facial animation has garnered lots of attention thanks to its broad range of applications. Despite recent advancements in achieving realistic lip motion, current methods fail to capture the nuanced emotional undertones conveyed through speech and produce monotonous facial motion. These limitations result in blunt and repetitive facial animations, reducing user engagement and hindering their applicability. To address these challenges, we introduce DEEPTalk, a novel approach that generates diverse and emotionally rich 3D facial expressions directly from speech inputs. To achieve this, we first train DEE (Dynamic Emotion Embedding), which employs probabilistic contrastive learning to forge a joint emotion embedding space for both speech and facial motion. This probabilistic framework captures the uncertainty in interpreting emotions from speech and facial motion, enabling the derivation of emotion vectors from its multifaceted space. Moreover, to generate dynamic facial motion, we design TH-VQVAE (Temporally Hierarchical VQ-VAE) as an expressive and robust motion prior overcoming limitations of VAEs and VQ-VAEs. Utilizing these strong priors, we develop DEEPTalk, A talking head generator that non-autoregressively predicts codebook indices to create dynamic facial motion, incorporating a novel emotion consistency loss. Extensive experiments on various datasets demonstrate the effectiveness of our approach in creating diverse, emotionally expressive talking faces that maintain accurate lip-sync. Source code will be made publicly available soon. △ Less

Submitted 12 August, 2024; originally announced August 2024.

Comments: First two authors contributed equally

arXiv:2408.05955 [pdf, other]

Probabilistic Vision-Language Representation for Weakly Supervised Temporal Action Localization

Authors: Geuntaek Lim, Hyunwoo Kim, Joonsoo Kim, Yukyung Choi

Abstract: Weakly supervised temporal action localization (WTAL) aims to detect action instances in untrimmed videos using only video-level annotations. Since many existing works optimize WTAL models based on action classification labels, they encounter the task discrepancy problem (i.e., localization-by-classification). To tackle this issue, recent studies have attempted to utilize action category names as… ▽ More Weakly supervised temporal action localization (WTAL) aims to detect action instances in untrimmed videos using only video-level annotations. Since many existing works optimize WTAL models based on action classification labels, they encounter the task discrepancy problem (i.e., localization-by-classification). To tackle this issue, recent studies have attempted to utilize action category names as auxiliary semantic knowledge through vision-language pre-training (VLP). However, there are still areas where existing research falls short. Previous approaches primarily focused on leveraging textual information from language models but overlooked the alignment of dynamic human action and VLP knowledge in a joint space. Furthermore, the deterministic representation employed in previous studies struggles to capture fine-grained human motions. To address these problems, we propose a novel framework that aligns human action knowledge and VLP knowledge in a probabilistic embedding space. Moreover, we propose intra- and inter-distribution contrastive learning to enhance the probabilistic embedding space based on statistical similarities. Extensive experiments and ablation studies reveal that our method significantly outperforms all previous state-of-the-art methods. Code is available at https://github.com/sejong-rcv/PVLR. △ Less

Submitted 12 August, 2024; originally announced August 2024.

Comments: Accepted to ACM MM 2024

arXiv:2408.04990 [pdf, ps, other]

Stochastic Geometry Analysis of RIS-Assisted Cellular Networks with Reflective Intelligent Surfaces on Roads

Authors: Chang-Sik Choi, Junhyeong Kim, Junil Choi

Abstract: Reconfigurable intelligent surfaces (RISs) provide alternative routes for reflected signals to network users, offering numerous applications. This paper explores an innovative approach of strategically deploying RISs along road areas to leverage various propagation and blockage conditions present in cellular networks with roads. To address the local network geometries shown by such networks, we us… ▽ More Reconfigurable intelligent surfaces (RISs) provide alternative routes for reflected signals to network users, offering numerous applications. This paper explores an innovative approach of strategically deploying RISs along road areas to leverage various propagation and blockage conditions present in cellular networks with roads. To address the local network geometries shown by such networks, we use a stochastic geometry framework, specifically the Cox point processes, to model the locations of RISs and vehicle users. Then, we define the coverage probability as the chance that either a base station or an RIS is in line of sight (LOS) of the typical user and that the LOS signal has a signal-to-noise ratio (SNR) greater than a threshold. We derive the coverage probability as a function of key parameters such as RIS density and path loss exponent. We observe that the network geometry highly affects the coverage and that the proposed RIS deployment effectively leverages the underlying difference of attenuation and blockage, significantly increasing the coverage of vehicle users in the network. With experimental results addressing the impact of key variables to network performance, this work serves as a versatile tool for designing, analyzing, and optimizing RIS-assisted cellular networks with many vehicles. △ Less

Submitted 9 August, 2024; originally announced August 2024.

Comments: accepted to IEEE Transactions on Communications

arXiv:2408.04874 [pdf, other]

DG Comics: Semi-Automatically Authoring Graph Comics for Dynamic Graphs

Authors: Joohee Kim, Hyunwook Lee, Duc M. Nguyen, Minjeong Shin, Bum Chul Kwon, Sungahn Ko, Niklas Elmqvist

Abstract: Comics are an effective method for sequential data-driven storytelling, especially for dynamic graphs -- graphs whose vertices and edges change over time. However, manually creating such comics is currently time-consuming, complex, and error-prone. In this paper, we propose DG Comics, a novel comic authoring tool for dynamic graphs that allows users to semi-automatically build and annotate comics.… ▽ More Comics are an effective method for sequential data-driven storytelling, especially for dynamic graphs -- graphs whose vertices and edges change over time. However, manually creating such comics is currently time-consuming, complex, and error-prone. In this paper, we propose DG Comics, a novel comic authoring tool for dynamic graphs that allows users to semi-automatically build and annotate comics. The tool uses a newly developed hierarchical clustering algorithm to segment consecutive snapshots of dynamic graphs while preserving their chronological order. It also presents rich information on both individuals and communities extracted from dynamic graphs in multiple views, where users can explore dynamic graphs and choose what to tell in comics. For evaluation, we provide an example and report the results of a user study and an expert review. △ Less

Submitted 9 August, 2024; originally announced August 2024.

Comments: To appear in IEEE Transactions on Visualization and Computer Graphics

arXiv:2408.04693 [pdf, other]

Understanding the Performance and Estimating the Cost of LLM Fine-Tuning

Authors: Yuchen Xia, Jiho Kim, Yuhan Chen, Haojie Ye, Souvik Kundu, Cong Hao, Nishil Talati

Abstract: Due to the cost-prohibitive nature of training Large Language Models (LLMs), fine-tuning has emerged as an attractive alternative for specializing LLMs for specific tasks using limited compute resources in a cost-effective manner. In this paper, we characterize sparse Mixture of Experts (MoE) based LLM fine-tuning to understand their accuracy and runtime performance on a single GPU. Our evaluation… ▽ More Due to the cost-prohibitive nature of training Large Language Models (LLMs), fine-tuning has emerged as an attractive alternative for specializing LLMs for specific tasks using limited compute resources in a cost-effective manner. In this paper, we characterize sparse Mixture of Experts (MoE) based LLM fine-tuning to understand their accuracy and runtime performance on a single GPU. Our evaluation provides unique insights into the training efficacy of sparse and dense versions of MoE models, as well as their runtime characteristics, including maximum batch size, execution time breakdown, end-to-end throughput, GPU hardware utilization, and load distribution. Our study identifies the optimization of the MoE layer as crucial for further improving the performance of LLM fine-tuning. Using our profiling results, we also develop and validate an analytical model to estimate the cost of LLM fine-tuning on the cloud. This model, based on parameters of the model and GPU architecture, estimates LLM throughput and the cost of training, aiding practitioners in industry and academia to budget the cost of fine-tuning a specific model. △ Less

Submitted 8 August, 2024; originally announced August 2024.

Comments: 10 pages, conference

arXiv:2408.04266 [pdf, other]

BPMP-Tracker: A Versatile Aerial Target Tracker Using Bernstein Polynomial Motion Primitives

Authors: Yunwoo Lee, Jungwon Park, Boseong Jeon, Seungwoo Jung, H. Jin Kim

Abstract: This letter presents a versatile trajectory planning pipeline for aerial tracking. The proposed tracker is capable of handling various chasing settings such as complex unstructured environments, crowded dynamic obstacles and multiple-target following. Among the entire pipeline, we focus on developing a predictor for future target motion and a chasing trajectory planner. For rapid computation, we e… ▽ More This letter presents a versatile trajectory planning pipeline for aerial tracking. The proposed tracker is capable of handling various chasing settings such as complex unstructured environments, crowded dynamic obstacles and multiple-target following. Among the entire pipeline, we focus on developing a predictor for future target motion and a chasing trajectory planner. For rapid computation, we employ the sample-check-select strategy: modules sample a set of candidate movements, check multiple constraints, and then select the best trajectory. Also, we leverage the properties of Bernstein polynomials for quick calculations. The prediction module predicts the trajectories of the targets, which do not overlap with static and dynamic obstacles. Then the trajectory planner outputs a trajectory, ensuring various conditions such as occlusion and collision avoidance, the visibility of all targets within a camera image and dynamical limits. We fully test the proposed tracker in simulations and hardware experiments under challenging scenarios, including dual-target following, environments with dozens of dynamic obstacles and complex indoor and outdoor spaces. △ Less

Submitted 8 August, 2024; originally announced August 2024.

Comments: 8 pages, 9 figures

arXiv:2408.03612 [pdf, other]

JARViS: Detecting Actions in Video Using Unified Actor-Scene Context Relation Modeling

Authors: Seok Hwan Lee, Taein Son, Soo Won Seo, Jisong Kim, Jun Won Choi

Abstract: Video action detection (VAD) is a formidable vision task that involves the localization and classification of actions within the spatial and temporal dimensions of a video clip. Among the myriad VAD architectures, two-stage VAD methods utilize a pre-trained person detector to extract the region of interest features, subsequently employing these features for action detection. However, the performan… ▽ More Video action detection (VAD) is a formidable vision task that involves the localization and classification of actions within the spatial and temporal dimensions of a video clip. Among the myriad VAD architectures, two-stage VAD methods utilize a pre-trained person detector to extract the region of interest features, subsequently employing these features for action detection. However, the performance of two-stage VAD methods has been limited as they depend solely on localized actor features to infer action semantics. In this study, we propose a new two-stage VAD framework called Joint Actor-scene context Relation modeling based on Visual Semantics (JARViS), which effectively consolidates cross-modal action semantics distributed globally across spatial and temporal dimensions using Transformer attention. JARViS employs a person detector to produce densely sampled actor features from a keyframe. Concurrently, it uses a video backbone to create spatio-temporal scene features from a video clip. Finally, the fine-grained interactions between actors and scenes are modeled through a Unified Action-Scene Context Transformer to directly output the final set of actions in parallel. Our experimental results demonstrate that JARViS outperforms existing methods by significant margins and achieves state-of-the-art performance on three popular VAD datasets, including AVA, UCF101-24, and JHMDB51-21. △ Less

Submitted 7 August, 2024; originally announced August 2024.

Comments: 31 pages, 10 figures

arXiv:2408.03551 [pdf, other]

VPOcc: Exploiting Vanishing Point for Monocular 3D Semantic Occupancy Prediction

Authors: Junsu Kim, Junhee Lee, Ukcheol Shin, Jean Oh, Kyungdon Joo

Abstract: Monocular 3D semantic occupancy prediction is becoming important in robot vision due to the compactness of using a single RGB camera. However, existing methods often do not adequately account for camera perspective geometry, resulting in information imbalance along the depth range of the image. To address this issue, we propose a vanishing point (VP) guided monocular 3D semantic occupancy predicti… ▽ More Monocular 3D semantic occupancy prediction is becoming important in robot vision due to the compactness of using a single RGB camera. However, existing methods often do not adequately account for camera perspective geometry, resulting in information imbalance along the depth range of the image. To address this issue, we propose a vanishing point (VP) guided monocular 3D semantic occupancy prediction framework named VPOcc. Our framework consists of three novel modules utilizing VP. First, in the VPZoomer module, we initially utilize VP in feature extraction to achieve information balanced feature extraction across the scene by generating a zoom-in image based on VP. Second, we perform perspective geometry-aware feature aggregation by sampling points towards VP using a VP-guided cross-attention (VPCA) module. Finally, we create an information-balanced feature volume by effectively fusing original and zoom-in voxel feature volumes with a balanced feature volume fusion (BVFV) module. Experiments demonstrate that our method achieves state-of-the-art performance for both IoU and mIoU on SemanticKITTI and SSCBench-KITTI360. These results are obtained by effectively addressing the information imbalance in images through the utilization of VP. Our code will be available at www.github.com/anonymous. △ Less

Submitted 7 August, 2024; originally announced August 2024.

arXiv:2408.03541 [pdf, ps, other]

EXAONE 3.0 7.8B Instruction Tuned Language Model

Authors: LG AI Research, :, Soyoung An, Kyunghoon Bae, Eunbi Choi, Stanley Jungkyu Choi, Yemuk Choi, Seokhee Hong, Yeonjung Hong, Junwon Hwang, Hyojin Jeon, Gerrard Jeongwon Jo, Hyunjik Jo, Jiyeon Jung, Yountae Jung, Euisoon Kim, Hyosang Kim, Joonkee Kim, Seonghwan Kim, Soyeon Kim, Sunkyoung Kim, Yireun Kim, Youchul Kim, Edward Hwayoung Lee, Haeju Lee , et al. (14 additional authors not shown)

Abstract: We introduce EXAONE 3.0 instruction-tuned language model, the first open model in the family of Large Language Models (LLMs) developed by LG AI Research. Among different model sizes, we publicly release the 7.8B instruction-tuned model to promote open research and innovations. Through extensive evaluations across a wide range of public and in-house benchmarks, EXAONE 3.0 demonstrates highly compet… ▽ More We introduce EXAONE 3.0 instruction-tuned language model, the first open model in the family of Large Language Models (LLMs) developed by LG AI Research. Among different model sizes, we publicly release the 7.8B instruction-tuned model to promote open research and innovations. Through extensive evaluations across a wide range of public and in-house benchmarks, EXAONE 3.0 demonstrates highly competitive real-world performance with instruction-following capability against other state-of-the-art open models of similar size. Our comparative analysis shows that EXAONE 3.0 excels particularly in Korean, while achieving compelling performance across general tasks and complex reasoning. With its strong real-world effectiveness and bilingual proficiency, we hope that EXAONE keeps contributing to advancements in Expert AI. Our EXAONE 3.0 instruction-tuned model is available at https://huggingface.co/LGAI-EXAONE/EXAONE-3.0-7.8B-Instruct △ Less

Submitted 13 August, 2024; v1 submitted 7 August, 2024; originally announced August 2024.

arXiv:2408.02888 [pdf, other]

VizECGNet: Visual ECG Image Network for Cardiovascular Diseases Classification with Multi-Modal Training and Knowledge Distillation

Authors: Ju-Hyeon Nam, Seo-Hyung Park, Su Jung Kim, Sang-Chul Lee

Abstract: An electrocardiogram (ECG) captures the heart's electrical signal to assess various heart conditions. In practice, ECG data is stored as either digitized signals or printed images. Despite the emergence of numerous deep learning models for digitized signals, many hospitals prefer image storage due to cost considerations. Recognizing the unavailability of raw ECG signals in many clinical settings,… ▽ More An electrocardiogram (ECG) captures the heart's electrical signal to assess various heart conditions. In practice, ECG data is stored as either digitized signals or printed images. Despite the emergence of numerous deep learning models for digitized signals, many hospitals prefer image storage due to cost considerations. Recognizing the unavailability of raw ECG signals in many clinical settings, we propose VizECGNet, which uses only printed ECG graphics to determine the prognosis of multiple cardiovascular diseases. During training, cross-modal attention modules (CMAM) are used to integrate information from two modalities - image and signal, while self-modality attention modules (SMAM) capture inherent long-range dependencies in ECG data of each modality. Additionally, we utilize knowledge distillation to improve the similarity between two distinct predictions from each modality stream. This innovative multi-modal deep learning architecture enables the utilization of only ECG images during inference. VizECGNet with image input achieves higher performance in precision, recall, and F1-Score compared to signal-based ECG classification models, with improvements of 3.50%, 8.21%, and 7.38%, respectively. △ Less

Submitted 5 August, 2024; originally announced August 2024.

Comments: Accepted in International Conference on Image Processing (ICIP) 2024

arXiv:2408.02883 [pdf, other]

doi 10.1145/3686909

"Sharing, Not Showing Off": How BeReal Approaches Authentic Self-Presentation on Social Media Through Its Design

Authors: JaeWon Kim, Robert Wolfe, Ishita Chordia, Katie Davis, Alexis Hiniker

Abstract: Adolescents are particularly vulnerable to the pressures created by social media, such as heightened self-consciousness and the need for extensive self-presentation. In this study, we investigate how BeReal, a social media platform designed to counter some of these pressures, influences adolescents' self-presentation behaviors. We interviewed 29 users aged 13-18 to understand their experiences wit… ▽ More Adolescents are particularly vulnerable to the pressures created by social media, such as heightened self-consciousness and the need for extensive self-presentation. In this study, we investigate how BeReal, a social media platform designed to counter some of these pressures, influences adolescents' self-presentation behaviors. We interviewed 29 users aged 13-18 to understand their experiences with BeReal. We found that BeReal's design focuses on spontaneous sharing, including randomly timed daily notifications and reciprocal posting, discourages staged posts, encourages careful curation of the audience, and reduces pressure on self-presentation. The space created by BeReal offers benefits such as validating an unfiltered life and reframing social comparison, but its approach to self-presentation is sometimes perceived as limited or unappealing and, at times, even toxic. Drawing on this empirical data, we distill a set of design guidelines for creating platforms that support authentic self-presentation online, such as scaffolding reciprocity and expanding beyond spontaneous photo-sharing to allow users to more accurately and comfortably portray themselves. △ Less

Submitted 5 August, 2024; originally announced August 2024.

arXiv:2408.02582 [pdf, other]

Clustering and Mining Accented Speech for Inclusive and Fair Speech Recognition

Authors: Jaeyoung Kim, Han Lu, Soheil Khorram, Anshuman Tripathi, Qian Zhang, Hasim Sak

Abstract: Modern automatic speech recognition (ASR) systems are typically trained on more than tens of thousands hours of speech data, which is one of the main factors for their great success. However, the distribution of such data is typically biased towards common accents or typical speech patterns. As a result, those systems often poorly perform on atypical accented speech. In this paper, we present acce… ▽ More Modern automatic speech recognition (ASR) systems are typically trained on more than tens of thousands hours of speech data, which is one of the main factors for their great success. However, the distribution of such data is typically biased towards common accents or typical speech patterns. As a result, those systems often poorly perform on atypical accented speech. In this paper, we present accent clustering and mining schemes for fair speech recognition systems which can perform equally well on under-represented accented speech. For accent recognition, we applied three schemes to overcome limited size of supervised accent data: supervised or unsupervised pre-training, distributionally robust optimization (DRO) and unsupervised clustering. Three schemes can significantly improve the accent recognition model especially for unbalanced and small accented speech. Fine-tuning ASR on the mined Indian accent speech using the proposed supervised or unsupervised clustering schemes showed 10.0% and 5.3% relative improvements compared to fine-tuning on the randomly sampled speech, respectively. △ Less

Submitted 5 August, 2024; originally announced August 2024.

arXiv:2408.01585 [pdf, other]

OpenLogParser: Unsupervised Parsing with Open-Source Large Language Models

Authors: Zeyang Ma, Dong Jae Kim, Tse-Hsun Chen

Abstract: Log parsing is a critical step that transforms unstructured log data into structured formats, facilitating subsequent log-based analysis. Traditional syntax-based log parsers are efficient and effective, but they often experience decreased accuracy when processing logs that deviate from the predefined rules. Recently, large language models (LLM) based log parsers have shown superior parsing accura… ▽ More Log parsing is a critical step that transforms unstructured log data into structured formats, facilitating subsequent log-based analysis. Traditional syntax-based log parsers are efficient and effective, but they often experience decreased accuracy when processing logs that deviate from the predefined rules. Recently, large language models (LLM) based log parsers have shown superior parsing accuracy. However, existing LLM-based parsers face three main challenges: 1)time-consuming and labor-intensive manual labeling for fine-tuning or in-context learning, 2)increased parsing costs due to the vast volume of log data and limited context size of LLMs, and 3)privacy risks from using commercial models like ChatGPT with sensitive log information. To overcome these limitations, this paper introduces OpenLogParser, an unsupervised log parsing approach that leverages open-source LLMs (i.e., Llama3-8B) to enhance privacy and reduce operational costs while achieving state-of-the-art parsing accuracy. OpenLogParser first groups logs with similar static text but varying dynamic variables using a fixed-depth grouping tree. It then parses logs within these groups using three components: i)similarity scoring-based retrieval augmented generation: selects diverse logs within each group based on Jaccard similarity, helping the LLM distinguish between static text and dynamic variables; ii)self-reflection: iteratively query LLMs to refine log templates to improve parsing accuracy; and iii) log template memory: stores parsed templates to reduce LLM queries for improved parsing efficiency. Our evaluation on LogHub-2.0 shows that OpenLogParser achieves 25% higher parsing accuracy and processes logs 2.7 times faster compared to state-of-the-art LLM-based parsers. In short, OpenLogParser addresses privacy and cost concerns of using commercial LLMs while achieving state-of-the-arts parsing efficiency and accuracy. △ Less

Submitted 2 August, 2024; originally announced August 2024.

arXiv:2408.01446 [pdf, other]

Estimating Environmental Cost Throughout Model's Adaptive Life Cycle

Authors: Vishwesh Sangarya, Richard Bradford, Jung-Eun Kim

Abstract: With the rapid increase in the research, development, and application of neural networks in the current era, there is a proportional increase in the energy needed to train and use models. Crucially, this is accompanied by the increase in carbon emissions into the environment. A sustainable and socially beneficial approach to reducing the carbon footprint and rising energy demands associated with t… ▽ More With the rapid increase in the research, development, and application of neural networks in the current era, there is a proportional increase in the energy needed to train and use models. Crucially, this is accompanied by the increase in carbon emissions into the environment. A sustainable and socially beneficial approach to reducing the carbon footprint and rising energy demands associated with the modern age of AI/deep learning is the adaptive and continuous reuse of models with regard to changes in the environment of model deployment or variations/changes in the input data. In this paper, we propose PreIndex, a predictive index to estimate the environmental and compute resources associated with model retraining to distributional shifts in data. PreIndex can be used to estimate environmental costs such as carbon emissions and energy usage when retraining from current data distribution to new data distribution. It also correlates with and can be used to estimate other resource indicators associated with deep learning, such as epochs, gradient norm, and magnitude of model parameter change. PreIndex requires only one forward pass of the data, following which it provides a single concise value to estimate resources associated with retraining to the new distribution shifted data. We show that PreIndex can be reliably used across various datasets, model architectures, different types, and intensities of distribution shifts. Thus, PreIndex enables users to make informed decisions for retraining to different distribution shifts and determine the most cost-effective and sustainable option, allowing for the reuse of a model with a much smaller footprint in the environment. The code for this work is available here: https://github.com/JEKimLab/AIES2024PreIndex △ Less

Submitted 22 July, 2024; originally announced August 2024.

Comments: Accepted in the AAAI/ACM Conference on Artificial Intelligence, Ethics, and Society, 2024

arXiv:2408.01292 [pdf]

3DPX: Progressive 2D-to-3D Oral Image Reconstruction with Hybrid MLP-CNN Networks

Authors: Xiaoshuang Li, Mingyuan Meng, Zimo Huang, Lei Bi, Eduardo Delamare, Dagan Feng, Bin Sheng, Jinman Kim

Abstract: Panoramic X-ray (PX) is a prevalent modality in dental practice for its wide availability and low cost. However, as a 2D projection image, PX does not contain 3D anatomical information, and therefore has limited use in dental applications that can benefit from 3D information, e.g., tooth angular misa-lignment detection and classification. Reconstructing 3D structures directly from 2D PX has recent… ▽ More Panoramic X-ray (PX) is a prevalent modality in dental practice for its wide availability and low cost. However, as a 2D projection image, PX does not contain 3D anatomical information, and therefore has limited use in dental applications that can benefit from 3D information, e.g., tooth angular misa-lignment detection and classification. Reconstructing 3D structures directly from 2D PX has recently been explored to address limitations with existing methods primarily reliant on Convolutional Neural Networks (CNNs) for direct 2D-to-3D mapping. These methods, however, are unable to correctly infer depth-axis spatial information. In addition, they are limited by the in-trinsic locality of convolution operations, as the convolution kernels only capture the information of immediate neighborhood pixels. In this study, we propose a progressive hybrid Multilayer Perceptron (MLP)-CNN pyra-mid network (3DPX) for 2D-to-3D oral PX reconstruction. We introduce a progressive reconstruction strategy, where 3D images are progressively re-constructed in the 3DPX with guidance imposed on the intermediate recon-struction result at each pyramid level. Further, motivated by the recent ad-vancement of MLPs that show promise in capturing fine-grained long-range dependency, our 3DPX integrates MLPs and CNNs to improve the semantic understanding during reconstruction. Extensive experiments on two large datasets involving 464 studies demonstrate that our 3DPX outperforms state-of-the-art 2D-to-3D oral reconstruction methods, including standalone MLP and transformers, in reconstruction quality, and also im-proves the performance of downstream angular misalignment classification tasks. △ Less

Submitted 2 August, 2024; originally announced August 2024.

Comments: accepted by MICCAI 2024

arXiv:2408.01084 [pdf, other]

Adaptive Contrastive Decoding in Retrieval-Augmented Generation for Handling Noisy Contexts

Authors: Youna Kim, Hyuhng Joon Kim, Cheonbok Park, Choonghyun Park, Hyunsoo Cho, Junyeob Kim, Kang Min Yoo, Sang-goo Lee, Taeuk Kim

Abstract: When using large language models (LLMs) in knowledge-intensive tasks, such as open-domain question answering, external context can bridge a gap between external knowledge and LLM's parametric knowledge. Recent research has been developed to amplify contextual knowledge over the parametric knowledge of LLM with contrastive decoding approaches. While these approaches could yield truthful responses w… ▽ More When using large language models (LLMs) in knowledge-intensive tasks, such as open-domain question answering, external context can bridge a gap between external knowledge and LLM's parametric knowledge. Recent research has been developed to amplify contextual knowledge over the parametric knowledge of LLM with contrastive decoding approaches. While these approaches could yield truthful responses when relevant context is provided, they are prone to vulnerabilities when faced with noisy contexts. We extend the scope of previous studies to encompass noisy contexts and propose adaptive contrastive decoding (ACD) to leverage contextual influence effectively. ACD demonstrates improvements in open-domain question answering tasks compared to baselines, especially in robustness by remaining undistracted by noisy contexts in retrieval-augmented generation. △ Less

Submitted 2 August, 2024; originally announced August 2024.

arXiv:2408.00994 [pdf, other]

ArchCode: Incorporating Software Requirements in Code Generation with Large Language Models

Authors: Hojae Han, Jaejin Kim, Jaeseok Yoo, Youngwon Lee, Seung-won Hwang

Abstract: This paper aims to extend the code generation capability of large language models (LLMs) to automatically manage comprehensive software requirements from given textual descriptions. Such requirements include both functional (i.e. achieving expected behavior for inputs) and non-functional (e.g., time/space performance, robustness, maintainability) requirements. However, textual descriptions can eit… ▽ More This paper aims to extend the code generation capability of large language models (LLMs) to automatically manage comprehensive software requirements from given textual descriptions. Such requirements include both functional (i.e. achieving expected behavior for inputs) and non-functional (e.g., time/space performance, robustness, maintainability) requirements. However, textual descriptions can either express requirements verbosely or may even omit some of them. We introduce ARCHCODE, a novel framework that leverages in-context learning to organize requirements observed in descriptions and to extrapolate unexpressed requirements from them. ARCHCODE generates requirements from given descriptions, conditioning them to produce code snippets and test cases. Each test case is tailored to one of the requirements, allowing for the ranking of code snippets based on the compliance of their execution results with the requirements. Public benchmarks show that ARCHCODE enhances to satisfy functional requirements, significantly improving Pass@k scores. Furthermore, we introduce HumanEval-NFR, the first evaluation of LLMs' non-functional requirements in code generation, demonstrating ARCHCODE's superiority over baseline methods. The implementation of ARCHCODE and the HumanEval-NFR benchmark are both publicly accessible. △ Less

Submitted 1 August, 2024; originally announced August 2024.

Comments: Accepted by ACL 2024 main conference

arXiv:2408.00380 [pdf, other]

Enhancing Whole Slide Pathology Foundation Models through Stain Normalization

Authors: Juseung Yun, Yi Hu, Jinhyung Kim, Jongseong Jang, Soonyoung Lee

Abstract: Recent advancements in digital pathology have led to the development of numerous foundational models that utilize self-supervised learning on patches extracted from gigapixel whole slide images (WSIs). While this approach leverages vast amounts of unlabeled data, we have discovered a significant issue: features extracted from these self-supervised models tend to cluster by individual WSIs, a pheno… ▽ More Recent advancements in digital pathology have led to the development of numerous foundational models that utilize self-supervised learning on patches extracted from gigapixel whole slide images (WSIs). While this approach leverages vast amounts of unlabeled data, we have discovered a significant issue: features extracted from these self-supervised models tend to cluster by individual WSIs, a phenomenon we term WSI-specific feature collapse. This problem can potentially limit the model's generalization ability and performance on various downstream tasks. To address this issue, we introduce Stain Normalized Pathology Foundational Model, a novel foundational model trained on patches that have undergone stain normalization. Stain normalization helps reduce color variability arising from different laboratories and scanners, enabling the model to learn more consistent features. Stain Normalized Pathology Foundational Model is trained using 285,153,903 patches extracted from a total of 34,795 WSIs, combining data from The Cancer Genome Atlas (TCGA) and the Genotype-Tissue Expression (GTEx) project. Our experiments demonstrate that Stain Normalized Pathology Foundational Model significantly mitigates the feature collapse problem, indicating that the model has learned more generalized features rather than overfitting to individual WSI characteristics. We compared Stain Normalized Pathology Foundational Model with state-of-the-art models across six downstream task datasets, and our results show that Stain Normalized Pathology Foundational Model achieves excellent performance relative to the number of WSIs used and the model's parameter count. This suggests that the application of stain normalization has substantially improved the model's efficiency and generalization capabilities. △ Less

Submitted 4 August, 2024; v1 submitted 1 August, 2024; originally announced August 2024.

Comments: 13 pages, 8 figures

arXiv:2408.00351 [pdf, other]

Hierarchically Structured Neural Bones for Reconstructing Animatable Objects from Casual Videos

Authors: Subin Jeon, In Cho, Minsu Kim, Woong Oh Cho, Seon Joo Kim

Abstract: We propose a new framework for creating and easily manipulating 3D models of arbitrary objects using casually captured videos. Our core ingredient is a novel hierarchy deformation model, which captures motions of objects with a tree-structured bones. Our hierarchy system decomposes motions based on the granularity and reveals the correlations between parts without exploiting any prior structural k… ▽ More We propose a new framework for creating and easily manipulating 3D models of arbitrary objects using casually captured videos. Our core ingredient is a novel hierarchy deformation model, which captures motions of objects with a tree-structured bones. Our hierarchy system decomposes motions based on the granularity and reveals the correlations between parts without exploiting any prior structural knowledge. We further propose to regularize the bones to be positioned at the basis of motions, centers of parts, sufficiently covering related surfaces of the part. This is achieved by our bone occupancy function, which identifies whether a given 3D point is placed within the bone. Coupling the proposed components, our framework offers several clear advantages: (1) users can obtain animatable 3D models of the arbitrary objects in improved quality from their casual videos, (2) users can manipulate 3D models in an intuitive manner with minimal costs, and (3) users can interactively add or delete control points as necessary. The experimental results demonstrate the efficacy of our framework on diverse instances, in reconstruction quality, interpretability and easier manipulation. Our code is available at https://github.com/subin6/HSNB. △ Less

Submitted 1 August, 2024; originally announced August 2024.

Comments: ECCV 2024 accepted

arXiv:2408.00347 [pdf, other]

Advancing Medical Image Segmentation: Morphology-Driven Learning with Diffusion Transformer

Authors: Sungmin Kang, Jaeha Song, Jihie Kim

Abstract: Understanding the morphological structure of medical images and precisely segmenting the region of interest or abnormality is an important task that can assist in diagnosis. However, the unique properties of medical imaging make clear segmentation difficult, and the high cost and time-consuming task of labeling leads to a coarse-grained representation of ground truth. Facing with these problems, w… ▽ More Understanding the morphological structure of medical images and precisely segmenting the region of interest or abnormality is an important task that can assist in diagnosis. However, the unique properties of medical imaging make clear segmentation difficult, and the high cost and time-consuming task of labeling leads to a coarse-grained representation of ground truth. Facing with these problems, we propose a novel Diffusion Transformer Segmentation (DTS) model for robust segmentation in the presence of noise. We propose an alternative to the dominant Denoising U-Net encoder through experiments applying a transformer architecture, which captures global dependency through self-attention. Additionally, we propose k-neighbor label smoothing, reverse boundary attention, and self-supervised learning with morphology-driven learning to improve the ability to identify complex structures. Our model, which analyzes the morphological representation of images, shows better results than the previous models in various medical imaging modalities, including CT, MRI, and lesion images. △ Less

Submitted 1 August, 2024; originally announced August 2024.

Comments: Accepted in BMVC 2024

arXiv:2408.00326 [pdf, other]

doi 10.1145/3627673.3679920

Exploiting Preferences in Loss Functions for Sequential Recommendation via Weak Transitivity

Authors: Hyunsoo Chung, Jungtaek Kim, Hyungeun Jo, Hyungwon Choi

Abstract: A choice of optimization objective is immensely pivotal in the design of a recommender system as it affects the general modeling process of a user's intent from previous interactions. Existing approaches mainly adhere to three categories of loss functions: pairwise, pointwise, and setwise loss functions. Despite their effectiveness, a critical and common drawback of such objectives is viewing the… ▽ More A choice of optimization objective is immensely pivotal in the design of a recommender system as it affects the general modeling process of a user's intent from previous interactions. Existing approaches mainly adhere to three categories of loss functions: pairwise, pointwise, and setwise loss functions. Despite their effectiveness, a critical and common drawback of such objectives is viewing the next observed item as a unique positive while considering all remaining items equally negative. Such a binary label assignment is generally limited to assuring a higher recommendation score of the positive item, neglecting potential structures induced by varying preferences between other unobserved items. To alleviate this issue, we propose a novel method that extends original objectives to explicitly leverage the different levels of preferences as relative orders between their scores. Finally, we demonstrate the superior performance of our method compared to baseline objectives. △ Less

Submitted 1 August, 2024; originally announced August 2024.

Comments: Accepted to CIKM 2024, Short Research Paper Track

arXiv:2407.21604 [pdf, other]

MicroMIL: Graph-based Contextual Multiple Instance Learning for Patient Diagnosis Using Microscopy Images

Authors: JongWoo Kim, Bryan Wong, YoungSin Ko, MunYong Yi

Abstract: Current histopathology research has primarily focused on using whole-slide images (WSIs) produced by scanners with weakly-supervised multiple instance learning (MIL). However, WSIs are costly, memory-intensive, and require extensive analysis time. As an alternative, microscopy-based analysis offers cost and memory efficiency, though microscopy images face issues with unknown absolute positions and… ▽ More Current histopathology research has primarily focused on using whole-slide images (WSIs) produced by scanners with weakly-supervised multiple instance learning (MIL). However, WSIs are costly, memory-intensive, and require extensive analysis time. As an alternative, microscopy-based analysis offers cost and memory efficiency, though microscopy images face issues with unknown absolute positions and redundant images due to multiple captures from the subjective perspectives of pathologists. To this end, we introduce MicroMIL, a weakly-supervised MIL framework specifically built to address these challenges by dynamically clustering images using deep cluster embedding (DCE) and Gumbel Softmax for representative image extraction. Graph edges are then constructed from the upper triangular similarity matrix, with nodes connected to their most similar neighbors, and a graph neural network (GNN) is utilized to capture local and diverse areas of contextual information. Unlike existing graph-based MIL methods designed for WSIs that require absolute positions, MicroMIL efficiently handles the graph edges without this need. Extensive evaluations on real-world colon cancer (Seegene) and public BreakHis datasets demonstrate that MicroMIL outperforms state-of-the-art (SOTA) methods, offering a robust and efficient solution for patient diagnosis using microscopy images. The code is available at https://anonymous.4open.science/r/MicroMIL-6C7C △ Less

Submitted 31 July, 2024; originally announced July 2024.

Comments: The first two authors contributed equally to this work

arXiv:2407.21571 [pdf, other]

PMoE: Progressive Mixture of Experts with Asymmetric Transformer for Continual Learning

Authors: Min Jae Jung, JooHee Kim

Abstract: Large Language Models (LLMs) encounter significant challenges in continual learning due to catastrophic forgetting, where new information overwrites previously acquired knowledge. This limitation leads to substantial environmental and economic waste. In this study, we introduce the PMoE, Progressive Mixture of Experts with Asymmetric Transformer, which aims to minimize forgetting by utilizing an a… ▽ More Large Language Models (LLMs) encounter significant challenges in continual learning due to catastrophic forgetting, where new information overwrites previously acquired knowledge. This limitation leads to substantial environmental and economic waste. In this study, we introduce the PMoE, Progressive Mixture of Experts with Asymmetric Transformer, which aims to minimize forgetting by utilizing an asymmetric design with shallow layers dedicated to general knowledge and deep layers for new knowledge. PMoE incorporates progressively added experts in deep layers and a router that allocates new knowledge to the appropriate experts efficiently. The router, positioned adjacent to the deep layers, utilizes deep features aggregating consolidated information. This enables the router to perform efficiently, allocating new knowledge to the appropriate experts, which progressively increase in the deep layers. Extensive experiments on TRACE datasets and general language understanding datasets demonstrate that the proposed PMoE outperforms previous state-of-the-art approaches. △ Less

Submitted 31 July, 2024; originally announced July 2024.

arXiv:2407.21448 [pdf, other]

Accelerating Image Super-Resolution Networks with Pixel-Level Classification

Authors: Jinho Jeong, Jinwoo Kim, Younghyun Jo, Seon Joo Kim

Abstract: In recent times, the need for effective super-resolution (SR) techniques has surged, especially for large-scale images ranging 2K to 8K resolutions. For DNN-based SISR, decomposing images into overlapping patches is typically necessary due to computational constraints. In such patch-decomposing scheme, one can allocate computational resources differently based on each patch's difficulty to further… ▽ More In recent times, the need for effective super-resolution (SR) techniques has surged, especially for large-scale images ranging 2K to 8K resolutions. For DNN-based SISR, decomposing images into overlapping patches is typically necessary due to computational constraints. In such patch-decomposing scheme, one can allocate computational resources differently based on each patch's difficulty to further improve efficiency while maintaining SR performance. However, this approach has a limitation: computational resources is uniformly allocated within a patch, leading to lower efficiency when the patch contain pixels with varying levels of restoration difficulty. To address the issue, we propose the Pixel-level Classifier for Single Image Super-Resolution (PCSR), a novel method designed to distribute computational resources adaptively at the pixel level. A PCSR model comprises a backbone, a pixel-level classifier, and a set of pixel-level upsamplers with varying capacities. The pixel-level classifier assigns each pixel to an appropriate upsampler based on its restoration difficulty, thereby optimizing computational resource usage. Our method allows for performance and computational cost balance during inference without re-training. Our experiments demonstrate PCSR's advantage over existing patch-distributing methods in PSNR-FLOP trade-offs across different backbone models and benchmarks. The code is available at https://github.com/3587jjh/PCSR. △ Less

Submitted 31 July, 2024; originally announced July 2024.

Comments: Accepted by ECCV 2024

arXiv:2407.21267 [pdf, other]

DEF-oriCORN: efficient 3D scene understanding for robust language-directed manipulation without demonstrations

Authors: Dongwon Son, Sanghyeon Son, Jaehyung Kim, Beomjoon Kim

Abstract: We present DEF-oriCORN, a framework for language-directed manipulation tasks. By leveraging a novel object-based scene representation and diffusion-model-based state estimation algorithm, our framework enables efficient and robust manipulation planning in response to verbal commands, even in tightly packed environments with sparse camera views without any demonstrations. Unlike traditional represe… ▽ More We present DEF-oriCORN, a framework for language-directed manipulation tasks. By leveraging a novel object-based scene representation and diffusion-model-based state estimation algorithm, our framework enables efficient and robust manipulation planning in response to verbal commands, even in tightly packed environments with sparse camera views without any demonstrations. Unlike traditional representations, our representation affords efficient collision checking and language grounding. Compared to state-of-the-art baselines, our framework achieves superior estimation and motion planning performance from sparse RGB images and zero-shot generalizes to real-world scenarios with diverse materials, including transparent and reflective objects, despite being trained exclusively in simulation. Our code for data generation, training, inference, and pre-trained weights are publicly available at: https://sites.google.com/view/def-oricorn/home. △ Less

Submitted 30 July, 2024; originally announced July 2024.

arXiv:2407.21035 [pdf, other]

Direct Unlearning Optimization for Robust and Safe Text-to-Image Models

Authors: Yong-Hyun Park, Sangdoo Yun, Jin-Hwa Kim, Junho Kim, Geonhui Jang, Yonghyun Jeong, Junghyo Jo, Gayoung Lee

Abstract: Recent advancements in text-to-image (T2I) models have greatly benefited from large-scale datasets, but they also pose significant risks due to the potential generation of unsafe content. To mitigate this issue, researchers have developed unlearning techniques to remove the model's ability to generate potentially harmful content. However, these methods are easily bypassed by adversarial attacks, m… ▽ More Recent advancements in text-to-image (T2I) models have greatly benefited from large-scale datasets, but they also pose significant risks due to the potential generation of unsafe content. To mitigate this issue, researchers have developed unlearning techniques to remove the model's ability to generate potentially harmful content. However, these methods are easily bypassed by adversarial attacks, making them unreliable for ensuring the safety of generated images. In this paper, we propose Direct Unlearning Optimization (DUO), a novel framework for removing Not Safe For Work (NSFW) content from T2I models while preserving their performance on unrelated topics. DUO employs a preference optimization approach using curated paired image data, ensuring that the model learns to remove unsafe visual concepts while retaining unrelated features. Furthermore, we introduce an output-preserving regularization term to maintain the model's generative capabilities on safe content. Extensive experiments demonstrate that DUO can robustly defend against various state-of-the-art red teaming methods without significant performance degradation on unrelated topics, as measured by FID and CLIP scores. Our work contributes to the development of safer and more reliable T2I models, paving the way for their responsible deployment in both closed-source and open-source scenarios. △ Less

Submitted 17 July, 2024; originally announced July 2024.

Comments: Extended abstract accepted in GenLaw 2024 workshop @ ICML2024

arXiv:2407.20648 [pdf, other]

Leveraging Multi-facet Paths for Heterogeneous Graph Representation Learning

Authors: JongWoo Kim, SeongYeub Chu, HyeongMin Park, Bryan Wong, MunYong Yi

Abstract: Recent advancements in graph neural networks (GNNs) and heterogeneous GNNs (HGNNs) have advanced node embeddings and relationship learning for various tasks. However, existing methods often rely on domain-specific predefined meta-paths, which are coarse-grained and focus solely on aspects like node type, limiting their ability to capture complex interactions. We introduce MF2Vec, a model that uses… ▽ More Recent advancements in graph neural networks (GNNs) and heterogeneous GNNs (HGNNs) have advanced node embeddings and relationship learning for various tasks. However, existing methods often rely on domain-specific predefined meta-paths, which are coarse-grained and focus solely on aspects like node type, limiting their ability to capture complex interactions. We introduce MF2Vec, a model that uses multi-faceted (fine-grained) paths instead of predefined meta-paths. MF2Vec extracts paths via random walks and generates multi-faceted vectors, ignoring predefined schemas. This method learns diverse aspects of nodes and their relationships, constructs a homogeneous network, and creates node embeddings for classification, link prediction, and clustering. Extensive experiments show that MF2Vec outperforms existing methods, offering a more flexible and comprehensive framework for analyzing complex networks. The code is available at https://anonymous.4open.science/r/MF2Vec-6ABC. △ Less

Submitted 30 July, 2024; originally announced July 2024.

Comments: 9pages

arXiv:2407.19900 [pdf, other]

Practical and Reproducible Symbolic Music Generation by Large Language Models with Structural Embeddings

Authors: Seungyeon Rhyu, Kichang Yang, Sungjun Cho, Jaehyeon Kim, Kyogu Lee, Moontae Lee

Abstract: Music generation introduces challenging complexities to large language models. Symbolic structures of music often include vertical harmonization as well as horizontal counterpoint, urging various adaptations and enhancements for large-scale Transformers. However, existing works share three major drawbacks: 1) their tokenization requires domain-specific annotations, such as bars and beats, that are… ▽ More Music generation introduces challenging complexities to large language models. Symbolic structures of music often include vertical harmonization as well as horizontal counterpoint, urging various adaptations and enhancements for large-scale Transformers. However, existing works share three major drawbacks: 1) their tokenization requires domain-specific annotations, such as bars and beats, that are typically missing in raw MIDI data; 2) the pure impact of enhancing token embedding methods is hardly examined without domain-specific annotations; and 3) existing works to overcome the aforementioned drawbacks, such as MuseNet, lack reproducibility. To tackle such limitations, we develop a MIDI-based music generation framework inspired by MuseNet, empirically studying two structural embeddings that do not rely on domain-specific annotations. We provide various metrics and insights that can guide suitable encoding to deploy. We also verify that multiple embedding configurations can selectively boost certain musical aspects. By providing open-source implementations via HuggingFace, our findings shed light on leveraging large language models toward practical and reproducible music generation. △ Less

Submitted 29 July, 2024; originally announced July 2024.

Comments: 9 pages, 6 figures, 4 tables

arXiv:2407.19532 [pdf, other]

The Interpretability of Codebooks in Model-Based Reinforcement Learning is Limited

Authors: Kenneth Eaton, Jonathan Balloch, Julia Kim, Mark Riedl

Abstract: Interpretability of deep reinforcement learning systems could assist operators with understanding how they interact with their environment. Vector quantization methods -- also called codebook methods -- discretize a neural network's latent space that is often suggested to yield emergent interpretability. We investigate whether vector quantization in fact provides interpretability in model-based re… ▽ More Interpretability of deep reinforcement learning systems could assist operators with understanding how they interact with their environment. Vector quantization methods -- also called codebook methods -- discretize a neural network's latent space that is often suggested to yield emergent interpretability. We investigate whether vector quantization in fact provides interpretability in model-based reinforcement learning. Our experiments, conducted in the reinforcement learning environment Crafter, show that the codes of vector quantization models are inconsistent, have no guarantee of uniqueness, and have a limited impact on concept disentanglement, all of which are necessary traits for interpretability. We share insights on why vector quantization may be fundamentally insufficient for model interpretability. △ Less

Submitted 28 July, 2024; originally announced July 2024.

arXiv:2407.19216 [pdf, other]

EaTVul: ChatGPT-based Evasion Attack Against Software Vulnerability Detection

Authors: Shigang Liu, Di Cao, Junae Kim, Tamas Abraham, Paul Montague, Seyit Camtepe, Jun Zhang, Yang Xiang

Abstract: Recently, deep learning has demonstrated promising results in enhancing the accuracy of vulnerability detection and identifying vulnerabilities in software. However, these techniques are still vulnerable to attacks. Adversarial examples can exploit vulnerabilities within deep neural networks, posing a significant threat to system security. This study showcases the susceptibility of deep learning m… ▽ More Recently, deep learning has demonstrated promising results in enhancing the accuracy of vulnerability detection and identifying vulnerabilities in software. However, these techniques are still vulnerable to attacks. Adversarial examples can exploit vulnerabilities within deep neural networks, posing a significant threat to system security. This study showcases the susceptibility of deep learning models to adversarial attacks, which can achieve 100% attack success rate (refer to Table 5). The proposed method, EaTVul, encompasses six stages: identification of important samples using support vector machines, identification of important features using the attention mechanism, generation of adversarial data based on these features using ChatGPT, preparation of an adversarial attack pool, selection of seed data using a fuzzy genetic algorithm, and the execution of an evasion attack. Extensive experiments demonstrate the effectiveness of EaTVul, achieving an attack success rate of more than 83% when the snippet size is greater than 2. Furthermore, in most cases with a snippet size of 4, EaTVul achieves a 100% attack success rate. The findings of this research emphasize the necessity of robust defenses against adversarial attacks in software vulnerability detection. △ Less

Submitted 27 July, 2024; originally announced July 2024.

arXiv:2407.19156 [pdf, other]

Robust Multimodal 3D Object Detection via Modality-Agnostic Decoding and Proximity-based Modality Ensemble

Authors: Juhan Cha, Minseok Joo, Jihwan Park, Sanghyeok Lee, Injae Kim, Hyunwoo J. Kim

Abstract: Recent advancements in 3D object detection have benefited from multi-modal information from the multi-view cameras and LiDAR sensors. However, the inherent disparities between the modalities pose substantial challenges. We observe that existing multi-modal 3D object detection methods heavily rely on the LiDAR sensor, treating the camera as an auxiliary modality for augmenting semantic details. Thi… ▽ More Recent advancements in 3D object detection have benefited from multi-modal information from the multi-view cameras and LiDAR sensors. However, the inherent disparities between the modalities pose substantial challenges. We observe that existing multi-modal 3D object detection methods heavily rely on the LiDAR sensor, treating the camera as an auxiliary modality for augmenting semantic details. This often leads to not only underutilization of camera data but also significant performance degradation in scenarios where LiDAR data is unavailable. Additionally, existing fusion methods overlook the detrimental impact of sensor noise induced by environmental changes, on detection performance. In this paper, we propose MEFormer to address the LiDAR over-reliance problem by harnessing critical information for 3D object detection from every available modality while concurrently safeguarding against corrupted signals during the fusion process. Specifically, we introduce Modality Agnostic Decoding (MOAD) that extracts geometric and semantic features with a shared transformer decoder regardless of input modalities and provides promising improvement with a single modality as well as multi-modality. Additionally, our Proximity-based Modality Ensemble (PME) module adaptively utilizes the strengths of each modality depending on the environment while mitigating the effects of a noisy sensor. Our MEFormer achieves state-of-the-art performance of 73.9% NDS and 71.5% mAP in the nuScenes validation set. Extensive analyses validate that our MEFormer improves robustness against challenging conditions such as sensor malfunctions or environmental changes. The source code is available at https://github.com/hanchaa/MEFormer △ Less

Submitted 19 August, 2024; v1 submitted 26 July, 2024; originally announced July 2024.

arXiv:2407.18574 [pdf, other]

Learning to Enhance Aperture Phasor Field for Non-Line-of-Sight Imaging

Authors: In Cho, Hyunbo Shim, Seon Joo Kim

Abstract: This paper aims to facilitate more practical NLOS imaging by reducing the number of samplings and scan areas. To this end, we introduce a phasor-based enhancement network that is capable of predicting clean and full measurements from noisy partial observations. We leverage a denoising autoencoder scheme to acquire rich and noise-robust representations in the measurement space. Through this pipelin… ▽ More This paper aims to facilitate more practical NLOS imaging by reducing the number of samplings and scan areas. To this end, we introduce a phasor-based enhancement network that is capable of predicting clean and full measurements from noisy partial observations. We leverage a denoising autoencoder scheme to acquire rich and noise-robust representations in the measurement space. Through this pipeline, our enhancement network is trained to accurately reconstruct complete measurements from their corrupted and partial counterparts. However, we observe that the \naive application of denoising often yields degraded and over-smoothed results, caused by unnecessary and spurious frequency signals present in measurements. To address this issue, we introduce a phasor-based pipeline designed to limit the spectrum of our network to the frequency range of interests, where the majority of informative signals are detected. The phasor wavefronts at the aperture, which are band-limited signals, are employed as inputs and outputs of the network, guiding our network to learn from the frequency range of interests and discard unnecessary information. The experimental results in more practical acquisition scenarios demonstrate that we can look around the corners with $16\times$ or $64\times$ fewer samplings and $4\times$ smaller apertures. Our code is available at https://github.com/join16/LEAP. △ Less

Submitted 28 July, 2024; v1 submitted 26 July, 2024; originally announced July 2024.

arXiv:2407.18550 [pdf, other]

ReALFRED: An Embodied Instruction Following Benchmark in Photo-Realistic Environments

Authors: Taewoong Kim, Cheolhong Min, Byeonghwi Kim, Jinyeon Kim, Wonje Jeung, Jonghyun Choi

Abstract: Simulated virtual environments have been widely used to learn robotic agents that perform daily household tasks. These environments encourage research progress by far, but often provide limited object interactability, visual appearance different from real-world environments, or relatively smaller environment sizes. This prevents the learned models in the virtual scenes from being readily deployabl… ▽ More Simulated virtual environments have been widely used to learn robotic agents that perform daily household tasks. These environments encourage research progress by far, but often provide limited object interactability, visual appearance different from real-world environments, or relatively smaller environment sizes. This prevents the learned models in the virtual scenes from being readily deployable. To bridge the gap between these learning environments and deploying (i.e., real) environments, we propose the ReALFRED benchmark that employs real-world scenes, objects, and room layouts to learn agents to complete household tasks by understanding free-form language instructions and interacting with objects in large, multi-room and 3D-captured scenes. Specifically, we extend the ALFRED benchmark with updates for larger environmental spaces with smaller visual domain gaps. With ReALFRED, we analyze previously crafted methods for the ALFRED benchmark and observe that they consistently yield lower performance in all metrics, encouraging the community to develop methods in more realistic environments. Our code and data are publicly available. △ Less

Submitted 26 July, 2024; originally announced July 2024.

Comments: ECCV 2024 (Project page: https://twoongg.github.io/projects/realfred)

Showing 1–50 of 2,606 results for author: Kim, J