Skip to main content

Showing 1–50 of 1,525 results for author: Zhao, H

Searching in archive cs. Search in all archives.
.
  1. arXiv:2408.10722  [pdf, other

    cs.CL cs.AI

    MEGen: Generative Backdoor in Large Language Models via Model Editing

    Authors: Jiyang Qiu, Xinbei Ma, Zhuosheng Zhang, Hai Zhao

    Abstract: Large language models (LLMs) have demonstrated remarkable capabilities. Their powerful generative abilities enable flexible responses based on various queries or instructions. Emerging as widely adopted generalists for diverse tasks, LLMs are still vulnerable to backdoors. This paper proposes an editing-based generative backdoor, named MEGen, aiming to create a customized backdoor for NLP tasks wi… ▽ More

    Submitted 20 August, 2024; originally announced August 2024.

    Comments: Working in progress

  2. arXiv:2408.10285  [pdf, other

    cs.LG cs.AI cs.CE

    BatGPT-Chem: A Foundation Large Model For Retrosynthesis Prediction

    Authors: Yifei Yang, Runhan Shi, Zuchao Li, Shu Jiang, Bao-Liang Lu, Yang Yang, Hai Zhao

    Abstract: Retrosynthesis analysis is pivotal yet challenging in drug discovery and organic chemistry. Despite the proliferation of computational tools over the past decade, AI-based systems often fall short in generalizing across diverse reaction types and exploring alternative synthetic pathways. This paper presents BatGPT-Chem, a large language model with 15 billion parameters, tailored for enhanced retro… ▽ More

    Submitted 19 August, 2024; originally announced August 2024.

  3. arXiv:2408.09896  [pdf, other

    cs.LG physics.chem-ph q-bio.BM

    Instruction-Based Molecular Graph Generation with Unified Text-Graph Diffusion Model

    Authors: Yuran Xiang, Haiteng Zhao, Chang Ma, Zhi-Hong Deng

    Abstract: Recent advancements in computational chemistry have increasingly focused on synthesizing molecules based on textual instructions. Integrating graph generation with these instructions is complex, leading most current methods to use molecular sequences with pre-trained large language models. In response to this challenge, we propose a novel framework, named… ▽ More

    Submitted 19 August, 2024; originally announced August 2024.

  4. arXiv:2408.09853  [pdf, other

    cs.CL cs.AI

    Self-Directed Turing Test for Large Language Models

    Authors: Weiqi Wu, Hongqiu Wu, Hai Zhao

    Abstract: The Turing test examines whether AIs can exhibit human-like behaviour in natural language conversations. Traditional Turing tests adopt a rigid dialogue format where each participant sends only one message each time and require continuous human involvement to direct the entire interaction with the test subject. This fails to reflect a natural conversational style and hinders the evaluation of Larg… ▽ More

    Submitted 19 August, 2024; originally announced August 2024.

  5. arXiv:2408.09671  [pdf, other

    cs.IR

    GANPrompt: Enhancing Robustness in LLM-Based Recommendations with GAN-Enhanced Diversity Prompts

    Authors: Xinyu Li, Chuang Zhao, Hongke Zhao, Likang Wu, Ming HE

    Abstract: In recent years, LLM has demonstrated remarkable proficiency in comprehending and generating natural language, with a growing prevalence in the domain of recommender systems. However, LLM continues to face a significant challenge in that it is highly susceptible to the influence of prompt words. This inconsistency in response to minor alterations in prompt input may compromise the accuracy and res… ▽ More

    Submitted 18 August, 2024; originally announced August 2024.

  6. arXiv:2408.09665  [pdf, other

    cs.CV

    SG-GS: Photo-realistic Animatable Human Avatars with Semantically-Guided Gaussian Splatting

    Authors: Haoyu Zhao, Chen Yang, Hao Wang, Xingyue Zhao, Wei Shen

    Abstract: Reconstructing photo-realistic animatable human avatars from monocular videos remains challenging in computer vision and graphics. Recently, methods using 3D Gaussians to represent the human body have emerged, offering faster optimization and real-time rendering. However, due to ignoring the crucial role of human body semantic information which represents the intrinsic structure and connections wi… ▽ More

    Submitted 18 August, 2024; originally announced August 2024.

    Comments: 12 pages, 5 figures

  7. arXiv:2408.09663  [pdf, other

    cs.CV

    CHASE: 3D-Consistent Human Avatars with Sparse Inputs via Gaussian Splatting and Contrastive Learning

    Authors: Haoyu Zhao, Hao Wang, Chen Yang, Wei Shen

    Abstract: Recent advancements in human avatar synthesis have utilized radiance fields to reconstruct photo-realistic animatable human avatars. However, both NeRFs-based and 3DGS-based methods struggle with maintaining 3D consistency and exhibit suboptimal detail reconstruction, especially with sparse inputs. To address this challenge, we propose CHASE, which introduces supervision from intrinsic 3D consiste… ▽ More

    Submitted 19 August, 2024; v1 submitted 18 August, 2024; originally announced August 2024.

    Comments: 13 pages, 6 figures

  8. arXiv:2408.09452  [pdf, other

    cs.CL

    Identifying Speakers and Addressees of Quotations in Novels with Prompt Learning

    Authors: Yuchen Yan, Hanjie Zhao, Senbin Zhu, Hongde Liu, Zhihong Zhang, Yuxiang Jia

    Abstract: Quotations in literary works, especially novels, are important to create characters, reflect character relationships, and drive plot development. Current research on quotation extraction in novels primarily focuses on quotation attribution, i.e., identifying the speaker of the quotation. However, the addressee of the quotation is also important to construct the relationship between the speaker and… ▽ More

    Submitted 18 August, 2024; originally announced August 2024.

    Comments: This paper has been accepted by NLPCC 2024

  9. arXiv:2408.09386  [pdf, other

    cs.AI cs.CL cs.HC

    Game Development as Human-LLM Interaction

    Authors: Jiale Hong, Hongqiu Wu, Hai Zhao

    Abstract: Game development is a highly specialized task that relies on a complex game engine powered by complex programming languages, preventing many gaming enthusiasts from handling it. This paper introduces the Interaction-driven Game Engine (IGE) powered by LLM, which allows everyone to develop a custom game using natural language through Human-LLM interaction. To enable an LLM to function as an IGE, we… ▽ More

    Submitted 18 August, 2024; originally announced August 2024.

  10. arXiv:2408.08586  [pdf, other

    cs.DC

    Rubick: Exploiting Job Reconfigurability for Deep Learning Cluster Scheduling

    Authors: Xinyi Zhang, Hanyu Zhao, Wencong Xiao, Xianyan Jia, Fei Xu, Yong Li, Wei Lin, Fangming Liu

    Abstract: The era of large deep learning models has given rise to advanced training strategies such as 3D parallelism and the ZeRO series. These strategies enable various (re-)configurable execution plans for a training job, which exhibit remarkably different requirements of multiple resource types. Existing cluster scheduling systems, however, treat such reconfigurable training jobs as black boxes: they re… ▽ More

    Submitted 16 August, 2024; originally announced August 2024.

  11. arXiv:2408.06608  [pdf, other

    cs.AR cs.GR

    Potamoi: Accelerating Neural Rendering via a Unified Streaming Architecture

    Authors: Yu Feng, Weikai Lin, Zihan Liu, Jingwen Leng, Minyi Guo, Han Zhao, Xiaofeng Hou, Jieru Zhao, Yuhao Zhu

    Abstract: Neural Radiance Field (NeRF) has emerged as a promising alternative for photorealistic rendering. Despite recent algorithmic advancements, achieving real-time performance on today's resource-constrained devices remains challenging. In this paper, we identify the primary bottlenecks in current NeRF algorithms and introduce a unified algorithm-architecture co-design, Potamoi, designed to accommodate… ▽ More

    Submitted 12 August, 2024; originally announced August 2024.

    Comments: arXiv admin note: substantial text overlap with arXiv:2404.11852

  12. arXiv:2408.06574  [pdf, other

    cs.CL

    SparkRA: A Retrieval-Augmented Knowledge Service System Based on Spark Large Language Model

    Authors: Dayong Wu, Jiaqi Li, Baoxin Wang, Honghong Zhao, Siyuan Xue, Yanjie Yang, Zhijun Chang, Rui Zhang, Li Qian, Bo Wang, Shijin Wang, Zhixiong Zhang, Guoping Hu

    Abstract: Large language models (LLMs) have shown remarkable achievements across various language tasks.To enhance the performance of LLMs in scientific literature services, we developed the scientific literature LLM (SciLit-LLM) through pre-training and supervised fine-tuning on scientific literature, building upon the iFLYTEK Spark LLM. Furthermore, we present a knowledge service system Spark Research Ass… ▽ More

    Submitted 12 August, 2024; originally announced August 2024.

  13. arXiv:2408.06567  [pdf, other

    cs.CL cs.AI

    AquilaMoE: Efficient Training for MoE Models with Scale-Up and Scale-Out Strategies

    Authors: Bo-Wen Zhang, Liangdong Wang, Ye Yuan, Jijie Li, Shuhao Gu, Mengdi Zhao, Xinya Wu, Guang Liu, Chengwei Wu, Hanyu Zhao, Li Du, Yiming Ju, Quanyue Ma, Yulong Ao, Yingli Zhao, Songhe Zhu, Zhou Cao, Dong Liang, Yonghua Lin, Ming Zhang, Shunfei Wang, Yanxin Zhou, Min Ye, Xuekai Chen, Xinyang Yu , et al. (2 additional authors not shown)

    Abstract: In recent years, with the rapid application of large language models across various fields, the scale of these models has gradually increased, and the resources required for their pre-training have grown exponentially. Training an LLM from scratch will cost a lot of computation resources while scaling up from a smaller model is a more efficient approach and has thus attracted significant attention… ▽ More

    Submitted 12 August, 2024; originally announced August 2024.

  14. arXiv:2408.06327  [pdf, other

    cs.AI cs.CL cs.CV

    VisualAgentBench: Towards Large Multimodal Models as Visual Foundation Agents

    Authors: Xiao Liu, Tianjie Zhang, Yu Gu, Iat Long Iong, Yifan Xu, Xixuan Song, Shudan Zhang, Hanyu Lai, Xinyi Liu, Hanlin Zhao, Jiadai Sun, Xinyue Yang, Yu Yang, Zehan Qi, Shuntian Yao, Xueqiao Sun, Siyi Cheng, Qinkai Zheng, Hao Yu, Hanchen Zhang, Wenyi Hong, Ming Ding, Lihang Pan, Xiaotao Gu, Aohan Zeng , et al. (5 additional authors not shown)

    Abstract: Large Multimodal Models (LMMs) have ushered in a new era in artificial intelligence, merging capabilities in both language and vision to form highly capable Visual Foundation Agents. These agents are postulated to excel across a myriad of tasks, potentially approaching general artificial intelligence. However, existing benchmarks fail to sufficiently challenge or showcase the full potential of LMM… ▽ More

    Submitted 12 August, 2024; originally announced August 2024.

  15. arXiv:2408.05891  [pdf, other

    cs.CV

    CMAB: A First National-Scale Multi-Attribute Building Dataset Derived from Open Source Data and GeoAI

    Authors: Yecheng Zhang, Huimin Zhao, Ying Long

    Abstract: Rapidly acquiring three-dimensional (3D) building data, including geometric attributes like rooftop, height, and structure, as well as indicative attributes like function, quality, and age, is essential for accurate urban analysis, simulations, and policy updates. Existing large-scale building datasets lack accuracy, extensibility and indicative attributes. This paper presents a geospatial artific… ▽ More

    Submitted 11 August, 2024; originally announced August 2024.

    Comments: 43 pages, 20 figures

    ACM Class: I.4.9

  16. arXiv:2408.05842  [pdf, other

    cs.AI cs.HC

    Scaling Virtual World with Delta-Engine

    Authors: Hongqiu Wu, Zekai Xu, Tianyang Xu, Jiale Hong, Weiqi Wu, Hai Zhao, Min Zhang, Zhezhi He

    Abstract: In this paper, we focus on \emph{virtual world}, a cyberspace where people can live in. An ideal virtual world shares great similarity with our real world. One of the crucial aspects is its evolving nature, reflected by the individuals' capacity to grow and thereby influence the objective world. Such dynamics is unpredictable and beyond the reach of existing systems. For this, we propose a special… ▽ More

    Submitted 11 August, 2024; originally announced August 2024.

  17. arXiv:2408.04914  [pdf, other

    cs.CV

    GuidedNet: Semi-Supervised Multi-Organ Segmentation via Labeled Data Guide Unlabeled Data

    Authors: Haochen Zhao, Hui Meng, Deqian Yang, Xiaozheng Xie, Xiaoze Wu, Qingfeng Li, Jianwei Niu

    Abstract: Semi-supervised multi-organ medical image segmentation aids physicians in improving disease diagnosis and treatment planning and reduces the time and effort required for organ annotation.Existing state-of-the-art methods train the labeled data with ground truths and train the unlabeled data with pseudo-labels. However, the two training flows are separate, which does not reflect the interrelationsh… ▽ More

    Submitted 9 August, 2024; originally announced August 2024.

    Comments: Accepted by ACM MM2024, 10 pages, 5 figures

  18. arXiv:2408.03768  [pdf, other

    cs.RO

    HDPlanner: Advancing Autonomous Deployments in Unknown Environments through Hierarchical Decision Networks

    Authors: Jingsong Liang, Yuhong Cao, Yixiao Ma, Hanqi Zhao, Guillaume Sartoretti

    Abstract: In this paper, we introduce HDPlanner, a deep reinforcement learning (DRL) based framework designed to tackle two core and challenging tasks for mobile robots: autonomous exploration and navigation, where the robot must optimize its trajectory adaptively to achieve the task objective through continuous interactions in unknown environments. Specifically, HDPlanner relies on novel hierarchical atten… ▽ More

    Submitted 7 August, 2024; originally announced August 2024.

    Comments: Submitted to RA-L

  19. arXiv:2408.02544  [pdf, other

    cs.CL

    Caution for the Environment: Multimodal Agents are Susceptible to Environmental Distractions

    Authors: Xinbei Ma, Yiting Wang, Yao Yao, Tongxin Yuan, Aston Zhang, Zhuosheng Zhang, Hai Zhao

    Abstract: This paper investigates the faithfulness of multimodal large language model (MLLM) agents in the graphical user interface (GUI) environment, aiming to address the research question of whether multimodal GUI agents can be distracted by environmental context. A general setting is proposed where both the user and the agent are benign, and the environment, while not malicious, contains unrelated conte… ▽ More

    Submitted 5 August, 2024; originally announced August 2024.

  20. arXiv:2408.02310  [pdf, other

    cs.CR cs.LG

    On the Robustness of Malware Detectors to Adversarial Samples

    Authors: Muhammad Salman, Benjamin Zi Hao Zhao, Hassan Jameel Asghar, Muhammad Ikram, Sidharth Kaushik, Mohamed Ali Kaafar

    Abstract: Adversarial examples add imperceptible alterations to inputs with the objective to induce misclassification in machine learning models. They have been demonstrated to pose significant challenges in domains like image classification, with results showing that an adversarially perturbed image to evade detection against one classifier is most likely transferable to other classifiers. Adversarial exam… ▽ More

    Submitted 5 August, 2024; originally announced August 2024.

    Comments: This is the full version of the paper with the same title to appear in the proceedings of the 2024 Workshop on Security and Artificial Intelligence (SECAI 2024)

  21. arXiv:2408.00521  [pdf, other

    cs.AI

    A new approach for encoding code and assisting code understanding

    Authors: Mengdan Fan, Wei Zhang, Haiyan Zhao, Zhi Jin

    Abstract: Some companies(e.g., Microsoft Research and Google DeepMind) have discovered some of the limitations of GPTs autoregressive paradigm next-word prediction, manifested in the model lack of planning, working memory, backtracking, and reasoning skills. GPTs rely on a local and greedy process of generating the next word, without a global understanding of the task or the output.We have confirmed the abo… ▽ More

    Submitted 1 August, 2024; originally announced August 2024.

    Comments: 10 page, 14 figures

  22. arXiv:2407.21783  [pdf, other

    cs.AI cs.CL cs.CV

    The Llama 3 Herd of Models

    Authors: Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Roziere, Bethany Biron, Binh Tang , et al. (510 additional authors not shown)

    Abstract: Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical… ▽ More

    Submitted 15 August, 2024; v1 submitted 31 July, 2024; originally announced July 2024.

  23. arXiv:2407.21467  [pdf

    cs.CV cs.AI

    Deep Learning-Based Longitudinal Prediction of Childhood Myopia Progression Using Fundus Image Sequences and Baseline Refraction Data

    Authors: Mengtian Kang, Yansong Hu, Shuo Gao, Yuanyuan Liu, Hongbei Meng, Xuemeng Li, Xuhang Chen, Hubin Zhao, Jing Fu, Guohua Hu, Wei Wang, Yanning Dai, Arokia Nathan, Peter Smielewski, Ningli Wang, Shiming Li

    Abstract: Childhood myopia constitutes a significant global health concern. It exhibits an escalating prevalence and has the potential to evolve into severe, irreversible conditions that detrimentally impact familial well-being and create substantial economic costs. Contemporary research underscores the importance of precisely predicting myopia progression to enable timely and effective interventions, there… ▽ More

    Submitted 31 July, 2024; originally announced July 2024.

  24. arXiv:2407.21065  [pdf, other

    cs.CL cs.IR cs.LG

    LawLLM: Law Large Language Model for the US Legal System

    Authors: Dong Shu, Haoran Zhao, Xukun Liu, David Demeter, Mengnan Du, Yongfeng Zhang

    Abstract: In the rapidly evolving field of legal analytics, finding relevant cases and accurately predicting judicial outcomes are challenging because of the complexity of legal language, which often includes specialized terminology, complex syntax, and historical context. Moreover, the subtle distinctions between similar and precedent cases require a deep understanding of legal knowledge. Researchers often… ▽ More

    Submitted 27 July, 2024; originally announced July 2024.

    Comments: 21 pages, 2 figures, accepted at the 33rd ACM International Conference on Information and Knowledge Management (CIKM 2024) for the Applied Research Paper track

  25. arXiv:2407.19397  [pdf, other

    cs.CV

    Domain Adaptive Lung Nodule Detection in X-ray Image

    Authors: Haifeng Zhao, Lixiang Jiang, Leilei Ma, Dengdi Sun, Yanping Fu

    Abstract: Medical images from different healthcare centers exhibit varied data distributions, posing significant challenges for adapting lung nodule detection due to the domain shift between training and application phases. Traditional unsupervised domain adaptive detection methods often struggle with this shift, leading to suboptimal outcomes. To overcome these challenges, we introduce a novel domain adapt… ▽ More

    Submitted 2 August, 2024; v1 submitted 28 July, 2024; originally announced July 2024.

    Comments: This paper will submit to IEEE SMC 2024

  26. arXiv:2407.19256  [pdf

    cs.AI cs.CL cs.LG

    Stochastic Parrots or ICU Experts? Large Language Models in Critical Care Medicine: A Scoping Review

    Authors: Tongyue Shi, Jun Ma, Zihan Yu, Haowei Xu, Minqi Xiong, Meirong Xiao, Yilin Li, Huiying Zhao, Guilan Kong

    Abstract: With the rapid development of artificial intelligence (AI), large language models (LLMs) have shown strong capabilities in natural language understanding, reasoning, and generation, attracting amounts of research interest in applying LLMs to health and medicine. Critical care medicine (CCM) provides diagnosis and treatment for critically ill patients who often require intensive monitoring and inte… ▽ More

    Submitted 27 July, 2024; originally announced July 2024.

    Comments: 28 pages, 5 figures

  27. arXiv:2407.19079  [pdf, other

    cs.CV

    UniForensics: Face Forgery Detection via General Facial Representation

    Authors: Ziyuan Fang, Hanqing Zhao, Tianyi Wei, Wenbo Zhou, Ming Wan, Zhanyi Wang, Weiming Zhang, Nenghai Yu

    Abstract: Previous deepfake detection methods mostly depend on low-level textural features vulnerable to perturbations and fall short of detecting unseen forgery methods. In contrast, high-level semantic features are less susceptible to perturbations and not limited to forgery-specific artifacts, thus having stronger generalization. Motivated by this, we propose a detection method that utilizes high-level s… ▽ More

    Submitted 26 July, 2024; originally announced July 2024.

  28. arXiv:2407.18962  [pdf

    cs.RO cs.LG

    Autonomous Navigation of Unmanned Vehicle Through Deep Reinforcement Learning

    Authors: Letian Xu, Jiabei Liu, Haopeng Zhao, Tianyao Zheng, Tongzhou Jiang, Lipeng Liu

    Abstract: This paper explores the method of achieving autonomous navigation of unmanned vehicles through Deep Reinforcement Learning (DRL). The focus is on using the Deep Deterministic Policy Gradient (DDPG) algorithm to address issues in high-dimensional continuous action spaces. The paper details the model of a Ackermann robot and the structure and application of the DDPG algorithm. Experiments were condu… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

  29. Text-Region Matching for Multi-Label Image Recognition with Missing Labels

    Authors: Leilei Ma, Hongxing Xie, Lei Wang, Yanping Fu, Dengdi Sun, Haifeng Zhao

    Abstract: Recently, large-scale visual language pre-trained (VLP) models have demonstrated impressive performance across various downstream tasks. Motivated by these advancements, pioneering efforts have emerged in multi-label image recognition with missing labels, leveraging VLP prompt-tuning technology. However, they usually cannot match text and vision features well, due to complicated semantics gaps and… ▽ More

    Submitted 7 August, 2024; v1 submitted 26 July, 2024; originally announced July 2024.

    Comments: Accepted to ACM International Conference on Multimedia (ACM MM) 2024

  30. arXiv:2407.18232  [pdf, other

    cs.CV

    LION: Linear Group RNN for 3D Object Detection in Point Clouds

    Authors: Zhe Liu, Jinghua Hou, Xinyu Wang, Xiaoqing Ye, Jingdong Wang, Hengshuang Zhao, Xiang Bai

    Abstract: The benefit of transformers in large-scale 3D point cloud perception tasks, such as 3D object detection, is limited by their quadratic computation cost when modeling long-range relationships. In contrast, linear RNNs have low computational complexity and are suitable for long-range modeling. Toward this goal, we propose a simple and effective window-based framework built on LInear grOup RNN (i.e.,… ▽ More

    Submitted 25 July, 2024; originally announced July 2024.

    Comments: Project page: https://happinesslz.github.io/projects/LION/

  31. arXiv:2407.18064  [pdf, other

    cs.HC

    ComPeer: A Generative Conversational Agent for Proactive Peer Support

    Authors: Tianjian Liu, Hongzheng Zhao, Yuheng Liu, Xingbo Wang, Zhenhui Peng

    Abstract: Conversational Agents (CAs) acting as peer supporters have been widely studied and demonstrated beneficial for people's mental health. However, previous peer support CAs either are user-initiated or follow predefined rules to initiate the conversations, which may discourage users to engage and build relationships with the CAs for long-term benefits. In this paper, we develop ComPeer, a generative… ▽ More

    Submitted 5 August, 2024; v1 submitted 25 July, 2024; originally announced July 2024.

    Comments: To appear at the 2024 ACM Symposium on User Interface Software and Technology (UIST); 22 pages (7 figures, 7 tables)

  32. arXiv:2407.18003  [pdf, other

    cs.CL

    Keep the Cost Down: A Review on Methods to Optimize LLM' s KV-Cache Consumption

    Authors: Luohe Shi, Hongyi Zhang, Yao Yao, Zuchao Li, Hai Zhao

    Abstract: Large Language Models (LLMs), epitomized by ChatGPT' s release in late 2022, have revolutionized various industries with their advanced language comprehension. However, their efficiency is challenged by the Transformer architecture' s struggle with handling long texts. KV-Cache has emerged as a pivotal solution to this issue, converting the time complexity of token generation from quadratic to lin… ▽ More

    Submitted 13 August, 2024; v1 submitted 25 July, 2024; originally announced July 2024.

    Comments: to be published in CoLM 2024

  33. arXiv:2407.16412  [pdf, other

    cs.RO

    Cross Anything: General Quadruped Robot Navigation through Complex Terrains

    Authors: Shaoting Zhu, Derun Li, Yong Liu, Ningyi Xu, Hang Zhao

    Abstract: The application of vision-language models (VLMs) has achieved impressive success in various robotics tasks, but there are few explorations for foundation models used in quadruped robot navigation. We introduce Cross Anything System (CAS), an innovative system composed of a high-level reasoning module and a low-level control policy, enabling the robot to navigate across complex 3D terrains and reac… ▽ More

    Submitted 23 July, 2024; originally announced July 2024.

  34. arXiv:2407.15886  [pdf, other

    cs.CV cs.AI

    CatVTON: Concatenation Is All You Need for Virtual Try-On with Diffusion Models

    Authors: Zheng Chong, Xiao Dong, Haoxiang Li, Shiyue Zhang, Wenqing Zhang, Xujie Zhang, Hanqing Zhao, Xiaodan Liang

    Abstract: Virtual try-on methods based on diffusion models achieve realistic try-on effects but often replicate the backbone network as a ReferenceNet or use additional image encoders to process condition inputs, leading to high training and inference costs. In this work, we rethink the necessity of ReferenceNet and image encoders and innovate the interaction between garment and person by proposing CatVTON,… ▽ More

    Submitted 21 July, 2024; originally announced July 2024.

    Comments: 10 pages, 9 figures, 4 tables

    MSC Class: 68T42 (Primary) 168T45 (Secondary) ACM Class: I.4.9

  35. arXiv:2407.15719  [pdf, other

    cs.CV cs.AI

    GFE-Mamba: Mamba-based AD Multi-modal Progression Assessment via Generative Feature Extraction from MCI

    Authors: Zhaojie Fang, Shenghao Zhu, Yifei Chen, Binfeng Zou, Fan Jia, Linwei Qiu, Chang Liu, Yiyu Huang, Xiang Feng, Feiwei Qin, Changmiao Wang, Yeru Wang, Jin Fan, Changbiao Chu, Wan-Zhen Wu, Hu Zhao

    Abstract: Alzheimer's Disease (AD) is an irreversible neurodegenerative disorder that often progresses from Mild Cognitive Impairment (MCI), leading to memory loss and significantly impacting patients' lives. Clinical trials indicate that early targeted interventions for MCI patients can potentially slow or halt the development and progression of AD. Previous research has shown that accurate medical classif… ▽ More

    Submitted 22 July, 2024; originally announced July 2024.

    Comments: 35 pages, 4 figures

  36. arXiv:2407.15661  [pdf, other

    cs.CV

    DriveDiTFit: Fine-tuning Diffusion Transformers for Autonomous Driving

    Authors: Jiahang Tu, Wei Ji, Hanbin Zhao, Chao Zhang, Roger Zimmermann, Hui Qian

    Abstract: In autonomous driving, deep models have shown remarkable performance across various visual perception tasks with the demand of high-quality and huge-diversity training datasets. Such datasets are expected to cover various driving scenarios with adverse weather, lighting conditions and diverse moving objects. However, manually collecting these data presents huge challenges and expensive cost. With… ▽ More

    Submitted 22 July, 2024; originally announced July 2024.

  37. arXiv:2407.15431  [pdf, other

    cs.SI cs.AI cs.LG

    Pre-Training and Prompting for Few-Shot Node Classification on Text-Attributed Graphs

    Authors: Huanjing Zhao, Beining Yang, Yukuo Cen, Junyu Ren, Chenhui Zhang, Yuxiao Dong, Evgeny Kharlamov, Shu Zhao, Jie Tang

    Abstract: The text-attributed graph (TAG) is one kind of important real-world graph-structured data with each node associated with raw texts. For TAGs, traditional few-shot node classification methods directly conduct training on the pre-processed node features and do not consider the raw texts. The performance is highly dependent on the choice of the feature pre-processing method. In this paper, we propose… ▽ More

    Submitted 22 July, 2024; originally announced July 2024.

    Comments: Accepted to KDD'24

  38. arXiv:2407.15341  [pdf, other

    cs.CL

    ZZU-NLP at SIGHAN-2024 dimABSA Task: Aspect-Based Sentiment Analysis with Coarse-to-Fine In-context Learning

    Authors: Senbin Zhu, Hanjie Zhao, Xingren Wang, Shanhong Liu, Yuxiang Jia, Hongying Zan

    Abstract: The DimABSA task requires fine-grained sentiment intensity prediction for restaurant reviews, including scores for Valence and Arousal dimensions for each Aspect Term. In this study, we propose a Coarse-to-Fine In-context Learning(CFICL) method based on the Baichuan2-7B model for the DimABSA task in the SIGHAN 2024 workshop. Our method improves prediction accuracy through a two-stage optimization… ▽ More

    Submitted 21 July, 2024; originally announced July 2024.

  39. arXiv:2407.15282  [pdf, other

    cs.CV

    Point Transformer V3 Extreme: 1st Place Solution for 2024 Waymo Open Dataset Challenge in Semantic Segmentation

    Authors: Xiaoyang Wu, Xiang Xu, Lingdong Kong, Liang Pan, Ziwei Liu, Tong He, Wanli Ouyang, Hengshuang Zhao

    Abstract: In this technical report, we detail our first-place solution for the 2024 Waymo Open Dataset Challenge's semantic segmentation track. We significantly enhanced the performance of Point Transformer V3 on the Waymo benchmark by implementing cutting-edge, plug-and-play training and inference technologies. Notably, our advanced version, Point Transformer V3 Extreme, leverages multi-frame training and… ▽ More

    Submitted 21 July, 2024; originally announced July 2024.

    Comments: 1st Place Solution for 2024 Waymo Open Dataset Challenge in Semantic Segmentation

  40. arXiv:2407.14500  [pdf, other

    cs.CV

    ViLLa: Video Reasoning Segmentation with Large Language Model

    Authors: Rongkun Zheng, Lu Qi, Xi Chen, Yi Wang, Kun Wang, Yu Qiao, Hengshuang Zhao

    Abstract: Although video perception models have made remarkable advancements in recent years, they still heavily rely on explicit text descriptions or pre-defined categories to identify target instances before executing video perception tasks. These models, however, fail to proactively comprehend and reason the user's intentions via textual input. Even though previous works attempt to investigate solutions… ▽ More

    Submitted 29 July, 2024; v1 submitted 18 July, 2024; originally announced July 2024.

    Comments: 15 pages,6 figures

  41. arXiv:2407.14020  [pdf, other

    q-bio.NC cs.LG

    NeuroBind: Towards Unified Multimodal Representations for Neural Signals

    Authors: Fengyu Yang, Chao Feng, Daniel Wang, Tianye Wang, Ziyao Zeng, Zhiyang Xu, Hyoungseob Park, Pengliang Ji, Hanbin Zhao, Yuanning Li, Alex Wong

    Abstract: Understanding neural activity and information representation is crucial for advancing knowledge of brain function and cognition. Neural activity, measured through techniques like electrophysiology and neuroimaging, reflects various aspects of information processing. Recent advances in deep neural networks offer new approaches to analyzing these signals using pre-trained models. However, challenges… ▽ More

    Submitted 19 July, 2024; originally announced July 2024.

  42. arXiv:2407.13771  [pdf, other

    cs.CV

    Training-Free Model Merging for Multi-target Domain Adaptation

    Authors: Wenyi Li, Huan-ang Gao, Mingju Gao, Beiwen Tian, Rong Zhi, Hao Zhao

    Abstract: In this paper, we study multi-target domain adaptation of scene understanding models. While previous methods achieved commendable results through inter-domain consistency losses, they often assumed unrealistic simultaneous access to images from all target domains, overlooking constraints such as data transfer bandwidth limitations and data privacy concerns. Given these challenges, we pose the ques… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

    Comments: Accepted to ECCV 2024

  43. arXiv:2407.13752  [pdf, other

    cs.CV

    LogoSticker: Inserting Logos into Diffusion Models for Customized Generation

    Authors: Mingkang Zhu, Xi Chen, Zhongdao Wang, Hengshuang Zhao, Jiaya Jia

    Abstract: Recent advances in text-to-image model customization have underscored the importance of integrating new concepts with a few examples. Yet, these progresses are largely confined to widely recognized subjects, which can be learned with relative ease through models' adequate shared prior knowledge. In contrast, logos, characterized by unique patterns and textual elements, are hard to establish shared… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

    Comments: ECCV2024

  44. arXiv:2407.13338  [pdf, other

    cs.CV

    Learn to Memorize and to Forget: A Continual Learning Perspective of Dynamic SLAM

    Authors: Baicheng Li, Zike Yan, Dong Wu, Hanqing Jiang, Hongbin Zha

    Abstract: Simultaneous localization and mapping (SLAM) with implicit neural representations has received extensive attention due to the expressive representation power and the innovative paradigm of continual learning. However, deploying such a system within a dynamic environment has not been well-studied. Such challenges are intractable even for conventional algorithms since observations from different vie… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

  45. arXiv:2407.11895  [pdf, other

    cs.CV

    OmniBind: Large-scale Omni Multimodal Representation via Binding Spaces

    Authors: Zehan Wang, Ziang Zhang, Hang Zhang, Luping Liu, Rongjie Huang, Xize Cheng, Hengshuang Zhao, Zhou Zhao

    Abstract: Recently, human-computer interaction with various modalities has shown promising applications, like GPT-4o and Gemini. Given the foundational role of multimodal joint representation in understanding and generation pipelines, high-quality omni joint representations would be a step toward co-processing more diverse multimodal information. In this work, we present OmniBind, large-scale multimodal joi… ▽ More

    Submitted 16 July, 2024; originally announced July 2024.

    Comments: Homepage is http://omnibind.github.io

  46. arXiv:2407.11682  [pdf, other

    cs.CV

    MapDistill: Boosting Efficient Camera-based HD Map Construction via Camera-LiDAR Fusion Model Distillation

    Authors: Xiaoshuai Hao, Ruikai Li, Hui Zhang, Dingzhe Li, Rong Yin, Sangil Jung, Seung-In Park, ByungIn Yoo, Haimei Zhao, Jing Zhang

    Abstract: Online high-definition (HD) map construction is an important and challenging task in autonomous driving. Recently, there has been a growing interest in cost-effective multi-view camera-based methods without relying on other sensors like LiDAR. However, these methods suffer from a lack of explicit depth information, necessitating the use of large models to achieve satisfactory performance. To addre… ▽ More

    Submitted 16 July, 2024; originally announced July 2024.

    Comments: Accepted by ECCV2024

  47. arXiv:2407.11478  [pdf, other

    cs.RO

    Trajectory Optimization under Contact Timing Uncertainties

    Authors: Haizhou Zhao, Majid Khadiv

    Abstract: Most interesting problems in robotics (e.g., locomotion and manipulation) are realized through intermittent contact with the environment. Due to the perception and modeling errors, assuming an exact time for establishing contact with the environment is unrealistic. On the other hand, handling uncertainties in contact timing is notoriously difficult as it gives rise to either handling uncertain com… ▽ More

    Submitted 16 July, 2024; originally announced July 2024.

  48. arXiv:2407.10999  [pdf, other

    cs.CL cs.AI

    TALEC: Teach Your LLM to Evaluate in Specific Domain with In-house Criteria by Criteria Division and Zero-shot Plus Few-shot

    Authors: Kaiqi Zhang, Shuai Yuan, Honghan Zhao

    Abstract: With the rapid development of large language models (LLM), the evaluation of LLM becomes increasingly important. Measuring text generation tasks such as summarization and article creation is very difficult. Especially in specific application domains (e.g., to-business or to-customer service), in-house evaluation criteria have to meet not only general standards (correctness, helpfulness and creativ… ▽ More

    Submitted 25 June, 2024; originally announced July 2024.

  49. arXiv:2407.10701  [pdf, other

    cs.CL

    DOCBENCH: A Benchmark for Evaluating LLM-based Document Reading Systems

    Authors: Anni Zou, Wenhao Yu, Hongming Zhang, Kaixin Ma, Deng Cai, Zhuosheng Zhang, Hai Zhao, Dong Yu

    Abstract: Recently, there has been a growing interest among large language model (LLM) developers in LLM-based document reading systems, which enable users to upload their own documents and pose questions related to the document contents, going beyond simple reading comprehension tasks. Consequently, these systems have been carefully designed to tackle challenges such as file parsing, metadata extraction, m… ▽ More

    Submitted 15 July, 2024; originally announced July 2024.

    Comments: Work in progress

  50. arXiv:2407.08706  [pdf, other

    cs.CV

    HiRes-LLaVA: Restoring Fragmentation Input in High-Resolution Large Vision-Language Models

    Authors: Runhui Huang, Xinpeng Ding, Chunwei Wang, Jianhua Han, Yulong Liu, Hengshuang Zhao, Hang Xu, Lu Hou, Wei Zhang, Xiaodan Liang

    Abstract: High-resolution inputs enable Large Vision-Language Models (LVLMs) to discern finer visual details, enhancing their comprehension capabilities. To reduce the training and computation costs caused by high-resolution input, one promising direction is to use sliding windows to slice the input into uniform patches, each matching the input size of the well-trained vision encoder. Although efficient, th… ▽ More

    Submitted 11 July, 2024; originally announced July 2024.