Skip to main content

Showing 1–50 of 871 results for author: Zhao, S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2408.10901  [pdf, other

    cs.CV cs.AI cs.LG

    A Grey-box Attack against Latent Diffusion Model-based Image Editing by Posterior Collapse

    Authors: Zhongliang Guo, Lei Fang, Jingyu Lin, Yifei Qian, Shuai Zhao, Zeyu Wang, Junhao Dong, Cunjian Chen, Ognjen Arandjelović, Chun Pong Lau

    Abstract: Recent advancements in generative AI, particularly Latent Diffusion Models (LDMs), have revolutionized image synthesis and manipulation. However, these generative techniques raises concerns about data misappropriation and intellectual property infringement. Adversarial attacks on machine learning models have been extensively studied, and a well-established body of research has extended these techn… ▽ More

    Submitted 20 August, 2024; originally announced August 2024.

    Comments: 21 pages, 7 figures, 10 tables

  2. arXiv:2408.09984  [pdf, other

    cs.CV

    Boosting Open-Domain Continual Learning via Leveraging Intra-domain Category-aware Prototype

    Authors: Yadong Lu, Shitian Zhao, Boxiang Yun, Dongsheng Jiang, Yin Li, Qingli Li, Yan Wang

    Abstract: Despite recent progress in enhancing the efficacy of Open-Domain Continual Learning (ODCL) in Vision-Language Models (VLM), failing to (1) correctly identify the Task-ID of a test image and (2) use only the category set corresponding to the Task-ID, while preserving the knowledge related to each domain, cannot address the two primary challenges of ODCL: forgetting old knowledge and maintaining zer… ▽ More

    Submitted 19 August, 2024; originally announced August 2024.

  3. arXiv:2408.09815  [pdf, other

    cs.LG cs.HC

    A Population-to-individual Tuning Framework for Adapting Pretrained LM to On-device User Intent Prediction

    Authors: Jiahui Gong, Jingtao Ding, Fanjin Meng, Guilong Chen, Hong Chen, Shen Zhao, Haisheng Lu, Yong Li

    Abstract: Mobile devices, especially smartphones, can support rich functions and have developed into indispensable tools in daily life. With the rise of generative AI services, smartphones can potentially transform into personalized assistants, anticipating user needs and scheduling services accordingly. Predicting user intents on smartphones, and reflecting anticipated activities based on past interactions… ▽ More

    Submitted 19 August, 2024; originally announced August 2024.

    Comments: accepted by KDD 2024

  4. arXiv:2408.09278  [pdf, other

    eess.IV cs.CV

    Cross-Species Data Integration for Enhanced Layer Segmentation in Kidney Pathology

    Authors: Junchao Zhu, Mengmeng Yin, Ruining Deng, Yitian Long, Yu Wang, Yaohong Wang, Shilin Zhao, Haichun Yang, Yuankai Huo

    Abstract: Accurate delineation of the boundaries between the renal cortex and medulla is crucial for subsequent functional structural analysis and disease diagnosis. Training high-quality deep-learning models for layer segmentation relies on the availability of large amounts of annotated data. However, due to the patient's privacy of medical data and scarce clinical cases, constructing pathological datasets… ▽ More

    Submitted 17 August, 2024; originally announced August 2024.

  5. arXiv:2408.08575  [pdf, other

    cs.CV

    Tell Codec What Worth Compressing: Semantically Disentangled Image Coding for Machine with LMMs

    Authors: Jinming Liu, Yuntao Wei, Junyan Lin, Shengyang Zhao, Heming Sun, Zhibo Chen, Wenjun Zeng, Xin Jin

    Abstract: We present a new image compression paradigm to achieve ``intelligently coding for machine'' by cleverly leveraging the common sense of Large Multimodal Models (LMMs). We are motivated by the evidence that large language/multimodal models are powerful general-purpose semantics predictors for understanding the real world. Different from traditional image compression typically optimized for human eye… ▽ More

    Submitted 16 August, 2024; originally announced August 2024.

  6. arXiv:2408.08092  [pdf, other

    cs.CV cs.AI

    OC3D: Weakly Supervised Outdoor 3D Object Detection with Only Coarse Click Annotation

    Authors: Qiming Xia, Hongwei Lin, Wei Ye, Hai Wu, Yadan Luo, Shijia Zhao, Xin Li, Chenglu Wen

    Abstract: LiDAR-based outdoor 3D object detection has received widespread attention. However, training 3D detectors from the LiDAR point cloud typically relies on expensive bounding box annotations. This paper presents OC3D, an innovative weakly supervised method requiring only coarse clicks on the bird's eye view of the 3D point cloud. A key challenge here is the absence of complete geometric descriptions… ▽ More

    Submitted 15 August, 2024; v1 submitted 15 August, 2024; originally announced August 2024.

  7. arXiv:2408.07422  [pdf, other

    cs.CV cs.AI

    LLMI3D: Empowering LLM with 3D Perception from a Single 2D Image

    Authors: Fan Yang, Sicheng Zhao, Yanhao Zhang, Haoxiang Chen, Hui Chen, Wenbo Tang, Haonan Lu, Pengfei Xu, Zhenyu Yang, Jungong Han, Guiguang Ding

    Abstract: Recent advancements in autonomous driving, augmented reality, robotics, and embodied intelligence have necessitated 3D perception algorithms. However, current 3D perception methods, particularly small models, struggle with processing logical reasoning, question-answering, and handling open scenario categories. On the other hand, generative multimodal large language models (MLLMs) excel in general… ▽ More

    Submitted 14 August, 2024; originally announced August 2024.

  8. arXiv:2408.07278  [pdf, other

    cs.IR cs.AI cs.CV

    Scene-wise Adaptive Network for Dynamic Cold-start Scenes Optimization in CTR Prediction

    Authors: Wenhao Li, Jie Zhou, Chuan Luo, Chao Tang, Kun Zhang, Shixiong Zhao

    Abstract: In the realm of modern mobile E-commerce, providing users with nearby commercial service recommendations through location-based online services has become increasingly vital. While machine learning approaches have shown promise in multi-scene recommendation, existing methodologies often struggle to address cold-start problems in unprecedented scenes: the increasing diversity of commercial choices,… ▽ More

    Submitted 18 August, 2024; v1 submitted 3 August, 2024; originally announced August 2024.

    Comments: 10 pages, 6 figures, accepted by Recsys 2024

    MSC Class: 68T09 ACM Class: I.2.0

  9. arXiv:2408.07098  [pdf, other

    cs.MA cs.AI

    QTypeMix: Enhancing Multi-Agent Cooperative Strategies through Heterogeneous and Homogeneous Value Decomposition

    Authors: Songchen Fu, Shaojing Zhao, Ta Li, YongHong Yan

    Abstract: In multi-agent cooperative tasks, the presence of heterogeneous agents is familiar. Compared to cooperation among homogeneous agents, collaboration requires considering the best-suited sub-tasks for each agent. However, the operation of multi-agent systems often involves a large amount of complex interaction information, making it more challenging to learn heterogeneous strategies. Related multi-a… ▽ More

    Submitted 12 August, 2024; originally announced August 2024.

    Comments: 16 pages, 8 figures

    ACM Class: I.2.6; I.2.11

  10. arXiv:2408.06904  [pdf, other

    cs.CL

    Re-TASK: Revisiting LLM Tasks from Capability, Skill, and Knowledge Perspectives

    Authors: Zhihu Wang, Shiwan Zhao, Yu Wang, Heyuan Huang, Jiaxin Shi, Sitao Xie, Zhixing Wang, Yubo Zhang, Hongyan Li, Junchi Yan

    Abstract: As large language models (LLMs) continue to scale, their enhanced performance often proves insufficient for solving domain-specific tasks. Systematically analyzing their failures and effectively enhancing their performance remain significant challenges. This paper introduces the Re-TASK framework, a novel theoretical model that Revisits LLM Tasks from cApability, Skill, Knowledge perspectives, gui… ▽ More

    Submitted 13 August, 2024; originally announced August 2024.

    Comments: Work in Progress

  11. arXiv:2408.06878  [pdf, other

    cs.CV cs.GR

    PBIR-NIE: Glossy Object Capture under Non-Distant Lighting

    Authors: Guangyan Cai, Fujun Luan, Miloš Hašan, Kai Zhang, Sai Bi, Zexiang Xu, Iliyan Georgiev, Shuang Zhao

    Abstract: Glossy objects present a significant challenge for 3D reconstruction from multi-view input images under natural lighting. In this paper, we introduce PBIR-NIE, an inverse rendering framework designed to holistically capture the geometry, material attributes, and surrounding illumination of such objects. We propose a novel parallax-aware non-distant environment map as a lightweight and efficient li… ▽ More

    Submitted 13 August, 2024; originally announced August 2024.

  12. arXiv:2408.05706  [pdf, other

    cs.CV

    Decoder Pre-Training with only Text for Scene Text Recognition

    Authors: Shuai Zhao, Yongkun Du, Zhineng Chen, Yu-Gang Jiang

    Abstract: Scene text recognition (STR) pre-training methods have achieved remarkable progress, primarily relying on synthetic datasets. However, the domain gap between synthetic and real images poses a challenge in acquiring feature representations that align well with images on real scenes, thereby limiting the performance of these methods. We note that vision-language models like CLIP, pre-trained on exte… ▽ More

    Submitted 11 August, 2024; originally announced August 2024.

    Comments: Accepted by ACM MM 2024

  13. arXiv:2408.04662  [pdf, other

    cs.CL cs.AI

    Citekit: A Modular Toolkit for Large Language Model Citation Generation

    Authors: Jiajun Shen, Tong Zhou, Suifeng Zhao, Yubo Chen, Kang Liu

    Abstract: Enabling Large Language Models (LLMs) to generate citations in Question-Answering (QA) tasks is an emerging paradigm aimed at enhancing the verifiability of their responses when LLMs are utilizing external references to generate an answer. However, there is currently no unified framework to standardize and fairly compare different citation generation methods, leading to difficulties in reproducing… ▽ More

    Submitted 5 August, 2024; originally announced August 2024.

    Comments: 7 pages, 13 figures

  14. arXiv:2408.04543  [pdf, ps, other

    quant-ph cs.LG

    Quantum Machine Learning: Performance and Security Implications in Real-World Applications

    Authors: Zhengping Jay Luo, Tyler Stewart, Mourya Narasareddygari, Rui Duan, Shangqing Zhao

    Abstract: Quantum computing has garnered significant attention in recent years from both academia and industry due to its potential to achieve a "quantum advantage" over classical computers. The advent of quantum computing introduces new challenges for security and privacy. This poster explores the performance and security implications of quantum computing through a case study of machine learning in a real-… ▽ More

    Submitted 8 August, 2024; originally announced August 2024.

  15. arXiv:2408.03320  [pdf, other

    q-fin.PM cs.LG

    Hedge Fund Portfolio Construction Using PolyModel Theory and iTransformer

    Authors: Siqiao Zhao, Zhikang Dong, Zeyu Cao, Raphael Douady

    Abstract: When constructing portfolios, a key problem is that a lot of financial time series data are sparse, making it challenging to apply machine learning methods. Polymodel theory can solve this issue and demonstrate superiority in portfolio construction from various aspects. To implement the PolyModel theory for constructing a hedge fund portfolio, we begin by identifying an asset pool, utilizing over… ▽ More

    Submitted 15 August, 2024; v1 submitted 6 August, 2024; originally announced August 2024.

  16. arXiv:2408.02657  [pdf, other

    cs.CV

    Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining

    Authors: Dongyang Liu, Shitian Zhao, Le Zhuo, Weifeng Lin, Yu Qiao, Hongsheng Li, Peng Gao

    Abstract: We present Lumina-mGPT, a family of multimodal autoregressive models capable of various vision and language tasks, particularly excelling in generating flexible photorealistic images from text descriptions. Unlike existing autoregressive image generation approaches, Lumina-mGPT employs a pretrained decoder-only transformer as a unified framework for modeling multimodal token sequences. Our key ins… ▽ More

    Submitted 5 August, 2024; originally announced August 2024.

    Comments: Code available at: https://github.com/Alpha-VLLM/Lumina-mGPT

  17. arXiv:2408.02450  [pdf, other

    cs.SE

    An Evaluation of Requirements Modeling for Cyber-Physical Systems via LLMs

    Authors: Dongming Jin, Shengxin Zhao, Zhi Jin, Xiaohong Chen, Chunhui Wang, Zheng Fang, Hongbin Xiao

    Abstract: Cyber-physical systems (CPSs) integrate cyber and physical components and enable them to interact with each other to meet user needs. The needs for CPSs span rich application domains such as healthcare and medicine, smart home, smart building, etc. This indicates that CPSs are all about solving real-world problems. With the increasing abundance of sensing devices and effectors, the problems wanted… ▽ More

    Submitted 5 August, 2024; originally announced August 2024.

    Comments: 12 pages, 8 figures

  18. arXiv:2408.02302  [pdf, other

    cs.CL

    SNFinLLM: Systematic and Nuanced Financial Domain Adaptation of Chinese Large Language Models

    Authors: Shujuan Zhao, Lingfeng Qiao, Kangyang Luo, Qian-Wen Zhang, Junru Lu, Di Yin

    Abstract: Large language models (LLMs) have become powerful tools for advancing natural language processing applications in the financial industry. However, existing financial LLMs often face challenges such as hallucinations or superficial parameter training, resulting in suboptimal performance, particularly in financial computing and machine reading comprehension (MRC). To address these issues, we propose… ▽ More

    Submitted 5 August, 2024; originally announced August 2024.

  19. arXiv:2408.00325  [pdf, other

    cs.SD eess.AS

    Iterative Prototype Refinement for Ambiguous Speech Emotion Recognition

    Authors: Haoqin Sun, Shiwan Zhao, Xiangyu Kong, Xuechen Wang, Hui Wang, Jiaming Zhou, Yong Qin

    Abstract: Recognizing emotions from speech is a daunting task due to the subtlety and ambiguity of expressions. Traditional speech emotion recognition (SER) systems, which typically rely on a singular, precise emotion label, struggle with this complexity. Therefore, modeling the inherent ambiguity of emotions is an urgent problem. In this paper, we propose an iterative prototype refinement framework (IPR) f… ▽ More

    Submitted 1 August, 2024; originally announced August 2024.

  20. arXiv:2407.21783  [pdf, other

    cs.AI cs.CL cs.CV

    The Llama 3 Herd of Models

    Authors: Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Roziere, Bethany Biron, Binh Tang , et al. (510 additional authors not shown)

    Abstract: Modern artificial intelligence (AI) systems are powered by foundation models. This paper presents a new set of foundation models, called Llama 3. It is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage. Our largest model is a dense Transformer with 405B parameters and a context window of up to 128K tokens. This paper presents an extensive empirical… ▽ More

    Submitted 15 August, 2024; v1 submitted 31 July, 2024; originally announced July 2024.

  21. NegotiaToR: Towards A Simple Yet Effective On-demand Reconfigurable Datacenter Network

    Authors: Cong Liang, Xiangli Song, Jing Cheng, Mowei Wang, Yashe Liu, Zhenhua Liu, Shizhen Zhao, Yong Cui

    Abstract: Recent advances in fast optical switching technology show promise in meeting the high goodput and low latency requirements of datacenter networks (DCN). We present NegotiaToR, a simple network architecture for optical reconfigurable DCNs that utilizes on-demand scheduling to handle dynamic traffic. In NegotiaToR, racks exchange scheduling messages through an in-band control plane and distributedly… ▽ More

    Submitted 29 July, 2024; originally announced July 2024.

    Comments: This paper is accepted by ACM SIGCOMM 2024

  22. arXiv:2407.18461  [pdf, other

    cs.SD cs.CL eess.AS

    Enhancing Dysarthric Speech Recognition for Unseen Speakers via Prototype-Based Adaptation

    Authors: Shiyao Wang, Shiwan Zhao, Jiaming Zhou, Aobo Kong, Yong Qin

    Abstract: Dysarthric speech recognition (DSR) presents a formidable challenge due to inherent inter-speaker variability, leading to severe performance degradation when applying DSR models to new dysarthric speakers. Traditional speaker adaptation methodologies typically involve fine-tuning models for each speaker, but this strategy is cost-prohibitive and inconvenient for disabled users, requiring substanti… ▽ More

    Submitted 25 July, 2024; originally announced July 2024.

    Comments: accepted by Interspeech 2024

    Journal ref: INTERSPEECH 2024

  23. arXiv:2407.18390  [pdf, other

    eess.IV cs.CV

    Adapting Mouse Pathological Model to Human Glomerular Lesion Segmentation

    Authors: Lining Yu, Mengmeng Yin, Ruining Deng, Quan Liu, Tianyuan Yao, Can Cui, Yu Wang, Yaohong Wang, Shilin Zhao, Haichun Yang, Yuankai Huo

    Abstract: Moving from animal models to human applications in preclinical research encompasses a broad spectrum of disciplines in medical science. A fundamental element in the development of new drugs, treatments, diagnostic methods, and in deepening our understanding of disease processes is the accurate measurement of kidney tissues. Past studies have demonstrated the viability of translating glomeruli segm… ▽ More

    Submitted 25 July, 2024; originally announced July 2024.

  24. TOM: A Development Platform For Wearable Intelligent Assistants

    Authors: Nuwan Janaka, Shengdong Zhao, David Hsu, Sherisse Tan Jing Wen, Koh Chun Keat

    Abstract: Advanced digital assistants can significantly enhance task performance, reduce user burden, and provide personalized guidance to improve users' abilities. However, the development of such intelligent digital assistants presents a formidable challenge. To address this, we introduce TOM, a conceptual architecture and software platform (https://github.com/TOM-Platform) designed to support the develop… ▽ More

    Submitted 22 July, 2024; originally announced July 2024.

    Comments: 14 pages, 6 figures, 2 tables

    Journal ref: UbiComp Companion 2024

  25. arXiv:2407.15431  [pdf, other

    cs.SI cs.AI cs.LG

    Pre-Training and Prompting for Few-Shot Node Classification on Text-Attributed Graphs

    Authors: Huanjing Zhao, Beining Yang, Yukuo Cen, Junyu Ren, Chenhui Zhang, Yuxiao Dong, Evgeny Kharlamov, Shu Zhao, Jie Tang

    Abstract: The text-attributed graph (TAG) is one kind of important real-world graph-structured data with each node associated with raw texts. For TAGs, traditional few-shot node classification methods directly conduct training on the pre-processed node features and do not consider the raw texts. The performance is highly dependent on the choice of the feature pre-processing method. In this paper, we propose… ▽ More

    Submitted 22 July, 2024; originally announced July 2024.

    Comments: Accepted to KDD'24

  26. arXiv:2407.12592  [pdf, other

    cs.CV

    VegeDiff: Latent Diffusion Model for Geospatial Vegetation Forecasting

    Authors: Sijie Zhao, Hao Chen, Xueliang Zhang, Pengfeng Xiao, Lei Bai, Wanli Ouyang

    Abstract: In the context of global climate change and frequent extreme weather events, forecasting future geospatial vegetation states under these conditions is of significant importance. The vegetation change process is influenced by the complex interplay between dynamic meteorological variables and static environmental variables, leading to high levels of uncertainty. Existing deterministic methods are in… ▽ More

    Submitted 17 July, 2024; originally announced July 2024.

    Comments: 15 pages, 8 figures

  27. arXiv:2407.12415  [pdf, other

    cs.LG

    Not All Frequencies Are Created Equal:Towards a Dynamic Fusion of Frequencies in Time-Series Forecasting

    Authors: Xingyu Zhang, Siyu Zhao, Zeen Song, Huijie Guo, Jianqi Zhang, Changwen Zheng, Wenwen Qiang

    Abstract: Long-term time series forecasting is a long-standing challenge in various applications. A central issue in time series forecasting is that methods should expressively capture long-term dependency. Furthermore, time series forecasting methods should be flexible when applied to different scenarios. Although Fourier analysis offers an alternative to effectively capture reusable and periodic patterns… ▽ More

    Submitted 18 July, 2024; v1 submitted 17 July, 2024; originally announced July 2024.

    Comments: Accpeted by ACMMM2024

  28. Demonstrating PilotAR: A Tool to Assist Wizard-of-Oz Pilot Studies with OHMD

    Authors: Nuwan Janaka, Runze Cai, Shengdong Zhao, David Hsu

    Abstract: While pilot studies help to identify potential interesting research directions, the additional requirements in AR/MR make it challenging to conduct quick and dirty pilot studies efficiently with Optical See-Through Head-Mounted Displays (OST HMDs, OHMDs). To overcome these challenges, including the inability to observe and record in-context user interactions, increased task load, and difficulties… ▽ More

    Submitted 17 July, 2024; originally announced July 2024.

    Comments: 10 pages, 5 figures, 1 table

    Journal ref: UbiComp Companion (2024)

  29. arXiv:2407.12229  [pdf, other

    eess.AS cs.AI eess.SP

    Laugh Now Cry Later: Controlling Time-Varying Emotional States of Flow-Matching-Based Zero-Shot Text-to-Speech

    Authors: Haibin Wu, Xiaofei Wang, Sefik Emre Eskimez, Manthan Thakker, Daniel Tompkins, Chung-Hsien Tsai, Canrun Li, Zhen Xiao, Sheng Zhao, Jinyu Li, Naoyuki Kanda

    Abstract: People change their tones of voice, often accompanied by nonverbal vocalizations (NVs) such as laughter and cries, to convey rich emotions. However, most text-to-speech (TTS) systems lack the capability to generate speech with rich emotions, including NVs. This paper introduces EmoCtrl-TTS, an emotion-controllable zero-shot TTS that can generate highly emotional speech with NVs for any speaker. Em… ▽ More

    Submitted 16 July, 2024; originally announced July 2024.

    Comments: See https://aka.ms/emoctrl-tts for demo samples

  30. arXiv:2407.12019  [pdf, other

    cs.CL cs.AI

    DIM: Dynamic Integration of Multimodal Entity Linking with Large Language Model

    Authors: Shezheng Song, Shasha Li, Jie Yu, Shan Zhao, Xiaopeng Li, Jun Ma, Xiaodong Liu, Zhuo Li, Xiaoguang Mao

    Abstract: Our study delves into Multimodal Entity Linking, aligning the mention in multimodal information with entities in knowledge base. Existing methods are still facing challenges like ambiguous entity representations and limited image information utilization. Thus, we propose dynamic entity extraction using ChatGPT, which dynamically extracts entities and enhances datasets. We also propose a method: Dy… ▽ More

    Submitted 27 June, 2024; originally announced July 2024.

    Comments: Published on PRCV24

  31. arXiv:2407.11921  [pdf, other

    cs.CV cs.CR

    IPA-NeRF: Illusory Poisoning Attack Against Neural Radiance Fields

    Authors: Wenxiang Jiang, Hanwei Zhang, Shuo Zhao, Zhongwen Guo, Hao Wang

    Abstract: Neural Radiance Field (NeRF) represents a significant advancement in computer vision, offering implicit neural network-based scene representation and novel view synthesis capabilities. Its applications span diverse fields including robotics, urban mapping, autonomous navigation, virtual reality/augmented reality, etc., some of which are considered high-risk AI applications. However, despite its wi… ▽ More

    Submitted 18 July, 2024; v1 submitted 16 July, 2024; originally announced July 2024.

  32. arXiv:2407.11044  [pdf, other

    cs.LG cs.AI

    Generalizing soft actor-critic algorithms to discrete action spaces

    Authors: Le Zhang, Yong Gu, Xin Zhao, Yanshuo Zhang, Shu Zhao, Yifei Jin, Xinxin Wu

    Abstract: ATARI is a suite of video games used by reinforcement learning (RL) researchers to test the effectiveness of the learning algorithm. Receiving only the raw pixels and the game score, the agent learns to develop sophisticated strategies, even to the comparable level of a professional human games tester. Ideally, we also want an agent requiring very few interactions with the environment. Previous co… ▽ More

    Submitted 7 July, 2024; originally announced July 2024.

    Comments: Chinese Conference on Pattern Recognition and Computer Vision (PRCV) 2024. GitHub Repo https://github.com/lezhang-thu/bigger-better-faster-SAC

  33. arXiv:2407.10943  [pdf, other

    cs.RO cs.CV

    GRUtopia: Dream General Robots in a City at Scale

    Authors: Hanqing Wang, Jiahe Chen, Wensi Huang, Qingwei Ben, Tai Wang, Boyu Mi, Tao Huang, Siheng Zhao, Yilun Chen, Sizhe Yang, Peizhou Cao, Wenye Yu, Zichao Ye, Jialun Li, Junfeng Long, Zirui Wang, Huiling Wang, Ying Zhao, Zhongying Tu, Yu Qiao, Dahua Lin, Jiangmiao Pang

    Abstract: Recent works have been exploring the scaling laws in the field of Embodied AI. Given the prohibitive costs of collecting real-world data, we believe the Simulation-to-Real (Sim2Real) paradigm is a crucial step for scaling the learning of embodied models. This paper introduces project GRUtopia, the first simulated interactive 3D society designed for various robots. It features several advancements:… ▽ More

    Submitted 15 July, 2024; originally announced July 2024.

  34. arXiv:2407.10876  [pdf, other

    cs.CV

    RepVF: A Unified Vector Fields Representation for Multi-task 3D Perception

    Authors: Chunliang Li, Wencheng Han, Junbo Yin, Sanyuan Zhao, Jianbing Shen

    Abstract: Concurrent processing of multiple autonomous driving 3D perception tasks within the same spatiotemporal scene poses a significant challenge, in particular due to the computational inefficiencies and feature competition between tasks when using traditional multi-task learning approaches. This paper addresses these issues by proposing a novel unified representation, RepVF, which harmonizes the repre… ▽ More

    Submitted 20 July, 2024; v1 submitted 15 July, 2024; originally announced July 2024.

    Comments: Accepted by ECCV 2024

  35. arXiv:2407.10833  [pdf, other

    eess.IV cs.CV

    MoE-DiffIR: Task-customized Diffusion Priors for Universal Compressed Image Restoration

    Authors: Yulin Ren, Xin Li, Bingchen Li, Xingrui Wang, Mengxi Guo, Shijie Zhao, Li Zhang, Zhibo Chen

    Abstract: We present MoE-DiffIR, an innovative universal compressed image restoration (CIR) method with task-customized diffusion priors. This intends to handle two pivotal challenges in the existing CIR methods: (i) lacking adaptability and universality for different image codecs, e.g., JPEG and WebP; (ii) poor texture generation capability, particularly at low bitrates. Specifically, our MoE-DiffIR develo… ▽ More

    Submitted 15 July, 2024; originally announced July 2024.

    Comments: Accepted by ECCV 2024

  36. arXiv:2407.09029  [pdf, other

    cs.MM cs.CV cs.SD eess.AS

    Enhancing Emotion Recognition in Incomplete Data: A Novel Cross-Modal Alignment, Reconstruction, and Refinement Framework

    Authors: Haoqin Sun, Shiwan Zhao, Shaokai Li, Xiangyu Kong, Xuechen Wang, Aobo Kong, Jiaming Zhou, Yong Chen, Wenjia Zeng, Yong Qin

    Abstract: Multimodal emotion recognition systems rely heavily on the full availability of modalities, suffering significant performance declines when modal data is incomplete. To tackle this issue, we present the Cross-Modal Alignment, Reconstruction, and Refinement (CM-ARR) framework, an innovative approach that sequentially engages in cross-modal alignment, reconstruction, and refinement phases to handle… ▽ More

    Submitted 12 July, 2024; originally announced July 2024.

  37. arXiv:2407.08995  [pdf, other

    cs.CL

    Self-Prompt Tuning: Enable Autonomous Role-Playing in LLMs

    Authors: Aobo Kong, Shiwan Zhao, Hao Chen, Qicheng Li, Yong Qin, Ruiqi Sun, Xin Zhou, Jiaming Zhou, Haoqin Sun

    Abstract: Recent advancements in LLMs have showcased their remarkable role-playing capabilities, able to accurately simulate the dialogue styles and cognitive processes of various roles based on different instructions and contexts. Studies indicate that assigning LLMs the roles of experts, a strategy known as role-play prompting, can enhance their performance in the corresponding domains. However, the promp… ▽ More

    Submitted 12 July, 2024; originally announced July 2024.

  38. arXiv:2407.08551  [pdf, other

    cs.CL cs.SD eess.AS

    Autoregressive Speech Synthesis without Vector Quantization

    Authors: Lingwei Meng, Long Zhou, Shujie Liu, Sanyuan Chen, Bing Han, Shujie Hu, Yanqing Liu, Jinyu Li, Sheng Zhao, Xixin Wu, Helen Meng, Furu Wei

    Abstract: We present MELLE, a novel continuous-valued tokens based language modeling approach for text to speech synthesis (TTS). MELLE autoregressively generates continuous mel-spectrogram frames directly from text condition, bypassing the need for vector quantization, which are originally designed for audio compression and sacrifice fidelity compared to mel-spectrograms. Specifically, (i) instead of cross… ▽ More

    Submitted 11 July, 2024; originally announced July 2024.

  39. arXiv:2407.08501  [pdf, other

    cs.HC

    SelfIE: Self-Initiated Explorable Instructions Towards Enhanced User Experience

    Authors: Hyeongcheol Kim, Katherine Fennedy, Georgia Zhang, Can Liu, Shengdong Zhao

    Abstract: Given the widespread use of procedural instructions with non-linear access (situational information retrieval), there has been a proposal to accommodate both linear and non-linear usage in instructional design. However, it has received inadequate scholarly attention, leading to limited exploration. This paper introduces Self-Initiated Explorable (SelfIE) instructions, a new design concept aiming a… ▽ More

    Submitted 29 July, 2024; v1 submitted 10 July, 2024; originally announced July 2024.

  40. arXiv:2407.07077  [pdf, other

    cs.CV cs.AI

    ConceptExpress: Harnessing Diffusion Models for Single-image Unsupervised Concept Extraction

    Authors: Shaozhe Hao, Kai Han, Zhengyao Lv, Shihao Zhao, Kwan-Yee K. Wong

    Abstract: While personalized text-to-image generation has enabled the learning of a single concept from multiple images, a more practical yet challenging scenario involves learning multiple concepts within a single image. However, existing works tackling this scenario heavily rely on extensive human annotations. In this paper, we introduce a novel task named Unsupervised Concept Extraction (UCE) that consid… ▽ More

    Submitted 9 July, 2024; originally announced July 2024.

    Comments: ECCV 2024, Project page: https://haoosz.github.io/ConceptExpress/

  41. arXiv:2407.05934  [pdf, other

    cs.LG cs.AI

    Graph Anomaly Detection with Noisy Labels by Reinforcement Learning

    Authors: Zhu Wang, Shuang Zhou, Junnan Dong, Chang Yang, Xiao Huang, Shengjie Zhao

    Abstract: Graph anomaly detection (GAD) has been widely applied in many areas, e.g., fraud detection in finance and robot accounts in social networks. Existing methods are dedicated to identifying the outlier nodes that deviate from normal ones. While they heavily rely on high-quality annotation, which is hard to obtain in real-world scenarios, this could lead to severely degraded performance based on noisy… ▽ More

    Submitted 8 July, 2024; originally announced July 2024.

  42. arXiv:2407.05758  [pdf, other

    eess.IV cs.AI cs.CV

    Potential of Multimodal Large Language Models for Data Mining of Medical Images and Free-text Reports

    Authors: Yutong Zhang, Yi Pan, Tianyang Zhong, Peixin Dong, Kangni Xie, Yuxiao Liu, Hanqi Jiang, Zhengliang Liu, Shijie Zhao, Tuo Zhang, Xi Jiang, Dinggang Shen, Tianming Liu, Xin Zhang

    Abstract: Medical images and radiology reports are crucial for diagnosing medical conditions, highlighting the importance of quantitative analysis for clinical decision-making. However, the diversity and cross-source heterogeneity of these data challenge the generalizability of current data-mining methods. Multimodal large language models (MLLMs) have recently transformed many domains, significantly affecti… ▽ More

    Submitted 8 July, 2024; originally announced July 2024.

  43. arXiv:2407.03245  [pdf, other

    cs.RO cs.AI eess.SY

    TieBot: Learning to Knot a Tie from Visual Demonstration through a Real-to-Sim-to-Real Approach

    Authors: Weikun Peng, Jun Lv, Yuwei Zeng, Haonan Chen, Siheng Zhao, Jichen Sun, Cewu Lu, Lin Shao

    Abstract: The tie-knotting task is highly challenging due to the tie's high deformation and long-horizon manipulation actions. This work presents TieBot, a Real-to-Sim-to-Real learning from visual demonstration system for the robots to learn to knot a tie. We introduce the Hierarchical Feature Matching approach to estimate a sequence of tie's meshes from the demonstration video. With these estimated meshes… ▽ More

    Submitted 3 July, 2024; v1 submitted 3 July, 2024; originally announced July 2024.

    Comments: fix few typos

  44. arXiv:2407.00596  [pdf, other

    eess.IV cs.CV

    HATs: Hierarchical Adaptive Taxonomy Segmentation for Panoramic Pathology Image Analysis

    Authors: Ruining Deng, Quan Liu, Can Cui, Tianyuan Yao, Juming Xiong, Shunxing Bao, Hao Li, Mengmeng Yin, Yu Wang, Shilin Zhao, Yucheng Tang, Haichun Yang, Yuankai Huo

    Abstract: Panoramic image segmentation in computational pathology presents a remarkable challenge due to the morphologically complex and variably scaled anatomy. For instance, the intricate organization in kidney pathology spans multiple layers, from regions like the cortex and medulla to functional units such as glomeruli, tubules, and vessels, down to various cell types. In this paper, we propose a novel… ▽ More

    Submitted 30 June, 2024; originally announced July 2024.

    Comments: arXiv admin note: text overlap with arXiv:2402.19286

  45. arXiv:2406.19247  [pdf, other

    cs.CV

    Local Manifold Learning for No-Reference Image Quality Assessment

    Authors: Timin Gao, Wensheng Pan, Yan Zhang, Sicheng Zhao, Shengchuan Zhang, Xiawu Zheng, Ke Li, Liujuan Cao, Rongrong Ji

    Abstract: Contrastive learning has considerably advanced the field of Image Quality Assessment (IQA), emerging as a widely adopted technique. The core mechanism of contrastive learning involves minimizing the distance between quality-similar (positive) examples while maximizing the distance between quality-dissimilar (negative) examples. Despite its successes, current contrastive learning methods often negl… ▽ More

    Submitted 27 June, 2024; originally announced June 2024.

  46. arXiv:2406.18485  [pdf, other

    cs.DC

    LoongTrain: Efficient Training of Long-Sequence LLMs with Head-Context Parallelism

    Authors: Diandian Gu, Peng Sun, Qinghao Hu, Ting Huang, Xun Chen, Yingtong Xiong, Guoteng Wang, Qiaoling Chen, Shangchun Zhao, Jiarui Fang, Yonggang Wen, Tianwei Zhang, Xin Jin, Xuanzhe Liu

    Abstract: Efficiently training LLMs with long sequences is important yet challenged by the massive computation and memory requirements. Sequence parallelism has been proposed to tackle these problems, but existing methods suffer from scalability or efficiency issues. We propose LoongTrain, a novel system to efficiently train LLMs with long sequences at scale. The core of LoongTrain is the 2D-Attention mecha… ▽ More

    Submitted 26 June, 2024; originally announced June 2024.

  47. arXiv:2406.18020  [pdf, other

    cs.LG cs.AI physics.chem-ph

    MolFusion: Multimodal Fusion Learning for Molecular Representations via Multi-granularity Views

    Authors: Muzhen Cai, Sendong Zhao, Haochun Wang, Yanrui Du, Zewen Qiang, Bing Qin, Ting Liu

    Abstract: Artificial Intelligence predicts drug properties by encoding drug molecules, aiding in the rapid screening of candidates. Different molecular representations, such as SMILES and molecule graphs, contain complementary information for molecular encoding. Thus exploiting complementary information from different molecular representations is one of the research priorities in molecular encoding. Most ex… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

    Comments: 8 pages, 5 figures

  48. arXiv:2406.18009  [pdf, other

    eess.AS cs.SD

    E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS

    Authors: Sefik Emre Eskimez, Xiaofei Wang, Manthan Thakker, Canrun Li, Chung-Hsien Tsai, Zhen Xiao, Hemin Yang, Zirun Zhu, Min Tang, Xu Tan, Yanqing Liu, Sheng Zhao, Naoyuki Kanda

    Abstract: This paper introduces Embarrassingly Easy Text-to-Speech (E2 TTS), a fully non-autoregressive zero-shot text-to-speech system that offers human-level naturalness and state-of-the-art speaker similarity and intelligibility. In the E2 TTS framework, the text input is converted into a character sequence with filler tokens. The flow-matching-based mel spectrogram generator is then trained based on the… ▽ More

    Submitted 25 June, 2024; originally announced June 2024.

  49. arXiv:2406.16605  [pdf, other

    cs.CL cs.AI cs.LG stat.ME

    CLEAR: Can Language Models Really Understand Causal Graphs?

    Authors: Sirui Chen, Mengying Xu, Kun Wang, Xingyu Zeng, Rui Zhao, Shengjie Zhao, Chaochao Lu

    Abstract: Causal reasoning is a cornerstone of how humans interpret the world. To model and reason about causality, causal graphs offer a concise yet effective solution. Given the impressive advancements in language models, a crucial question arises: can they really understand causal graphs? To this end, we pioneer an investigation into language models' understanding of causal graphs. Specifically, we devel… ▽ More

    Submitted 24 June, 2024; originally announced June 2024.

  50. arXiv:2406.15570  [pdf, other

    cs.CL cs.LG

    DEM: Distribution Edited Model for Training with Mixed Data Distributions

    Authors: Dhananjay Ram, Aditya Rawal, Momchil Hardalov, Nikolaos Pappas, Sheng Zha

    Abstract: Training with mixed data distributions is a common and important part of creating multi-task and instruction-following models. The diversity of the data distributions and cost of joint training makes the optimization procedure extremely challenging. Data mixing methods partially address this problem, albeit having a sub-optimal performance across data sources and require multiple expensive trainin… ▽ More

    Submitted 21 June, 2024; originally announced June 2024.

    MSC Class: 68T50 ACM Class: F.2.2; I.2.7