Skip to main content

Showing 1–50 of 653 results for author: Zhou, K

Searching in archive cs. Search in all archives.
.
  1. arXiv:2408.09680  [pdf, other

    cs.CV cs.AI

    MambaLoc: Efficient Camera Localisation via State Space Model

    Authors: Jialu Wang, Kaichen Zhou, Andrew Markham, Niki Trigoni

    Abstract: Location information is pivotal for the automation and intelligence of terminal devices and edge-cloud IoT systems, such as autonomous vehicles and augmented reality. However, achieving reliable positioning across diverse IoT applications remains challenging due to significant training costs and the necessity of densely collected data. To tackle these issues, we have innovatively applied the selec… ▽ More

    Submitted 20 August, 2024; v1 submitted 18 August, 2024; originally announced August 2024.

  2. arXiv:2408.09240  [pdf, other

    cs.CV

    RepControlNet: ControlNet Reparameterization

    Authors: Zhaoli Deng, Kaibin Zhou, Fanyi Wang, Zhenpeng Mi

    Abstract: With the wide application of diffusion model, the high cost of inference resources has became an important bottleneck for its universal application. Controllable generation, such as ControlNet, is one of the key research directions of diffusion model, and the research related to inference acceleration and model compression is more important. In order to solve this problem, this paper proposes a mo… ▽ More

    Submitted 17 August, 2024; originally announced August 2024.

  3. arXiv:2408.08070  [pdf, other

    cs.CV

    MambaMIM: Pre-training Mamba with State Space Token-interpolation

    Authors: Fenghe Tang, Bingkun Nian, Yingtai Li, Jie Yang, Liu Wei, S. Kevin Zhou

    Abstract: Generative self-supervised learning demonstrates outstanding representation learning capabilities in both Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). However, there are currently no generative pre-training methods related to selective state space models (Mamba) that can handle long-range dependencies effectively. To address this challenge, we introduce a generative self-su… ▽ More

    Submitted 15 August, 2024; originally announced August 2024.

    Comments: 10 pages, 7 figures

  4. arXiv:2408.07595  [pdf, other

    cs.CV

    Progressive Radiance Distillation for Inverse Rendering with Gaussian Splatting

    Authors: Keyang Ye, Qiming Hou, Kun Zhou

    Abstract: We propose progressive radiance distillation, an inverse rendering method that combines physically-based rendering with Gaussian-based radiance field rendering using a distillation progress map. Taking multi-view images as input, our method starts from a pre-trained radiance field guidance, and distills physically-based light and material parameters from the radiance field using an image-fitting p… ▽ More

    Submitted 14 August, 2024; originally announced August 2024.

  5. arXiv:2408.05936  [pdf, other

    cs.CV

    Multi-scale Contrastive Adaptor Learning for Segmenting Anything in Underperformed Scenes

    Authors: Ke Zhou, Zhongwei Qiu, Dongmei Fu

    Abstract: Foundational vision models, such as the Segment Anything Model (SAM), have achieved significant breakthroughs through extensive pre-training on large-scale visual datasets. Despite their general success, these models may fall short in specialized tasks with limited data, and fine-tuning such large-scale models is often not feasible. Current strategies involve incorporating adaptors into the pre-tr… ▽ More

    Submitted 12 August, 2024; originally announced August 2024.

  6. arXiv:2408.05815  [pdf, other

    cs.CV

    HySparK: Hybrid Sparse Masking for Large Scale Medical Image Pre-Training

    Authors: Fenghe Tang, Ronghao Xu, Qingsong Yao, Xueming Fu, Quan Quan, Heqin Zhu, Zaiyi Liu, S. Kevin Zhou

    Abstract: The generative self-supervised learning strategy exhibits remarkable learning representational capabilities. However, there is limited attention to end-to-end pre-training methods based on a hybrid architecture of CNN and Transformer, which can learn strong local and global representations simultaneously. To address this issue, we propose a generative pre-training strategy called Hybrid Sparse mas… ▽ More

    Submitted 11 August, 2024; originally announced August 2024.

    Comments: Early accept at MICCAI 2024

    ACM Class: I.4.10; I.4.6

  7. arXiv:2408.05711  [pdf, other

    cs.CV

    Contrastive masked auto-encoders based self-supervised hashing for 2D image and 3D point cloud cross-modal retrieval

    Authors: Rukai Wei, Heng Cui, Yu Liu, Yufeng Hou, Yanzhao Xie, Ke Zhou

    Abstract: Implementing cross-modal hashing between 2D images and 3D point-cloud data is a growing concern in real-world retrieval systems. Simply applying existing cross-modal approaches to this new task fails to adequately capture latent multi-modal semantics and effectively bridge the modality gap between 2D and 3D. To address these issues without relying on hand-crafted labels, we propose contrastive mas… ▽ More

    Submitted 11 August, 2024; originally announced August 2024.

    Comments: Accepted by ICME 2024

  8. SAT3D: Image-driven Semantic Attribute Transfer in 3D

    Authors: Zhijun Zhai, Zengmao Wang, Xiaoxiao Long, Kaixuan Zhou, Bo Du

    Abstract: GAN-based image editing task aims at manipulating image attributes in the latent space of generative models. Most of the previous 2D and 3D-aware approaches mainly focus on editing attributes in images with ambiguous semantics or regions from a reference image, which fail to achieve photographic semantic attribute transfer, such as the beard from a photo of a man. In this paper, we propose an imag… ▽ More

    Submitted 3 August, 2024; originally announced August 2024.

    Journal ref: In Proceedings of the 32nd ACM International Conference on Multimedia, 2024

  9. arXiv:2408.00796  [pdf, ps, other

    cs.DS cs.CC math-ph math.PR

    Discrepancy Algorithms for the Binary Perceptron

    Authors: Shuangping Li, Tselil Schramm, Kangjie Zhou

    Abstract: The binary perceptron problem asks us to find a sign vector in the intersection of independently chosen random halfspaces with intercept $-κ$. We analyze the performance of the canonical discrepancy minimization algorithms of Lovett-Meka and Rothvoss/Eldan-Singh for the asymmetric binary perceptron problem. We obtain new algorithmic results in the $κ= 0$ case and in the large-$|κ|$ case. In the… ▽ More

    Submitted 18 July, 2024; originally announced August 2024.

    Comments: 58 pages

  10. arXiv:2408.00254  [pdf, other

    cs.CV

    LoopSparseGS: Loop Based Sparse-View Friendly Gaussian Splatting

    Authors: Zhenyu Bao, Guibiao Liao, Kaichen Zhou, Kanglin Liu, Qing Li, Guoping Qiu

    Abstract: Despite the photorealistic novel view synthesis (NVS) performance achieved by the original 3D Gaussian splatting (3DGS), its rendering quality significantly degrades with sparse input views. This performance drop is mainly caused by the limited number of initial points generated from the sparse input, insufficient supervision during the training process, and inadequate regularization of the oversi… ▽ More

    Submitted 31 July, 2024; originally announced August 2024.

    Comments: 13 pages, 10 figures

  11. arXiv:2407.20937  [pdf, other

    eess.IV cs.CV

    EAR: Edge-Aware Reconstruction of 3-D vertebrae structures from bi-planar X-ray images

    Authors: Lixing Tan, Shuang Song, Yaofeng He, Kangneng Zhou, Tong Lu, Ruoxiu Xiao

    Abstract: X-ray images ease the diagnosis and treatment process due to their rapid imaging speed and high resolution. However, due to the projection process of X-ray imaging, much spatial information has been lost. To accurately provide efficient spinal morphological and structural information, reconstructing the 3-D structures of the spine from the 2-D X-ray images is essential. It is challenging for curre… ▽ More

    Submitted 4 August, 2024; v1 submitted 30 July, 2024; originally announced July 2024.

    Comments: 13 pages, 11 figures, 3 tables

  12. arXiv:2407.18743  [pdf, other

    cs.CL

    Towards Effective and Efficient Continual Pre-training of Large Language Models

    Authors: Jie Chen, Zhipeng Chen, Jiapeng Wang, Kun Zhou, Yutao Zhu, Jinhao Jiang, Yingqian Min, Wayne Xin Zhao, Zhicheng Dou, Jiaxin Mao, Yankai Lin, Ruihua Song, Jun Xu, Xu Chen, Rui Yan, Zhewei Wei, Di Hu, Wenbing Huang, Ji-Rong Wen

    Abstract: Continual pre-training (CPT) has been an important approach for adapting language models to specific domains or tasks. To make the CPT approach more traceable, this paper presents a technical report for continually pre-training Llama-3 (8B), which significantly enhances the Chinese language ability and scientific reasoning ability of the backbone model. To enhance the new abilities while retaining… ▽ More

    Submitted 26 July, 2024; originally announced July 2024.

    Comments: 16 pages, 10 figures, 16 tables

    MSC Class: 68T50 ACM Class: I.2.7

  13. arXiv:2407.18035  [pdf, other

    cs.CV cs.AI cs.CL

    RestoreAgent: Autonomous Image Restoration Agent via Multimodal Large Language Models

    Authors: Haoyu Chen, Wenbo Li, Jinjin Gu, Jingjing Ren, Sixiang Chen, Tian Ye, Renjing Pei, Kaiwen Zhou, Fenglong Song, Lei Zhu

    Abstract: Natural images captured by mobile devices often suffer from multiple types of degradation, such as noise, blur, and low light. Traditional image restoration methods require manual selection of specific tasks, algorithms, and execution sequences, which is time-consuming and may yield suboptimal results. All-in-one models, though capable of handling multiple tasks, typically support only a limited r… ▽ More

    Submitted 25 July, 2024; originally announced July 2024.

  14. arXiv:2407.17996  [pdf, other

    cs.CV

    Joint RGB-Spectral Decomposition Model Guided Image Enhancement in Mobile Photography

    Authors: Kailai Zhou, Lijing Cai, Yibo Wang, Mengya Zhang, Bihan Wen, Qiu Shen, Xun Cao

    Abstract: The integration of miniaturized spectrometers into mobile devices offers new avenues for image quality enhancement and facilitates novel downstream tasks. However, the broader application of spectral sensors in mobile photography is hindered by the inherent complexity of spectral images and the constraints of spectral imaging capabilities. To overcome these challenges, we propose a joint RGB-Spect… ▽ More

    Submitted 25 July, 2024; originally announced July 2024.

  15. arXiv:2407.16237  [pdf, other

    cs.AR cs.AI cs.LG

    OriGen:Enhancing RTL Code Generation with Code-to-Code Augmentation and Self-Reflection

    Authors: Fan Cui, Chenyang Yin, Kexing Zhou, Youwei Xiao, Guangyu Sun, Qiang Xu, Qipeng Guo, Demin Song, Dahua Lin, Xingcheng Zhang, Yun, Liang

    Abstract: Recent studies have illuminated that Large Language Models (LLMs) exhibit substantial potential in the realm of RTL (Register Transfer Level) code generation, with notable advancements evidenced by commercial models such as GPT-4 and Claude3-Opus. Despite their proficiency, these commercial LLMs often raise concerns regarding privacy and security. Conversely, open-source LLMs, which offer solution… ▽ More

    Submitted 23 July, 2024; originally announced July 2024.

  16. arXiv:2407.15770  [pdf, other

    cs.CY cs.CE

    Examining Inequality in Park Quality for Promoting Health Across 35 Global Cities

    Authors: Linus W. Dietz, Sanja Šćepanović, Ke Zhou, André Felipe Zanella, Daniele Quercia

    Abstract: Urban parks provide significant health benefits by offering spaces and facilities for various recreational and leisure activities. However, the capacity of specific park spaces and elements to foster health remains underexamined. Traditional studies have focused on parks' size, greenery, and accessibility, often overlooking their ability to facilitate specific health-promoting activities. To addre… ▽ More

    Submitted 22 July, 2024; originally announced July 2024.

    Comments: 29 pages main paper, 10 pages appendix

  17. arXiv:2407.13328  [pdf, other

    cs.CV

    Unsupervised Domain Adaptive Lane Detection via Contextual Contrast and Aggregation

    Authors: Kunyang Zhou, Yunjian Feng, Jun Li

    Abstract: This paper focuses on two crucial issues in domain-adaptive lane detection, i.e., how to effectively learn discriminative features and transfer knowledge across domains. Existing lane detection methods usually exploit a pixel-wise cross-entropy loss to train detection models. However, the loss ignores the difference in feature representation among lanes, which leads to inefficient feature learning… ▽ More

    Submitted 18 July, 2024; originally announced July 2024.

  18. arXiv:2407.11550  [pdf, other

    cs.CL cs.AI

    Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference

    Authors: Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, S. Kevin Zhou

    Abstract: Large Language Models have excelled in various fields but encounter challenges in memory and time efficiency due to the expanding Key-Value (KV) cache required for long-sequence inference. Recent efforts try to reduce KV cache size to a given memory budget by evicting vast non-critical cache elements during runtime, while preserving generation quality. Our revisiting of current eviction methods re… ▽ More

    Submitted 16 August, 2024; v1 submitted 16 July, 2024; originally announced July 2024.

  19. arXiv:2407.10707  [pdf, other

    cs.CV

    Interactive Rendering of Relightable and Animatable Gaussian Avatars

    Authors: Youyi Zhan, Tianjia Shao, He Wang, Yin Yang, Kun Zhou

    Abstract: Creating relightable and animatable avatars from multi-view or monocular videos is a challenging task for digital human creation and virtual reality applications. Previous methods rely on neural radiance fields or ray tracing, resulting in slow training and rendering processes. By utilizing Gaussian Splatting, we propose a simple and efficient method to decouple body materials and lighting from sp… ▽ More

    Submitted 15 July, 2024; originally announced July 2024.

  20. arXiv:2407.10275  [pdf, other

    cs.CL cs.AI

    Cross-Lingual Multi-Hop Knowledge Editing -- Benchmarks, Analysis and a Simple Contrastive Learning based Approach

    Authors: Aditi Khandelwal, Harman Singh, Hengrui Gu, Tianlong Chen, Kaixiong Zhou

    Abstract: Large language models are often expected to constantly adapt to new sources of knowledge and knowledge editing techniques aim to efficiently patch the outdated model knowledge, with minimal modification. Most prior works focus on monolingual knowledge editing in English, even though new information can emerge in any language from any part of the world. We propose the Cross-Lingual Multi-Hop Knowle… ▽ More

    Submitted 14 July, 2024; originally announced July 2024.

    Comments: Paper on Cross-Lingual Multi-Hop Knowledge Editing

  21. Pattern Guided UV Recovery for Realistic Video Garment Texturing

    Authors: Youyi Zhan, Tuanfeng Y. Wang, Tianjia Shao, Kun Zhou

    Abstract: The fast growth of E-Commerce creates a global market worth USD 821 billion for online fashion shopping. What unique about fashion presentation is that, the same design can usually be offered with different cloths textures. However, only real video capturing or manual per-frame editing can be used for virtual showcase on the same design with different textures, both of which are heavily labor inte… ▽ More

    Submitted 14 July, 2024; originally announced July 2024.

    Comments: Accepted to IEEE Transactions on Visualization and Computer Graphics

  22. arXiv:2407.07950  [pdf, other

    cs.CL cs.AI cs.HC

    Rel-A.I.: An Interaction-Centered Approach To Measuring Human-LM Reliance

    Authors: Kaitlyn Zhou, Jena D. Hwang, Xiang Ren, Nouha Dziri, Dan Jurafsky, Maarten Sap

    Abstract: The reconfiguration of human-LM interactions from simple sentence completions to complex, multi-domain, humanlike engagements necessitates new methodologies to understand how humans choose to rely on LMs. In our work, we contend that reliance is influenced by numerous factors within the interactional context of a generation, a departure from prior work that used verbalized confidence (e.g., "I'm c… ▽ More

    Submitted 10 July, 2024; originally announced July 2024.

    Comments: Preprint

  23. arXiv:2407.07406  [pdf, other

    cs.CV cs.AI

    Weakly-supervised Medical Image Segmentation with Gaze Annotations

    Authors: Yuan Zhong, Chenhui Tang, Yumeng Yang, Ruoxi Qi, Kang Zhou, Yuqi Gong, Pheng Ann Heng, Janet H. Hsiao, Qi Dou

    Abstract: Eye gaze that reveals human observational patterns has increasingly been incorporated into solutions for vision tasks. Despite recent explorations on leveraging gaze to aid deep networks, few studies exploit gaze as an efficient annotation approach for medical image segmentation which typically entails heavy annotating costs. In this paper, we propose to collect dense weak supervision for medical… ▽ More

    Submitted 10 July, 2024; originally announced July 2024.

    Comments: MICCAI 2024

  24. arXiv:2407.06168  [pdf, other

    cs.RO cs.CV

    TARGO: Benchmarking Target-driven Object Grasping under Occlusions

    Authors: Yan Xia, Ran Ding, Ziyuan Qin, Guanqi Zhan, Kaichen Zhou, Long Yang, Hao Dong, Daniel Cremers

    Abstract: Recent advances in predicting 6D grasp poses from a single depth image have led to promising performance in robotic grasping. However, previous grasping models face challenges in cluttered environments where nearby objects impact the target object's grasp. In this paper, we first establish a new benchmark dataset for TARget-driven Grasping under Occlusions, named TARGO. We make the following contr… ▽ More

    Submitted 8 July, 2024; originally announced July 2024.

    Comments: 19 pages, 17 figures

  25. arXiv:2407.05563  [pdf, other

    cs.CL

    LLMBox: A Comprehensive Library for Large Language Models

    Authors: Tianyi Tang, Yiwen Hu, Bingqian Li, Wenyang Luo, Zijing Qin, Haoxiang Sun, Jiapeng Wang, Shiyi Xu, Xiaoxue Cheng, Geyang Guo, Han Peng, Bowen Zheng, Yiru Tang, Yingqian Min, Yushuo Chen, Jie Chen, Yuanqian Zhao, Luran Ding, Yuhao Wang, Zican Dong, Chunxuan Xia, Junyi Li, Kun Zhou, Wayne Xin Zhao, Ji-Rong Wen

    Abstract: To facilitate the research on large language models (LLMs), this paper presents a comprehensive and unified library, LLMBox, to ease the development, use, and evaluation of LLMs. This library is featured with three main merits: (1) a unified data interface that supports the flexible implementation of various training strategies, (2) a comprehensive evaluation that covers extensive tasks, datasets,… ▽ More

    Submitted 7 July, 2024; originally announced July 2024.

    Comments: Accepted by ACL 2024 Demo

  26. arXiv:2407.04055  [pdf, other

    q-bio.QM cs.AI cs.LG

    Benchmark on Drug Target Interaction Modeling from a Structure Perspective

    Authors: Xinnan Zhang, Jialin Wu, Junyi Xie, Tianlong Chen, Kaixiong Zhou

    Abstract: The prediction modeling of drug-target interactions is crucial to drug discovery and design, which has seen rapid advancements owing to deep learning technologies. Recently developed methods, such as those based on graph neural networks (GNNs) and Transformers, demonstrate exceptional performance across various datasets by effectively extracting structural information. However, the benchmarking of… ▽ More

    Submitted 4 July, 2024; originally announced July 2024.

    Comments: Submitted to NIPS 2024 Dataset and Benchmark

  27. arXiv:2407.01697  [pdf, other

    cs.CL cs.AI cs.HC

    NLPGuard: A Framework for Mitigating the Use of Protected Attributes by NLP Classifiers

    Authors: Salvatore Greco, Ke Zhou, Licia Capra, Tania Cerquitelli, Daniele Quercia

    Abstract: AI regulations are expected to prohibit machine learning models from using sensitive attributes during training. However, the latest Natural Language Processing (NLP) classifiers, which rely on deep learning, operate as black-box systems, complicating the detection and remediation of such misuse. Traditional bias mitigation methods in NLP aim for comparable performance across different groups base… ▽ More

    Submitted 1 July, 2024; originally announced July 2024.

    Comments: Paper accepted at CSCW 2024

  28. arXiv:2407.01595  [pdf, other

    cs.LG cs.CY cs.SE

    Fairpriori: Improving Biased Subgroup Discovery for Deep Neural Network Fairness

    Authors: Kacy Zhou, Jiawen Wen, Nan Yang, Dong Yuan, Qinghua Lu, Huaming Chen

    Abstract: While deep learning has become a core functional module of most software systems, concerns regarding the fairness of ML predictions have emerged as a significant issue that affects prediction results due to discrimination. Intersectional bias, which disproportionately affects members of subgroups, is a prime example of this. For instance, a machine learning model might exhibit bias against darker-… ▽ More

    Submitted 24 June, 2024; originally announced July 2024.

    Comments: 11 pages

  29. arXiv:2407.00983  [pdf, other

    cs.CV

    FairMedFM: Fairness Benchmarking for Medical Imaging Foundation Models

    Authors: Ruinan Jin, Zikang Xu, Yuan Zhong, Qiongsong Yao, Qi Dou, S. Kevin Zhou, Xiaoxiao Li

    Abstract: The advent of foundation models (FMs) in healthcare offers unprecedented opportunities to enhance medical diagnostics through automated classification and segmentation tasks. However, these models also raise significant concerns about their fairness, especially when applied to diverse and underrepresented populations in healthcare applications. Currently, there is a lack of comprehensive benchmark… ▽ More

    Submitted 3 July, 2024; v1 submitted 1 July, 2024; originally announced July 2024.

    Comments: 29 pages, 17 figures

  30. arXiv:2407.00632  [pdf, other

    cs.RO cs.CL cs.CV cs.MA

    CAMON: Cooperative Agents for Multi-Object Navigation with LLM-based Conversations

    Authors: Pengying Wu, Yao Mu, Kangjie Zhou, Ji Ma, Junting Chen, Chang Liu

    Abstract: Visual navigation tasks are critical for household service robots. As these tasks become increasingly complex, effective communication and collaboration among multiple robots become imperative to ensure successful completion. In recent years, large language models (LLMs) have exhibited remarkable comprehension and planning abilities in the context of embodied agents. However, their application in… ▽ More

    Submitted 30 June, 2024; originally announced July 2024.

    Comments: Accepted to the RSS 2024 Workshop: GROUND

  31. arXiv:2406.19853  [pdf, other

    cs.CL cs.AI

    YuLan: An Open-source Large Language Model

    Authors: Yutao Zhu, Kun Zhou, Kelong Mao, Wentong Chen, Yiding Sun, Zhipeng Chen, Qian Cao, Yihan Wu, Yushuo Chen, Feng Wang, Lei Zhang, Junyi Li, Xiaolei Wang, Lei Wang, Beichen Zhang, Zican Dong, Xiaoxue Cheng, Yuhan Chen, Xinyu Tang, Yupeng Hou, Qiangqiang Ren, Xincheng Pang, Shufang Xie, Wayne Xin Zhao, Zhicheng Dou , et al. (13 additional authors not shown)

    Abstract: Large language models (LLMs) have become the foundation of many applications, leveraging their extensive capabilities in processing and understanding natural language. While many open-source LLMs have been released with technical reports, the lack of training details hinders further research and development. This paper presents the development of YuLan, a series of open-source LLMs with $12$ billi… ▽ More

    Submitted 28 June, 2024; originally announced June 2024.

  32. arXiv:2406.14129  [pdf, other

    cs.CV cs.CL cs.MM

    Towards Event-oriented Long Video Understanding

    Authors: Yifan Du, Kun Zhou, Yuqi Huo, Yifan Li, Wayne Xin Zhao, Haoyu Lu, Zijia Zhao, Bingning Wang, Weipeng Chen, Ji-Rong Wen

    Abstract: With the rapid development of video Multimodal Large Language Models (MLLMs), numerous benchmarks have been proposed to assess their video understanding capability. However, due to the lack of rich events in the videos, these datasets may suffer from the short-cut bias that the answers can be deduced from a few frames, without the need to watch the entire video. To address this issue, we introduce… ▽ More

    Submitted 20 June, 2024; originally announced June 2024.

    Comments: Work on progress

  33. arXiv:2406.13137  [pdf, other

    cs.LG

    Efficient Sharpness-Aware Minimization for Molecular Graph Transformer Models

    Authors: Yili Wang, Kaixiong Zhou, Ninghao Liu, Ying Wang, Xin Wang

    Abstract: Sharpness-aware minimization (SAM) has received increasing attention in computer vision since it can effectively eliminate the sharp local minima from the training trajectory and mitigate generalization degradation. However, SAM requires two sequential gradient computations during the optimization of each step: one to obtain the perturbation gradient and the other to obtain the updating gradient.… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

  34. arXiv:2406.12606  [pdf, other

    cs.CL

    Low-Redundant Optimization for Large Language Model Alignment

    Authors: Zhipeng Chen, Kun Zhou, Wayne Xin Zhao, Jingyuan Wang, Ji-Rong Wen

    Abstract: Large language models (LLMs) are still struggling in aligning with human preference in complex tasks and scenarios. They are prone to overfit into the unexpected patterns or superficial styles in the training data. We conduct an empirical study that only selects the top-10\% most updated parameters in LLMs for alignment training, and see improvements in the convergence process and final performanc… ▽ More

    Submitted 18 June, 2024; originally announced June 2024.

    Comments: 14 pages, working in progress

  35. arXiv:2406.11859  [pdf

    cs.CY cs.AI cs.ET cs.HC

    "Sora is Incredible and Scary": Emerging Governance Challenges of Text-to-Video Generative AI Models

    Authors: Kyrie Zhixuan Zhou, Abhinav Choudhry, Ece Gumusel, Madelyn Rose Sanfilippo

    Abstract: Text-to-video generative AI models such as Sora OpenAI have the potential to disrupt multiple industries. In this paper, we report a qualitative social media analysis aiming to uncover people's perceived impact of and concerns about Sora's integration. We collected and analyzed comments (N=292) under popular posts about Sora-generated videos, comparison between Sora videos and Midjourney images, a… ▽ More

    Submitted 9 April, 2024; originally announced June 2024.

  36. arXiv:2406.11548  [pdf, other

    cs.RO cs.AI cs.CV

    AIC MLLM: Autonomous Interactive Correction MLLM for Robust Robotic Manipulation

    Authors: Chuyan Xiong, Chengyu Shen, Xiaoqi Li, Kaichen Zhou, Jiaming Liu, Ruiping Wang, Hao Dong

    Abstract: The ability to reflect on and correct failures is crucial for robotic systems to interact stably with real-life objects.Observing the generalization and reasoning capabilities of Multimodal Large Language Models (MLLMs), previous approaches have aimed to utilize these models to enhance robotic systems accordingly.However, these methods typically focus on high-level planning corrections using an ad… ▽ More

    Submitted 23 June, 2024; v1 submitted 17 June, 2024; originally announced June 2024.

  37. arXiv:2406.10522  [pdf, other

    cs.LG cs.AI cs.CL

    Humor in AI: Massive Scale Crowd-Sourced Preferences and Benchmarks for Cartoon Captioning

    Authors: Jifan Zhang, Lalit Jain, Yang Guo, Jiayi Chen, Kuan Lok Zhou, Siddharth Suresh, Andrew Wagenmaker, Scott Sievert, Timothy Rogers, Kevin Jamieson, Robert Mankoff, Robert Nowak

    Abstract: We present a novel multimodal preference dataset for creative tasks, consisting of over 250 million human ratings on more than 2.2 million captions, collected through crowdsourcing rating data for The New Yorker's weekly cartoon caption contest over the past eight years. This unique dataset supports the development and evaluation of multimodal large language models and preference-based fine-tuning… ▽ More

    Submitted 15 June, 2024; originally announced June 2024.

  38. arXiv:2406.07471  [pdf, other

    cs.CV

    OphNet: A Large-Scale Video Benchmark for Ophthalmic Surgical Workflow Understanding

    Authors: Ming Hu, Peng Xia, Lin Wang, Siyuan Yan, Feilong Tang, Zhongxing Xu, Yimin Luo, Kaimin Song, Jurgen Leitner, Xuelian Cheng, Jun Cheng, Chi Liu, Kaijing Zhou, Zongyuan Ge

    Abstract: Surgical scene perception via videos is critical for advancing robotic surgery, telesurgery, and AI-assisted surgery, particularly in ophthalmology. However, the scarcity of diverse and richly annotated video datasets has hindered the development of intelligent systems for surgical workflow analysis. Existing datasets face challenges such as small scale, lack of diversity in surgery and phase cate… ▽ More

    Submitted 19 July, 2024; v1 submitted 11 June, 2024; originally announced June 2024.

    Comments: Accepted by ECCV 2024

  39. arXiv:2406.06564  [pdf, ps, other

    cs.LG cs.AI cs.CL

    Revolutionizing Large Language Model Training through Dynamic Parameter Adjustment

    Authors: Kaiye Zhou, Shucheng Wang

    Abstract: In the era of large language models, the demand for efficient use of computational resources has become critically important. Although parameter-efficient fine-tuning techniques have achieved results comparable to full fine-tuning, their application during the pre-training phase poses significant challenges. Specifically, employing parameter-efficient strategies at the onset of pre-training can se… ▽ More

    Submitted 3 June, 2024; originally announced June 2024.

    Comments: This paper introduces an innovative parameter-efficient training method that dynamically switches parameters throughout the entire training period, achieving significant memory and computational savings

  40. arXiv:2406.04637  [pdf

    cs.HC cs.CY

    Accessible Adventures: Teaching Accessibility to High School Students Through Games

    Authors: Kyrie Zhixuan Zhou, Chunyu Liu, Jingwen Shan, Devorah Kletenik, Rachel F. Adler

    Abstract: Accessibility education has been rarely incorporated into the high school curricula. This is a missed opportunity to equip next-generation software designers and decision-makers with knowledge, awareness, and empathy regarding accessibility and disabilities. We taught accessibility to students (N=93) in a midwestern high school through empathy-driven games and interviewed three Computer Science hi… ▽ More

    Submitted 7 June, 2024; originally announced June 2024.

    Comments: 87th Annual Meeting of the Association for Information Science and Technology (ASIS&T)

  41. arXiv:2406.04339  [pdf, other

    cs.CV

    RoboMamba: Multimodal State Space Model for Efficient Robot Reasoning and Manipulation

    Authors: Jiaming Liu, Mengzhen Liu, Zhenyu Wang, Lily Lee, Kaichen Zhou, Pengju An, Senqiao Yang, Renrui Zhang, Yandong Guo, Shanghang Zhang

    Abstract: A fundamental objective in robot manipulation is to enable models to comprehend visual scenes and execute actions. Although existing robot Multimodal Large Language Models (MLLMs) can handle a range of basic tasks, they still face challenges in two areas: 1) inadequate reasoning ability to tackle complex tasks, and 2) high computational costs for MLLM fine-tuning and inference. The recently propos… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

  42. arXiv:2406.04301  [pdf, other

    cs.CV

    Neural Surface Reconstruction from Sparse Views Using Epipolar Geometry

    Authors: Kaichen Zhou

    Abstract: This paper addresses the challenge of reconstructing surfaces from sparse view inputs, where ambiguity and occlusions due to missing information pose significant hurdles. We present a novel approach, named EpiS, that incorporates Epipolar information into the reconstruction process. Existing methods in sparse-view neural surface learning have mainly focused on mean and variance considerations usin… ▽ More

    Submitted 6 June, 2024; originally announced June 2024.

  43. arXiv:2406.02970  [pdf, ps, other

    math.PR cs.LG math.OC

    Which exceptional low-dimensional projections of a Gaussian point cloud can be found in polynomial time?

    Authors: Andrea Montanari, Kangjie Zhou

    Abstract: Given $d$-dimensional standard Gaussian vectors $\boldsymbol{x}_1,\dots, \boldsymbol{x}_n$, we consider the set of all empirical distributions of its $m$-dimensional projections, for $m$ a fixed constant. Diaconis and Freedman (1984) proved that, if $n/d\to \infty$, all such distributions converge to the standard Gaussian distribution. In contrast, we study the proportional asymptotics, whereby… ▽ More

    Submitted 5 June, 2024; originally announced June 2024.

    Comments: 83 pages

  44. arXiv:2406.02009  [pdf, other

    eess.AS cs.CL cs.SD

    Phonetic Enhanced Language Modeling for Text-to-Speech Synthesis

    Authors: Kun Zhou, Shengkui Zhao, Yukun Ma, Chong Zhang, Hao Wang, Dianwen Ng, Chongjia Ni, Nguyen Trung Hieu, Jia Qi Yip, Bin Ma

    Abstract: Recent language model-based text-to-speech (TTS) frameworks demonstrate scalability and in-context learning capabilities. However, they suffer from robustness issues due to the accumulation of errors in speech unit predictions during autoregressive language modeling. In this paper, we propose a phonetic enhanced language modeling method to improve the performance of TTS models. We leverage self-su… ▽ More

    Submitted 11 June, 2024; v1 submitted 4 June, 2024; originally announced June 2024.

    Comments: Accepted by Interspeech 2024

  45. arXiv:2406.00663  [pdf, other

    cs.CV cs.AI cs.LG

    SimSAM: Zero-shot Medical Image Segmentation via Simulated Interaction

    Authors: Benjamin Towle, Xin Chen, Ke Zhou

    Abstract: The recently released Segment Anything Model (SAM) has shown powerful zero-shot segmentation capabilities through a semi-automatic annotation setup in which the user can provide a prompt in the form of clicks or bounding boxes. There is growing interest around applying this to medical imaging, where the cost of obtaining expert annotations is high, privacy restrictions may limit sharing of patient… ▽ More

    Submitted 2 June, 2024; originally announced June 2024.

    Comments: Published at ISBI 2024. Awarded Top 12 Oral Presentation

  46. arXiv:2406.00262  [pdf, other

    cs.LG cs.AI

    Contrastive Learning Via Equivariant Representation

    Authors: Sifan Song, Jinfeng Wang, Qiaochu Zhao, Xiang Li, Dufan Wu, Angelos Stefanidis, Jionglong Su, S. Kevin Zhou, Quanzheng Li

    Abstract: Invariant-based Contrastive Learning (ICL) methods have achieved impressive performance across various domains. However, the absence of latent space representation for distortion (augmentation)-related information in the latent space makes ICL sub-optimal regarding training efficiency and robustness in downstream tasks. Recent studies suggest that introducing equivariance into Contrastive Learning… ▽ More

    Submitted 31 May, 2024; originally announced June 2024.

    Comments: Preprint. Under review

  47. arXiv:2405.20072  [pdf, other

    cs.CV

    Faces of the Mind: Unveiling Mental Health States Through Facial Expressions in 11,427 Adolescents

    Authors: Xiao Xu, Keyin Zhou, Yan Zhang, Yang Wang, Fei Wang, Xizhe Zhang

    Abstract: Mood disorders, including depression and anxiety, often manifest through facial expressions. While previous research has explored the connection between facial features and emotions, machine learning algorithms for estimating mood disorder severity have been hindered by small datasets and limited real-world application. To address this gap, we analyzed facial videos of 11,427 participants, a datas… ▽ More

    Submitted 30 May, 2024; originally announced May 2024.

  48. arXiv:2405.19188  [pdf, other

    cs.HC

    Personalized Interiors at Scale: Leveraging AI for Efficient and Customizable Design Solutions

    Authors: Kaiwen Zhou, Tianyu Wang

    Abstract: In this paper, we introduce an innovative application of artificial intelligence in the realm of interior design through the integration of Stable Diffusion and Dreambooth models. This paper explores the potential of these advanced generative models to streamline and democratize the process of room interior generation, offering a significant departure from conventional, labor-intensive techniques.… ▽ More

    Submitted 29 May, 2024; originally announced May 2024.

    Comments: 18 pages, 4 figures

  49. arXiv:2405.17418  [pdf, other

    cs.CV

    Self-Corrected Multimodal Large Language Model for End-to-End Robot Manipulation

    Authors: Jiaming Liu, Chenxuan Li, Guanqun Wang, Lily Lee, Kaichen Zhou, Sixiang Chen, Chuyan Xiong, Jiaxin Ge, Renrui Zhang, Shanghang Zhang

    Abstract: Robot manipulation policies have shown unsatisfactory action performance when confronted with novel task or object instances. Hence, the capability to automatically detect and self-correct failure action is essential for a practical robotic system. Recently, Multimodal Large Language Models (MLLMs) have shown promise in visual instruction following and demonstrated strong reasoning abilities in va… ▽ More

    Submitted 27 May, 2024; originally announced May 2024.

  50. arXiv:2405.15660  [pdf, other

    cs.CV

    Low-Light Video Enhancement via Spatial-Temporal Consistent Illumination and Reflection Decomposition

    Authors: Xiaogang Xu, Kun Zhou, Tao Hu, Ruixing Wang, Hujun Bao

    Abstract: Low-Light Video Enhancement (LLVE) seeks to restore dynamic and static scenes plagued by severe invisibility and noise. One critical aspect is formulating a consistency constraint specifically for temporal-spatial illumination and appearance enhanced versions, a dimension overlooked in existing methods. In this paper, we present an innovative video Retinex-based decomposition strategy that operate… ▽ More

    Submitted 24 May, 2024; originally announced May 2024.