Search | arXiv e-print repository

QuadraNet: Improving High-Order Neural Interaction Efficiency with Hardware-Aware Quadratic Neural Networks

Authors: Chenhui Xu, Fuxun Yu, Zirui Xu, Chenchen Liu, Jinjun Xiong, Xiang Chen

Abstract: Recent progress in computer vision-oriented neural network designs is mostly driven by capturing high-order neural interactions among inputs and features. And there emerged a variety of approaches to accomplish this, such as Transformers and its variants. However, these interactions generate a large amount of intermediate state and/or strong data dependency, leading to considerable memory consum… ▽ More Recent progress in computer vision-oriented neural network designs is mostly driven by capturing high-order neural interactions among inputs and features. And there emerged a variety of approaches to accomplish this, such as Transformers and its variants. However, these interactions generate a large amount of intermediate state and/or strong data dependency, leading to considerable memory consumption and computing cost, and therefore compromising the overall runtime performance. To address this challenge, we rethink the high-order interactive neural network design with a quadratic computing approach. Specifically, we propose QuadraNet -- a comprehensive model design methodology from neuron reconstruction to structural block and eventually to the overall neural network implementation. Leveraging quadratic neurons' intrinsic high-order advantages and dedicated computation optimization schemes, QuadraNet could effectively achieve optimal cognition and computation performance. Incorporating state-of-the-art hardware-aware neural architecture search and system integration techniques, QuadraNet could also be well generalized in different hardware constraint settings and deployment scenarios. The experiment shows thatQuadraNet achieves up to 1.5$\times$ throughput, 30% less memory footprint, and similar cognition performance, compared with the state-of-the-art high-order approaches. △ Less

Submitted 29 November, 2023; originally announced November 2023.

Comments: ASP-DAC 2024 Best Paper Nomination

arXiv:2311.16434 [pdf, ps, other]

Observational signature of continuously operating drivers of decayless kink oscillation

Authors: Dong Li, Zhentong Li, Fanpeng Shi, Yang Su, Wei Chen, Fu Yu, Chuan Li, Ye Qiu, Yu Huang, Zongjun Ning

Abstract: Decayless kink oscillations, which are nearly omnipresent in the solar corona, are believed to be driven by continuously operating energy supply. In this letter, we investigate an external continuous excitation of an apparent decayless oscillation during an X1.1 flare on June 20, 2023 (SOL2023-06-20T16:42).The decayless kink oscillation was identified in the coronal loop at extreme ultraviolet (EU… ▽ More Decayless kink oscillations, which are nearly omnipresent in the solar corona, are believed to be driven by continuously operating energy supply. In this letter, we investigate an external continuous excitation of an apparent decayless oscillation during an X1.1 flare on June 20, 2023 (SOL2023-06-20T16:42).The decayless kink oscillation was identified in the coronal loop at extreme ultraviolet (EUV) wavelengths and the associated flare quasi-periodic pulsations (QPPs) were simultaneously observed in passbands of hard X-ray (HXR), microwave, and ultraviolet (UV) emissions. The kink oscillation is detected as a transverse oscillation of the coronal loop, which reveals five apparent cycles with an average period of about 130-10 s. The oscillation amplitude does not show any significantly decay, suggesting a decayless oscillation. At the same time, the solar flare occurs in the vicinity of the oscillating loop and exhibits five main pulses in HXR, microwave, and UV emissions, which could be regarded as flare QPPs. They have similar periods of about 100-130 s, which may indicate successive and repetitive energy releases during the flare impulsive phase. The peak of each loop oscillation cycle appears to follow the pulse of the QPPs, suggesting that the transverse oscillation is closely associated with flare QPPs. Our observations support the scenario where the repetitive energy released following flare QPPs could be invoked as external, continuously operating drivers of the apparent decayless kink oscillation. △ Less

Submitted 27 November, 2023; originally announced November 2023.

Comments: accepted by A&A

arXiv:2311.13233 [pdf, other]

A Survey of Adversarial CAPTCHAs on its History, Classification and Generation

Authors: Zisheng Xu, Qiao Yan, F. Richard Yu, Victor C. M. Leung

Abstract: Completely Automated Public Turing test to tell Computers and Humans Apart, short for CAPTCHA, is an essential and relatively easy way to defend against malicious attacks implemented by bots. The security and usability trade-off limits the use of massive geometric transformations to interfere deep model recognition and deep models even outperformed humans in complex CAPTCHAs. The discovery of adve… ▽ More Completely Automated Public Turing test to tell Computers and Humans Apart, short for CAPTCHA, is an essential and relatively easy way to defend against malicious attacks implemented by bots. The security and usability trade-off limits the use of massive geometric transformations to interfere deep model recognition and deep models even outperformed humans in complex CAPTCHAs. The discovery of adversarial examples provides an ideal solution to the security and usability trade-off by integrating adversarial examples and CAPTCHAs to generate adversarial CAPTCHAs that can fool the deep models. In this paper, we extend the definition of adversarial CAPTCHAs and propose a classification method for adversarial CAPTCHAs. Then we systematically review some commonly used methods to generate adversarial examples and methods that are successfully used to generate adversarial CAPTCHAs. Also, we analyze some defense methods that can be used to defend adversarial CAPTCHAs, indicating potential threats to adversarial CAPTCHAs. Finally, we discuss some possible future research directions for adversarial CAPTCHAs at the end of this paper. △ Less

Submitted 22 November, 2023; originally announced November 2023.

Comments: Submitted to ACM Computing Surveys (Under Review)

arXiv:2311.12345 [pdf, other]

Stable Diffusion For Aerial Object Detection

Authors: Yanan Jian, Fuxun Yu, Simranjit Singh, Dimitrios Stamoulis

Abstract: Aerial object detection is a challenging task, in which one major obstacle lies in the limitations of large-scale data collection and the long-tail distribution of certain classes. Synthetic data offers a promising solution, especially with recent advances in diffusion-based methods like stable diffusion (SD). However, the direct application of diffusion methods to aerial domains poses unique chal… ▽ More Aerial object detection is a challenging task, in which one major obstacle lies in the limitations of large-scale data collection and the long-tail distribution of certain classes. Synthetic data offers a promising solution, especially with recent advances in diffusion-based methods like stable diffusion (SD). However, the direct application of diffusion methods to aerial domains poses unique challenges: stable diffusion's optimization for rich ground-level semantics doesn't align with the sparse nature of aerial objects, and the extraction of post-synthesis object coordinates remains problematic. To address these challenges, we introduce a synthetic data augmentation framework tailored for aerial images. It encompasses sparse-to-dense region of interest (ROI) extraction to bridge the semantic gap, fine-tuning the diffusion model with low-rank adaptation (LORA) to circumvent exhaustive retraining, and finally, a Copy-Paste method to compose synthesized objects with backgrounds, providing a nuanced approach to aerial object detection through synthetic data. △ Less

Submitted 20 November, 2023; originally announced November 2023.

Comments: Accepted at NeurIPS 2023 Synthetic Data Generation with Generative AI workshop

arXiv:2311.10117 [pdf, other]

Automatic Engineering of Long Prompts

Authors: Cho-Jui Hsieh, Si Si, Felix X. Yu, Inderjit S. Dhillon

Abstract: Large language models (LLMs) have demonstrated remarkable capabilities in solving complex open-domain tasks, guided by comprehensive instructions and demonstrations provided in the form of prompts. However, these prompts can be lengthy, often comprising hundreds of lines and thousands of tokens, and their design often requires considerable human effort. Recent research has explored automatic promp… ▽ More Large language models (LLMs) have demonstrated remarkable capabilities in solving complex open-domain tasks, guided by comprehensive instructions and demonstrations provided in the form of prompts. However, these prompts can be lengthy, often comprising hundreds of lines and thousands of tokens, and their design often requires considerable human effort. Recent research has explored automatic prompt engineering for short prompts, typically consisting of one or a few sentences. However, the automatic design of long prompts remains a challenging problem due to its immense search space. In this paper, we investigate the performance of greedy algorithms and genetic algorithms for automatic long prompt engineering. We demonstrate that a simple greedy approach with beam search outperforms other methods in terms of search efficiency. Moreover, we introduce two novel techniques that utilize search history to enhance the effectiveness of LLM-based mutation in our search algorithm. Our results show that the proposed automatic long prompt engineering algorithm achieves an average of 9.2% accuracy gain on eight tasks in Big Bench Hard, highlighting the significance of automating prompt designs to fully harness the capabilities of LLMs. △ Less

Submitted 16 November, 2023; originally announced November 2023.

arXiv:2311.09724 [pdf, other]

OVM, Outcome-supervised Value Models for Planning in Mathematical Reasoning

Authors: Fei Yu, Anningzhe Gao, Benyou Wang

Abstract: Large language models (LLMs) often struggle with maintaining accuracy throughout multiple multiple reasoning steps, especially in mathematical reasoning where an error in earlier steps can propagate to subsequent ones and it ultimately leading to an incorrect answer. To reduce error propagation, guided decoding is employed to direct the LM decoding on a step-by-step basis. We argue that in guided… ▽ More Large language models (LLMs) often struggle with maintaining accuracy throughout multiple multiple reasoning steps, especially in mathematical reasoning where an error in earlier steps can propagate to subsequent ones and it ultimately leading to an incorrect answer. To reduce error propagation, guided decoding is employed to direct the LM decoding on a step-by-step basis. We argue that in guided decoding, assessing the potential of an incomplete reasoning path can be more advantageous than simply ensuring per-step correctness, as the former approach leads towards a correct final answer. This transforms the task into a $\textit{value estimation}$ problem in planning. Inspired by the findings that $\textit{outcome supervision for guided decoding essentially acts as a value model}$, we propose Outcome-supervised Value Model (OVM) that employs outcome supervision for training a value model, which prioritizes steps that lead to accurate conclusions. Furthermore, the OVM eliminates the need for labor-intensive annotations of step-level correctness, thereby significantly enhancing its scalability. Our experiments on two multi-step mathematical reasoning datasets, GSM8K and Game of 24, demonstrate the superior performance of the OVM model. Notably, in GSM8K, our $\textbf{OVM-7B model achieves state-of-the-art results among LLMs up to 13B parameters}$; especially it does not utilize GPT-4 or code execution. These findings offer a novel perspective on the role of outcome supervision in training value models for multi-step reasoning tasks and provide theoretical justification for its advantage in value estimation for guided decoding. △ Less

Submitted 1 April, 2024; v1 submitted 16 November, 2023; originally announced November 2023.

Comments: Accepted to NAACL findings. https://github.com/FreedomIntelligence/OVM

arXiv:2311.04471 [pdf, ps, other]

Multiple blowing-up solutions for a slightly critical Lane-Emden system with non-power nonlinearity

Authors: Shengbing Deng, Fang Yu

Abstract: In this paper, we study the following Lane-Emden system with nearly critical non-power nonlinearity \begin{eqnarray*} \left\{ \arraycolsep=1.5pt \begin{array}{lll} -Δu =\frac{|v|^{p-1}v}{[\ln(e+|v|)]^ε}\ \ &{\rm in}\ Ω, \\[2mm] -Δv =\frac{|u|^{q-1}u}{[\ln(e+|u|)]^ε}\ \ &{\rm in}\ Ω, \\[2mm] u= v=0 \ \ & {\rm on}\ \partialΩ, \end{array} \right. \end{eqnarray*} where $Ω$ is a bounded smooth domain… ▽ More In this paper, we study the following Lane-Emden system with nearly critical non-power nonlinearity \begin{eqnarray*} \left\{ \arraycolsep=1.5pt \begin{array}{lll} -Δu =\frac{|v|^{p-1}v}{[\ln(e+|v|)]^ε}\ \ &{\rm in}\ Ω, \\[2mm] -Δv =\frac{|u|^{q-1}u}{[\ln(e+|u|)]^ε}\ \ &{\rm in}\ Ω, \\[2mm] u= v=0 \ \ & {\rm on}\ \partialΩ, \end{array} \right. \end{eqnarray*} where $Ω$ is a bounded smooth domain in $\mathbb{R}^N$, $N\geq 3$, $ε>0$ is a small parameter, $p$ and $q $ lying on the critical Sobolev hyperbola $\frac{1}{p+1}+\frac{1}{q+1}=\frac{N-2}{N}$. We construct multiple blowing-up solutions based on the finite dimensional Lyapunov-Schmidt reduction method as $ε$ goes to zero. △ Less

Submitted 8 November, 2023; originally announced November 2023.

arXiv:2310.20242 [pdf, other]

Intelligent-Reflecting-Surface-Assisted UAV Communications for 6G Networks

Authors: Zhaolong Ning, Tengfeng Li, Yu Wu, Xiaojie Wang, Qingqing Wu, Fei Richard Yu, Song Guo

Abstract: In 6th-Generation (6G) mobile networks, Intelligent Reflective Surfaces (IRSs) and Unmanned Aerial Vehicles (UAVs) have emerged as promising technologies to address the coverage difficulties and resource constraints faced by terrestrial networks. UAVs, with their mobility and low costs, offer diverse connectivity options for mobile users and a novel deployment paradigm for 6G networks. However, th… ▽ More In 6th-Generation (6G) mobile networks, Intelligent Reflective Surfaces (IRSs) and Unmanned Aerial Vehicles (UAVs) have emerged as promising technologies to address the coverage difficulties and resource constraints faced by terrestrial networks. UAVs, with their mobility and low costs, offer diverse connectivity options for mobile users and a novel deployment paradigm for 6G networks. However, the limited battery capacity of UAVs, dynamic and unpredictable channel environments, and communication resource constraints result in poor performance of traditional UAV-based networks. IRSs can not only reconstruct the wireless environment in a unique way, but also achieve wireless network relay in a cost-effective manner. Hence, it receives significant attention as a promising solution to solve the above challenges. In this article, we conduct a comprehensive survey on IRS-assisted UAV communications for 6G networks. First, primary issues, key technologies, and application scenarios of IRS-assisted UAV communications for 6G networks are introduced. Then, we put forward specific solutions to the issues of IRS-assisted UAV communications. Finally, we discuss some open issues and future research directions to guide researchers in related fields. △ Less

Submitted 31 October, 2023; originally announced October 2023.

arXiv:2310.19617 [pdf, other]

Data-driven Modeling of a Coronal Magnetic Flux Rope: from Birth to Death

Authors: J. H. Guo, Y. W. Ni, Y. Guo, C. Xia, B. Schmieder, S. Poedts, Z. Zhong, Y. H. Zhou, F. Yu, P. F. Chen

Abstract: Magnetic flux ropes are a bundle of twisted magnetic field lines produced by internal electric currents, which are responsible for solar eruptions and are the major drivers of geomagnetic storms. As such, it is crucial to develop a numerical model that can capture the entire evolution of a flux rope, from its birth to death, in order to predict whether adverse space weather events might occur or n… ▽ More Magnetic flux ropes are a bundle of twisted magnetic field lines produced by internal electric currents, which are responsible for solar eruptions and are the major drivers of geomagnetic storms. As such, it is crucial to develop a numerical model that can capture the entire evolution of a flux rope, from its birth to death, in order to predict whether adverse space weather events might occur or not. In this paper, we develop a data-driven modeling that combines a time-dependent magneto-frictional approach with a thermodynamic magnetohydrodynamic model. Our numerical modeling successfully reproduces the formation and confined eruption of an observed flux rope, and unveils the physical details behind the observations. Regarding the long-term evolution of the active region, our simulation results indicate that the flux cancellation due to collisional shearing plays a critical role in the formation of the flux rope, corresponding to a substantial increase in magnetic free energy and helicity. Regarding the eruption stage, the deformation of the flux rope during its eruption can cause an increase in the downward tension force, which suppresses it from further rising. This finding may shed light on why some torus-unstable flux ropes lead to failed eruptions after large-angle rotations. Moreover, we find that twisted fluxes can accumulate during the confined eruptions, which would breed the subsequent eruptive flares. △ Less

Submitted 30 October, 2023; originally announced October 2023.

Comments: 30 pages, 10 figures, Accepted for ApJ

arXiv:2310.17944 [pdf, other]

A Survey on Trustworthy Edge Intelligence: From Security and Reliability To Transparency and Sustainability

Authors: Xiaojie Wang, Beibei Wang, Yu Wu, Zhaolong Ning, Song Guo, Fei Richard Yu

Abstract: Edge Intelligence (EI) integrates Edge Computing (EC) and Artificial Intelligence (AI) to push the capabilities of AI to the network edge for real-time, efficient and secure intelligent decision-making and computation. However, EI faces various challenges due to resource constraints, heterogeneous network environments, and diverse service requirements of different applications, which together affe… ▽ More Edge Intelligence (EI) integrates Edge Computing (EC) and Artificial Intelligence (AI) to push the capabilities of AI to the network edge for real-time, efficient and secure intelligent decision-making and computation. However, EI faces various challenges due to resource constraints, heterogeneous network environments, and diverse service requirements of different applications, which together affect the trustworthiness of EI in the eyes of stakeholders. This survey comprehensively summarizes the characteristics, architecture, technologies, and solutions of trustworthy EI. Specifically, we first emphasize the need for trustworthy EI in the context of the trend toward large models. We then provide an initial definition of trustworthy EI, explore its key characteristics and give a multi-layered architecture for trustworthy EI. Then, we summarize several important issues that hinder the achievement of trustworthy EI. Subsequently, we present enabling technologies for trustworthy EI systems and provide an in-depth literature review of the state-of-the-art solutions for realizing the trustworthiness of EI. Finally, we discuss the corresponding research challenges and open issues. △ Less

Submitted 25 January, 2024; v1 submitted 27 October, 2023; originally announced October 2023.

Comments: 25 pages, 6 figures, 8 tables

arXiv:2310.17784 [pdf, other]

Data-Centric Financial Large Language Models

Authors: Zhixuan Chu, Huaiyu Guo, Xinyuan Zhou, Yijia Wang, Fei Yu, Hong Chen, Wanqing Xu, Xin Lu, Qing Cui, Longfei Li, Jun Zhou, Sheng Li

Abstract: Large language models (LLMs) show promise for natural language tasks but struggle when applied directly to complex domains like finance. LLMs have difficulty reasoning about and integrating all relevant information. We propose a data-centric approach to enable LLMs to better handle financial tasks. Our key insight is that rather than overloading the LLM with everything at once, it is more effectiv… ▽ More Large language models (LLMs) show promise for natural language tasks but struggle when applied directly to complex domains like finance. LLMs have difficulty reasoning about and integrating all relevant information. We propose a data-centric approach to enable LLMs to better handle financial tasks. Our key insight is that rather than overloading the LLM with everything at once, it is more effective to preprocess and pre-understand the data. We create a financial LLM (FLLM) using multitask prompt-based finetuning to achieve data pre-processing and pre-understanding. However, labeled data is scarce for each task. To overcome manual annotation costs, we employ abductive augmentation reasoning (AAR) to automatically generate training data by modifying the pseudo labels from FLLM's own outputs. Experiments show our data-centric FLLM with AAR substantially outperforms baseline financial LLMs designed for raw text, achieving state-of-the-art on financial analysis and interpretation tasks. We also open source a new benchmark for financial analysis and interpretation. Our methodology provides a promising path to unlock LLMs' potential for complex real-world domains. △ Less

Submitted 13 November, 2023; v1 submitted 7 October, 2023; originally announced October 2023.

arXiv:2310.15301 [pdf, other]

ADMarker: A Multi-Modal Federated Learning System for Monitoring Digital Biomarkers of Alzheimer's Disease

Authors: Xiaomin Ouyang, Xian Shuai, Yang Li, Li Pan, Xifan Zhang, Heming Fu, Sitong Cheng, Xinyan Wang, Shihua Cao, Jiang Xin, Hazel Mok, Zhenyu Yan, Doris Sau Fung Yu, Timothy Kwok, Guoliang Xing

Abstract: Alzheimer's Disease (AD) and related dementia are a growing global health challenge due to the aging population. In this paper, we present ADMarker, the first end-to-end system that integrates multi-modal sensors and new federated learning algorithms for detecting multidimensional AD digital biomarkers in natural living environments. ADMarker features a novel three-stage multi-modal federated lear… ▽ More Alzheimer's Disease (AD) and related dementia are a growing global health challenge due to the aging population. In this paper, we present ADMarker, the first end-to-end system that integrates multi-modal sensors and new federated learning algorithms for detecting multidimensional AD digital biomarkers in natural living environments. ADMarker features a novel three-stage multi-modal federated learning architecture that can accurately detect digital biomarkers in a privacy-preserving manner. Our approach collectively addresses several major real-world challenges, such as limited data labels, data heterogeneity, and limited computing resources. We built a compact multi-modality hardware system and deployed it in a four-week clinical trial involving 91 elderly participants. The results indicate that ADMarker can accurately detect a comprehensive set of digital biomarkers with up to 93.8% accuracy and identify early AD with an average of 88.9% accuracy. ADMarker offers a new platform that can allow AD clinicians to characterize and track the complex correlation between multidimensional interpretable digital biomarkers, demographic factors of patients, and AD diagnosis in a longitudinal manner. △ Less

Submitted 12 April, 2024; v1 submitted 23 October, 2023; originally announced October 2023.

arXiv:2310.15141 [pdf, other]

SpecTr: Fast Speculative Decoding via Optimal Transport

Authors: Ziteng Sun, Ananda Theertha Suresh, Jae Hun Ro, Ahmad Beirami, Himanshu Jain, Felix Yu

Abstract: Autoregressive sampling from large language models has led to state-of-the-art results in several natural language tasks. However, autoregressive sampling generates tokens one at a time making it slow, and even prohibitive in certain tasks. One way to speed up sampling is $\textit{speculative decoding}$: use a small model to sample a $\textit{draft}$ (block or sequence of tokens), and then score a… ▽ More Autoregressive sampling from large language models has led to state-of-the-art results in several natural language tasks. However, autoregressive sampling generates tokens one at a time making it slow, and even prohibitive in certain tasks. One way to speed up sampling is $\textit{speculative decoding}$: use a small model to sample a $\textit{draft}$ (block or sequence of tokens), and then score all tokens in the draft by the large language model in parallel. A subset of the tokens in the draft are accepted (and the rest rejected) based on a statistical method to guarantee that the final output follows the distribution of the large model. In this work, we provide a principled understanding of speculative decoding through the lens of optimal transport (OT) with $\textit{membership cost}$. This framework can be viewed as an extension of the well-known $\textit{maximal-coupling}$ problem. This new formulation enables us to generalize the speculative decoding method to allow for a set of $k$ candidates at the token-level, which leads to an improved optimal membership cost. We show that the optimal draft selection algorithm (transport plan) can be computed via linear programming, whose best-known runtime is exponential in $k$. We then propose a valid draft selection algorithm whose acceptance probability is $(1-1/e)$-optimal multiplicatively. Moreover, it can be computed in time almost linear with size of domain of a single token. Using this $new draft selection$ algorithm, we develop a new autoregressive sampling algorithm called $\textit{SpecTr}$, which provides speedup in decoding while ensuring that there is no quality degradation in the decoded output. We experimentally demonstrate that for state-of-the-art large language models, the proposed approach achieves a wall clock speedup of 2.13X, a further 1.37X speedup over speculative decoding on standard benchmarks. △ Less

Submitted 17 January, 2024; v1 submitted 23 October, 2023; originally announced October 2023.

Comments: NeurIPS 2023

arXiv:2310.13810 [pdf]

A Better Match for Drivers and Riders: Reinforcement Learning at Lyft

Authors: Xabi Azagirre, Akshay Balwally, Guillaume Candeli, Nicholas Chamandy, Benjamin Han, Alona King, Hyungjun Lee, Martin Loncaric, Sebastien Martin, Vijay Narasiman, Zhiwei, Qin, Baptiste Richard, Sara Smoot, Sean Taylor, Garrett van Ryzin, Di Wu, Fei Yu, Alex Zamoshchin

Abstract: To better match drivers to riders in our ridesharing application, we revised Lyft's core matching algorithm. We use a novel online reinforcement learning approach that estimates the future earnings of drivers in real time and use this information to find more efficient matches. This change was the first documented implementation of a ridesharing matching algorithm that can learn and improve in rea… ▽ More To better match drivers to riders in our ridesharing application, we revised Lyft's core matching algorithm. We use a novel online reinforcement learning approach that estimates the future earnings of drivers in real time and use this information to find more efficient matches. This change was the first documented implementation of a ridesharing matching algorithm that can learn and improve in real time. We evaluated the new approach during weeks of switchback experimentation in most Lyft markets, and estimated how it benefited drivers, riders, and the platform. In particular, it enabled our drivers to serve millions of additional riders each year, leading to more than $30 million per year in incremental revenue. Lyft rolled out the algorithm globally in 2021. △ Less

Submitted 13 November, 2023; v1 submitted 20 October, 2023; originally announced October 2023.

arXiv:2310.12970 [pdf, other]

Real-Time Motion Prediction via Heterogeneous Polyline Transformer with Relative Pose Encoding

Authors: Zhejun Zhang, Alexander Liniger, Christos Sakaridis, Fisher Yu, Luc Van Gool

Abstract: The real-world deployment of an autonomous driving system requires its components to run on-board and in real-time, including the motion prediction module that predicts the future trajectories of surrounding traffic participants. Existing agent-centric methods have demonstrated outstanding performance on public benchmarks. However, they suffer from high computational overhead and poor scalability… ▽ More The real-world deployment of an autonomous driving system requires its components to run on-board and in real-time, including the motion prediction module that predicts the future trajectories of surrounding traffic participants. Existing agent-centric methods have demonstrated outstanding performance on public benchmarks. However, they suffer from high computational overhead and poor scalability as the number of agents to be predicted increases. To address this problem, we introduce the K-nearest neighbor attention with relative pose encoding (KNARPE), a novel attention mechanism allowing the pairwise-relative representation to be used by Transformers. Then, based on KNARPE we present the Heterogeneous Polyline Transformer with Relative pose encoding (HPTR), a hierarchical framework enabling asynchronous token update during the online inference. By sharing contexts among agents and reusing the unchanged contexts, our approach is as efficient as scene-centric methods, while performing on par with state-of-the-art agent-centric methods. Experiments on Waymo and Argoverse-2 datasets show that HPTR achieves superior performance among end-to-end methods that do not apply expensive post-processing or model ensembling. The code is available at https://github.com/zhejz/HPTR. △ Less

Submitted 19 October, 2023; originally announced October 2023.

Comments: Accepted by NeurIPS 2023

arXiv:2310.10068 [pdf, other]

Generalizable Person Search on Open-world User-Generated Video Content

Authors: Junjie Li, Guanshuo Wang, Yichao Yan, Fufu Yu, Qiong Jia, Jie Qin, Shouhong Ding, Xiaokang Yang

Abstract: Person search is a challenging task that involves detecting and retrieving individuals from a large set of un-cropped scene images. Existing person search applications are mostly trained and deployed in the same-origin scenarios. However, collecting and annotating training samples for each scene is often difficult due to the limitation of resources and the labor cost. Moreover, large-scale intra-d… ▽ More Person search is a challenging task that involves detecting and retrieving individuals from a large set of un-cropped scene images. Existing person search applications are mostly trained and deployed in the same-origin scenarios. However, collecting and annotating training samples for each scene is often difficult due to the limitation of resources and the labor cost. Moreover, large-scale intra-domain data for training are generally not legally available for common developers, due to the regulation of privacy and public security. Leveraging easily accessible large-scale User Generated Video Contents (\emph{i.e.} UGC videos) to train person search models can fit the open-world distribution, but still suffering a performance gap from the domain difference to surveillance scenes. In this work, we explore enhancing the out-of-domain generalization capabilities of person search models, and propose a generalizable framework on both feature-level and data-level generalization to facilitate downstream tasks in arbitrary scenarios. Specifically, we focus on learning domain-invariant representations for both detection and ReID by introducing a multi-task prototype-based domain-specific batch normalization, and a channel-wise ID-relevant feature decorrelation strategy. We also identify and address typical sources of noise in open-world training frames, including inaccurate bounding boxes, the omission of identity labels, and the absence of cross-camera data. Our framework achieves promising performance on two challenging person search benchmarks without using any human annotation or samples from the target domain. △ Less

Submitted 16 October, 2023; originally announced October 2023.

arXiv:2310.04863 [pdf, other]

SA-Paraformer: Non-autoregressive End-to-End Speaker-Attributed ASR

Authors: Yangze Li, Fan Yu, Yuhao Liang, Pengcheng Guo, Mohan Shi, Zhihao Du, Shiliang Zhang, Lei Xie

Abstract: Joint modeling of multi-speaker ASR and speaker diarization has recently shown promising results in speaker-attributed automatic speech recognition (SA-ASR).Although being able to obtain state-of-the-art (SOTA) performance, most of the studies are based on an autoregressive (AR) decoder which generates tokens one-by-one and results in a large real-time factor (RTF). To speed up inference, we intro… ▽ More Joint modeling of multi-speaker ASR and speaker diarization has recently shown promising results in speaker-attributed automatic speech recognition (SA-ASR).Although being able to obtain state-of-the-art (SOTA) performance, most of the studies are based on an autoregressive (AR) decoder which generates tokens one-by-one and results in a large real-time factor (RTF). To speed up inference, we introduce a recently proposed non-autoregressive model Paraformer as an acoustic model in the SA-ASR model.Paraformer uses a single-step decoder to enable parallel generation, obtaining comparable performance to the SOTA AR transformer models. Besides, we propose a speaker-filling strategy to reduce speaker identification errors and adopt an inter-CTC strategy to enhance the encoder's ability in acoustic modeling. Experiments on the AliMeeting corpus show that our model outperforms the cascaded SA-ASR model by a 6.1% relative speaker-dependent character error rate (SD-CER) reduction on the test set. Moreover, our model achieves a comparable SD-CER of 34.8% with only 1/10 RTF compared with the SOTA joint AR SA-ASR model. △ Less

Submitted 7 October, 2023; originally announced October 2023.

arXiv:2310.03006 [pdf, other]

COOLer: Class-Incremental Learning for Appearance-Based Multiple Object Tracking

Authors: Zhizheng Liu, Mattia Segu, Fisher Yu

Abstract: Continual learning allows a model to learn multiple tasks sequentially while retaining the old knowledge without the training data of the preceding tasks. This paper extends the scope of continual learning research to class-incremental learning for multiple object tracking (MOT), which is desirable to accommodate the continuously evolving needs of autonomous systems. Previous solutions for continu… ▽ More Continual learning allows a model to learn multiple tasks sequentially while retaining the old knowledge without the training data of the preceding tasks. This paper extends the scope of continual learning research to class-incremental learning for multiple object tracking (MOT), which is desirable to accommodate the continuously evolving needs of autonomous systems. Previous solutions for continual learning of object detectors do not address the data association stage of appearance-based trackers, leading to catastrophic forgetting of previous classes' re-identification features. We introduce COOLer, a COntrastive- and cOntinual-Learning-based tracker, which incrementally learns to track new categories while preserving past knowledge by training on a combination of currently available ground truth labels and pseudo-labels generated by the past tracker. To further exacerbate the disentanglement of instance representations, we introduce a novel contrastive class-incremental instance representation learning technique. Finally, we propose a practical evaluation protocol for continual learning for MOT and conduct experiments on the BDD100K and SHIFT datasets. Experimental results demonstrate that COOLer continually learns while effectively addressing catastrophic forgetting of both tracking and detection. The code is available at https://github.com/BoSmallEar/COOLer. △ Less

Submitted 5 October, 2023; v1 submitted 4 October, 2023; originally announced October 2023.

Comments: GCPR 2023 Oral

arXiv:2310.02690 [pdf, other]

Multi-Dimension-Embedding-Aware Modality Fusion Transformer for Psychiatric Disorder Clasification

Authors: Guoxin Wang, Xuyang Cao, Shan An, Fengmei Fan, Chao Zhang, Jinsong Wang, Feng Yu, Zhiren Wang

Abstract: Deep learning approaches, together with neuroimaging techniques, play an important role in psychiatric disorders classification. Previous studies on psychiatric disorders diagnosis mainly focus on using functional connectivity matrices of resting-state functional magnetic resonance imaging (rs-fMRI) as input, which still needs to fully utilize the rich temporal information of the time series of rs… ▽ More Deep learning approaches, together with neuroimaging techniques, play an important role in psychiatric disorders classification. Previous studies on psychiatric disorders diagnosis mainly focus on using functional connectivity matrices of resting-state functional magnetic resonance imaging (rs-fMRI) as input, which still needs to fully utilize the rich temporal information of the time series of rs-fMRI data. In this work, we proposed a multi-dimension-embedding-aware modality fusion transformer (MFFormer) for schizophrenia and bipolar disorder classification using rs-fMRI and T1 weighted structural MRI (T1w sMRI). Concretely, to fully utilize the temporal information of rs-fMRI and spatial information of sMRI, we constructed a deep learning architecture that takes as input 2D time series of rs-fMRI and 3D volumes T1w. Furthermore, to promote intra-modality attention and information fusion across different modalities, a fusion transformer module (FTM) is designed through extensive self-attention of hybrid feature maps of multi-modality. In addition, a dimension-up and dimension-down strategy is suggested to properly align feature maps of multi-dimensional from different modalities. Experimental results on our private and public OpenfMRI datasets show that our proposed MFFormer performs better than that using a single modality or multi-modality MRI on schizophrenia and bipolar disorder diagnosis. △ Less

Submitted 4 October, 2023; originally announced October 2023.

arXiv:2310.02629 [pdf, other]

BA-MoE: Boundary-Aware Mixture-of-Experts Adapter for Code-Switching Speech Recognition

Authors: Peikun Chen, Fan Yu, Yuhao Lian, Hongfei Xue, Xucheng Wan, Naijun Zheng, Huan Zhou, Lei Xie

Abstract: Mixture-of-experts based models, which use language experts to extract language-specific representations effectively, have been well applied in code-switching automatic speech recognition. However, there is still substantial space to improve as similar pronunciation across languages may result in ineffective multi-language modeling and inaccurate language boundary estimation. To eliminate these dr… ▽ More Mixture-of-experts based models, which use language experts to extract language-specific representations effectively, have been well applied in code-switching automatic speech recognition. However, there is still substantial space to improve as similar pronunciation across languages may result in ineffective multi-language modeling and inaccurate language boundary estimation. To eliminate these drawbacks, we propose a cross-layer language adapter and a boundary-aware training method, namely Boundary-Aware Mixture-of-Experts (BA-MoE). Specifically, we introduce language-specific adapters to separate language-specific representations and a unified gating layer to fuse representations within each encoder layer. Second, we compute language adaptation loss of the mean output of each language-specific adapter to improve the adapter module's language-specific representation learning. Besides, we utilize a boundary-aware predictor to learn boundary representations for dealing with language boundary confusion. Our approach achieves significant performance improvement, reducing the mixture error rate by 16.55\% compared to the baseline on the ASRU 2019 Mandarin-English code-switching challenge dataset. △ Less

Submitted 7 October, 2023; v1 submitted 4 October, 2023; originally announced October 2023.

Comments: Accepted by ASRU2023

arXiv:2310.01926 [pdf, other]

DARTH: Holistic Test-time Adaptation for Multiple Object Tracking

Authors: Mattia Segu, Bernt Schiele, Fisher Yu

Abstract: Multiple object tracking (MOT) is a fundamental component of perception systems for autonomous driving, and its robustness to unseen conditions is a requirement to avoid life-critical failures. Despite the urge of safety in driving systems, no solution to the MOT adaptation problem to domain shift in test-time conditions has ever been proposed. However, the nature of a MOT system is manifold - req… ▽ More Multiple object tracking (MOT) is a fundamental component of perception systems for autonomous driving, and its robustness to unseen conditions is a requirement to avoid life-critical failures. Despite the urge of safety in driving systems, no solution to the MOT adaptation problem to domain shift in test-time conditions has ever been proposed. However, the nature of a MOT system is manifold - requiring object detection and instance association - and adapting all its components is non-trivial. In this paper, we analyze the effect of domain shift on appearance-based trackers, and introduce DARTH, a holistic test-time adaptation framework for MOT. We propose a detection consistency formulation to adapt object detection in a self-supervised fashion, while adapting the instance appearance representations via our novel patch contrastive loss. We evaluate our method on a variety of domain shifts - including sim-to-real, outdoor-to-indoor, indoor-to-outdoor - and substantially improve the source model performance on all metrics. Code: https://github.com/mattiasegu/darth. △ Less

Submitted 3 October, 2023; originally announced October 2023.

Comments: Proceedings of the IEEE/CVF International Conference on Computer Vision

arXiv:2309.16421 [pdf, other]

Distilling ODE Solvers of Diffusion Models into Smaller Steps

Authors: Sanghwan Kim, Hao Tang, Fisher Yu

Abstract: Abstract Diffusion models have recently gained prominence as a novel category of generative models. Despite their success, these models face a notable drawback in terms of slow sampling speeds, requiring a high number of function evaluations (NFE) in the order of hundreds or thousands. In response, both learning-free and learning-based sampling strategies have been explored to expedite the samplin… ▽ More Abstract Diffusion models have recently gained prominence as a novel category of generative models. Despite their success, these models face a notable drawback in terms of slow sampling speeds, requiring a high number of function evaluations (NFE) in the order of hundreds or thousands. In response, both learning-free and learning-based sampling strategies have been explored to expedite the sampling process. Learning-free sampling employs various ordinary differential equation (ODE) solvers based on the formulation of diffusion ODEs. However, it encounters challenges in faithfully tracking the true sampling trajectory, particularly for small NFE. Conversely, learning-based sampling methods, such as knowledge distillation, demand extensive additional training, limiting their practical applicability. To overcome these limitations, we introduce Distilled-ODE solvers (D-ODE solvers), a straightforward distillation approach grounded in ODE solver formulations. Our method seamlessly integrates the strengths of both learning-free and learning-based sampling. D-ODE solvers are constructed by introducing a single parameter adjustment to existing ODE solvers. Furthermore, we optimize D-ODE solvers with smaller steps using knowledge distillation from ODE solvers with larger steps across a batch of samples. Comprehensive experiments demonstrate the superior performance of D-ODE solvers compared to existing ODE solvers, including DDIM, PNDM, DPM-Solver, DEIS, and EDM, particularly in scenarios with fewer NFE. Notably, our method incurs negligible computational overhead compared to previous distillation techniques, facilitating straightforward and rapid integration with existing samplers. Qualitative analysis reveals that D-ODE solvers not only enhance image quality but also faithfully follow the target ODE trajectory. △ Less

Submitted 26 March, 2024; v1 submitted 28 September, 2023; originally announced September 2023.

arXiv:2309.13573 [pdf, other]

The second multi-channel multi-party meeting transcription challenge (M2MeT) 2.0): A benchmark for speaker-attributed ASR

Authors: Yuhao Liang, Mohan Shi, Fan Yu, Yangze Li, Shiliang Zhang, Zhihao Du, Qian Chen, Lei Xie, Yanmin Qian, Jian Wu, Zhuo Chen, Kong Aik Lee, Zhijie Yan, Hui Bu

Abstract: With the success of the first Multi-channel Multi-party Meeting Transcription challenge (M2MeT), the second M2MeT challenge (M2MeT 2.0) held in ASRU2023 particularly aims to tackle the complex task of \emph{speaker-attributed ASR (SA-ASR)}, which directly addresses the practical and challenging problem of ``who spoke what at when" at typical meeting scenario. We particularly established two sub-tr… ▽ More With the success of the first Multi-channel Multi-party Meeting Transcription challenge (M2MeT), the second M2MeT challenge (M2MeT 2.0) held in ASRU2023 particularly aims to tackle the complex task of \emph{speaker-attributed ASR (SA-ASR)}, which directly addresses the practical and challenging problem of ``who spoke what at when" at typical meeting scenario. We particularly established two sub-tracks. The fixed training condition sub-track, where the training data is constrained to predetermined datasets, but participants can use any open-source pre-trained model. The open training condition sub-track, which allows for the use of all available data and models without limitation. In addition, we release a new 10-hour test set for challenge ranking. This paper provides an overview of the dataset, track settings, results, and analysis of submitted systems, as a benchmark to show the current state of speaker-attributed ASR. △ Less

Submitted 5 October, 2023; v1 submitted 24 September, 2023; originally announced September 2023.

Comments: 8 pages, Accepted by ASRU2023

arXiv:2309.12053 [pdf, other]

AceGPT, Localizing Large Language Models in Arabic

Authors: Huang Huang, Fei Yu, Jianqing Zhu, Xuening Sun, Hao Cheng, Dingjie Song, Zhihong Chen, Abdulmohsen Alharthi, Bang An, Juncai He, Ziche Liu, Zhiyi Zhang, Junying Chen, Jianquan Li, Benyou Wang, Lian Zhang, Ruoyu Sun, Xiang Wan, Haizhou Li, Jinchao Xu

Abstract: This paper is devoted to the development of a localized Large Language Model (LLM) specifically for Arabic, a language imbued with unique cultural characteristics inadequately addressed by current mainstream models. Significant concerns emerge when addressing cultural sensitivity and local values. To address this, the paper proposes a comprehensive solution that includes further pre-training with… ▽ More This paper is devoted to the development of a localized Large Language Model (LLM) specifically for Arabic, a language imbued with unique cultural characteristics inadequately addressed by current mainstream models. Significant concerns emerge when addressing cultural sensitivity and local values. To address this, the paper proposes a comprehensive solution that includes further pre-training with Arabic texts, Supervised Fine-Tuning (SFT) utilizing native Arabic instructions, and GPT-4 responses in Arabic, alongside Reinforcement Learning with AI Feedback (RLAIF) employing a reward model attuned to local culture and values. The goal is to cultivate culturally cognizant and value-aligned Arabic LLMs capable of accommodating the diverse, application-specific needs of Arabic-speaking communities. Comprehensive evaluations reveal that the resulting model, dubbed `AceGPT', sets the state-of-the-art standard for open Arabic LLMs across various benchmarks. Codes, data, and models are in https://github.com/FreedomIntelligence/AceGPT. △ Less

Submitted 2 April, 2024; v1 submitted 21 September, 2023; originally announced September 2023.

Comments: Accepted to NAACL main conference. https://github.com/FreedomIntelligence/AceGPT

arXiv:2309.06006 [pdf, ps, other]

SoccerNet 2023 Challenges Results

Authors: Anthony Cioppa, Silvio Giancola, Vladimir Somers, Floriane Magera, Xin Zhou, Hassan Mkhallati, Adrien Deliège, Jan Held, Carlos Hinojosa, Amir M. Mansourian, Pierre Miralles, Olivier Barnich, Christophe De Vleeschouwer, Alexandre Alahi, Bernard Ghanem, Marc Van Droogenbroeck, Abdullah Kamal, Adrien Maglo, Albert Clapés, Amr Abdelaziz, Artur Xarles, Astrid Orcesi, Atom Scott, Bin Liu, Byoungkwon Lim , et al. (77 additional authors not shown)

Abstract: The SoccerNet 2023 challenges were the third annual video understanding challenges organized by the SoccerNet team. For this third edition, the challenges were composed of seven vision-based tasks split into three main themes. The first theme, broadcast video understanding, is composed of three high-level tasks related to describing events occurring in the video broadcasts: (1) action spotting, fo… ▽ More The SoccerNet 2023 challenges were the third annual video understanding challenges organized by the SoccerNet team. For this third edition, the challenges were composed of seven vision-based tasks split into three main themes. The first theme, broadcast video understanding, is composed of three high-level tasks related to describing events occurring in the video broadcasts: (1) action spotting, focusing on retrieving all timestamps related to global actions in soccer, (2) ball action spotting, focusing on retrieving all timestamps related to the soccer ball change of state, and (3) dense video captioning, focusing on describing the broadcast with natural language and anchored timestamps. The second theme, field understanding, relates to the single task of (4) camera calibration, focusing on retrieving the intrinsic and extrinsic camera parameters from images. The third and last theme, player understanding, is composed of three low-level tasks related to extracting information about the players: (5) re-identification, focusing on retrieving the same players across multiple views, (6) multiple object tracking, focusing on tracking players and the ball through unedited video streams, and (7) jersey number recognition, focusing on recognizing the jersey number of players from tracklets. Compared to the previous editions of the SoccerNet challenges, tasks (2-3-7) are novel, including new annotations and data, task (4) was enhanced with more data and annotations, and task (6) now focuses on end-to-end approaches. More information on the tasks, challenges, and leaderboards are available on https://www.soccer-net.org. Baselines and development kits can be found on https://github.com/SoccerNet. △ Less

Submitted 12 September, 2023; originally announced September 2023.

arXiv:2309.05396 [pdf, other]

SlideSpeech: A Large-Scale Slide-Enriched Audio-Visual Corpus

Authors: Haoxu Wang, Fan Yu, Xian Shi, Yuezhang Wang, Shiliang Zhang, Ming Li

Abstract: Multi-Modal automatic speech recognition (ASR) techniques aim to leverage additional modalities to improve the performance of speech recognition systems. While existing approaches primarily focus on video or contextual information, the utilization of extra supplementary textual information has been overlooked. Recognizing the abundance of online conference videos with slides, which provide rich do… ▽ More Multi-Modal automatic speech recognition (ASR) techniques aim to leverage additional modalities to improve the performance of speech recognition systems. While existing approaches primarily focus on video or contextual information, the utilization of extra supplementary textual information has been overlooked. Recognizing the abundance of online conference videos with slides, which provide rich domain-specific information in the form of text and images, we release SlideSpeech, a large-scale audio-visual corpus enriched with slides. The corpus contains 1,705 videos, 1,000+ hours, with 473 hours of high-quality transcribed speech. Moreover, the corpus contains a significant amount of real-time synchronized slides. In this work, we present the pipeline for constructing the corpus and propose baseline methods for utilizing text information in the visual slide context. Through the application of keyword extraction and contextual ASR methods in the benchmark system, we demonstrate the potential of improving speech recognition performance by incorporating textual information from supplementary video slides. △ Less

Submitted 25 December, 2023; v1 submitted 11 September, 2023; originally announced September 2023.

Comments: Accepted by ICASSP 2024

arXiv:2309.04707 [pdf, other]

Advantage Actor-Critic with Reasoner: Explaining the Agent's Behavior from an Exploratory Perspective

Authors: Muzhe Guo, Feixu Yu, Tian Lan, Fang Jin

Abstract: Reinforcement learning (RL) is a powerful tool for solving complex decision-making problems, but its lack of transparency and interpretability has been a major challenge in domains where decisions have significant real-world consequences. In this paper, we propose a novel Advantage Actor-Critic with Reasoner (A2CR), which can be easily applied to Actor-Critic-based RL models and make them interpre… ▽ More Reinforcement learning (RL) is a powerful tool for solving complex decision-making problems, but its lack of transparency and interpretability has been a major challenge in domains where decisions have significant real-world consequences. In this paper, we propose a novel Advantage Actor-Critic with Reasoner (A2CR), which can be easily applied to Actor-Critic-based RL models and make them interpretable. A2CR consists of three interconnected networks: the Policy Network, the Value Network, and the Reasoner Network. By predefining and classifying the underlying purpose of the actor's actions, A2CR automatically generates a more comprehensive and interpretable paradigm for understanding the agent's decision-making process. It offers a range of functionalities such as purpose-based saliency, early failure detection, and model supervision, thereby promoting responsible and trustworthy RL. Evaluations conducted in action-rich Super Mario Bros environments yield intriguing findings: Reasoner-predicted label proportions decrease for ``Breakout" and increase for ``Hovering" as the exploration level of the RL algorithm intensifies. Additionally, purpose-based saliencies are more focused and comprehensible. △ Less

Submitted 9 September, 2023; originally announced September 2023.

arXiv:2309.04422 [pdf, other]

Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving

Authors: Thomas E. Huang, Yifan Liu, Luc Van Gool, Fisher Yu

Abstract: Performing multiple heterogeneous visual tasks in dynamic scenes is a hallmark of human perception capability. Despite remarkable progress in image and video recognition via representation learning, current research still focuses on designing specialized networks for singular, homogeneous, or simple combination of tasks. We instead explore the construction of a unified model for major image and vi… ▽ More Performing multiple heterogeneous visual tasks in dynamic scenes is a hallmark of human perception capability. Despite remarkable progress in image and video recognition via representation learning, current research still focuses on designing specialized networks for singular, homogeneous, or simple combination of tasks. We instead explore the construction of a unified model for major image and video recognition tasks in autonomous driving with diverse input and output structures. To enable such an investigation, we design a new challenge, Video Task Decathlon (VTD), which includes ten representative image and video tasks spanning classification, segmentation, localization, and association of objects and pixels. On VTD, we develop our unified network, VTDNet, that uses a single structure and a single set of weights for all ten tasks. VTDNet groups similar tasks and employs task interaction stages to exchange information within and between task groups. Given the impracticality of labeling all tasks on all frames, and the performance degradation associated with joint training of many tasks, we design a Curriculum training, Pseudo-labeling, and Fine-tuning (CPF) scheme to successfully train VTDNet on all tasks and mitigate performance loss. Armed with CPF, VTDNet significantly outperforms its single-task counterparts on most tasks with only 20% overall computations. VTD is a promising new direction for exploring the unification of perception tasks in autonomous driving. △ Less

Submitted 26 November, 2023; v1 submitted 8 September, 2023; originally announced September 2023.

Comments: ICCV 2023, project page at https://www.vis.xyz/pub/vtd

arXiv:2308.15726 [pdf]

AGS: An Dataset and Taxonomy for Domestic Scene Sound Event Recognition

Authors: Nan Che, Chenrui Liu, Fei Yu

Abstract: Environmental sound scene and sound event recognition is important for the recognition of suspicious events in indoor and outdoor environments (such as nurseries, smart homes, nursing homes, etc.) and is a fundamental task involved in many audio surveillance applications. In particular, there is no public common data set for the research field of sound event recognition for the data set of the ind… ▽ More Environmental sound scene and sound event recognition is important for the recognition of suspicious events in indoor and outdoor environments (such as nurseries, smart homes, nursing homes, etc.) and is a fundamental task involved in many audio surveillance applications. In particular, there is no public common data set for the research field of sound event recognition for the data set of the indoor environmental sound scene. Therefore, this paper proposes a data set (called as AGS) for the home environment sound. This data set considers various types of overlapping audio in the scene, background noise. Moreover, based on the proposed data set, this paper compares and analyzes the advanced methods for sound event recognition, and then illustrates the reliability of the data set proposed in this paper, and studies the challenges raised by the new data set. Our proposed AGS and the source code of the corresponding baselines at https://github.com/taolunzu11/AGS . △ Less

Submitted 29 August, 2023; originally announced August 2023.

arXiv:2308.15070 [pdf, other]

DiffBIR: Towards Blind Image Restoration with Generative Diffusion Prior

Authors: Xinqi Lin, Jingwen He, Ziyan Chen, Zhaoyang Lyu, Bo Dai, Fanghua Yu, Wanli Ouyang, Yu Qiao, Chao Dong

Abstract: We present DiffBIR, a general restoration pipeline that could handle different blind image restoration tasks in a unified framework. DiffBIR decouples blind image restoration problem into two stages: 1) degradation removal: removing image-independent content; 2) information regeneration: generating the lost image content. Each stage is developed independently but they work seamlessly in a cascaded… ▽ More We present DiffBIR, a general restoration pipeline that could handle different blind image restoration tasks in a unified framework. DiffBIR decouples blind image restoration problem into two stages: 1) degradation removal: removing image-independent content; 2) information regeneration: generating the lost image content. Each stage is developed independently but they work seamlessly in a cascaded manner. In the first stage, we use restoration modules to remove degradations and obtain high-fidelity restored results. For the second stage, we propose IRControlNet that leverages the generative ability of latent diffusion models to generate realistic details. Specifically, IRControlNet is trained based on specially produced condition images without distracting noisy content for stable generation performance. Moreover, we design a region-adaptive restoration guidance that can modify the denoising process during inference without model re-training, allowing users to balance realness and fidelity through a tunable guidance scale. Extensive experiments have demonstrated DiffBIR's superiority over state-of-the-art approaches for blind image super-resolution, blind face restoration and blind image denoising tasks on both synthetic and real-world datasets. The code is available at https://github.com/XPixelGroup/DiffBIR. △ Less

Submitted 12 April, 2024; v1 submitted 29 August, 2023; originally announced August 2023.

arXiv:2308.14713 [pdf, other]

R3D3: Dense 3D Reconstruction of Dynamic Scenes from Multiple Cameras

Authors: Aron Schmied, Tobias Fischer, Martin Danelljan, Marc Pollefeys, Fisher Yu

Abstract: Dense 3D reconstruction and ego-motion estimation are key challenges in autonomous driving and robotics. Compared to the complex, multi-modal systems deployed today, multi-camera systems provide a simpler, low-cost alternative. However, camera-based 3D reconstruction of complex dynamic scenes has proven extremely difficult, as existing solutions often produce incomplete or incoherent results. We p… ▽ More Dense 3D reconstruction and ego-motion estimation are key challenges in autonomous driving and robotics. Compared to the complex, multi-modal systems deployed today, multi-camera systems provide a simpler, low-cost alternative. However, camera-based 3D reconstruction of complex dynamic scenes has proven extremely difficult, as existing solutions often produce incomplete or incoherent results. We propose R3D3, a multi-camera system for dense 3D reconstruction and ego-motion estimation. Our approach iterates between geometric estimation that exploits spatial-temporal information from multiple cameras, and monocular depth refinement. We integrate multi-camera feature correlation and dense bundle adjustment operators that yield robust geometric depth and pose estimates. To improve reconstruction where geometric depth is unreliable, e.g. for moving objects or low-textured regions, we introduce learnable scene priors via a depth refinement network. We show that this design enables a dense, consistent 3D reconstruction of challenging, dynamic outdoor environments. Consequently, we achieve state-of-the-art dense depth prediction on the DDAD and NuScenes benchmarks. △ Less

Submitted 28 August, 2023; originally announced August 2023.

Comments: Accepted to ICCV 2023. Project page is available at https://www.vis.xyz/pub/r3d3/

arXiv:2308.12581 [pdf, other]

A Huber Loss Minimization Approach to Byzantine Robust Federated Learning

Authors: Puning Zhao, Fei Yu, Zhiguo Wan

Abstract: Federated learning systems are susceptible to adversarial attacks. To combat this, we introduce a novel aggregator based on Huber loss minimization, and provide a comprehensive theoretical analysis. Under independent and identically distributed (i.i.d) assumption, our approach has several advantages compared to existing methods. Firstly, it has optimal dependence on $ε$, which stands for the ratio… ▽ More Federated learning systems are susceptible to adversarial attacks. To combat this, we introduce a novel aggregator based on Huber loss minimization, and provide a comprehensive theoretical analysis. Under independent and identically distributed (i.i.d) assumption, our approach has several advantages compared to existing methods. Firstly, it has optimal dependence on $ε$, which stands for the ratio of attacked clients. Secondly, our approach does not need precise knowledge of $ε$. Thirdly, it allows different clients to have unequal data sizes. We then broaden our analysis to include non-i.i.d data, such that clients have slightly different distributions. △ Less

Submitted 25 March, 2024; v1 submitted 24 August, 2023; originally announced August 2023.

arXiv:2308.12234 [pdf, other]

doi 10.1109/iccv51070.2023.01791

MolGrapher: Graph-based Visual Recognition of Chemical Structures

Authors: Lucas Morin, Martin Danelljan, Maria Isabel Agea, Ahmed Nassar, Valery Weber, Ingmar Meijer, Peter Staar, Fisher Yu

Abstract: The automatic analysis of chemical literature has immense potential to accelerate the discovery of new materials and drugs. Much of the critical information in patent documents and scientific articles is contained in figures, depicting the molecule structures. However, automatically parsing the exact chemical structure is a formidable challenge, due to the amount of detailed information, the diver… ▽ More The automatic analysis of chemical literature has immense potential to accelerate the discovery of new materials and drugs. Much of the critical information in patent documents and scientific articles is contained in figures, depicting the molecule structures. However, automatically parsing the exact chemical structure is a formidable challenge, due to the amount of detailed information, the diversity of drawing styles, and the need for training data. In this work, we introduce MolGrapher to recognize chemical structures visually. First, a deep keypoint detector detects the atoms. Second, we treat all candidate atoms and bonds as nodes and put them in a graph. This construct allows a natural graph representation of the molecule. Last, we classify atom and bond nodes in the graph with a Graph Neural Network. To address the lack of real training data, we propose a synthetic data generation pipeline producing diverse and realistic results. In addition, we introduce a large-scale benchmark of annotated real molecule images, USPTO-30K, to spur research on this critical topic. Extensive experiments on five datasets show that our approach significantly outperforms classical and learning-based methods in most settings. Code, models, and datasets are available. △ Less

Submitted 23 August, 2023; originally announced August 2023.

arXiv:2308.11093 [pdf, other]

Video OWL-ViT: Temporally-consistent open-world localization in video

Authors: Georg Heigold, Matthias Minderer, Alexey Gritsenko, Alex Bewley, Daniel Keysers, Mario Lučić, Fisher Yu, Thomas Kipf

Abstract: We present an architecture and a training recipe that adapts pre-trained open-world image models to localization in videos. Understanding the open visual world (without being constrained by fixed label spaces) is crucial for many real-world vision tasks. Contrastive pre-training on large image-text datasets has recently led to significant improvements for image-level tasks. For more structured tas… ▽ More We present an architecture and a training recipe that adapts pre-trained open-world image models to localization in videos. Understanding the open visual world (without being constrained by fixed label spaces) is crucial for many real-world vision tasks. Contrastive pre-training on large image-text datasets has recently led to significant improvements for image-level tasks. For more structured tasks involving object localization applying pre-trained models is more challenging. This is particularly true for video tasks, where task-specific data is limited. We show successful transfer of open-world models by building on the OWL-ViT open-vocabulary detection model and adapting it to video by adding a transformer decoder. The decoder propagates object representations recurrently through time by using the output tokens for one frame as the object queries for the next. Our model is end-to-end trainable on video data and enjoys improved temporal consistency compared to tracking-by-detection baselines, while retaining the open-world capabilities of the backbone detector. We evaluate our model on the challenging TAO-OW benchmark and demonstrate that open-world capabilities, learned from large-scale image-text pre-training, can be transferred successfully to open-world localization across diverse videos. △ Less

Submitted 21 August, 2023; originally announced August 2023.

Comments: ICCV 2023

arXiv:2308.08612 [pdf, other]

doi 10.1002/andp.202300106

Primer on Axion Physics

Authors: Felix Yu

Abstract: I review the canonical axion potential, with an emphasis on the field theory underlying radial and angular modes of complex scalar fields. I present the explicit calculation of the instanton-induced breaking of the Goldstone field direction necessary to derive the canonical axion mass and decay constant relation. The primer is intended to serve an audience with elementary quantum field theory expe… ▽ More I review the canonical axion potential, with an emphasis on the field theory underlying radial and angular modes of complex scalar fields. I present the explicit calculation of the instanton-induced breaking of the Goldstone field direction necessary to derive the canonical axion mass and decay constant relation. The primer is intended to serve an audience with elementary quantum field theory expertise. △ Less

Submitted 16 August, 2023; originally announced August 2023.

Comments: 15 pages, 1 figure; invited contribution to Annalen der Physik

Report number: MITP-23-038

Journal ref: Ann. Phys.(Berlin) 2023, 2300106

arXiv:2308.06272 [pdf, other]

Beyond Reality: The Pivotal Role of Generative AI in the Metaverse

Authors: Vinay Chamola, Gaurang Bansal, Tridib Kumar Das, Vikas Hassija, Naga Siva Sai Reddy, Jiacheng Wang, Sherali Zeadally, Amir Hussain, F. Richard Yu, Mohsen Guizani, Dusit Niyato

Abstract: Imagine stepping into a virtual world that's as rich, dynamic, and interactive as our physical one. This is the promise of the Metaverse, and it's being brought to life by the transformative power of Generative Artificial Intelligence (AI). This paper offers a comprehensive exploration of how generative AI technologies are shaping the Metaverse, transforming it into a dynamic, immersive, and inter… ▽ More Imagine stepping into a virtual world that's as rich, dynamic, and interactive as our physical one. This is the promise of the Metaverse, and it's being brought to life by the transformative power of Generative Artificial Intelligence (AI). This paper offers a comprehensive exploration of how generative AI technologies are shaping the Metaverse, transforming it into a dynamic, immersive, and interactive virtual world. We delve into the applications of text generation models like ChatGPT and GPT-3, which are enhancing conversational interfaces with AI-generated characters. We explore the role of image generation models such as DALL-E and MidJourney in creating visually stunning and diverse content. We also examine the potential of 3D model generation technologies like Point-E and Lumirithmic in creating realistic virtual objects that enrich the Metaverse experience. But the journey doesn't stop there. We also address the challenges and ethical considerations of implementing these technologies in the Metaverse, offering insights into the balance between user control and AI automation. This paper is not just a study, but a guide to the future of the Metaverse, offering readers a roadmap to harnessing the power of generative AI in creating immersive virtual worlds. △ Less

Submitted 28 July, 2023; originally announced August 2023.

Comments: 8 pages, 4 figures

arXiv:2308.05023 [pdf]

High-energy nitrogen rings stabilized by superatom properties

Authors: Zhen Gong, Rui Wang, Famin Yu, Chenxi Wan, Xinrui Yang, Zhigang Wang

Abstract: How to stabilize nitrogen-rich high-energy-density molecules under conventional conditions is particularly important for the energy storage and conversion of such systems and has attracted extensive attention. In this work, our theoretical study showed for the first time that the stabilization mechanism of the nitrogen ring conformed to the superatomic properties at the atomic level. This result o… ▽ More How to stabilize nitrogen-rich high-energy-density molecules under conventional conditions is particularly important for the energy storage and conversion of such systems and has attracted extensive attention. In this work, our theoretical study showed for the first time that the stabilization mechanism of the nitrogen ring conformed to the superatomic properties at the atomic level. This result occurred because the stabilized anionic nitrogen rings generally showed planar high symmetry and the injected electrons occupied the superatomic molecular orbitals (SAMOs) of the nitrogen rings. According to these results, we identified the typical stabilized anionic nitrogen ring structures N64-, N5- and N42-, and their superatomic electronic configurations were 1S21P41D41F22S21P21F21D42P41G41F4, 1S21P41D41P22S21F41D42P4 and 1S21P41D21P21D22S22P41D4, respectively. On this basis, we further designed a pathway to stabilize nitrogen rings by introducing metal atoms as electron donors to form neutral ThN6, LiN5 and MgN4 structures, thereby replacing the anionization of systems. Our study highlights the importance of developing nitrogen-rich energetic materials from the perspective of superatoms. △ Less

Submitted 9 August, 2023; originally announced August 2023.

Comments: 6 pages, 3 figures

arXiv:2308.03422 [pdf, other]

Prompt Guided Copy Mechanism for Conversational Question Answering

Authors: Yong Zhang, Zhitao Li, Jianzong Wang, Yiming Gao, Ning Cheng, Fengying Yu, Jing Xiao

Abstract: Conversational Question Answering (CQA) is a challenging task that aims to generate natural answers for conversational flow questions. In this paper, we propose a pluggable approach for extractive methods that introduces a novel prompt-guided copy mechanism to improve the fluency and appropriateness of the extracted answers. Our approach uses prompts to link questions to answers and employs attent… ▽ More Conversational Question Answering (CQA) is a challenging task that aims to generate natural answers for conversational flow questions. In this paper, we propose a pluggable approach for extractive methods that introduces a novel prompt-guided copy mechanism to improve the fluency and appropriateness of the extracted answers. Our approach uses prompts to link questions to answers and employs attention to guide the copy mechanism to verify the naturalness of extracted answers, making necessary edits to ensure that the answers are fluent and appropriate. The three prompts, including a question-rationale relationship prompt, a question description prompt, and a conversation history prompt, enhance the copy mechanism's performance. Our experiments demonstrate that this approach effectively promotes the generation of natural answers and achieves good results in the CoQA challenge. △ Less

Submitted 7 August, 2023; originally announced August 2023.

Comments: Accepted by 24th Annual Conference of the International Speech Communication Association (INTERSPEECH 2023)

arXiv:2308.03364 [pdf, other]

Dual Aggregation Transformer for Image Super-Resolution

Authors: Zheng Chen, Yulun Zhang, Jinjin Gu, Linghe Kong, Xiaokang Yang, Fisher Yu

Abstract: Transformer has recently gained considerable popularity in low-level vision tasks, including image super-resolution (SR). These networks utilize self-attention along different dimensions, spatial or channel, and achieve impressive performance. This inspires us to combine the two dimensions in Transformer for a more powerful representation capability. Based on the above idea, we propose a novel Tra… ▽ More Transformer has recently gained considerable popularity in low-level vision tasks, including image super-resolution (SR). These networks utilize self-attention along different dimensions, spatial or channel, and achieve impressive performance. This inspires us to combine the two dimensions in Transformer for a more powerful representation capability. Based on the above idea, we propose a novel Transformer model, Dual Aggregation Transformer (DAT), for image SR. Our DAT aggregates features across spatial and channel dimensions, in the inter-block and intra-block dual manner. Specifically, we alternately apply spatial and channel self-attention in consecutive Transformer blocks. The alternate strategy enables DAT to capture the global context and realize inter-block feature aggregation. Furthermore, we propose the adaptive interaction module (AIM) and the spatial-gate feed-forward network (SGFN) to achieve intra-block feature aggregation. AIM complements two self-attention mechanisms from corresponding dimensions. Meanwhile, SGFN introduces additional non-linear spatial information in the feed-forward network. Extensive experiments show that our DAT surpasses current methods. Code and models are obtainable at https://github.com/zhengchen1999/DAT. △ Less

Submitted 11 August, 2023; v1 submitted 7 August, 2023; originally announced August 2023.

Comments: Accepted to ICCV 2023. Code is available at https://github.com/zhengchen1999/DAT

arXiv:2308.03166 [pdf, other]

Strategic Preys Make Acute Predators: Enhancing Camouflaged Object Detectors by Generating Camouflaged Objects

Authors: Chunming He, Kai Li, Yachao Zhang, Yulun Zhang, Zhenhua Guo, Xiu Li, Martin Danelljan, Fisher Yu

Abstract: Camouflaged object detection (COD) is the challenging task of identifying camouflaged objects visually blended into surroundings. Albeit achieving remarkable success, existing COD detectors still struggle to obtain precise results in some challenging cases. To handle this problem, we draw inspiration from the prey-vs-predator game that leads preys to develop better camouflage and predators to acqu… ▽ More Camouflaged object detection (COD) is the challenging task of identifying camouflaged objects visually blended into surroundings. Albeit achieving remarkable success, existing COD detectors still struggle to obtain precise results in some challenging cases. To handle this problem, we draw inspiration from the prey-vs-predator game that leads preys to develop better camouflage and predators to acquire more acute vision systems and develop algorithms from both the prey side and the predator side. On the prey side, we propose an adversarial training framework, Camouflageator, which introduces an auxiliary generator to generate more camouflaged objects that are harder for a COD method to detect. Camouflageator trains the generator and detector in an adversarial way such that the enhanced auxiliary generator helps produce a stronger detector. On the predator side, we introduce a novel COD method, called Internal Coherence and Edge Guidance (ICEG), which introduces a camouflaged feature coherence module to excavate the internal coherence of camouflaged objects, striving to obtain more complete segmentation results. Additionally, ICEG proposes a novel edge-guided separated calibration module to remove false predictions to avoid obtaining ambiguous boundaries. Extensive experiments show that ICEG outperforms existing COD detectors and Camouflageator is flexible to improve various COD detectors, including ICEG, which brings state-of-the-art COD performance. △ Less

Submitted 10 March, 2024; v1 submitted 6 August, 2023; originally announced August 2023.

Comments: Accepted at ICLR 2024

arXiv:2308.02621 [pdf, other]

Color Image Recovery Using Generalized Matrix Completion over Higher-Order Finite Dimensional Algebra

Authors: Liang Liao, Zhuang Guo, Qi Gao, Yan Wang, Fajun Yu, Qifeng Zhao, Stephen Johh Maybank

Abstract: To improve the accuracy of color image completion with missing entries, we present a recovery method based on generalized higher-order scalars. We extend the traditional second-order matrix model to a more comprehensive higher-order matrix equivalent, called the "t-matrix" model, which incorporates a pixel neighborhood expansion strategy to characterize the local pixel constraints. This "t-matrix"… ▽ More To improve the accuracy of color image completion with missing entries, we present a recovery method based on generalized higher-order scalars. We extend the traditional second-order matrix model to a more comprehensive higher-order matrix equivalent, called the "t-matrix" model, which incorporates a pixel neighborhood expansion strategy to characterize the local pixel constraints. This "t-matrix" model is then used to extend some commonly used matrix and tensor completion algorithms to their higher-order versions. We perform extensive experiments on various algorithms using simulated data and algorithms on simulated data and publicly available images and compare their performance. The results show that our generalized matrix completion model and the corresponding algorithm compare favorably with their lower-order tensor and conventional matrix counterparts. △ Less

Submitted 4 August, 2023; originally announced August 2023.

Comments: 24 pages; 9 figures

arXiv:2308.01244 [pdf, ps, other]

Quantum Imprint of the Anharmonic Oscillator

Authors: Prisco Lo Chiatto, Sebastian Schenk, Felix Yu

Abstract: We study the anharmonic double well in quantum mechanics using exact Wentzel-Kramers-Brillouin (WKB) methods in a 't Hooft-like double scaling limit where classical behavior is expected to dominate. We compute the tunneling action in this double scaling limit, and compare it to the transition amplitude from the vacuum to a highly excited state. Our results, exact in the semiclassical limit, show t… ▽ More We study the anharmonic double well in quantum mechanics using exact Wentzel-Kramers-Brillouin (WKB) methods in a 't Hooft-like double scaling limit where classical behavior is expected to dominate. We compute the tunneling action in this double scaling limit, and compare it to the transition amplitude from the vacuum to a highly excited state. Our results, exact in the semiclassical limit, show that the two expressions coincide, apart from an irreducible and surprising instanton contribution. Thus, the semiclassical limit of the anharmonic oscillator betrays its quantum origin as a rule, which we dub the "quantum imprint rule," showing that the quantum theory is intrinsically gapped from classical behavior. Besides an example of the failure of reductionism and an example of a resurgent connection between perturbative and nonperturbative physics, this work provides a possible classification of theories according to their quantum imprints. △ Less

Submitted 2 August, 2023; originally announced August 2023.

Comments: 23 pages

Report number: MITP-23-034

arXiv:2307.16147 [pdf]

Broadband Dispersive-Wave Emission Coupled with Two-Stage Soliton Self-Compression in Gas-Filled Anti-Resonant Hollow-Core Fibers

Authors: Jinyu Pan, Zhiyuan Huang, Yifei Chen, Fei Yu, Dakun Wu, Tiandao Chen, Donghan Liu, Yue Yu, Xin Jiang, Meng Pang, Yuxin Leng, Ruxin Li

Abstract: We studied the underlying mechanism of broadband dispersive-wave emission within a resonance band of gas-filled anti-resonant hollow-core fiber. Both theoretical and experimental results unveiled that the high-order soliton, launched into the hollow-core fiber, experienced two stages of pulse compression, resulting in a multi-peak structure of the dispersive-wave spectrum. Over the first-stage pul… ▽ More We studied the underlying mechanism of broadband dispersive-wave emission within a resonance band of gas-filled anti-resonant hollow-core fiber. Both theoretical and experimental results unveiled that the high-order soliton, launched into the hollow-core fiber, experienced two stages of pulse compression, resulting in a multi-peak structure of the dispersive-wave spectrum. Over the first-stage pulse compression, a sharp increase of the pulse peak power triggered the first time of dispersion-wave emission, and simultaneously caused ionization of the noble gas filled in the fiber core. Strong soliton-plasma interactions led to blue shifting of the pump pulse, and the blue-shifted pulse experienced a decreasing dispersion value in the fiber waveguide, resulting in an increase of its soliton order. Then, the second-stage pulse compression due to the high-order soliton effect triggered the second time of dispersive-wave emission at a phase-matched frequency slightly lower than that in the first stage. Multi-peak spectra of the output dispersive-waves and their formation dynamics were clearly observed in our experiments, which can be understood using a delicate coupling mechanism among three nonlinear effects including high-order-soliton compression, soliton-plasma interaction and phase-matched dispersive-wave emission. The output broadband dispersive-wave could be potentially compressed to sub-30 fs duration using precise chirp-compensation technique. △ Less

Submitted 30 July, 2023; originally announced July 2023.

arXiv:2307.16000 [pdf, other]

Automated Hit-frame Detection for Badminton Match Analysis

Authors: Yu-Hang Chien, Fang Yu

Abstract: Sports professionals constantly under pressure to perform at the highest level can benefit from sports analysis, which allows coaches and players to reduce manual efforts and systematically evaluate their performance using automated tools. This research aims to advance sports analysis in badminton, systematically detecting hit-frames automatically from match videos using modern deep learning techn… ▽ More Sports professionals constantly under pressure to perform at the highest level can benefit from sports analysis, which allows coaches and players to reduce manual efforts and systematically evaluate their performance using automated tools. This research aims to advance sports analysis in badminton, systematically detecting hit-frames automatically from match videos using modern deep learning techniques. The data included in hit-frames can subsequently be utilized to synthesize players' strokes and on-court movement, as well as for other downstream applications such as analyzing training tasks and competition strategy. The proposed approach in this study comprises several automated procedures like rally-wise video trimming, player and court keypoints detection, shuttlecock flying direction prediction, and hit-frame detection. In the study, we achieved 99% accuracy on shot angle recognition for video trimming, over 92% accuracy for applying player keypoints sequences on shuttlecock flying direction prediction, and reported the evaluation results of rally-wise video trimming and hit-frame detection. △ Less

Submitted 2 August, 2023; v1 submitted 29 July, 2023; originally announced July 2023.

arXiv:2307.14918 [pdf, other]

GET3D--: Learning GET3D from Unconstrained Image Collections

Authors: Fanghua Yu, Xintao Wang, Zheyuan Li, Yan-Pei Cao, Ying Shan, Chao Dong

Abstract: The demand for efficient 3D model generation techniques has grown exponentially, as manual creation of 3D models is time-consuming and requires specialized expertise. While generative models have shown potential in creating 3D textured shapes from 2D images, their applicability in 3D industries is limited due to the lack of a well-defined camera distribution in real-world scenarios, resulting in l… ▽ More The demand for efficient 3D model generation techniques has grown exponentially, as manual creation of 3D models is time-consuming and requires specialized expertise. While generative models have shown potential in creating 3D textured shapes from 2D images, their applicability in 3D industries is limited due to the lack of a well-defined camera distribution in real-world scenarios, resulting in low-quality shapes. To overcome this limitation, we propose GET3D--, the first method that directly generates textured 3D shapes from 2D images with unknown pose and scale. GET3D-- comprises a 3D shape generator and a learnable camera sampler that captures the 6D external changes on the camera. In addition, We propose a novel training schedule to stably optimize both the shape generator and camera sampler in a unified framework. By controlling external variations using the learnable camera sampler, our method can generate aligned shapes with clear textures. Extensive experiments demonstrate the efficacy of GET3D--, which precisely fits the 6D camera pose distribution and generates high-quality shapes on both synthetic and realistic unconstrained datasets. △ Less

Submitted 27 July, 2023; originally announced July 2023.

arXiv:2307.13048

The IceCube-Gen2 Collaboration -- Contributions to the 38th International Cosmic Ray Conference (ICRC2023)

Authors: IceCube-Gen2, :, R. Abbasi, M. Ackermann, J. Adams, S. K. Agarwalla, J. A. Aguilar, M. Ahlers, J. M. Alameddine, N. M. Amin, K. Andeen, G. Anton, C. Argüelles, Y. Ashida, S. Athanasiadou, J. Audehm, S. N. Axani, X. Bai, A. Balagopal V., M. Baricevic, S. W. Barwick, V. Basu, R. Bay, J. Becker Tjus, J. Beise , et al. (432 additional authors not shown)

Abstract: IceCube-Gen2 is a planned next-generation neutrino observatory at the South Pole that builds upon the successful design of IceCube. Integrating two complementary detection technologies for neutrinos, optical and radio Cherenkov emission, in combination with a surface array for cosmic ray air shower detection, IceCube-Gen2 will cover a broad neutrino energy range from MeV to EeV. This index of cont… ▽ More IceCube-Gen2 is a planned next-generation neutrino observatory at the South Pole that builds upon the successful design of IceCube. Integrating two complementary detection technologies for neutrinos, optical and radio Cherenkov emission, in combination with a surface array for cosmic ray air shower detection, IceCube-Gen2 will cover a broad neutrino energy range from MeV to EeV. This index of contributions to the 38th International Cosmic Ray Conference in Nagoya, Japan (July 26 - August 3, 2023) describes research and development efforts for IceCube-Gen2. Included are summaries of the design, status, and sensitivity of the IceCube-Gen2 optical, surface, and radio components; performance studies of next-generation optical sensors detecting optical Cherenkov radiation from cosmic ray and neutrino events; reconstruction techniques of radio and optical events in terms of energy, direction, and neutrino flavor; and sensitivity studies of astrophysical neutrino flavors, diffuse neutrino fluxes, and cosmic ray anisotropies. Contributions related to IceCube and the scheduled IceCube Upgrade are available in a separate collection. △ Less

Submitted 24 July, 2023; originally announced July 2023.

Comments: To access the list of contributions, please follow the "HTML" link. Links to individual contributions will fill in as authors upload their material

arXiv:2307.13047

The IceCube Collaboration -- Contributions to the 38th International Cosmic Ray Conference (ICRC2023)

Authors: IceCube, :, R. Abbasi, M. Ackermann, J. Adams, S. K. Agarwalla, J. A. Aguilar, M. Ahlers, J. M. Alameddine, N. M. Amin, K. Andeen, G. Anton, C. Argüelles, Y. Ashida, S. Athanasiadou, S. N. Axani, X. Bai, A. Balagopal V., M. Baricevic, S. W. Barwick, V. Basu, R. Bay, J. J. Beatty, J. Becker Tjus, J. Beise , et al. (382 additional authors not shown)

Abstract: The IceCube Observatory at the South Pole has been operating in its full configuration since May 2011 with a duty cycle of about 99%. Its main component consists of a cubic-kilometer array of optical sensors deployed deep in the Glacial ice designed for the detection of high-energy astrophysical neutrinos. A surface array for cosmic ray air shower detection, IceTop, and a denser inner subdetector,… ▽ More The IceCube Observatory at the South Pole has been operating in its full configuration since May 2011 with a duty cycle of about 99%. Its main component consists of a cubic-kilometer array of optical sensors deployed deep in the Glacial ice designed for the detection of high-energy astrophysical neutrinos. A surface array for cosmic ray air shower detection, IceTop, and a denser inner subdetector, DeepCore, significantly enhance the capabilities of the observatory, making it a multipurpose facility. This list of contributions to the 38th International Cosmic Ray Conference in Nagoya, Japan (July 26 - August 3, 2023) summarizes the latest results from IceCube covering a broad set of key questions in physics and astrophysics. The papers in this index are grouped topically to highlight IceCube contributions related to high-energy neutrino and multi-messenger astrophysics, cosmic-ray physics, low-energy neutrino transients such as Galactic supernovae, fundamental physics, detector calibration and event reconstruction, education and public outreach, and research and development for the IceCube Upgrade, a scheduled dense sensor infill complemented by calibration devices. Contributions related to IceCube-Gen2, the future extension of IceCube, are available in a separate collection. △ Less

Submitted 24 July, 2023; originally announced July 2023.

Comments: To access the list of contributions, please follow the "HTML" link. Links to individual contributions will fill in as authors upload their material

arXiv:2307.12862 [pdf, other]

Stochastic Step-wise Feature Selection for Exponential Random Graph Models (ERGMs)

Authors: Helal El-Zaatari, Fei Yu, Michael R Kosorok

Abstract: Statistical analysis of social networks provides valuable insights into complex network interactions across various scientific disciplines. However, accurate modeling of networks remains challenging due to the heavy computational burden and the need to account for observed network dependencies. Exponential Random Graph Models (ERGMs) have emerged as a promising technique used in social network mod… ▽ More Statistical analysis of social networks provides valuable insights into complex network interactions across various scientific disciplines. However, accurate modeling of networks remains challenging due to the heavy computational burden and the need to account for observed network dependencies. Exponential Random Graph Models (ERGMs) have emerged as a promising technique used in social network modeling to capture network dependencies by incorporating endogenous variables. Nevertheless, using ERGMs poses multiple challenges, including the occurrence of ERGM degeneracy, which generates unrealistic and meaningless network structures. To address these challenges and enhance the modeling of collaboration networks, we propose and test a novel approach that focuses on endogenous variable selection within ERGMs. Our method aims to overcome the computational burden and improve the accommodation of observed network dependencies, thereby facilitating more accurate and meaningful interpretations of network phenomena in various scientific fields. We conduct empirical testing and rigorous analysis to contribute to the advancement of statistical techniques and offer practical insights for network analysis. △ Less

Submitted 24 July, 2023; originally announced July 2023.

Comments: 23 pages, 6 tables and 18 figures

arXiv:2307.12191 [pdf, other]

Effects of Coronal Magnetic Field Configuration on Particle Acceleration and Release during the Ground Level Enhancement Events in Solar Cycle 24

Authors: Wenlong Liu, Xiangliang Kong, Fan Guo, Lulu Zhao, Shiwei Feng, Feiyu Yu, Zelong Jiang, Yao Chen, Joe Giacalone

Abstract: Ground level enhancements (GLEs) are extreme solar energetic particle (SEP) events that are of particular importance in space weather. In solar cycle 24, two GLEs were recorded on 2012 May 17 (GLE 71) and 2017 September 10 (GLE 72), respectively, by a range of advanced modern instruments. Here we conduct a comparative analysis of the two events by focusing on the effects of large-scale magnetic fi… ▽ More Ground level enhancements (GLEs) are extreme solar energetic particle (SEP) events that are of particular importance in space weather. In solar cycle 24, two GLEs were recorded on 2012 May 17 (GLE 71) and 2017 September 10 (GLE 72), respectively, by a range of advanced modern instruments. Here we conduct a comparative analysis of the two events by focusing on the effects of large-scale magnetic field configuration near active regions on particle acceleration and release. Although the active regions both located near the western limb, temporal variations of SEP intensities and energy spectra measured in-situ display different behaviors at early stages. By combining a potential field model, we find the CME in GLE 71 originated below the streamer belt, while in GLE 72 near the edge of the streamer belt. We reconstruct the CME shock fronts with an ellipsoid model based on nearly simultaneous coronagraph images from multi-viewpoints, and further derive the 3D shock geometry at the GLE onset. The highest-energy particles are primarily accelerated in the shock-streamer interaction regions, i.e., likely at the nose of the shock in GLE 71 and the eastern flank in GLE 72, due to quasi-perpendicular shock geometry and confinement of closed fields. Subsequently, they are released to the field lines connecting to near-Earth spacecraft when the shocks move through the streamer cusp region. This suggests that magnetic structures in the corona, especially shock-streamer interactions, may have played an important role in the acceleration and release of the highest-energy particles in the two events. △ Less

Submitted 22 July, 2023; originally announced July 2023.

Comments: Accepted for publication in ApJ

arXiv:2307.11035 [pdf, other]

Cascade-DETR: Delving into High-Quality Universal Object Detection

Authors: Mingqiao Ye, Lei Ke, Siyuan Li, Yu-Wing Tai, Chi-Keung Tang, Martin Danelljan, Fisher Yu

Abstract: Object localization in general environments is a fundamental part of vision systems. While dominating on the COCO benchmark, recent Transformer-based detection methods are not competitive in diverse domains. Moreover, these methods still struggle to very accurately estimate the object bounding boxes in complex environments. We introduce Cascade-DETR for high-quality universal object detection. W… ▽ More Object localization in general environments is a fundamental part of vision systems. While dominating on the COCO benchmark, recent Transformer-based detection methods are not competitive in diverse domains. Moreover, these methods still struggle to very accurately estimate the object bounding boxes in complex environments. We introduce Cascade-DETR for high-quality universal object detection. We jointly tackle the generalization to diverse domains and localization accuracy by proposing the Cascade Attention layer, which explicitly integrates object-centric information into the detection decoder by limiting the attention to the previous box prediction. To further enhance accuracy, we also revisit the scoring of queries. Instead of relying on classification scores, we predict the expected IoU of the query, leading to substantially more well-calibrated confidences. Lastly, we introduce a universal object detection benchmark, UDB10, that contains 10 datasets from diverse domains. While also advancing the state-of-the-art on COCO, Cascade-DETR substantially improves DETR-based detectors on all datasets in UDB10, even by over 10 mAP in some cases. The improvements under stringent quality requirements are even more pronounced. Our code and models will be released at https://github.com/SysCV/cascade-detr. △ Less

Submitted 20 July, 2023; originally announced July 2023.

Comments: Accepted in ICCV 2023. Our code and models will be released at https://github.com/SysCV/cascade-detr

Showing 101–150 of 747 results for author: Yu, F