Search | arXiv e-print repository

Equivariant Filter for Tightly Coupled LiDAR-Inertial Odometry

Authors: Anbo Tao, Yarong Luo, Chunxi Xia, Chi Guo, Xingxing Li

Abstract: Pose estimation is a crucial problem in simultaneous localization and mapping (SLAM). However, developing a robust and consistent state estimator remains a significant challenge, as the traditional extended Kalman filter (EKF) struggles to handle the model nonlinearity, especially for inertial measurement unit (IMU) and light detection and ranging (LiDAR). To provide a consistent and efficient sol… ▽ More Pose estimation is a crucial problem in simultaneous localization and mapping (SLAM). However, developing a robust and consistent state estimator remains a significant challenge, as the traditional extended Kalman filter (EKF) struggles to handle the model nonlinearity, especially for inertial measurement unit (IMU) and light detection and ranging (LiDAR). To provide a consistent and efficient solution of pose estimation, we propose Eq-LIO, a robust state estimator for tightly coupled LIO systems based on an equivariant filter (EqF). Compared with the invariant Kalman filter based on the $\SE_2(3)$ group structure, the EqF uses the symmetry of the semi-direct product group to couple the system state including IMU bias, navigation state and LiDAR extrinsic calibration state, thereby suppressing linearization error and improving the behavior of the estimator in the event of unexpected state changes. The proposed Eq-LIO owns natural consistency and higher robustness, which is theoretically proven with mathematical derivation and experimentally verified through a series of tests on both public and private datasets. △ Less

Submitted 10 September, 2024; originally announced September 2024.

arXiv:2409.01597 [pdf]

Bacteria exhibit optimal diffusivity near surfaces

Authors: Antai Tao, Guangzhe Liu, Rongjing Zhang, Junhua Yuan

Abstract: In natural environments, solid surfaces present both opportunities and challenges for bacteria. On one hand, they serve as platforms for biofilm formation, crucial for bacterial colonization and resilience in harsh conditions. On the other hand, surfaces can entrap bacteria, constraining their environmental exploration compared to the freedom they experience in bulk liquid. Here, through systemati… ▽ More In natural environments, solid surfaces present both opportunities and challenges for bacteria. On one hand, they serve as platforms for biofilm formation, crucial for bacterial colonization and resilience in harsh conditions. On the other hand, surfaces can entrap bacteria, constraining their environmental exploration compared to the freedom they experience in bulk liquid. Here, through systematic single-cell behavioral measurements, phenomenological modeling, and theoretical analysis, we reveal how bacteria strategically navigate these factors. We observe that bacterial surface residence time decreases sharply with increasing tumble bias, transitioning to a plateau at a tumble bias of around 0.25, consistent with the mean tumble bias of wild-type Escherichia coli. Furthermore, we find that bacterial surface diffusivity peaks near the mean tumble bias of wild-type E. coli. This reflects a bet-hedging strategy: some bacteria swiftly escape from the surface, while others, with longer surface residence times, explore this two-dimensional environment most efficiently. △ Less

Submitted 3 September, 2024; originally announced September 2024.

arXiv:2408.17034 [pdf, other]

MakeWay: Object-Aware Costmaps for Proactive Indoor Navigation Using LiDAR

Authors: Binbin Xu, Allen Tao, Hugues Thomas, Jian Zhang, Timothy D. Barfoot

Abstract: In this paper, we introduce a LiDAR-based robot navigation system, based on novel object-aware affordance-based costmaps. Utilizing a 3D object detection network, our system identifies objects of interest in LiDAR keyframes, refines their 3D poses with the Iterative Closest Point (ICP) algorithm, and tracks them via Kalman filters and the Hungarian algorithm for data association. It then updates e… ▽ More In this paper, we introduce a LiDAR-based robot navigation system, based on novel object-aware affordance-based costmaps. Utilizing a 3D object detection network, our system identifies objects of interest in LiDAR keyframes, refines their 3D poses with the Iterative Closest Point (ICP) algorithm, and tracks them via Kalman filters and the Hungarian algorithm for data association. It then updates existing object poses with new associated detections and creates new object maps for unmatched detections. Using the maintained object-level mapping system, our system creates affordance-driven object costmaps for proactive collision avoidance in path planning. Additionally, we address the scarcity of indoor semantic LiDAR data by introducing an automated labeling technique. This method utilizes a CAD model database for accurate ground-truth annotations, encompassing bounding boxes, positions, orientations, and point-wise semantics of each object in LiDAR sequences. Our extensive evaluations, conducted in both simulated and real-world robot platforms, highlights the effectiveness of proactive object avoidance by using object affordance costmaps, enhancing robotic navigation safety and efficiency. The system can operate in real-time onboard and we intend to release our code and data for public use. △ Less

Submitted 30 August, 2024; originally announced August 2024.

Comments: 8 pages, 11 figures

arXiv:2408.16192 [pdf]

Molecular-Scale Insights into the Heterogeneous Interactions Between an m-Terphenyl Isocyanide Ligand and Noble Metal Nanoparticles

Authors: Liya Bi, Yufei Wang, Zhe Wang, Alexandria Do, Alexander Fuqua, Krista P. Balto, Yanning Zhang, Joshua S. Figueroa, Tod A. Pascal, Andrea R. Tao, Shaowei Li

Abstract: The structural and chemical properties of metal nanoparticles are often dictated by their interactions with molecular ligand shells. These interactions are highly material-specific and can vary significantly even among elements within the same group or materials with similar crystal structure. Precise characterization of ligand-metal interactions is crucial for the rational design of ligands and t… ▽ More The structural and chemical properties of metal nanoparticles are often dictated by their interactions with molecular ligand shells. These interactions are highly material-specific and can vary significantly even among elements within the same group or materials with similar crystal structure. Precise characterization of ligand-metal interactions is crucial for the rational design of ligands and the functionalization of nanoparticles. In this study, we found that the ligation behavior with m-terphenyl isocyanide molecule differs significantly between Au and Ag nanoparticles, with distinct ligand extraction efficiencies and size dependencies. Surface-enhanced Raman spectroscopy measurements revealed unique enhancement factors for two molecular vibrational modes, indicating different ligand binding geometries on these metal surfaces. Molecular-level characterization using scanning tunneling microscopy allowed us to directly visualize these variations between Ag and Au surfaces, which we assign as two distinct binding mechanisms. This molecular-scale visualization provides clear insights into the different ligand-metal interactions, as well as the chemical behavior and spectroscopic characterization of isocyanide-functionalized nanoparticles. △ Less

Submitted 28 August, 2024; originally announced August 2024.

arXiv:2408.15998 [pdf, other]

Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders

Authors: Min Shi, Fuxiao Liu, Shihao Wang, Shijia Liao, Subhashree Radhakrishnan, De-An Huang, Hongxu Yin, Karan Sapra, Yaser Yacoob, Humphrey Shi, Bryan Catanzaro, Andrew Tao, Jan Kautz, Zhiding Yu, Guilin Liu

Abstract: The ability to accurately interpret complex visual information is a crucial topic of multimodal large language models (MLLMs). Recent work indicates that enhanced visual perception significantly reduces hallucinations and improves performance on resolution-sensitive tasks, such as optical character recognition and document analysis. A number of recent MLLMs achieve this goal using a mixture of vis… ▽ More The ability to accurately interpret complex visual information is a crucial topic of multimodal large language models (MLLMs). Recent work indicates that enhanced visual perception significantly reduces hallucinations and improves performance on resolution-sensitive tasks, such as optical character recognition and document analysis. A number of recent MLLMs achieve this goal using a mixture of vision encoders. Despite their success, there is a lack of systematic comparisons and detailed ablation studies addressing critical aspects, such as expert selection and the integration of multiple vision experts. This study provides an extensive exploration of the design space for MLLMs using a mixture of vision encoders and resolutions. Our findings reveal several underlying principles common to various existing strategies, leading to a streamlined yet effective design approach. We discover that simply concatenating visual tokens from a set of complementary vision encoders is as effective as more complex mixing architectures or strategies. We additionally introduce Pre-Alignment to bridge the gap between vision-focused encoders and language tokens, enhancing model coherence. The resulting family of MLLMs, Eagle, surpasses other leading open-source models on major MLLM benchmarks. Models and code: https://github.com/NVlabs/Eagle △ Less

Submitted 28 August, 2024; originally announced August 2024.

Comments: Github: https://github.com/NVlabs/Eagle, HuggingFace: https://huggingface.co/NVEagle

arXiv:2408.02925 [pdf, other]

Competitive Facility Location under Cross-Nested Logit Customer Choice Model: Hardness and Exact Approaches

Authors: Ba Luat Le, Tien Mai, Thuy Anh Ta, Minh Hoang Ha, Duc Minh Vu

Abstract: We study the competitive facility location problem, where a firm aims to establish new facilities in a market already occupied by competitors. In this problem, customer behavior is crucial for making optimal location decisions. We explore a general class of customer choice models, known as the cross-nested logit (CNL) model, which is recognized for its flexibility and generality in predicting peop… ▽ More We study the competitive facility location problem, where a firm aims to establish new facilities in a market already occupied by competitors. In this problem, customer behavior is crucial for making optimal location decisions. We explore a general class of customer choice models, known as the cross-nested logit (CNL) model, which is recognized for its flexibility and generality in predicting people's choice behavior. To explore the problem, we first demonstrate that it is NP-hard, even when there is only one customer class. We further show that this hardness result is tight, as the facility location problem under any simpler choice models (such as the logit or nested logit) is polynomial-time solvable when there is one customer class. To tackle the resulting facility location problem, we demonstrate that the objective function under a general cross-nested structure is not concave. Interestingly, we show that by a change of variables, the objective function can be converted to a convex program (i.e., a maximization problem with a concave objective and convex constraints), enabling it to be solved to optimality via an outer-approximation algorithm. Extensive experiments show the efficiency of our approach and provide analyses on the benefits of using the cross-nested model in the facility location context. △ Less

Submitted 5 August, 2024; originally announced August 2024.

arXiv:2407.20019 [pdf, ps, other]

Bi-Lipschitz embedding metric triangles in the plane

Authors: Xinyuan Luo, Matthew Romney, Alexandria L. Tao

Abstract: A metric polygon is a metric space comprised of a finite number of closed intervals joined cyclically. The second-named author and Ntalampekos recently found a method to bi-Lipschitz embed an arbitrary metric triangle in the Euclidean plane with uniformly bounded distortion, which we call here the tripodal embedding. In this paper, we prove the sharp distortion bound $4\sqrt{7/3}$ for the tripodal… ▽ More A metric polygon is a metric space comprised of a finite number of closed intervals joined cyclically. The second-named author and Ntalampekos recently found a method to bi-Lipschitz embed an arbitrary metric triangle in the Euclidean plane with uniformly bounded distortion, which we call here the tripodal embedding. In this paper, we prove the sharp distortion bound $4\sqrt{7/3}$ for the tripodal embedding. We also give a detailed analysis of four representative examples of metric triangles: the intrinsic circle, the three-petal rose, tripods and the twisted heart. In particular, our examples show the sharpness of the tripodal embedding distortion bound and give a lower bound for the optimal distortion bound in general. Finally, we show the triangle embedding theorem does not generalize to metric quadrilaterals by giving a family of examples of metric quadrilaterals that are not bi-Lipschitz embeddable in the plane with uniform distortion. △ Less

Submitted 29 July, 2024; originally announced July 2024.

Comments: 21 pages, 6 figures

MSC Class: 51F30 (Primary); 30L05 (Secondary)

arXiv:2407.18908 [pdf, other]

Wolf: Captioning Everything with a World Summarization Framework

Authors: Boyi Li, Ligeng Zhu, Ran Tian, Shuhan Tan, Yuxiao Chen, Yao Lu, Yin Cui, Sushant Veer, Max Ehrlich, Jonah Philion, Xinshuo Weng, Fuzhao Xue, Andrew Tao, Ming-Yu Liu, Sanja Fidler, Boris Ivanovic, Trevor Darrell, Jitendra Malik, Song Han, Marco Pavone

Abstract: We propose Wolf, a WOrLd summarization Framework for accurate video captioning. Wolf is an automated captioning framework that adopts a mixture-of-experts approach, leveraging complementary strengths of Vision Language Models (VLMs). By utilizing both image and video models, our framework captures different levels of information and summarizes them efficiently. Our approach can be applied to enhan… ▽ More We propose Wolf, a WOrLd summarization Framework for accurate video captioning. Wolf is an automated captioning framework that adopts a mixture-of-experts approach, leveraging complementary strengths of Vision Language Models (VLMs). By utilizing both image and video models, our framework captures different levels of information and summarizes them efficiently. Our approach can be applied to enhance video understanding, auto-labeling, and captioning. To evaluate caption quality, we introduce CapScore, an LLM-based metric to assess the similarity and quality of generated captions compared to the ground truth captions. We further build four human-annotated datasets in three domains: autonomous driving, general scenes, and robotics, to facilitate comprehensive comparisons. We show that Wolf achieves superior captioning performance compared to state-of-the-art approaches from the research community (VILA1.5, CogAgent) and commercial solutions (Gemini-Pro-1.5, GPT-4V). For instance, in comparison with GPT-4V, Wolf improves CapScore both quality-wise by 55.6% and similarity-wise by 77.4% on challenging driving videos. Finally, we establish a benchmark for video captioning and introduce a leaderboard, aiming to accelerate advancements in video understanding, captioning, and data alignment. Leaderboard: https://wolfv0.github.io/leaderboard.html. △ Less

Submitted 26 July, 2024; originally announced July 2024.

arXiv:2405.19335 [pdf, other]

X-VILA: Cross-Modality Alignment for Large Language Model

Authors: Hanrong Ye, De-An Huang, Yao Lu, Zhiding Yu, Wei Ping, Andrew Tao, Jan Kautz, Song Han, Dan Xu, Pavlo Molchanov, Hongxu Yin

Abstract: We introduce X-VILA, an omni-modality model designed to extend the capabilities of large language models (LLMs) by incorporating image, video, and audio modalities. By aligning modality-specific encoders with LLM inputs and diffusion decoders with LLM outputs, X-VILA achieves cross-modality understanding, reasoning, and generation. To facilitate this cross-modality alignment, we curate an effectiv… ▽ More We introduce X-VILA, an omni-modality model designed to extend the capabilities of large language models (LLMs) by incorporating image, video, and audio modalities. By aligning modality-specific encoders with LLM inputs and diffusion decoders with LLM outputs, X-VILA achieves cross-modality understanding, reasoning, and generation. To facilitate this cross-modality alignment, we curate an effective interleaved any-to-any modality instruction-following dataset. Furthermore, we identify a significant problem with the current cross-modality alignment method, which results in visual information loss. To address the issue, we propose a visual alignment mechanism with a visual embedding highway module. We then introduce a resource-efficient recipe for training X-VILA, that exhibits proficiency in any-to-any modality conversation, surpassing previous approaches by large margins. X-VILA also showcases emergent properties across modalities even in the absence of similar training data. The project will be made open-source. △ Less

Submitted 29 May, 2024; originally announced May 2024.

Comments: Technical Report

arXiv:2405.13899 [pdf, ps, other]

Symmetric Linear Bandits with Hidden Symmetry

Authors: Nam Phuong Tran, The Anh Ta, Debmalya Mandal, Long Tran-Thanh

Abstract: High-dimensional linear bandits with low-dimensional structure have received considerable attention in recent studies due to their practical significance. The most common structure in the literature is sparsity. However, it may not be available in practice. Symmetry, where the reward is invariant under certain groups of transformations on the set of arms, is another important inductive bias in the… ▽ More High-dimensional linear bandits with low-dimensional structure have received considerable attention in recent studies due to their practical significance. The most common structure in the literature is sparsity. However, it may not be available in practice. Symmetry, where the reward is invariant under certain groups of transformations on the set of arms, is another important inductive bias in the high-dimensional case that covers many standard structures, including sparsity. In this work, we study high-dimensional symmetric linear bandits where the symmetry is hidden from the learner, and the correct symmetry needs to be learned in an online setting. We examine the structure of a collection of hidden symmetry and provide a method based on model selection within the collection of low-dimensional subspaces. Our algorithm achieves a regret bound of $ O(d_0^{1/3} T^{2/3} \log(d))$, where $d$ is the ambient dimension which is potentially very large, and $d_0$ is the dimension of the true low-dimensional subspace such that $d_0 \ll d$. With an extra assumption on well-separated models, we can further improve the regret to $ O(d_0\sqrt{T\log(d)} )$. △ Less

Submitted 22 May, 2024; originally announced May 2024.

arXiv:2402.07067 [pdf, other]

Learning the Expected Core of Strictly Convex Stochastic Cooperative Games

Authors: Nam Phuong Tran, The Anh Ta, Shuqing Shi, Debmalya Mandal, Yali Du, Long Tran-Thanh

Abstract: Reward allocation, also known as the credit assignment problem, has been an important topic in economics, engineering, and machine learning. An important concept in reward allocation is the core, which is the set of stable allocations where no agent has the motivation to deviate from the grand coalition. In previous works, computing the core requires either knowledge of the reward function in dete… ▽ More Reward allocation, also known as the credit assignment problem, has been an important topic in economics, engineering, and machine learning. An important concept in reward allocation is the core, which is the set of stable allocations where no agent has the motivation to deviate from the grand coalition. In previous works, computing the core requires either knowledge of the reward function in deterministic games or the reward distribution in stochastic games. However, this is unrealistic, as the reward function or distribution is often only partially known and may be subject to uncertainty. In this paper, we consider the core learning problem in stochastic cooperative games, where the reward distribution is unknown. Our goal is to learn the expected core, that is, the set of allocations that are stable in expectation, given an oracle that returns a stochastic reward for an enquired coalition each round. Within the class of strictly convex games, we present an algorithm named \texttt{Common-Points-Picking} that returns a point in the expected core given a polynomial number of samples, with high probability. To analyse the algorithm, we develop a new extension of the separation hyperplane theorem for multiple convex sets. △ Less

Submitted 22 May, 2024; v1 submitted 10 February, 2024; originally announced February 2024.

arXiv:2312.07533 [pdf, other]

VILA: On Pre-training for Visual Language Models

Authors: Ji Lin, Hongxu Yin, Wei Ping, Yao Lu, Pavlo Molchanov, Andrew Tao, Huizi Mao, Jan Kautz, Mohammad Shoeybi, Song Han

Abstract: Visual language models (VLMs) rapidly progressed with the recent success of large language models. There have been growing efforts on visual instruction tuning to extend the LLM with visual inputs, but lacks an in-depth study of the visual language pre-training process, where the model learns to perform joint modeling on both modalities. In this work, we examine the design options for VLM pre-trai… ▽ More Visual language models (VLMs) rapidly progressed with the recent success of large language models. There have been growing efforts on visual instruction tuning to extend the LLM with visual inputs, but lacks an in-depth study of the visual language pre-training process, where the model learns to perform joint modeling on both modalities. In this work, we examine the design options for VLM pre-training by augmenting LLM towards VLM through step-by-step controllable comparisons. We introduce three main findings: (1) freezing LLMs during pre-training can achieve decent zero-shot performance, but lack in-context learning capability, which requires unfreezing the LLM; (2) interleaved pre-training data is beneficial whereas image-text pairs alone are not optimal; (3) re-blending text-only instruction data to image-text data during instruction fine-tuning not only remedies the degradation of text-only tasks, but also boosts VLM task accuracy. With an enhanced pre-training recipe we build VILA, a Visual Language model family that consistently outperforms the state-of-the-art models, e.g., LLaVA-1.5, across main benchmarks without bells and whistles. Multi-modal pre-training also helps unveil appealing properties of VILA, including multi-image reasoning, enhanced in-context learning, and better world knowledge. △ Less

Submitted 16 May, 2024; v1 submitted 12 December, 2023; originally announced December 2023.

Comments: CVPR 2024

arXiv:2308.10008 [pdf, ps, other]

What is the Impact of Releasing Code with Publications? Statistics from the Machine Learning, Robotics, and Control Communities

Authors: Siqi Zhou, Lukas Brunke, Allen Tao, Adam W. Hall, Federico Pizarro Bejarano, Jacopo Panerati, Angela P. Schoellig

Abstract: Open-sourcing research publications is a key enabler for the reproducibility of studies and the collective scientific progress of a research community. As all fields of science develop more advanced algorithms, we become more dependent on complex computational toolboxes -- sharing research ideas solely through equations and proofs is no longer sufficient to communicate scientific developments. Ove… ▽ More Open-sourcing research publications is a key enabler for the reproducibility of studies and the collective scientific progress of a research community. As all fields of science develop more advanced algorithms, we become more dependent on complex computational toolboxes -- sharing research ideas solely through equations and proofs is no longer sufficient to communicate scientific developments. Over the past years, several efforts have highlighted the importance and challenges of transparent and reproducible research; code sharing is one of the key necessities in such efforts. In this article, we study the impact of code release on scientific research and present statistics from three research communities: machine learning, robotics, and control. We found that, over a six-year period (2016-2021), the percentages of papers with code at major machine learning, robotics, and control conferences have at least doubled. Moreover, high-impact papers were generally supported by open-source codes. As an example, the top 1% of most cited papers at the Conference on Neural Information Processing Systems (NeurIPS) consistently included open-source codes. In addition, our analysis shows that popular code repositories generally come with high paper citations, which further highlights the coupling between code sharing and the impact of scientific research. While the trends are encouraging, we would like to continue to promote and increase our efforts toward transparent, reproducible research that accelerates innovation -- releasing code with our papers is a clear first step. △ Less

Submitted 19 August, 2023; originally announced August 2023.

arXiv:2307.00440 [pdf, other]

Friezes over $\mathbb Z[\sqrt{2}]$

Authors: Esther Banaian, Libby Farrell, Amy Tao, Kayla Wright, Joy Zhichun Zhang

Abstract: A frieze on a polygon is a map from the diagonals of the polygon to an integral domain which respects the Ptolemy relation. Conway and Coxeter previously studied positive friezes over $\mathbb{Z}$ and showed that they are in bijection with triangulations of a polygon. We extend their work by studying friezes over $\mathbb Z[\sqrt{2}]$ and their relationships to dissections of polygons. We largely… ▽ More A frieze on a polygon is a map from the diagonals of the polygon to an integral domain which respects the Ptolemy relation. Conway and Coxeter previously studied positive friezes over $\mathbb{Z}$ and showed that they are in bijection with triangulations of a polygon. We extend their work by studying friezes over $\mathbb Z[\sqrt{2}]$ and their relationships to dissections of polygons. We largely focus on the characterization of unitary friezes that arise from dissecting a polygon into triangles and quadrilaterals. We identify a family of dissections that give rise to unitary friezes and conjecture that this gives a complete classification of dissections which admit a unitary frieze. △ Less

Submitted 25 July, 2024; v1 submitted 1 July, 2023; originally announced July 2023.

arXiv:2306.15840 [pdf]

Molecular-Scale Visualization of Steric Effects of Ligand Binding to Reconstructed Au(111) Surfaces

Authors: Liya Bi, Sasawat Jamnuch, Amanda Chen, Alexandria Do, Krista P. Balto, Zhe Wang, Qingyi Zhu, Yufei Wang, Yanning Zhang, Andrea R. Tao, Tod A. Pascal, Joshua S. Figueroa, Shaowei Li

Abstract: Direct imaging of single molecules at nanostructured interfaces is a grand challenge, with potential to enable new, precise material architectures and technologies. Of particular interest are the structural morphology and spectroscopic signatures of the adsorbed molecule, where modern probes are only now being developed with the necessary spatial and energetic resolution to provide detailed inform… ▽ More Direct imaging of single molecules at nanostructured interfaces is a grand challenge, with potential to enable new, precise material architectures and technologies. Of particular interest are the structural morphology and spectroscopic signatures of the adsorbed molecule, where modern probes are only now being developed with the necessary spatial and energetic resolution to provide detailed information at molecule-surface interface. Here, we directly visualize the binding of individual m-terphenyl isocyanide ligands to a reconstructed Au(111) surface through scanning tunneling microscopy (STM) and inelastic electron tunneling spectroscopy (IETS). The site-dependent steric pressure of the various surface features alters the vibrational fingerprints of the m-terphenyl isocyanides, which is characterized with single-molecule precision through joint experimental and theoretical approaches. This study for the first time provides molecular-level insights into the steric-pressure-enabled surface binding selectivity, as well as its effect on the chemical properties of individual surface-binding ligands. △ Less

Submitted 28 November, 2023; v1 submitted 27 June, 2023; originally announced June 2023.

arXiv:2306.11071 [pdf, other]

ColabFit Exchange: open-access datasets for data-driven interatomic potentials

Authors: Joshua A. Vita, Eric G. Fuemmeler, Amit Gupta, Gregory P. Wolfe, Alexander Quanming Tao, Ryan S. Elliott, Stefano Martiniani, Ellad B. Tadmor

Abstract: Data-driven (DD) interatomic potentials (IPs) trained on large collections of first principles calculations are rapidly becoming essential tools in the fields of computational materials science and chemistry for performing atomic-scale simulations. Despite this, apart from a few notable exceptions, there is a distinct lack of well-organized, public datasets in common formats available for use with… ▽ More Data-driven (DD) interatomic potentials (IPs) trained on large collections of first principles calculations are rapidly becoming essential tools in the fields of computational materials science and chemistry for performing atomic-scale simulations. Despite this, apart from a few notable exceptions, there is a distinct lack of well-organized, public datasets in common formats available for use with IP development. This deficiency precludes the research community from implementing widespread benchmarking, which is essential for gaining insight into model performance and transferability, and also limits the development of more general, or even universal, IPs. To address this issue, we introduce the ColabFit Exchange, the first database providing open access to a large collection of systematically organized datasets from multiple domains that is especially designed for IP development. The ColabFit Exchange is publicly available at \url{https://colabfit.org/}, providing a web-based interface for exploring, downloading, and contributing datasets. Composed of data collected from the literature or provided by community researchers, the ColabFit Exchange currently (September 2023) consists of 139 datasets spanning nearly 70,000 unique chemistries, and is intended to continuously grow. In addition to outlining the software framework used for constructing and accessing the ColabFit Exchange, we also provide analyses of the data, quantifying the diversity of the database and proposing metrics for assessing the relative diversity of multiple datasets. Finally, we demonstrate an end-to-end IP development pipeline, utilizing datasets from the ColabFit Exchange, fitting tools from the KLIFF software package, and validation tests provided by the OpenKIM framework. △ Less

Submitted 6 September, 2023; v1 submitted 19 June, 2023; originally announced June 2023.

arXiv:2306.06189 [pdf, other]

FasterViT: Fast Vision Transformers with Hierarchical Attention

Authors: Ali Hatamizadeh, Greg Heinrich, Hongxu Yin, Andrew Tao, Jose M. Alvarez, Jan Kautz, Pavlo Molchanov

Abstract: We design a new family of hybrid CNN-ViT neural networks, named FasterViT, with a focus on high image throughput for computer vision (CV) applications. FasterViT combines the benefits of fast local representation learning in CNNs and global modeling properties in ViT. Our newly introduced Hierarchical Attention (HAT) approach decomposes global self-attention with quadratic complexity into a multi-… ▽ More We design a new family of hybrid CNN-ViT neural networks, named FasterViT, with a focus on high image throughput for computer vision (CV) applications. FasterViT combines the benefits of fast local representation learning in CNNs and global modeling properties in ViT. Our newly introduced Hierarchical Attention (HAT) approach decomposes global self-attention with quadratic complexity into a multi-level attention with reduced computational costs. We benefit from efficient window-based self-attention. Each window has access to dedicated carrier tokens that participate in local and global representation learning. At a high level, global self-attentions enable the efficient cross-window communication at lower costs. FasterViT achieves a SOTA Pareto-front in terms of accuracy and image throughput. We have extensively validated its effectiveness on various CV tasks including classification, object detection and segmentation. We also show that HAT can be used as a plug-and-play module for existing networks and enhance them. We further demonstrate significantly faster and more accurate performance than competitive counterparts for images with high resolution. Code is available at https://github.com/NVlabs/FasterViT. △ Less

Submitted 1 April, 2024; v1 submitted 9 June, 2023; originally announced June 2023.

Comments: ICLR'24 Accepted Paper

arXiv:2306.02991 [pdf, other]

Second-scale rotational coherence and dipolar interactions in a gas of ultracold polar molecules

Authors: Philip D. Gregory, Luke M. Fernley, Albert Li Tao, Sarah L. Bromley, Jonathan Stepp, Zewen Zhang, Svetlana Kotochigova, Kaden R. A. Hazzard, Simon L. Cornish

Abstract: Ultracold polar molecules uniquely combine a rich structure of long-lived internal states with access to controllable long-range, anisotropic dipole-dipole interactions. In particular, the rotational states of polar molecules confined in optical tweezers or optical lattices may be used to encode interacting qubits for quantum computation or pseudo-spins for simulating quantum magnetism. As with al… ▽ More Ultracold polar molecules uniquely combine a rich structure of long-lived internal states with access to controllable long-range, anisotropic dipole-dipole interactions. In particular, the rotational states of polar molecules confined in optical tweezers or optical lattices may be used to encode interacting qubits for quantum computation or pseudo-spins for simulating quantum magnetism. As with all quantum platforms, the engineering of robust coherent superpositions of states is vital. However, for optically trapped molecules, the coherence time between rotational states is typically limited by inhomogeneous light shifts. Here we demonstrate a rotationally-magic optical trap for RbCs molecules that supports a Ramsey coherence time of 0.78(4) seconds in the absence of dipole-dipole interactions. This extends to >1.4 seconds at the 95% confidence level using a single spin-echo pulse. In our magic trap, dipolar interactions become the dominant mechanism by which Ramsey contrast is lost for superpositions that generate oscillating dipoles. By changing the states forming the superposition, we tune the effective dipole moment and show that the coherence time is inversely proportional to the strength of the dipolar interaction. Our work unlocks the full potential of the rotational degree of freedom in molecules for quantum computation and quantum simulation. △ Less

Submitted 11 August, 2023; v1 submitted 5 June, 2023; originally announced June 2023.

Comments: 12 pages, 7 figures (main text and supplementary information combined)

arXiv:2305.11102 [pdf, other]

Progressive Learning of 3D Reconstruction Network from 2D GAN Data

Authors: Aysegul Dundar, Jun Gao, Andrew Tao, Bryan Catanzaro

Abstract: This paper presents a method to reconstruct high-quality textured 3D models from single images. Current methods rely on datasets with expensive annotations; multi-view images and their camera parameters. Our method relies on GAN generated multi-view image datasets which have a negligible annotation cost. However, they are not strictly multi-view consistent and sometimes GANs output distorted image… ▽ More This paper presents a method to reconstruct high-quality textured 3D models from single images. Current methods rely on datasets with expensive annotations; multi-view images and their camera parameters. Our method relies on GAN generated multi-view image datasets which have a negligible annotation cost. However, they are not strictly multi-view consistent and sometimes GANs output distorted images. This results in degraded reconstruction qualities. In this work, to overcome these limitations of generated datasets, we have two main contributions which lead us to achieve state-of-the-art results on challenging objects: 1) A robust multi-stage learning scheme that gradually relies more on the models own predictions when calculating losses, 2) A novel adversarial learning pipeline with online pseudo-ground truth generations to achieve fine details. Our work provides a bridge from 2D supervisions of GAN models to 3D reconstruction models and removes the expensive annotation efforts. We show significant improvements over previous methods whether they were trained on GAN generated multi-view images or on real images with expensive annotations. Please visit our web-page for 3D visuals: https://research.nvidia.com/labs/adlr/progressive-3d-learning △ Less

Submitted 18 May, 2023; originally announced May 2023.

Comments: Web-page: https://research.nvidia.com/labs/adlr/progressive-3d-learning. arXiv admin note: text overlap with arXiv:2203.09362

arXiv:2305.10474 [pdf, other]

Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models

Authors: Songwei Ge, Seungjun Nah, Guilin Liu, Tyler Poon, Andrew Tao, Bryan Catanzaro, David Jacobs, Jia-Bin Huang, Ming-Yu Liu, Yogesh Balaji

Abstract: Despite tremendous progress in generating high-quality images using diffusion models, synthesizing a sequence of animated frames that are both photorealistic and temporally coherent is still in its infancy. While off-the-shelf billion-scale datasets for image generation are available, collecting similar video data of the same scale is still challenging. Also, training a video diffusion model is co… ▽ More Despite tremendous progress in generating high-quality images using diffusion models, synthesizing a sequence of animated frames that are both photorealistic and temporally coherent is still in its infancy. While off-the-shelf billion-scale datasets for image generation are available, collecting similar video data of the same scale is still challenging. Also, training a video diffusion model is computationally much more expensive than its image counterpart. In this work, we explore finetuning a pretrained image diffusion model with video data as a practical solution for the video synthesis task. We find that naively extending the image noise prior to video noise prior in video diffusion leads to sub-optimal performance. Our carefully designed video noise prior leads to substantially better performance. Extensive experimental validation shows that our model, Preserve Your Own Correlation (PYoCo), attains SOTA zero-shot text-to-video results on the UCF-101 and MSR-VTT benchmarks. It also achieves SOTA video generation quality on the small-scale UCF-101 benchmark with a $10\times$ smaller model using significantly less computation than the prior art. △ Less

Submitted 25 March, 2024; v1 submitted 17 May, 2023; originally announced May 2023.

Comments: ICCV 2023. Project webpage: https://research.nvidia.com/labs/dir/pyoco

arXiv:2304.04869 [pdf, other]

doi 10.1088/1538-3873/acd1b5

The James Webb Space Telescope Mission

Authors: Jonathan P. Gardner, John C. Mather, Randy Abbott, James S. Abell, Mark Abernathy, Faith E. Abney, John G. Abraham, Roberto Abraham, Yasin M. Abul-Huda, Scott Acton, Cynthia K. Adams, Evan Adams, David S. Adler, Maarten Adriaensen, Jonathan Albert Aguilar, Mansoor Ahmed, Nasif S. Ahmed, Tanjira Ahmed, Rüdeger Albat, Loïc Albert, Stacey Alberts, David Aldridge, Mary Marsha Allen, Shaune S. Allen, Martin Altenburg , et al. (983 additional authors not shown)

Abstract: Twenty-six years ago a small committee report, building on earlier studies, expounded a compelling and poetic vision for the future of astronomy, calling for an infrared-optimized space telescope with an aperture of at least $4m$. With the support of their governments in the US, Europe, and Canada, 20,000 people realized that vision as the $6.5m$ James Webb Space Telescope. A generation of astrono… ▽ More Twenty-six years ago a small committee report, building on earlier studies, expounded a compelling and poetic vision for the future of astronomy, calling for an infrared-optimized space telescope with an aperture of at least $4m$. With the support of their governments in the US, Europe, and Canada, 20,000 people realized that vision as the $6.5m$ James Webb Space Telescope. A generation of astronomers will celebrate their accomplishments for the life of the mission, potentially as long as 20 years, and beyond. This report and the scientific discoveries that follow are extended thank-you notes to the 20,000 team members. The telescope is working perfectly, with much better image quality than expected. In this and accompanying papers, we give a brief history, describe the observatory, outline its objectives and current observing program, and discuss the inventions and people who made it possible. We cite detailed reports on the design and the measured performance on orbit. △ Less

Submitted 10 April, 2023; originally announced April 2023.

Comments: Accepted by PASP for the special issue on The James Webb Space Telescope Overview, 29 pages, 4 figures

arXiv:2211.02152 [pdf, other]

Binary-Continuous Sum-of-ratios Optimization: Discretization, Approximations, and Convex Reformulations

Authors: Tien Mai, Ngan Ha Duong, Thuy Anh Ta

Abstract: We study a class of non-convex sum-of-ratios programs which can be used for decision-making in prominent areas such as product assortment and price optimization, facility location, and security games. Such an optimization problem involves both continuous and binary decision variables and is known to be highly non-convex and intractable to solve. We explore a discretization approach to approximate… ▽ More We study a class of non-convex sum-of-ratios programs which can be used for decision-making in prominent areas such as product assortment and price optimization, facility location, and security games. Such an optimization problem involves both continuous and binary decision variables and is known to be highly non-convex and intractable to solve. We explore a discretization approach to approximate the optimization problem and show that the approximate program can be reformulated as mixed-integer linear or second-order cone programs, which can be conveniently handled by an off-the-shelf solver (e.g., CPLEX or GUROBI). We further establish (mild) conditions under which solutions to the approximate problem converge to optimal solutions as the number of discretization points increases. We also provide approximation abounds for solutions obtained from the approximated problem. We show how our approach applies to product assortment and price optimization, maximum covering facility location, and Bayesian Stackelberg security games and provide experimental results to evaluate the efficiency of our approach. △ Less

Submitted 3 November, 2022; originally announced November 2022.

arXiv:2210.08159 [pdf, other]

doi 10.1109/TCSVT.2024.3351680

Dynamics-aware Adversarial Attack of Adaptive Neural Networks

Authors: An Tao, Yueqi Duan, Yingqi Wang, Jiwen Lu, Jie Zhou

Abstract: In this paper, we investigate the dynamics-aware adversarial attack problem of adaptive neural networks. Most existing adversarial attack algorithms are designed under a basic assumption -- the network architecture is fixed throughout the attack process. However, this assumption does not hold for many recently proposed adaptive neural networks, which adaptively deactivate unnecessary execution uni… ▽ More In this paper, we investigate the dynamics-aware adversarial attack problem of adaptive neural networks. Most existing adversarial attack algorithms are designed under a basic assumption -- the network architecture is fixed throughout the attack process. However, this assumption does not hold for many recently proposed adaptive neural networks, which adaptively deactivate unnecessary execution units based on inputs to improve computational efficiency. It results in a serious issue of lagged gradient, making the learned attack at the current step ineffective due to the architecture change afterward. To address this issue, we propose a Leaded Gradient Method (LGM) and show the significant effects of the lagged gradient. More specifically, we reformulate the gradients to be aware of the potential dynamic changes of network architectures, so that the learned attack better "leads" the next step than the dynamics-unaware methods when network architecture changes dynamically. Extensive experiments on representative types of adaptive neural networks for both 2D images and 3D point clouds show that our LGM achieves impressive adversarial attack performance compared with the dynamic-unaware attack methods. Code is available at https://github.com/antao97/LGM. △ Less

Submitted 10 January, 2024; v1 submitted 14 October, 2022; originally announced October 2022.

Comments: arXiv admin note: text overlap with arXiv:2112.09428

Journal ref: IEEE Transactions on Circuits and Systems for Video Technology, 2024

arXiv:2205.07345 [pdf, other]

Joint Location and Cost Planning in Maximum Capture Facility Location under Multiplicative Random Utility Maximization

Authors: Ngan Ha Duong, Tien Thanh Dam, Thuy Anh Ta, Tien Mai

Abstract: We study a joint facility location and cost planning problem in a competitive market under random utility maximization (RUM) models. The objective is to locate new facilities and make decisions on the costs (or budgets) to spend on the new facilities, aiming to maximize an expected captured customer demand, assuming that customers choose a facility among all available facilities according to a RUM… ▽ More We study a joint facility location and cost planning problem in a competitive market under random utility maximization (RUM) models. The objective is to locate new facilities and make decisions on the costs (or budgets) to spend on the new facilities, aiming to maximize an expected captured customer demand, assuming that customers choose a facility among all available facilities according to a RUM model. We examine two RUM frameworks in the discrete choice literature, namely, the additive and multiplicative RUM. While the former has been widely used in facility location problems, we are the first to explore the latter in the context. We numerically show that the two RUM frameworks can well approximate each other in the context of the cost optimization problem. In addition, we show that, under the additive RUM framework, the resultant cost optimization problem becomes highly non-convex and may have several local optima. In contrast, the use of the multiplicative RUM brings several advantages to the competitive facility location problem. For instance, the cost optimization problem under the multiplicative RUM can be solved efficiently by a general convex optimization solver or can be reformulated as a conic quadratic program and handled by a conic solver available in some off-the-shelf solvers such as CPLEX or GUROBI. Furthermore, we consider a joint location and cost optimization problem under the multiplicative RUM and propose three approaches to solve the problem, namely, an equivalent conic reformulation, a multi-cut outer-approximation algorithm, and a local search heuristic. We provide numerical experiments based on synthetic instances of various sizes to evaluate the performances of the proposed algorithms in solving the cost optimization, and the joint location and cost optimization problems. △ Less

Submitted 11 February, 2023; v1 submitted 15 May, 2022; originally announced May 2022.

Journal ref: Computer and Operations Research (2023)

arXiv:2203.09362 [pdf, other]

Fine Detailed Texture Learning for 3D Meshes with Generative Models

Authors: Aysegul Dundar, Jun Gao, Andrew Tao, Bryan Catanzaro

Abstract: This paper presents a method to reconstruct high-quality textured 3D models from both multi-view and single-view images. The reconstruction is posed as an adaptation problem and is done progressively where in the first stage, we focus on learning accurate geometry, whereas in the second stage, we focus on learning the texture with a generative adversarial network. In the generative learning pipeli… ▽ More This paper presents a method to reconstruct high-quality textured 3D models from both multi-view and single-view images. The reconstruction is posed as an adaptation problem and is done progressively where in the first stage, we focus on learning accurate geometry, whereas in the second stage, we focus on learning the texture with a generative adversarial network. In the generative learning pipeline, we propose two improvements. First, since the learned textures should be spatially aligned, we propose an attention mechanism that relies on the learnable positions of pixels. Secondly, since discriminator receives aligned texture maps, we augment its input with a learnable embedding which improves the feedback to the generator. We achieve significant improvements on multi-view sequences from Tripod dataset as well as on single-view image datasets, Pascal 3D+ and CUB. We demonstrate that our method achieves superior 3D textured models compared to the previous works. Please visit our web-page for 3D visuals. △ Less

Submitted 17 March, 2022; originally announced March 2022.

arXiv:2202.00011 [pdf, other]

Leveraging Bitstream Metadata for Fast, Accurate, Generalized Compressed Video Quality Enhancement

Authors: Max Ehrlich, Jon Barker, Namitha Padmanabhan, Larry Davis, Andrew Tao, Bryan Catanzaro, Abhinav Shrivastava

Abstract: Video compression is a central feature of the modern internet powering technologies from social media to video conferencing. While video compression continues to mature, for many compression settings, quality loss is still noticeable. These settings nevertheless have important applications to the efficient transmission of videos over bandwidth constrained or otherwise unstable connections. In this… ▽ More Video compression is a central feature of the modern internet powering technologies from social media to video conferencing. While video compression continues to mature, for many compression settings, quality loss is still noticeable. These settings nevertheless have important applications to the efficient transmission of videos over bandwidth constrained or otherwise unstable connections. In this work, we develop a deep learning architecture capable of restoring detail to compressed videos which leverages the underlying structure and motion information embedded in the video bitstream. We show that this improves restoration accuracy compared to prior compression correction methods and is competitive when compared with recent deep-learning-based video compression methods on rate-distortion while achieving higher throughput. Furthermore, we condition our model on quantization data which is readily available in the bitstream. This allows our single model to handle a variety of different compression quality settings which required an ensemble of models in prior work. △ Less

Submitted 30 October, 2023; v1 submitted 31 January, 2022; originally announced February 2022.

Comments: WACV 2024

arXiv:2112.09428

Dynamics-aware Adversarial Attack of 3D Sparse Convolution Network

Authors: An Tao, Yueqi Duan, He Wang, Ziyi Wu, Pengliang Ji, Haowen Sun, Jie Zhou, Jiwen Lu

Abstract: In this paper, we investigate the dynamics-aware adversarial attack problem in deep neural networks. Most existing adversarial attack algorithms are designed under a basic assumption -- the network architecture is fixed throughout the attack process. However, this assumption does not hold for many recently proposed networks, e.g. 3D sparse convolution network, which contains input-dependent execut… ▽ More In this paper, we investigate the dynamics-aware adversarial attack problem in deep neural networks. Most existing adversarial attack algorithms are designed under a basic assumption -- the network architecture is fixed throughout the attack process. However, this assumption does not hold for many recently proposed networks, e.g. 3D sparse convolution network, which contains input-dependent execution to improve computational efficiency. It results in a serious issue of lagged gradient, making the learned attack at the current step ineffective due to the architecture changes afterward. To address this issue, we propose a Leaded Gradient Method (LGM) and show the significant effects of the lagged gradient. More specifically, we re-formulate the gradients to be aware of the potential dynamic changes of network architectures, so that the learned attack better "leads" the next step than the dynamics-unaware methods when network architecture changes dynamically. Extensive experiments on various datasets show that our LGM achieves impressive performance on semantic segmentation and classification. Compared with the dynamic-unaware methods, LGM achieves about 20% lower mIoU averagely on the ScanNet and S3DIS datasets. LGM also outperforms the recent point cloud attacks. △ Less

Submitted 20 January, 2023; v1 submitted 17 December, 2021; originally announced December 2021.

Comments: We have improved the quality of this work and updated a new version to address the limitations of the proposed method

arXiv:2111.13587 [pdf, other]

Adaptive Fourier Neural Operators: Efficient Token Mixers for Transformers

Authors: John Guibas, Morteza Mardani, Zongyi Li, Andrew Tao, Anima Anandkumar, Bryan Catanzaro

Abstract: Vision transformers have delivered tremendous success in representation learning. This is primarily due to effective token mixing through self attention. However, this scales quadratically with the number of pixels, which becomes infeasible for high-resolution inputs. To cope with this challenge, we propose Adaptive Fourier Neural Operator (AFNO) as an efficient token mixer that learns to mix in t… ▽ More Vision transformers have delivered tremendous success in representation learning. This is primarily due to effective token mixing through self attention. However, this scales quadratically with the number of pixels, which becomes infeasible for high-resolution inputs. To cope with this challenge, we propose Adaptive Fourier Neural Operator (AFNO) as an efficient token mixer that learns to mix in the Fourier domain. AFNO is based on a principled foundation of operator learning which allows us to frame token mixing as a continuous global convolution without any dependence on the input resolution. This principle was previously used to design FNO, which solves global convolution efficiently in the Fourier domain and has shown promise in learning challenging PDEs. To handle challenges in visual representation learning such as discontinuities in images and high resolution inputs, we propose principled architectural modifications to FNO which results in memory and computational efficiency. This includes imposing a block-diagonal structure on the channel mixing weights, adaptively sharing weights across tokens, and sparsifying the frequency modes via soft-thresholding and shrinkage. The resulting model is highly parallel with a quasi-linear complexity and has linear memory in the sequence size. AFNO outperforms self-attention mechanisms for few-shot segmentation in terms of both efficiency and accuracy. For Cityscapes segmentation with the Segformer-B3 backbone, AFNO can handle a sequence size of 65k and outperforms other efficient self-attention mechanisms. △ Less

Submitted 27 March, 2022; v1 submitted 24 November, 2021; originally announced November 2021.

arXiv:2110.08497 [pdf, other]

Robust Maximum Capture Facility Location under Random Utility Maximization Models

Authors: Anh Thuy Ta, Tien Thanh Dam, Tien Mai

Abstract: We study a robust version of the maximum capture facility location problem in a competitive market, assuming that each customer chooses among all available facilities according to a random utility maximization (RUM) model. We employ the generalized extreme value (GEV) family of models and assume that the parameters of the RUM model are not given exactly but lie in convex uncertainty sets. The prob… ▽ More We study a robust version of the maximum capture facility location problem in a competitive market, assuming that each customer chooses among all available facilities according to a random utility maximization (RUM) model. We employ the generalized extreme value (GEV) family of models and assume that the parameters of the RUM model are not given exactly but lie in convex uncertainty sets. The problem is to locate new facilities to maximize the worst-case captured user demand. We show that, interestingly, our robust model preserves the monotonicity and submodularity from its deterministic counterpart, implying that a simple greedy heuristic can guarantee a (1-1/e) approximation solution. We further show the concavity of the objective function under the classical multinomial logit (MNL) model, suggesting that an outer-approximation algorithm can be used to solve the robust model under MNL to optimality. We conduct experiments comparing our robust method to other deterministic and sampling approaches, using instances from different discrete choice models. Our results clearly demonstrate the advantages of our roust model in protecting the decision-maker from bad-case scenarios. △ Less

Submitted 11 February, 2023; v1 submitted 16 October, 2021; originally announced October 2021.

Journal ref: European Journal of Operational Research (2023)

arXiv:2108.13394 [pdf, ps, other]

Topology of augmented Bergman complexes

Authors: Elisabeth Bullock, Aidan Kelley, Victor Reiner, Kevin Ren, Gahl Shemy, Dawei Shen, Brian Sun, Amy Tao, Zhichun Joy Zhang

Abstract: The augmented Bergman complex of a matroid is a simplicial complex introduced recently in work of Braden, Huh, Matherne, Proudfoot and Wang. It may be viewed as a hybrid of two well-studied pure shellable simplicial complexes associated to matroids: the independent set complex and Bergman complex. It is shown here that the augmented Bergman complex is also shellable, via two different families o… ▽ More The augmented Bergman complex of a matroid is a simplicial complex introduced recently in work of Braden, Huh, Matherne, Proudfoot and Wang. It may be viewed as a hybrid of two well-studied pure shellable simplicial complexes associated to matroids: the independent set complex and Bergman complex. It is shown here that the augmented Bergman complex is also shellable, via two different families of shelling orders. Furthermore, comparing the description of its homotopy type induced from the two shellings re-interprets a known convolution formula counting bases of the matroid. The representation of the automorphism group of the matroid on the homology of the augmented Bergman complex turns out to have a surprisingly simple description. This last fact is generalized to closures beyond those coming from a matroid. △ Less

Submitted 16 September, 2021; v1 submitted 30 August, 2021; originally announced August 2021.

Comments: Very minor edits

MSC Class: 05B35; 52B22; 06A07

arXiv:2106.06533 [pdf, other]

View Generalization for Single Image Textured 3D Models

Authors: Anand Bhattad, Aysegul Dundar, Guilin Liu, Andrew Tao, Bryan Catanzaro

Abstract: Humans can easily infer the underlying 3D geometry and texture of an object only from a single 2D image. Current computer vision methods can do this, too, but suffer from view generalization problems - the models inferred tend to make poor predictions of appearance in novel views. As for generalization problems in machine learning, the difficulty is balancing single-view accuracy (cf. training err… ▽ More Humans can easily infer the underlying 3D geometry and texture of an object only from a single 2D image. Current computer vision methods can do this, too, but suffer from view generalization problems - the models inferred tend to make poor predictions of appearance in novel views. As for generalization problems in machine learning, the difficulty is balancing single-view accuracy (cf. training error; bias) with novel view accuracy (cf. test error; variance). We describe a class of models whose geometric rigidity is easily controlled to manage this tradeoff. We describe a cycle consistency loss that improves view generalization (roughly, a model from a generated view should predict the original view well). View generalization of textures requires that models share texture information, so a car seen from the back still has headlights because other cars have headlights. We describe a cycle consistency loss that encourages model textures to be aligned, so as to encourage sharing. We compare our method against the state-of-the-art method and show both qualitative and quantitative improvements. △ Less

Submitted 10 June, 2021; originally announced June 2021.

Comments: CVPR 2021. Project website: https://nv-adlr.github.io/view-generalization

arXiv:2104.02983 [pdf, other]

Optimal fire allocation in a combat model of mixed NCW type

Authors: My A. Vu, Nam H. Nguyen, Hanh Le T. Nguyen, Anh N. Ta, Mong H. Nguyen

Abstract: In this work, we introduce a nonlinear Lanchester model of NCW-type and study a problem of finding the optimal fire allocation for this model. A Blue party $B$ will fight against a Red party consisting of $A$ and $R$, where $A$ is an independent force and $R$ fights with supports from a supply unit $N$. A battle may consist of several stages but we consider the problem of finding optimal fire allo… ▽ More In this work, we introduce a nonlinear Lanchester model of NCW-type and study a problem of finding the optimal fire allocation for this model. A Blue party $B$ will fight against a Red party consisting of $A$ and $R$, where $A$ is an independent force and $R$ fights with supports from a supply unit $N$. A battle may consist of several stages but we consider the problem of finding optimal fire allocation for $B$ in the first stage only. Optimal fire allocation is a set of three non-negative numbers whose sum equals to one, such that the remaining force of $B$ is maximal at any instants. In order to tackle this problem, we introduce the notion of \textit{threatening rates} which are computed for $A, R, N$ at the beginning of the battle. Numerical illustrations are presented to justify the theoretical findings. △ Less

Submitted 7 April, 2021; originally announced April 2021.

arXiv:2103.16748 [pdf, other]

Dual Contrastive Loss and Attention for GANs

Authors: Ning Yu, Guilin Liu, Aysegul Dundar, Andrew Tao, Bryan Catanzaro, Larry Davis, Mario Fritz

Abstract: Generative Adversarial Networks (GANs) produce impressive results on unconditional image generation when powered with large-scale image datasets. Yet generated images are still easy to spot especially on datasets with high variance (e.g. bedroom, church). In this paper, we propose various improvements to further push the boundaries in image generation. Specifically, we propose a novel dual contras… ▽ More Generative Adversarial Networks (GANs) produce impressive results on unconditional image generation when powered with large-scale image datasets. Yet generated images are still easy to spot especially on datasets with high variance (e.g. bedroom, church). In this paper, we propose various improvements to further push the boundaries in image generation. Specifically, we propose a novel dual contrastive loss and show that, with this loss, discriminator learns more generalized and distinguishable representations to incentivize generation. In addition, we revisit attention and extensively experiment with different attention blocks in the generator. We find attention to be still an important module for successful image generation even though it was not used in the recent state-of-the-art models. Lastly, we study different attention architectures in the discriminator, and propose a reference attention mechanism. By combining the strengths of these remedies, we improve the compelling state-of-the-art Fréchet Inception Distance (FID) by at least 17.5% on several benchmark datasets. We obtain even more significant improvements on compositional synthetic scenes (up to 47.5% in FID). Code and models are available at https://github.com/ningyu1991/AttentionDualContrastGAN . △ Less

Submitted 17 March, 2022; v1 submitted 30 March, 2021; originally announced March 2021.

Comments: Accepted to ICCV'21

arXiv:2102.05754 [pdf, ps, other]

doi 10.1016/j.ejor.2021.09.006

Submodularity and Local Search Approaches for Maximum Capture Problems under Generalized Extreme Value Models

Authors: Tien Thanh Dam, Thuy Anh Ta, Tien Mai

Abstract: We study the maximum capture problem in facility location under random utility models, i.e., the problem of seeking to locate new facilities in a competitive market such that the captured user demand is maximized, assuming that each customer chooses among all available facilities according to a random utility maximization model. We employ the generalized extreme value (GEV) family of discrete choi… ▽ More We study the maximum capture problem in facility location under random utility models, i.e., the problem of seeking to locate new facilities in a competitive market such that the captured user demand is maximized, assuming that each customer chooses among all available facilities according to a random utility maximization model. We employ the generalized extreme value (GEV) family of discrete choice models and show that the objective function in this context is monotonic and submodular. This finding implies that a simple greed heuristic can always guarantee an (1-1/e) approximation solution. We further develop a new algorithm combining a greedy heuristic, a gradient-based local search and an exchanging procedure to efficiently solve the problem. We conduct experiments using instances of difference sizes and under different discrete choice models, and we show that our approach significantly outperforms prior approaches in terms of both returned objective value and CPU time. Our algorithm and theoretical findings can be applied to the maximum capture problems under various random utility models in the literature, including the popular multinomial logit, nested logit, cross nested logit, and the mixed logit models. △ Less

Submitted 10 February, 2021; originally announced February 2021.

Journal ref: European Journal of Operational Research - 300(2022) 953-965

arXiv:2012.10217 [pdf, other]

doi 10.1109/TIP.2022.3190709

SegGroup: Seg-Level Supervision for 3D Instance and Semantic Segmentation

Authors: An Tao, Yueqi Duan, Yi Wei, Jiwen Lu, Jie Zhou

Abstract: Most existing point cloud instance and semantic segmentation methods rely heavily on strong supervision signals, which require point-level labels for every point in the scene. However, such strong supervision suffers from large annotation costs, arousing the need to study efficient annotating. In this paper, we discover that the locations of instances matter for both instance and semantic 3D scene… ▽ More Most existing point cloud instance and semantic segmentation methods rely heavily on strong supervision signals, which require point-level labels for every point in the scene. However, such strong supervision suffers from large annotation costs, arousing the need to study efficient annotating. In this paper, we discover that the locations of instances matter for both instance and semantic 3D scene segmentation. By fully taking advantage of locations, we design a weakly-supervised point cloud segmentation method that only requires clicking on one point per instance to indicate its location for annotation. With over-segmentation for pre-processing, we extend these location annotations into segments as seg-level labels. We further design a segment grouping network (SegGroup) to generate point-level pseudo labels under seg-level labels by hierarchically grouping the unlabeled segments into the relevant nearby labeled segments, so that existing point-level supervised segmentation models can directly consume these pseudo labels for training. Experimental results show that our seg-level supervised method (SegGroup) achieves comparable results with the fully annotated point-level supervised methods. Moreover, it outperforms the recent weakly-supervised methods given a fixed annotation budget. Code is available at https://github.com/AnTao97/SegGroup. △ Less

Submitted 24 July, 2022; v1 submitted 18 December, 2020; originally announced December 2020.

Journal ref: IEEE Transactions on Image Processing, vol. 31, pp. 4952-4965, 2022

arXiv:2009.00197 [pdf, other]

Deep unsupervised learning for Microscopy-Based Malaria detection

Authors: Alexander Tao, Boran Han

Abstract: Malaria, a mosquito-borne disease caused by a parasite, kills over 1 million people globally each year. People, if left untreated, may develop severe complications, leading to death. Effective and accurate diagnosis is important for the management and control of malaria. Our research focuses on utilizing machine learning to improve the efficiency in Malaria diagnosis. We utilize a modified U-net a… ▽ More Malaria, a mosquito-borne disease caused by a parasite, kills over 1 million people globally each year. People, if left untreated, may develop severe complications, leading to death. Effective and accurate diagnosis is important for the management and control of malaria. Our research focuses on utilizing machine learning to improve the efficiency in Malaria diagnosis. We utilize a modified U-net architecture, as an unsupervised learning model, to conduct cell boundary detection. The blood cells infected by malaria are then identified in chromatic space by a Mahalanobis distance algorithm. Both the cell segmentation and Malaria detection process often requires intensive manual label, which we hope to eliminate via the unsupervised workflow. △ Less

Submitted 31 August, 2020; originally announced September 2020.

arXiv:2008.05250 [pdf, ps, other]

Optimizing fire allocation in a NCW-type model

Authors: Nam Hong Nguyen, My Anh Vu, Dinh Van Bui, Anh Ngoc Ta, Manh Duc Hy

Abstract: In this paper, we introduce a non-linear Lanchester model of NCW-type and investigate an optimization problem for this model, where only the Red force is supplied by several supply agents. Optimal fire allocation of the Blue force is sought in the form of a piece-wise constant function of time. A threatening rate is computed for the Red force and each of its supply agents at the beginning of each… ▽ More In this paper, we introduce a non-linear Lanchester model of NCW-type and investigate an optimization problem for this model, where only the Red force is supplied by several supply agents. Optimal fire allocation of the Blue force is sought in the form of a piece-wise constant function of time. A threatening rate is computed for the Red force and each of its supply agents at the beginning of each stage of the combat. These rates can be used to derive the optimal decision for the Blue force to focus its firepower to the Red force itself or one of its supply agents. This optimal fire allocation is derived and proved by considering an optimization problem of number of Blue force troops. Numerical experiments are included to demonstrate the theoretical results. △ Less

Submitted 12 August, 2020; originally announced August 2020.

Comments: 6 pages on NCW-type model

arXiv:2007.07243 [pdf, other]

Transposer: Universal Texture Synthesis Using Feature Maps as Transposed Convolution Filter

Authors: Guilin Liu, Rohan Taori, Ting-Chun Wang, Zhiding Yu, Shiqiu Liu, Fitsum A. Reda, Karan Sapra, Andrew Tao, Bryan Catanzaro

Abstract: Conventional CNNs for texture synthesis consist of a sequence of (de)-convolution and up/down-sampling layers, where each layer operates locally and lacks the ability to capture the long-term structural dependency required by texture synthesis. Thus, they often simply enlarge the input texture, rather than perform reasonable synthesis. As a compromise, many recent methods sacrifice generalizabilit… ▽ More Conventional CNNs for texture synthesis consist of a sequence of (de)-convolution and up/down-sampling layers, where each layer operates locally and lacks the ability to capture the long-term structural dependency required by texture synthesis. Thus, they often simply enlarge the input texture, rather than perform reasonable synthesis. As a compromise, many recent methods sacrifice generalizability by training and testing on the same single (or fixed set of) texture image(s), resulting in huge re-training time costs for unseen images. In this work, based on the discovery that the assembling/stitching operation in traditional texture synthesis is analogous to a transposed convolution operation, we propose a novel way of using transposed convolution operation. Specifically, we directly treat the whole encoded feature map of the input texture as transposed convolution filters and the features' self-similarity map, which captures the auto-correlation information, as input to the transposed convolution. Such a design allows our framework, once trained, to be generalizable to perform synthesis of unseen textures with a single forward pass in nearly real-time. Our method achieves state-of-the-art texture synthesis quality based on various metrics. While self-similarity helps preserve the input textures' regular structural patterns, our framework can also take random noise maps for irregular input textures instead of self-similarity maps as transposed convolution inputs. It allows to get more diverse results as well as generate arbitrarily large texture outputs by directly sampling large noise maps in a single pass as well. △ Less

Submitted 14 July, 2020; originally announced July 2020.

arXiv:2005.10821 [pdf, other]

Hierarchical Multi-Scale Attention for Semantic Segmentation

Authors: Andrew Tao, Karan Sapra, Bryan Catanzaro

Abstract: Multi-scale inference is commonly used to improve the results of semantic segmentation. Multiple images scales are passed through a network and then the results are combined with averaging or max pooling. In this work, we present an attention-based approach to combining multi-scale predictions. We show that predictions at certain scales are better at resolving particular failures modes, and that t… ▽ More Multi-scale inference is commonly used to improve the results of semantic segmentation. Multiple images scales are passed through a network and then the results are combined with averaging or max pooling. In this work, we present an attention-based approach to combining multi-scale predictions. We show that predictions at certain scales are better at resolving particular failures modes, and that the network learns to favor those scales for such cases in order to generate better predictions. Our attention mechanism is hierarchical, which enables it to be roughly 4x more memory efficient to train than other recent approaches. In addition to enabling faster training, this allows us to train with larger crop sizes which leads to greater model accuracy. We demonstrate the result of our method on two datasets: Cityscapes and Mapillary Vistas. For Cityscapes, which has a large number of weakly labelled images, we also leverage auto-labelling to improve generalization. Using our approach we achieve a new state-of-the-art results in both Mapillary (61.1 IOU val) and Cityscapes (85.1 IOU test). △ Less

Submitted 21 May, 2020; originally announced May 2020.

Comments: 11 pages, 5 figures

arXiv:2004.10289 [pdf, other]

Panoptic-based Image Synthesis

Authors: Aysegul Dundar, Karan Sapra, Guilin Liu, Andrew Tao, Bryan Catanzaro

Abstract: Conditional image synthesis for generating photorealistic images serves various applications for content editing to content generation. Previous conditional image synthesis algorithms mostly rely on semantic maps, and often fail in complex environments where multiple instances occlude each other. We propose a panoptic aware image synthesis network to generate high fidelity and photorealistic image… ▽ More Conditional image synthesis for generating photorealistic images serves various applications for content editing to content generation. Previous conditional image synthesis algorithms mostly rely on semantic maps, and often fail in complex environments where multiple instances occlude each other. We propose a panoptic aware image synthesis network to generate high fidelity and photorealistic images conditioned on panoptic maps which unify semantic and instance information. To achieve this, we efficiently use panoptic maps in convolution and upsampling layers. We show that with the proposed changes to the generator, we can improve on the previous state-of-the-art methods by generating images in complex instance interaction environments in higher fidelity and tiny objects in more details. Furthermore, our proposed method also outperforms the previous state-of-the-art methods in metrics of mean IoU (Intersection over Union), and detAP (Detection Average Precision). △ Less

Submitted 21 April, 2020; originally announced April 2020.

Comments: CVPR 2020

arXiv:2001.09518 [pdf, other]

Unsupervised Disentanglement of Pose, Appearance and Background from Images and Videos

Authors: Aysegul Dundar, Kevin J. Shih, Animesh Garg, Robert Pottorf, Andrew Tao, Bryan Catanzaro

Abstract: Unsupervised landmark learning is the task of learning semantic keypoint-like representations without the use of expensive input keypoint-level annotations. A popular approach is to factorize an image into a pose and appearance data stream, then to reconstruct the image from the factorized components. The pose representation should capture a set of consistent and tightly localized landmarks in ord… ▽ More Unsupervised landmark learning is the task of learning semantic keypoint-like representations without the use of expensive input keypoint-level annotations. A popular approach is to factorize an image into a pose and appearance data stream, then to reconstruct the image from the factorized components. The pose representation should capture a set of consistent and tightly localized landmarks in order to facilitate reconstruction of the input image. Ultimately, we wish for our learned landmarks to focus on the foreground object of interest. However, the reconstruction task of the entire image forces the model to allocate landmarks to model the background. This work explores the effects of factorizing the reconstruction task into separate foreground and background reconstructions, conditioning only the foreground reconstruction on the unsupervised landmarks. Our experiments demonstrate that the proposed factorization results in landmarks that are focused on the foreground object of interest. Furthermore, the rendered background quality is also improved, as the background rendering pipeline no longer requires the ill-suited landmarks to model its pose and appearance. We demonstrate this improvement in the context of the video-prediction task. △ Less

Submitted 26 January, 2020; originally announced January 2020.

arXiv:1912.11683 [pdf, other]

Neural ODEs for Image Segmentation with Level Sets

Authors: Rafael Valle, Fitsum Reda, Mohammad Shoeybi, Patrick Legresley, Andrew Tao, Bryan Catanzaro

Abstract: We propose a novel approach for image segmentation that combines Neural Ordinary Differential Equations (NODEs) and the Level Set method. Our approach parametrizes the evolution of an initial contour with a NODE that implicitly learns from data a speed function describing the evolution. In addition, for cases where an initial contour is not available and to alleviate the need for careful choice or… ▽ More We propose a novel approach for image segmentation that combines Neural Ordinary Differential Equations (NODEs) and the Level Set method. Our approach parametrizes the evolution of an initial contour with a NODE that implicitly learns from data a speed function describing the evolution. In addition, for cases where an initial contour is not available and to alleviate the need for careful choice or design of contour embedding functions, we propose a NODE-based method that evolves an image embedding into a dense per-pixel semantic label space. We evaluate our methods on kidney segmentation (KiTS19) and on salient object detection (PASCAL-S, ECSSD and HKU-IS). In addition to improving initial contours provided by deep learning models while using a fraction of their number of parameters, our approach achieves F scores that are higher than several state-of-the-art deep learning algorithms. △ Less

Submitted 25 December, 2019; originally announced December 2019.

arXiv:1910.12713 [pdf, other]

Few-shot Video-to-Video Synthesis

Authors: Ting-Chun Wang, Ming-Yu Liu, Andrew Tao, Guilin Liu, Jan Kautz, Bryan Catanzaro

Abstract: Video-to-video synthesis (vid2vid) aims at converting an input semantic video, such as videos of human poses or segmentation masks, to an output photorealistic video. While the state-of-the-art of vid2vid has advanced significantly, existing approaches share two major limitations. First, they are data-hungry. Numerous images of a target human subject or a scene are required for training. Second, a… ▽ More Video-to-video synthesis (vid2vid) aims at converting an input semantic video, such as videos of human poses or segmentation masks, to an output photorealistic video. While the state-of-the-art of vid2vid has advanced significantly, existing approaches share two major limitations. First, they are data-hungry. Numerous images of a target human subject or a scene are required for training. Second, a learned model has limited generalization capability. A pose-to-human vid2vid model can only synthesize poses of the single person in the training set. It does not generalize to other humans that are not in the training set. To address the limitations, we propose a few-shot vid2vid framework, which learns to synthesize videos of previously unseen subjects or scenes by leveraging few example images of the target at test time. Our model achieves this few-shot generalization capability via a novel network weight generation module utilizing an attention mechanism. We conduct extensive experimental validations with comparisons to strong baselines using several large-scale video datasets including human-dancing videos, talking-head videos, and street-scene videos. The experimental results verify the effectiveness of the proposed framework in addressing the two limitations of existing vid2vid approaches. △ Less

Submitted 28 October, 2019; originally announced October 2019.

Comments: In NeurIPS, 2019

arXiv:1909.02749 [pdf, other]

Video Interpolation and Prediction with Unsupervised Landmarks

Authors: Kevin J. Shih, Aysegul Dundar, Animesh Garg, Robert Pottorf, Andrew Tao, Bryan Catanzaro

Abstract: Prediction and interpolation for long-range video data involves the complex task of modeling motion trajectories for each visible object, occlusions and dis-occlusions, as well as appearance changes due to viewpoint and lighting. Optical flow based techniques generalize but are suitable only for short temporal ranges. Many methods opt to project the video frames to a low dimensional latent space,… ▽ More Prediction and interpolation for long-range video data involves the complex task of modeling motion trajectories for each visible object, occlusions and dis-occlusions, as well as appearance changes due to viewpoint and lighting. Optical flow based techniques generalize but are suitable only for short temporal ranges. Many methods opt to project the video frames to a low dimensional latent space, achieving long-range predictions. However, these latent representations are often non-interpretable, and therefore difficult to manipulate. This work poses video prediction and interpolation as unsupervised latent structure inference followed by a temporal prediction in this latent space. The latent representations capture foreground semantics without explicit supervision such as keypoints or poses. Further, as each landmark can be mapped to a coordinate indicating where a semantic part is positioned, we can reliably interpolate within the coordinate domain to achieve predictable motion interpolation. Given an image decoder capable of mapping these landmarks back to the image domain, we are able to achieve high-quality long-range video interpolation and extrapolation by operating on the landmark representation space. △ Less

Submitted 6 September, 2019; originally announced September 2019.

Comments: Technical Report

arXiv:1906.05928 [pdf, other]

Unsupervised Video Interpolation Using Cycle Consistency

Authors: Fitsum A. Reda, Deqing Sun, Aysegul Dundar, Mohammad Shoeybi, Guilin Liu, Kevin J. Shih, Andrew Tao, Jan Kautz, Bryan Catanzaro

Abstract: Learning to synthesize high frame rate videos via interpolation requires large quantities of high frame rate training videos, which, however, are scarce, especially at high resolutions. Here, we propose unsupervised techniques to synthesize high frame rate videos directly from low frame rate videos using cycle consistency. For a triplet of consecutive frames, we optimize models to minimize the dis… ▽ More Learning to synthesize high frame rate videos via interpolation requires large quantities of high frame rate training videos, which, however, are scarce, especially at high resolutions. Here, we propose unsupervised techniques to synthesize high frame rate videos directly from low frame rate videos using cycle consistency. For a triplet of consecutive frames, we optimize models to minimize the discrepancy between the center frame and its cycle reconstruction, obtained by interpolating back from interpolated intermediate frames. This simple unsupervised constraint alone achieves results comparable with supervision using the ground truth intermediate frames. We further introduce a pseudo supervised loss term that enforces the interpolated frames to be consistent with predictions of a pre-trained interpolation model. The pseudo supervised loss term, used together with cycle consistency, can effectively adapt a pre-trained model to a new target domain. With no additional data and in a completely unsupervised fashion, our techniques significantly improve pre-trained models on new target domains, increasing PSNR values from 32.84dB to 33.05dB on the Slowflow and from 31.82dB to 32.53dB on the Sintel evaluation datasets. △ Less

Submitted 27 March, 2021; v1 submitted 13 June, 2019; originally announced June 2019.

Comments: Published in ICCV 2019. Codes are available at https://github.com/NVIDIA/unsupervised-video-interpolation. Project website https://nv-adlr.github.io/publication/2019-UnsupervisedVideoInterpolation

arXiv:1905.10914 [pdf, ps, other]

doi 10.1007/s00373-020-02176-7

Consecutive Detecting Arrays for Interaction Faults

Authors: Ce Shi, Ling Jiang, Aiyuan Tao

Abstract: The concept of detecting arrays was developed to locate and detect interaction faults arising between the factors in a component-based system during software testing. In this paper, we propose a family of consecutive detecting arrays (CDAs) in which the interactions between factors are considered to be ordered. CDAs can be used to generate test suites for locating and detecting interaction faults… ▽ More The concept of detecting arrays was developed to locate and detect interaction faults arising between the factors in a component-based system during software testing. In this paper, we propose a family of consecutive detecting arrays (CDAs) in which the interactions between factors are considered to be ordered. CDAs can be used to generate test suites for locating and detecting interaction faults between neighboring factors. We establish a general criterion for measuring the optimality of CDAs in terms of their size. Based on this optimality criterion, the equivalence between optimum CDAs and consecutive orthogonal arrays with prescribed properties is explored. Using the advantages of this equivalence, a great number of optimum CDAs are presented. In particular, the existence of optimum CDAs with few factors is almost completely determined. △ Less

Submitted 25 January, 2024; v1 submitted 26 May, 2019; originally announced May 2019.

MSC Class: 05B15; 05B20; 62K15; 94C12

Journal ref: Graphs and Combinatorics 36, 1203-1218 (2020)

arXiv:1903.02728 [pdf, other]

Graphical Contrastive Losses for Scene Graph Parsing

Authors: Ji Zhang, Kevin J. Shih, Ahmed Elgammal, Andrew Tao, Bryan Catanzaro

Abstract: Most scene graph parsers use a two-stage pipeline to detect visual relationships: the first stage detects entities, and the second predicts the predicate for each entity pair using a softmax distribution. We find that such pipelines, trained with only a cross entropy loss over predicate classes, suffer from two common errors. The first, Entity Instance Confusion, occurs when the model confuses mul… ▽ More Most scene graph parsers use a two-stage pipeline to detect visual relationships: the first stage detects entities, and the second predicts the predicate for each entity pair using a softmax distribution. We find that such pipelines, trained with only a cross entropy loss over predicate classes, suffer from two common errors. The first, Entity Instance Confusion, occurs when the model confuses multiple instances of the same type of entity (e.g. multiple cups). The second, Proximal Relationship Ambiguity, arises when multiple subject-predicate-object triplets appear in close proximity with the same predicate, and the model struggles to infer the correct subject-object pairings (e.g. mis-pairing musicians and their instruments). We propose a set of contrastive loss formulations that specifically target these types of errors within the scene graph parsing problem, collectively termed the Graphical Contrastive Losses. These losses explicitly force the model to disambiguate related and unrelated instances through margin constraints specific to each type of confusion. We further construct a relationship detector, called RelDN, using the aforementioned pipeline to demonstrate the efficacy of our proposed losses. Our model outperforms the winning method of the OpenImages Relationship Detection Challenge by 4.7\% (16.5\% relative) on the test set. We also show improved results over the best previous methods on the Visual Genome and Visual Relationship Detection datasets. △ Less

Submitted 16 August, 2019; v1 submitted 7 March, 2019; originally announced March 2019.

arXiv:1812.01593 [pdf, other]

Improving Semantic Segmentation via Video Propagation and Label Relaxation

Authors: Yi Zhu, Karan Sapra, Fitsum A. Reda, Kevin J. Shih, Shawn Newsam, Andrew Tao, Bryan Catanzaro

Abstract: Semantic segmentation requires large amounts of pixel-wise annotations to learn accurate models. In this paper, we present a video prediction-based methodology to scale up training sets by synthesizing new training samples in order to improve the accuracy of semantic segmentation networks. We exploit video prediction models' ability to predict future frames in order to also predict future labels.… ▽ More Semantic segmentation requires large amounts of pixel-wise annotations to learn accurate models. In this paper, we present a video prediction-based methodology to scale up training sets by synthesizing new training samples in order to improve the accuracy of semantic segmentation networks. We exploit video prediction models' ability to predict future frames in order to also predict future labels. A joint propagation strategy is also proposed to alleviate mis-alignments in synthesized samples. We demonstrate that training segmentation models on datasets augmented by the synthesized samples leads to significant improvements in accuracy. Furthermore, we introduce a novel boundary label relaxation technique that makes training robust to annotation noise and propagation artifacts along object boundaries. Our proposed methods achieve state-of-the-art mIoUs of 83.5% on Cityscapes and 82.9% on CamVid. Our single model, without model ensembles, achieves 72.8% mIoU on the KITTI semantic segmentation test set, which surpasses the winning entry of the ROB challenge 2018. Our code and videos can be found at https://nv-adlr.github.io/publication/2018-Segmentation. △ Less

Submitted 2 July, 2019; v1 submitted 4 December, 2018; originally announced December 2018.

Comments: CVPR 2019 Oral. Code link: https://github.com/NVIDIA/semantic-segmentation. YouTube link: https://www.youtube.com/watch?v=aEbXjGZDZSQ

arXiv:1811.11718 [pdf, other]

Partial Convolution based Padding

Authors: Guilin Liu, Kevin J. Shih, Ting-Chun Wang, Fitsum A. Reda, Karan Sapra, Zhiding Yu, Andrew Tao, Bryan Catanzaro

Abstract: In this paper, we present a simple yet effective padding scheme that can be used as a drop-in module for existing convolutional neural networks. We call it partial convolution based padding, with the intuition that the padded region can be treated as holes and the original input as non-holes. Specifically, during the convolution operation, the convolution results are re-weighted near image borders… ▽ More In this paper, we present a simple yet effective padding scheme that can be used as a drop-in module for existing convolutional neural networks. We call it partial convolution based padding, with the intuition that the padded region can be treated as holes and the original input as non-holes. Specifically, during the convolution operation, the convolution results are re-weighted near image borders based on the ratios between the padded area and the convolution sliding window area. Extensive experiments with various deep network models on ImageNet classification and semantic segmentation demonstrate that the proposed padding scheme consistently outperforms standard zero padding with better accuracy. △ Less

Submitted 28 November, 2018; originally announced November 2018.

Comments: 11 pages; code is available at https://github.com/NVIDIA/partialconv

arXiv:1811.09543 [pdf, other]

An Interpretable Model for Scene Graph Generation

Authors: Ji Zhang, Kevin Shih, Andrew Tao, Bryan Catanzaro, Ahmed Elgammal

Abstract: We propose an efficient and interpretable scene graph generator. We consider three types of features: visual, spatial and semantic, and we use a late fusion strategy such that each feature's contribution can be explicitly investigated. We study the key factors about these features that have the most impact on the performance, and also visualize the learned visual features for relationships and inv… ▽ More We propose an efficient and interpretable scene graph generator. We consider three types of features: visual, spatial and semantic, and we use a late fusion strategy such that each feature's contribution can be explicitly investigated. We study the key factors about these features that have the most impact on the performance, and also visualize the learned visual features for relationships and investigate the efficacy of our model. We won the champion of the OpenImages Visual Relationship Detection Challenge on Kaggle, where we outperform the 2nd place by 5\% (20\% relatively). We believe an accurate scene graph generator is a fundamental stepping stone for higher-level vision-language tasks such as image captioning and visual QA, since it provides a semantic, structured comprehension of an image that is beyond pixels and objects. △ Less

Submitted 21 November, 2018; originally announced November 2018.

Comments: arXiv admin note: substantial text overlap with arXiv:1811.00662

Showing 1–50 of 59 results for author: Tao, A