Search | arXiv e-print repository

When Learning Meets Dynamics: Distributed User Connectivity Maximization in UAV-Based Communication Networks

Authors: Bowei Li, Saugat Tripathi, Salman Hosain, Ran Zhang, Jiang, Xie, Miao Wang

Abstract: Distributed management over Unmanned Aerial Vehicle (UAV) based communication networks (UCNs) has attracted increasing research attention. In this work, we study a distributed user connectivity maximization problem in a UCN. The work features a horizontal study over different levels of information exchange during the distributed iteration and a consideration of dynamics in UAV set and user distrib… ▽ More Distributed management over Unmanned Aerial Vehicle (UAV) based communication networks (UCNs) has attracted increasing research attention. In this work, we study a distributed user connectivity maximization problem in a UCN. The work features a horizontal study over different levels of information exchange during the distributed iteration and a consideration of dynamics in UAV set and user distribution, which are not well addressed in the existing works. Specifically, the studied problem is first formulated into a time-coupled mixed-integer non-convex optimization problem. A heuristic two-stage UAV-user association policy is proposed to faster determine the user connectivity. To tackle the NP-hard problem in scalable manner, the distributed user connectivity maximization algorithm 1 (DUCM-1) is proposed under the multi-agent deep Q learning (MA-DQL) framework. DUCM-1 emphasizes on designing different information exchange levels and evaluating how they impact the learning convergence with stationary and dynamic user distribution. To comply with the UAV dynamics, DUCM-2 algorithm is developed which is devoted to autonomously handling arbitrary quit's and join-in's of UAVs in a considered time horizon. Extensive simulations are conducted i) to conclude that exchanging state information with a deliberated task-specific reward function design yields the best convergence performance, and ii) to show the efficacy and robustness of DUCM-2 against the dynamics. △ Less

Submitted 9 September, 2024; originally announced September 2024.

Comments: 12 pages, 12 figures, journal draft

arXiv:2409.03944 [pdf, other]

HUMOS: Human Motion Model Conditioned on Body Shape

Authors: Shashank Tripathi, Omid Taheri, Christoph Lassner, Michael J. Black, Daniel Holden, Carsten Stoll

Abstract: Generating realistic human motion is essential for many computer vision and graphics applications. The wide variety of human body shapes and sizes greatly impacts how people move. However, most existing motion models ignore these differences, relying on a standardized, average body. This leads to uniform motion across different body types, where movements don't match their physical characteristics… ▽ More Generating realistic human motion is essential for many computer vision and graphics applications. The wide variety of human body shapes and sizes greatly impacts how people move. However, most existing motion models ignore these differences, relying on a standardized, average body. This leads to uniform motion across different body types, where movements don't match their physical characteristics, limiting diversity. To solve this, we introduce a new approach to develop a generative motion model based on body shape. We show that it's possible to train this model using unpaired data by applying cycle consistency, intuitive physics, and stability constraints, which capture the relationship between identity and movement. The resulting model generates diverse, physically plausible, and dynamically stable human motions that are both quantitatively and qualitatively more realistic than current state-of-the-art methods. More details are available on our project page https://CarstenEpic.github.io/humos/. △ Less

Submitted 5 September, 2024; originally announced September 2024.

Comments: Accepted in ECCV'24. Project page: https://CarstenEpic.github.io/humos/

arXiv:2407.19520 [pdf, other]

Ego-VPA: Egocentric Video Understanding with Parameter-efficient Adaptation

Authors: Tz-Ying Wu, Kyle Min, Subarna Tripathi, Nuno Vasconcelos

Abstract: Video understanding typically requires fine-tuning the large backbone when adapting to new domains. In this paper, we leverage the egocentric video foundation models (Ego-VFMs) based on video-language pre-training and propose a parameter-efficient adaptation for egocentric video tasks, namely Ego-VPA. It employs a local sparse approximation for each video frame/text feature using the basis prompts… ▽ More Video understanding typically requires fine-tuning the large backbone when adapting to new domains. In this paper, we leverage the egocentric video foundation models (Ego-VFMs) based on video-language pre-training and propose a parameter-efficient adaptation for egocentric video tasks, namely Ego-VPA. It employs a local sparse approximation for each video frame/text feature using the basis prompts, and the selected basis prompts are used to synthesize video/text prompts. Since the basis prompts are shared across frames and modalities, it models context fusion and cross-modal transfer in an efficient fashion. Experiments show that Ego-VPA excels in lightweight adaptation (with only 0.84% learnable parameters), largely improving over baselines and reaching the performance of full fine-tuning. △ Less

Submitted 28 July, 2024; originally announced July 2024.

arXiv:2406.09462 [pdf, other]

SViTT-Ego: A Sparse Video-Text Transformer for Egocentric Video

Authors: Hector A. Valdez, Kyle Min, Subarna Tripathi

Abstract: Pretraining egocentric vision-language models has become essential to improving downstream egocentric video-text tasks. These egocentric foundation models commonly use the transformer architecture. The memory footprint of these models during pretraining can be substantial. Therefore, we pretrain SViTT-Ego, the first sparse egocentric video-text transformer model integrating edge and node sparsific… ▽ More Pretraining egocentric vision-language models has become essential to improving downstream egocentric video-text tasks. These egocentric foundation models commonly use the transformer architecture. The memory footprint of these models during pretraining can be substantial. Therefore, we pretrain SViTT-Ego, the first sparse egocentric video-text transformer model integrating edge and node sparsification. We pretrain on the EgoClip dataset and incorporate the egocentric-friendly objective EgoNCE, instead of the frequently used InfoNCE. Most notably, SViTT-Ego obtains a +2.8% gain on EgoMCQ (intra-video) accuracy compared to LAVILA large, with no additional data augmentation techniques other than standard image augmentations, yet pretrainable on memory-limited devices. △ Less

Submitted 12 June, 2024; originally announced June 2024.

arXiv:2406.09017 [pdf, other]

doi 10.1007/978-3-031-45170-6_85

A PCA based Keypoint Tracking Approach to Automated Facial Expressions Encoding

Authors: Shivansh Chandra Tripathi, Rahul Garg

Abstract: The Facial Action Coding System (FACS) for studying facial expressions is manual and requires significant effort and expertise. This paper explores the use of automated techniques to generate Action Units (AUs) for studying facial expressions. We propose an unsupervised approach based on Principal Component Analysis (PCA) and facial keypoint tracking to generate data-driven AUs called PCA AUs usin… ▽ More The Facial Action Coding System (FACS) for studying facial expressions is manual and requires significant effort and expertise. This paper explores the use of automated techniques to generate Action Units (AUs) for studying facial expressions. We propose an unsupervised approach based on Principal Component Analysis (PCA) and facial keypoint tracking to generate data-driven AUs called PCA AUs using the publicly available DISFA dataset. The PCA AUs comply with the direction of facial muscle movements and are capable of explaining over 92.83 percent of the variance in other public test datasets (BP4D-Spontaneous and CK+), indicating their capability to generalize facial expressions. The PCA AUs are also comparable to a keypoint-based equivalence of FACS AUs in terms of variance explained on the test datasets. In conclusion, our research demonstrates the potential of automated techniques to be an alternative to manual FACS labeling which could lead to efficient real-time analysis of facial expressions in psychology and related fields. To promote further research, we have made code repository publicly available. △ Less

Submitted 13 June, 2024; originally announced June 2024.

Comments: This preprint has not undergone peer review or any post-submission improvements or corrections. The Version of Record of this contribution is published in [LNCS,volume 14301], and is available online at https://doi.org/10.1007/978-3-031-45170-6_85

arXiv:2406.05434 [pdf, other]

Unsupervised learning of Data-driven Facial Expression Coding System (DFECS) using keypoint tracking

Authors: Shivansh Chandra Tripathi, Rahul Garg

Abstract: The development of existing facial coding systems, such as the Facial Action Coding System (FACS), relied on manual examination of facial expression videos for defining Action Units (AUs). To overcome the labor-intensive nature of this process, we propose the unsupervised learning of an automated facial coding system by leveraging computer-vision-based facial keypoint tracking. In this novel facia… ▽ More The development of existing facial coding systems, such as the Facial Action Coding System (FACS), relied on manual examination of facial expression videos for defining Action Units (AUs). To overcome the labor-intensive nature of this process, we propose the unsupervised learning of an automated facial coding system by leveraging computer-vision-based facial keypoint tracking. In this novel facial coding system called the Data-driven Facial Expression Coding System (DFECS), the AUs are estimated by applying dimensionality reduction to facial keypoint movements from a neutral frame through a proposed Full Face Model (FFM). FFM employs a two-level decomposition using advanced dimensionality reduction techniques such as dictionary learning (DL) and non-negative matrix factorization (NMF). These techniques enhance the interpretability of AUs by introducing constraints such as sparsity and positivity to the encoding matrix. Results show that DFECS AUs estimated from the DISFA dataset can account for an average variance of up to 91.29 percent in test datasets (CK+ and BP4D-Spontaneous) and also surpass the variance explained by keypoint-based equivalents of FACS AUs in these datasets. Additionally, 87.5 percent of DFECS AUs are interpretable, i.e., align with the direction of facial muscle movements. In summary, advancements in automated facial coding systems can accelerate facial expression analysis across diverse fields such as security, healthcare, and entertainment. These advancements offer numerous benefits, including enhanced detection of abnormal behavior, improved pain analysis in healthcare settings, and enriched emotion-driven interactions. To facilitate further research, the code repository of DFECS has been made publicly accessible. △ Less

Submitted 8 June, 2024; originally announced June 2024.

arXiv:2406.02631 [pdf, other]

Contrastive Language Video Time Pre-training

Authors: Hengyue Liu, Kyle Min, Hector A. Valdez, Subarna Tripathi

Abstract: We introduce LAVITI, a novel approach to learning language, video, and temporal representations in long-form videos via contrastive learning. Different from pre-training on video-text pairs like EgoVLP, LAVITI aims to align language, video, and temporal features by extracting meaningful moments in untrimmed videos. Our model employs a set of learnable moment queries to decode clip-level visual, la… ▽ More We introduce LAVITI, a novel approach to learning language, video, and temporal representations in long-form videos via contrastive learning. Different from pre-training on video-text pairs like EgoVLP, LAVITI aims to align language, video, and temporal features by extracting meaningful moments in untrimmed videos. Our model employs a set of learnable moment queries to decode clip-level visual, language, and temporal features. In addition to vision and language alignment, we introduce relative temporal embeddings (TE) to represent timestamps in videos, which enables contrastive learning of time. Significantly different from traditional approaches, the prediction of a particular timestamp is transformed by computing the similarity score between the predicted TE and all TEs. Furthermore, existing approaches for video understanding are mainly designed for short videos due to high computational complexity and memory footprint. Our method can be trained on the Ego4D dataset with only 8 NVIDIA RTX-3090 GPUs in a day. We validated our method on CharadesEgo action recognition, achieving state-of-the-art results. △ Less

Submitted 3 June, 2024; originally announced June 2024.

Comments: CVPR EgoVis Workshop 2024 extended abstract

arXiv:2405.15392 [pdf, other]

D-VRE: From a Jupyter-enabled Private Research Environment to Decentralized Collaborative Research Ecosystem

Authors: Yuandou Wang, Sheejan Tripathi, Siamak Farshidi, Zhiming Zhao

Abstract: Today, scientific research is increasingly data-centric and compute-intensive, relying on data and models across distributed sources. However, it still faces challenges in the traditional cooperation mode, due to the high storage and computing cost, geo-location barriers, and local confidentiality regulations. The Jupyter environment has recently emerged and evolved as a vital virtual research env… ▽ More Today, scientific research is increasingly data-centric and compute-intensive, relying on data and models across distributed sources. However, it still faces challenges in the traditional cooperation mode, due to the high storage and computing cost, geo-location barriers, and local confidentiality regulations. The Jupyter environment has recently emerged and evolved as a vital virtual research environment for scientific computing, which researchers can use to scale computational analyses up to larger datasets and high-performance computing resources. Nevertheless, existing approaches lack robust support of a decentralized cooperation mode to unlock the full potential of decentralized collaborative scientific research, e.g., seamlessly secure data sharing. In this work, we change the basic structure and legacy norms of current research environments via the seamless integration of Jupyter with Ethereum blockchain capabilities. As such, it creates a Decentralized Virtual Research Environment (D-VRE) from private computational notebooks to decentralized collaborative research ecosystem. We propose a novel architecture for the D-VRE and prototype some essential D-VRE elements for enabling secure data sharing with decentralized identity, user-centric agreement-making, membership, and research asset management. To validate our method, we conducted an experimental study to test all functionalities of D-VRE smart contracts and their gas consumption. In addition, we deployed the D-VRE prototype on a test net of the Ethereum blockchain for demonstration. The feedback from the studies showcases the current prototype's usability, ease of use, and potential and suggests further improvements. △ Less

Submitted 26 June, 2024; v1 submitted 24 May, 2024; originally announced May 2024.

Comments: We revised the manuscript draft and submitted the revised manuscript to the journal Blockchain: Research and Applications

arXiv:2404.10539 [pdf, other]

VideoSAGE: Video Summarization with Graph Representation Learning

Authors: Jose M. Rojas Chaves, Subarna Tripathi

Abstract: We propose a graph-based representation learning framework for video summarization. First, we convert an input video to a graph where nodes correspond to each of the video frames. Then, we impose sparsity on the graph by connecting only those pairs of nodes that are within a specified temporal distance. We then formulate the video summarization task as a binary node classification problem, precise… ▽ More We propose a graph-based representation learning framework for video summarization. First, we convert an input video to a graph where nodes correspond to each of the video frames. Then, we impose sparsity on the graph by connecting only those pairs of nodes that are within a specified temporal distance. We then formulate the video summarization task as a binary node classification problem, precisely classifying video frames whether they should belong to the output summary video. A graph constructed this way aims to capture long-range interactions among video frames, and the sparsity ensures the model trains without hitting the memory and compute bottleneck. Experiments on two datasets(SumMe and TVSum) demonstrate the effectiveness of the proposed nimble model compared to existing state-of-the-art summarization approaches while being one order of magnitude more efficient in compute time and memory △ Less

Submitted 14 April, 2024; originally announced April 2024.

Comments: arXiv admin note: text overlap with arXiv:2207.07783

arXiv:2403.13695 [pdf, other]

doi 10.1109/ICEFEET59656.2023.10452217

Loss Regularizing Robotic Terrain Classification

Authors: Shakti Deo Kumar, Sudhanshu Tripathi, Krishna Ujjwal, Sarvada Sakshi Jha, Suddhasil De

Abstract: Locomotion mechanics of legged robots are suitable when pacing through difficult terrains. Recognising terrains for such robots are important to fully yoke the versatility of their movements. Consequently, robotic terrain classification becomes significant to classify terrains in real time with high accuracy. The conventional classifiers suffer from overfitting problem, low accuracy problem, high… ▽ More Locomotion mechanics of legged robots are suitable when pacing through difficult terrains. Recognising terrains for such robots are important to fully yoke the versatility of their movements. Consequently, robotic terrain classification becomes significant to classify terrains in real time with high accuracy. The conventional classifiers suffer from overfitting problem, low accuracy problem, high variance problem, and not suitable for live dataset. On the other hand, classifying a growing dataset is difficult for convolution based terrain classification. Supervised recurrent models are also not practical for this classification. Further, the existing recurrent architectures are still evolving to improve accuracy of terrain classification based on live variable-length sensory data collected from legged robots. This paper proposes a new semi-supervised method for terrain classification of legged robots, avoiding preprocessing of long variable-length dataset. The proposed method has a stacked Long Short-Term Memory architecture, including a new loss regularization. The proposed method solves the existing problems and improves accuracy. Comparison with the existing architectures show the improvements. △ Less

Submitted 20 March, 2024; originally announced March 2024.

Comments: Preliminary draft of the work published in IEEE conference 2023

arXiv:2403.00788 [pdf]

PRECISE Framework: GPT-based Text For Improved Readability, Reliability, and Understandability of Radiology Reports For Patient-Centered Care

Authors: Satvik Tripathi, Liam Mutter, Meghana Muppuri, Suhani Dheer, Emiliano Garza-Frias, Komal Awan, Aakash Jha, Michael Dezube, Azadeh Tabari, Christopher P. Bridge, Dania Daye

Abstract: This study introduces and evaluates the PRECISE framework, utilizing OpenAI's GPT-4 to enhance patient engagement by providing clearer and more accessible chest X-ray reports at a sixth-grade reading level. The framework was tested on 500 reports, demonstrating significant improvements in readability, reliability, and understandability. Statistical analyses confirmed the effectiveness of the PRECI… ▽ More This study introduces and evaluates the PRECISE framework, utilizing OpenAI's GPT-4 to enhance patient engagement by providing clearer and more accessible chest X-ray reports at a sixth-grade reading level. The framework was tested on 500 reports, demonstrating significant improvements in readability, reliability, and understandability. Statistical analyses confirmed the effectiveness of the PRECISE approach, highlighting its potential to foster patient-centric care delivery in healthcare decision-making. △ Less

Submitted 19 February, 2024; originally announced March 2024.

arXiv:2312.05432 [pdf, other]

Fusing Multiple Algorithms for Heterogeneous Online Learning

Authors: Darshan Gadginmath, Shivanshu Tripathi, Fabio Pasqualetti

Abstract: This study addresses the challenge of online learning in contexts where agents accumulate disparate data, face resource constraints, and use different local algorithms. This paper introduces the Switched Online Learning Algorithm (SOLA), designed to solve the heterogeneous online learning problem by amalgamating updates from diverse agents through a dynamic switching mechanism contingent upon thei… ▽ More This study addresses the challenge of online learning in contexts where agents accumulate disparate data, face resource constraints, and use different local algorithms. This paper introduces the Switched Online Learning Algorithm (SOLA), designed to solve the heterogeneous online learning problem by amalgamating updates from diverse agents through a dynamic switching mechanism contingent upon their respective performance and available resources. We theoretically analyze the design of the selecting mechanism to ensure that the regret of SOLA is bounded. Our findings show that the number of changes in selection needs to be bounded by a parameter dependent on the performance of the different local algorithms. Additionally, two test cases are presented to emphasize the effectiveness of SOLA, first on an online linear regression problem and then on an online classification problem with the MNIST dataset. △ Less

Submitted 8 December, 2023; originally announced December 2023.

Comments: 13 pages, 3 figures

arXiv:2312.03391 [pdf, other]

Action Scene Graphs for Long-Form Understanding of Egocentric Videos

Authors: Ivan Rodin, Antonino Furnari, Kyle Min, Subarna Tripathi, Giovanni Maria Farinella

Abstract: We present Egocentric Action Scene Graphs (EASGs), a new representation for long-form understanding of egocentric videos. EASGs extend standard manually-annotated representations of egocentric videos, such as verb-noun action labels, by providing a temporally evolving graph-based description of the actions performed by the camera wearer, including interacted objects, their relationships, and how a… ▽ More We present Egocentric Action Scene Graphs (EASGs), a new representation for long-form understanding of egocentric videos. EASGs extend standard manually-annotated representations of egocentric videos, such as verb-noun action labels, by providing a temporally evolving graph-based description of the actions performed by the camera wearer, including interacted objects, their relationships, and how actions unfold in time. Through a novel annotation procedure, we extend the Ego4D dataset by adding manually labeled Egocentric Action Scene Graphs offering a rich set of annotations designed for long-from egocentric video understanding. We hence define the EASG generation task and provide a baseline approach, establishing preliminary benchmarks. Experiments on two downstream tasks, egocentric action anticipation and egocentric activity summarization, highlight the effectiveness of EASGs for long-form egocentric video understanding. We will release the dataset and the code to replicate experiments and annotations. △ Less

Submitted 6 December, 2023; originally announced December 2023.

arXiv:2311.10476 [pdf, other]

FRCSyn Challenge at WACV 2024:Face Recognition Challenge in the Era of Synthetic Data

Authors: Pietro Melzi, Ruben Tolosana, Ruben Vera-Rodriguez, Minchul Kim, Christian Rathgeb, Xiaoming Liu, Ivan DeAndres-Tame, Aythami Morales, Julian Fierrez, Javier Ortega-Garcia, Weisong Zhao, Xiangyu Zhu, Zheyu Yan, Xiao-Yu Zhang, Jinlin Wu, Zhen Lei, Suvidha Tripathi, Mahak Kothari, Md Haider Zama, Debayan Deb, Bernardo Biesseck, Pedro Vidal, Roger Granada, Guilherme Fickel, Gustavo Führ , et al. (22 additional authors not shown)

Abstract: Despite the widespread adoption of face recognition technology around the world, and its remarkable performance on current benchmarks, there are still several challenges that must be covered in more detail. This paper offers an overview of the Face Recognition Challenge in the Era of Synthetic Data (FRCSyn) organized at WACV 2024. This is the first international challenge aiming to explore the use… ▽ More Despite the widespread adoption of face recognition technology around the world, and its remarkable performance on current benchmarks, there are still several challenges that must be covered in more detail. This paper offers an overview of the Face Recognition Challenge in the Era of Synthetic Data (FRCSyn) organized at WACV 2024. This is the first international challenge aiming to explore the use of synthetic data in face recognition to address existing limitations in the technology. Specifically, the FRCSyn Challenge targets concerns related to data privacy issues, demographic biases, generalization to unseen scenarios, and performance limitations in challenging scenarios, including significant age disparities between enrollment and testing, pose variations, and occlusions. The results achieved in the FRCSyn Challenge, together with the proposed benchmark, contribute significantly to the application of synthetic data to improve face recognition technology. △ Less

Submitted 17 November, 2023; originally announced November 2023.

Comments: 10 pages, 1 figure, WACV 2024 Workshops

arXiv:2310.02753 [pdf, other]

MUNCH: Modelling Unique 'N Controllable Heads

Authors: Debayan Deb, Suvidha Tripathi, Pranit Puri

Abstract: The automated generation of 3D human heads has been an intriguing and challenging task for computer vision researchers. Prevailing methods synthesize realistic avatars but with limited control over the diversity and quality of rendered outputs and suffer from limited correlation between shape and texture of the character. We propose a method that offers quality, diversity, control, and realism alo… ▽ More The automated generation of 3D human heads has been an intriguing and challenging task for computer vision researchers. Prevailing methods synthesize realistic avatars but with limited control over the diversity and quality of rendered outputs and suffer from limited correlation between shape and texture of the character. We propose a method that offers quality, diversity, control, and realism along with explainable network design, all desirable features to game-design artists in the domain. First, our proposed Geometry Generator identifies disentangled latent directions and generate novel and diverse samples. A Render Map Generator then learns to synthesize multiply high-fidelty physically-based render maps including Albedo, Glossiness, Specular, and Normals. For artists preferring fine-grained control over the output, we introduce a novel Color Transformer Model that allows semantic color control over generated maps. We also introduce quantifiable metrics called Uniqueness and Novelty and a combined metric to test the overall performance of our model. Demo for both shapes and textures can be found: https://munch-seven.vercel.app/. We will release our model along with the synthetic dataset. △ Less

Submitted 4 October, 2023; originally announced October 2023.

arXiv:2309.15273 [pdf, other]

DECO: Dense Estimation of 3D Human-Scene Contact In The Wild

Authors: Shashank Tripathi, Agniv Chatterjee, Jean-Claude Passy, Hongwei Yi, Dimitrios Tzionas, Michael J. Black

Abstract: Understanding how humans use physical contact to interact with the world is key to enabling human-centric artificial intelligence. While inferring 3D contact is crucial for modeling realistic and physically-plausible human-object interactions, existing methods either focus on 2D, consider body joints rather than the surface, use coarse 3D body regions, or do not generalize to in-the-wild images. I… ▽ More Understanding how humans use physical contact to interact with the world is key to enabling human-centric artificial intelligence. While inferring 3D contact is crucial for modeling realistic and physically-plausible human-object interactions, existing methods either focus on 2D, consider body joints rather than the surface, use coarse 3D body regions, or do not generalize to in-the-wild images. In contrast, we focus on inferring dense, 3D contact between the full body surface and objects in arbitrary images. To achieve this, we first collect DAMON, a new dataset containing dense vertex-level contact annotations paired with RGB images containing complex human-object and human-scene contact. Second, we train DECO, a novel 3D contact detector that uses both body-part-driven and scene-context-driven attention to estimate vertex-level contact on the SMPL body. DECO builds on the insight that human observers recognize contact by reasoning about the contacting body parts, their proximity to scene objects, and the surrounding scene context. We perform extensive evaluations of our detector on DAMON as well as on the RICH and BEHAVE datasets. We significantly outperform existing SOTA methods across all benchmarks. We also show qualitatively that DECO generalizes well to diverse and challenging real-world human interactions in natural images. The code, data, and models are available at https://deco.is.tue.mpg.de. △ Less

Submitted 26 September, 2023; originally announced September 2023.

Comments: Accepted as Oral in ICCV'23. Project page: https://deco.is.tue.mpg.de

arXiv:2307.16195 [pdf, ps, other]

Implementation of Fast and Power Efficient SEC-DAEC and SEC-DAEC-TAEC Codecs on FPGA

Authors: Sayan Tripathi, Jhilam Jana, Jaydeb Bhaumik

Abstract: The reliability of memory devices is affected by radiation induced soft errors. Multiple cell upsets (MCUs) caused by radiation corrupt data stored in multiple cells within memories. Error correction codes (ECCs) are typically used to mitigate the effects of MCUs. Single error correction-double error detection (SEC-DED) codes are not the right choice against MCUs, but are more suitable for protect… ▽ More The reliability of memory devices is affected by radiation induced soft errors. Multiple cell upsets (MCUs) caused by radiation corrupt data stored in multiple cells within memories. Error correction codes (ECCs) are typically used to mitigate the effects of MCUs. Single error correction-double error detection (SEC-DED) codes are not the right choice against MCUs, but are more suitable for protecting memory against single cell upset (SCU). Single error correction-double adjacent error correction (SEC-DAEC) and single error correction-double adjacent error correction-triple adjacent error correction (SEC-DAEC-TAEC) codes are more suitable due to the increasing tendency of adjacent errors. This paper presents the implementation of fast and low power multi-bit adjacent error correction codes for protecting memories. Related SEC-DAEC and SEC-DAEC-TAEC codecs with data length of 16-bit, 32-bit and 64-bit have been implemented. It is found from FPGA based implementation results that the modified designs have comparable area and have less delay and power consumption. △ Less

Submitted 30 July, 2023; originally announced July 2023.

Comments: 9 pages, 2 figures, 2 tables

arXiv:2306.08990 [pdf, other]

doi 10.1145/3610548.3618183

Emotional Speech-Driven Animation with Content-Emotion Disentanglement

Authors: Radek Daněček, Kiran Chhatre, Shashank Tripathi, Yandong Wen, Michael J. Black, Timo Bolkart

Abstract: To be widely adopted, 3D facial avatars must be animated easily, realistically, and directly from speech signals. While the best recent methods generate 3D animations that are synchronized with the input audio, they largely ignore the impact of emotions on facial expressions. Realistic facial animation requires lip-sync together with the natural expression of emotion. To that end, we propose EMOTE… ▽ More To be widely adopted, 3D facial avatars must be animated easily, realistically, and directly from speech signals. While the best recent methods generate 3D animations that are synchronized with the input audio, they largely ignore the impact of emotions on facial expressions. Realistic facial animation requires lip-sync together with the natural expression of emotion. To that end, we propose EMOTE (Expressive Model Optimized for Talking with Emotion), which generates 3D talking-head avatars that maintain lip-sync from speech while enabling explicit control over the expression of emotion. To achieve this, we supervise EMOTE with decoupled losses for speech (i.e., lip-sync) and emotion. These losses are based on two key observations: (1) deformations of the face due to speech are spatially localized around the mouth and have high temporal frequency, whereas (2) facial expressions may deform the whole face and occur over longer intervals. Thus, we train EMOTE with a per-frame lip-reading loss to preserve the speech-dependent content, while supervising emotion at the sequence level. Furthermore, we employ a content-emotion exchange mechanism in order to supervise different emotions on the same audio, while maintaining the lip motion synchronized with the speech. To employ deep perceptual losses without getting undesirable artifacts, we devise a motion prior in the form of a temporal VAE. Due to the absence of high-quality aligned emotional 3D face datasets with speech, EMOTE is trained with 3D pseudo-ground-truth extracted from an emotional video dataset (i.e., MEAD). Extensive qualitative and perceptual evaluations demonstrate that EMOTE produces speech-driven facial animations with better lip-sync than state-of-the-art methods trained on the same data, while offering additional, high-quality emotional control. △ Less

Submitted 26 September, 2023; v1 submitted 15 June, 2023; originally announced June 2023.

Comments: SIGGRAPH Asia 2023 Conference Paper

arXiv:2306.05689 [pdf, other]

Single-Stage Visual Relationship Learning using Conditional Queries

Authors: Alakh Desai, Tz-Ying Wu, Subarna Tripathi, Nuno Vasconcelos

Abstract: Research in scene graph generation (SGG) usually considers two-stage models, that is, detecting a set of entities, followed by combining them and labeling all possible relationships. While showing promising results, the pipeline structure induces large parameter and computation overhead, and typically hinders end-to-end optimizations. To address this, recent research attempts to train single-stage… ▽ More Research in scene graph generation (SGG) usually considers two-stage models, that is, detecting a set of entities, followed by combining them and labeling all possible relationships. While showing promising results, the pipeline structure induces large parameter and computation overhead, and typically hinders end-to-end optimizations. To address this, recent research attempts to train single-stage models that are computationally efficient. With the advent of DETR, a set based detection model, one-stage models attempt to predict a set of subject-predicate-object triplets directly in a single shot. However, SGG is inherently a multi-task learning problem that requires modeling entity and predicate distributions simultaneously. In this paper, we propose Transformers with conditional queries for SGG, namely, TraCQ with a new formulation for SGG that avoids the multi-task learning problem and the combinatorial entity pair distribution. We employ a DETR-based encoder-decoder design and leverage conditional queries to significantly reduce the entity label space as well, which leads to 20% fewer parameters compared to state-of-the-art single-stage models. Experimental results show that TraCQ not only outperforms existing single-stage scene graph generation methods, it also beats many state-of-the-art two-stage methods on the Visual Genome dataset, yet is capable of end-to-end training and faster inference. △ Less

Submitted 9 June, 2023; originally announced June 2023.

Comments: Accepted to NeurIPS 2022

arXiv:2306.01652 [pdf, other]

On the Coverage of Cognitive mmWave Networks with Directional Sensing and Communication

Authors: Shuchi Tripathi, Abhishek K. Gupta, SaiDhiraj Amuru

Abstract: Millimeter-waves' propagation characteristics create prospects for spatial and temporal spectrum sharing in a variety of contexts, including cognitive spectrum sharing (CSS). However, CSS along with omnidirectional sensing, is not efficient at mmWave frequencies due to their directional nature of transmission, as this limits secondary networks' ability to access the spectrum. This inspired us to c… ▽ More Millimeter-waves' propagation characteristics create prospects for spatial and temporal spectrum sharing in a variety of contexts, including cognitive spectrum sharing (CSS). However, CSS along with omnidirectional sensing, is not efficient at mmWave frequencies due to their directional nature of transmission, as this limits secondary networks' ability to access the spectrum. This inspired us to create an analytical approach using stochastic geometry to examine the implications of directional cognitive sensing in mmWave networks. We explore a scenario where multiple secondary transmitter-receiver pairs coexist with a primary transmitter-receiver pair, forming a cognitive network. The positions of the secondary transmitters are modelled using a homogeneous Poisson point process (PPP) with corresponding secondary receivers located around them. A threshold on directional transmission is imposed on each secondary transmitter in order to limit its interference at the primary receiver. We derive the medium-access-probability of a secondary user along with the fraction of the secondary transmitters active at a time-instant. To understand cognition's feasibility, we derive the coverage probabilities of primary and secondary links. We provide various design insights via numerical results. For example, we investigate the interference-threshold's optimal value while ensuring coverage for both links and its dependence on various parameters. We find that directionality improves both links' performance as a key factor. Further, allowing location-aware secondary directionality can help achieve similar coverage for all secondary links. △ Less

Submitted 2 June, 2023; originally announced June 2023.

Comments: 30 pages, 12 figures

arXiv:2304.11827 [pdf]

Safe and Secure Smart Home using Cisco Packet Tracer

Authors: Shivansh Walia, Tejas Iyer, Shubham Tripathi, Akshith Vanaparthy

Abstract: This project presents an implementation and designing of safe, secure and smart home with enhanced levels of security features which uses IoT-based technology. We got our motivation for this project after learning about movement of west towards smart homes and designs. This galvanized us to engage in this work as we wanted for homeowners to have a greater control over their in-house environment wh… ▽ More This project presents an implementation and designing of safe, secure and smart home with enhanced levels of security features which uses IoT-based technology. We got our motivation for this project after learning about movement of west towards smart homes and designs. This galvanized us to engage in this work as we wanted for homeowners to have a greater control over their in-house environment while also promising more safety and security features for the denizen. This contrivance of smart-home archetype has been intended to assimilate many kinds of sensors, boards along with advanced IoT devices and programming languages all of which in conjunction validate control and monitoring prowess over discrete electronic items present in home. △ Less

Submitted 24 April, 2023; originally announced April 2023.

Comments: 11 pages

arXiv:2304.08809 [pdf, other]

SViTT: Temporal Learning of Sparse Video-Text Transformers

Authors: Yi Li, Kyle Min, Subarna Tripathi, Nuno Vasconcelos

Abstract: Do video-text transformers learn to model temporal relationships across frames? Despite their immense capacity and the abundance of multimodal training data, recent work has revealed the strong tendency of video-text models towards frame-based spatial representations, while temporal reasoning remains largely unsolved. In this work, we identify several key challenges in temporal learning of video-t… ▽ More Do video-text transformers learn to model temporal relationships across frames? Despite their immense capacity and the abundance of multimodal training data, recent work has revealed the strong tendency of video-text models towards frame-based spatial representations, while temporal reasoning remains largely unsolved. In this work, we identify several key challenges in temporal learning of video-text transformers: the spatiotemporal trade-off from limited network size; the curse of dimensionality for multi-frame modeling; and the diminishing returns of semantic information by extending clip length. Guided by these findings, we propose SViTT, a sparse video-text architecture that performs multi-frame reasoning with significantly lower cost than naive transformers with dense attention. Analogous to graph-based networks, SViTT employs two forms of sparsity: edge sparsity that limits the query-key communications between tokens in self-attention, and node sparsity that discards uninformative visual tokens. Trained with a curriculum which increases model sparsity with the clip length, SViTT outperforms dense transformer baselines on multiple video-text retrieval and question answering benchmarks, with a fraction of computational cost. Project page: http://svcl.ucsd.edu/projects/svitt. △ Less

Submitted 18 April, 2023; originally announced April 2023.

Comments: CVPR 2023

arXiv:2304.00733 [pdf, other]

Unbiased Scene Graph Generation in Videos

Authors: Sayak Nag, Kyle Min, Subarna Tripathi, Amit K. Roy Chowdhury

Abstract: The task of dynamic scene graph generation (SGG) from videos is complicated and challenging due to the inherent dynamics of a scene, temporal fluctuation of model predictions, and the long-tailed distribution of the visual relationships in addition to the already existing challenges in image-based SGG. Existing methods for dynamic SGG have primarily focused on capturing spatio-temporal context usi… ▽ More The task of dynamic scene graph generation (SGG) from videos is complicated and challenging due to the inherent dynamics of a scene, temporal fluctuation of model predictions, and the long-tailed distribution of the visual relationships in addition to the already existing challenges in image-based SGG. Existing methods for dynamic SGG have primarily focused on capturing spatio-temporal context using complex architectures without addressing the challenges mentioned above, especially the long-tailed distribution of relationships. This often leads to the generation of biased scene graphs. To address these challenges, we introduce a new framework called TEMPURA: TEmporal consistency and Memory Prototype guided UnceRtainty Attenuation for unbiased dynamic SGG. TEMPURA employs object-level temporal consistencies via transformer-based sequence modeling, learns to synthesize unbiased relationship representations using memory-guided training, and attenuates the predictive uncertainty of visual relations using a Gaussian Mixture Model (GMM). Extensive experiments demonstrate that our method achieves significant (up to 10% in some cases) performance gain over existing methods highlighting its superiority in generating more unbiased scene graphs. △ Less

Submitted 29 June, 2023; v1 submitted 3 April, 2023; originally announced April 2023.

Comments: Published in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023

arXiv:2303.18246 [pdf, other]

3D Human Pose Estimation via Intuitive Physics

Authors: Shashank Tripathi, Lea Müller, Chun-Hao P. Huang, Omid Taheri, Michael J. Black, Dimitrios Tzionas

Abstract: Estimating 3D humans from images often produces implausible bodies that lean, float, or penetrate the floor. Such methods ignore the fact that bodies are typically supported by the scene. A physics engine can be used to enforce physical plausibility, but these are not differentiable, rely on unrealistic proxy bodies, and are difficult to integrate into existing optimization and learning frameworks… ▽ More Estimating 3D humans from images often produces implausible bodies that lean, float, or penetrate the floor. Such methods ignore the fact that bodies are typically supported by the scene. A physics engine can be used to enforce physical plausibility, but these are not differentiable, rely on unrealistic proxy bodies, and are difficult to integrate into existing optimization and learning frameworks. In contrast, we exploit novel intuitive-physics (IP) terms that can be inferred from a 3D SMPL body interacting with the scene. Inspired by biomechanics, we infer the pressure heatmap on the body, the Center of Pressure (CoP) from the heatmap, and the SMPL body's Center of Mass (CoM). With these, we develop IPMAN, to estimate a 3D body from a color image in a "stable" configuration by encouraging plausible floor contact and overlapping CoP and CoM. Our IP terms are intuitive, easy to implement, fast to compute, differentiable, and can be integrated into existing optimization and regression methods. We evaluate IPMAN on standard datasets and MoYo, a new dataset with synchronized multi-view images, ground-truth 3D bodies with complex poses, body-floor contact, CoM and pressure. IPMAN produces more plausible results than the state of the art, improving accuracy for static poses, while not hurting dynamic ones. Code and data are available for research at https://ipman.is.tue.mpg.de. △ Less

Submitted 24 July, 2023; v1 submitted 31 March, 2023; originally announced March 2023.

Comments: Accepted in CVPR'23. Project page: https://ipman.is.tue.mpg.de

arXiv:2303.17499 [pdf, other]

Fuzzified advanced robust hashes for identification of digital and physical objects

Authors: Shashank Tripathi, Volker Skwarek

Abstract: With the rising numbers for IoT objects, it is becoming easier to penetrate counterfeit objects into the mainstream market by adversaries. Such infiltration of bogus products can be addressed with third-party-verifiable identification. Generally, state-of-the-art identification schemes do not guarantee that an identifier e.g. barcodes or RFID itself cannot be forged. This paper introduces identifi… ▽ More With the rising numbers for IoT objects, it is becoming easier to penetrate counterfeit objects into the mainstream market by adversaries. Such infiltration of bogus products can be addressed with third-party-verifiable identification. Generally, state-of-the-art identification schemes do not guarantee that an identifier e.g. barcodes or RFID itself cannot be forged. This paper introduces identification patterns representing the objects intrinsic identity by robust hashes and not only by generated identification patterns. Inspired by these two notions, a collection of uniquely identifiable attributes called quasi-identifiers (QI) can be used to identify an object. Since all attributes do not contribute equally towards an object's identity, each QI has a different contribution towards the identifier. A robust hash developed utilising the QI has been named fuzzified robust hashes (FaR hashes), which can be used as an object identifier. Although the FaR hash is a single hash string, selected bits change in response to the modification of QI. On the other hand, other QIs in the object are more important for the object's identity. If these QIs change, the complete FaR hash is going to change. The calculation of FaR hash using attributes should allow third parties to generate the identifier and compare it with the current one to verify the genuineness of the object. △ Less

Submitted 30 March, 2023; originally announced March 2023.

Comments: 9 pages, 6 figures, 3 tables

ACM Class: E.3; E.4; H.1

arXiv:2212.04360 [pdf, other]

MIME: Human-Aware 3D Scene Generation

Authors: Hongwei Yi, Chun-Hao P. Huang, Shashank Tripathi, Lea Hering, Justus Thies, Michael J. Black

Abstract: Generating realistic 3D worlds occupied by moving humans has many applications in games, architecture, and synthetic data creation. But generating such scenes is expensive and labor intensive. Recent work generates human poses and motions given a 3D scene. Here, we take the opposite approach and generate 3D indoor scenes given 3D human motion. Such motions can come from archival motion capture or… ▽ More Generating realistic 3D worlds occupied by moving humans has many applications in games, architecture, and synthetic data creation. But generating such scenes is expensive and labor intensive. Recent work generates human poses and motions given a 3D scene. Here, we take the opposite approach and generate 3D indoor scenes given 3D human motion. Such motions can come from archival motion capture or from IMU sensors worn on the body, effectively turning human movement in a "scanner" of the 3D world. Intuitively, human movement indicates the free-space in a room and human contact indicates surfaces or objects that support activities such as sitting, lying or touching. We propose MIME (Mining Interaction and Movement to infer 3D Environments), which is a generative model of indoor scenes that produces furniture layouts that are consistent with the human movement. MIME uses an auto-regressive transformer architecture that takes the already generated objects in the scene as well as the human motion as input, and outputs the next plausible object. To train MIME, we build a dataset by populating the 3D FRONT scene dataset with 3D humans. Our experiments show that MIME produces more diverse and plausible 3D scenes than a recent generative scene method that does not know about human movement. Code and data will be available for research at https://mime.is.tue.mpg.de. △ Less

Submitted 8 December, 2022; originally announced December 2022.

Comments: Project Page: https://mime.is.tue.mpg.de

arXiv:2211.04442 [pdf, other]

Algorithmic Bias in Machine Learning Based Delirium Prediction

Authors: Sandhya Tripathi, Bradley A Fritz, Michael S Avidan, Yixin Chen, Christopher R King

Abstract: Although prediction models for delirium, a commonly occurring condition during general hospitalization or post-surgery, have not gained huge popularity, their algorithmic bias evaluation is crucial due to the existing association between social determinants of health and delirium risk. In this context, using MIMIC-III and another academic hospital dataset, we present some initial experimental evid… ▽ More Although prediction models for delirium, a commonly occurring condition during general hospitalization or post-surgery, have not gained huge popularity, their algorithmic bias evaluation is crucial due to the existing association between social determinants of health and delirium risk. In this context, using MIMIC-III and another academic hospital dataset, we present some initial experimental evidence showing how sociodemographic features such as sex and race can impact the model performance across subgroups. With this work, our intent is to initiate a discussion about the intersectionality effects of old age, race and socioeconomic factors on the early-stage detection and prevention of delirium using ML. △ Less

Submitted 26 November, 2022; v1 submitted 8 November, 2022; originally announced November 2022.

Comments: Extended Abstract presented at Machine Learning for Health (ML4H) symposium 2022, November 28th, 2022, New Orleans, United States & Virtual, http://www.ml4h.cc, 14 pages

arXiv:2210.15923 [pdf, other]

DELFI: Deep Mixture Models for Long-term Air Quality Forecasting in the Delhi National Capital Region

Authors: Naishadh Parmar, Raunak Shah, Tushar Goswamy, Vatsalya Tandon, Ravi Sahu, Ronak Sutaria, Purushottam Kar, Sachchida Nand Tripathi

Abstract: The identification and control of human factors in climate change is a rapidly growing concern and robust, real-time air-quality monitoring and forecasting plays a critical role in allowing effective policy formulation and implementation. This paper presents DELFI, a novel deep learning-based mixture model to make effective long-term predictions of Particulate Matter (PM) 2.5 concentrations. A key… ▽ More The identification and control of human factors in climate change is a rapidly growing concern and robust, real-time air-quality monitoring and forecasting plays a critical role in allowing effective policy formulation and implementation. This paper presents DELFI, a novel deep learning-based mixture model to make effective long-term predictions of Particulate Matter (PM) 2.5 concentrations. A key novelty in DELFI is its multi-scale approach to the forecasting problem. The observation that point predictions are more suitable in the short-term and probabilistic predictions in the long-term allows accurate predictions to be made as much as 24 hours in advance. DELFI incorporates meteorological data as well as pollutant-based features to ensure a robust model that is divided into two parts: (i) a stack of three Long Short-Term Memory (LSTM) networks that perform differential modelling of the same window of past data, and (ii) a fully-connected layer enabling attention to each of the components. Experimental evaluation based on deployment of 13 stations in the Delhi National Capital Region (Delhi-NCR) in India establishes that DELFI offers far superior predictions especially in the long-term as compared to even non-parametric baselines. The Delhi-NCR recorded the 3rd highest PM levels amongst 39 mega-cities across the world during 2011-2015 and DELFI's performance establishes it as a potential tool for effective long-term forecasting of PM levels to enable public health management and environment protection. △ Less

Submitted 28 October, 2022; originally announced October 2022.

Comments: 6 pages

arXiv:2210.10130 [pdf, other]

PERI: Part Aware Emotion Recognition In The Wild

Authors: Akshita Mittel, Shashank Tripathi

Abstract: Emotion recognition aims to interpret the emotional states of a person based on various inputs including audio, visual, and textual cues. This paper focuses on emotion recognition using visual features. To leverage the correlation between facial expression and the emotional state of a person, pioneering methods rely primarily on facial features. However, facial features are often unreliable in nat… ▽ More Emotion recognition aims to interpret the emotional states of a person based on various inputs including audio, visual, and textual cues. This paper focuses on emotion recognition using visual features. To leverage the correlation between facial expression and the emotional state of a person, pioneering methods rely primarily on facial features. However, facial features are often unreliable in natural unconstrained scenarios, such as in crowded scenes, as the face lacks pixel resolution and contains artifacts due to occlusion and blur. To address this, in the wild emotion recognition exploits full-body person crops as well as the surrounding scene context. In a bid to use body pose for emotion recognition, such methods fail to realize the potential that facial expressions, when available, offer. Thus, the aim of this paper is two-fold. First, we demonstrate our method, PERI, to leverage both body pose and facial landmarks. We create part aware spatial (PAS) images by extracting key regions from the input image using a mask generated from both body pose and facial landmarks. This allows us to exploit body pose in addition to facial context whenever available. Second, to reason from the PAS images, we introduce context infusion (Cont-In) blocks. These blocks attend to part-specific information, and pass them onto the intermediate features of an emotion recognition network. Our approach is conceptually simple and can be applied to any existing emotion recognition method. We provide our results on the publicly available in the wild EMOTIC dataset. Compared to existing methods, PERI achieves superior performance and leads to significant improvements in the mAP of emotion categories, while decreasing Valence, Arousal and Dominance errors. Importantly, we observe that our method improves performance in both images with fully visible faces as well as in images with occluded or blurred faces. △ Less

Submitted 18 October, 2022; originally announced October 2022.

Comments: Accepted at ECCVW 2022

arXiv:2210.00521 [pdf, other]

Leveraging unsupervised data and domain adaptation for deep regression in low-cost sensor calibration

Authors: Swapnil Dey, Vipul Arora, Sachchida Nand Tripathi

Abstract: Air quality monitoring is becoming an essential task with rising awareness about air quality. Low cost air quality sensors are easy to deploy but are not as reliable as the costly and bulky reference monitors. The low quality sensors can be calibrated against the reference monitors with the help of deep learning. In this paper, we translate the task of sensor calibration into a semi-supervised dom… ▽ More Air quality monitoring is becoming an essential task with rising awareness about air quality. Low cost air quality sensors are easy to deploy but are not as reliable as the costly and bulky reference monitors. The low quality sensors can be calibrated against the reference monitors with the help of deep learning. In this paper, we translate the task of sensor calibration into a semi-supervised domain adaptation problem and propose a novel solution for the same. The problem is challenging because it is a regression problem with covariate shift and label gap. We use histogram loss instead of mean squared or mean absolute error, which is commonly used for regression, and find it useful against covariate shift. To handle the label gap, we propose weighting of samples for adversarial entropy optimization. In experimental evaluations, the proposed scheme outperforms many competitive baselines, which are based on semi-supervised and supervised domain adaptation, in terms of R2 score and mean absolute error. Ablation studies show the relevance of each proposed component in the entire scheme. △ Less

Submitted 2 October, 2022; originally announced October 2022.

Comments: submitted to IEEE Trans. on Neural Networks and Learning Systems as a regular article

arXiv:2208.01953 [pdf, ps, other]

Maximum Minimal Feedback Vertex Set: A Parameterized Perspective

Authors: Ajinkya Gaikwad, Hitendra Kumar, Soumen Maity, Saket Saurabh, Shuvam Kant Tripathi

Abstract: In this paper we study a maximization version of the classical Feedback Vertex Set (FVS) problem, namely, the Max Min FVS problem, in the realm of parameterized complexity. In this problem, given an undirected graph $G$, a positive integer $k$, the question is to check whether $G$ has a minimal feedback vertex set of size at least $k$. We obtain following results for Max Min FVS. 1) We first des… ▽ More In this paper we study a maximization version of the classical Feedback Vertex Set (FVS) problem, namely, the Max Min FVS problem, in the realm of parameterized complexity. In this problem, given an undirected graph $G$, a positive integer $k$, the question is to check whether $G$ has a minimal feedback vertex set of size at least $k$. We obtain following results for Max Min FVS. 1) We first design a fixed parameter tractable (FPT) algorithm for Max Min FVS running in time $10^kn^{\mathcal{O}(1)}$. 2) Next, we consider the problem parameterized by the vertex cover number of the input graph (denoted by $\mathsf{vc}(G)$), and design an algorithm with running time $2^{\mathcal{O}(\mathsf{vc}(G)\log \mathsf{vc}(G))}n^{\mathcal{O}(1)}$. We complement this result by showing that the problem parameterized by $\mathsf{vc}(G)$ does not admit a polynomial compression unless coNP $\subseteq$ NP/poly. 3) Finally, we give an FPT-approximation scheme (fpt-AS) parameterized by $\mathsf{vc}(G)$. That is, we design an algorithm that for every $ε>0$, runs in time $2^{\mathcal{O}\left(\frac{\mathsf{vc}(G)}ε\right)} n^{\mathcal{O}(1)}$ and returns a minimal feedback vertex set of size at least $(1-ε){\sf opt}$. △ Less

Submitted 3 August, 2022; originally announced August 2022.

arXiv:2207.07783 [pdf, other]

Learning Long-Term Spatial-Temporal Graphs for Active Speaker Detection

Authors: Kyle Min, Sourya Roy, Subarna Tripathi, Tanaya Guha, Somdeb Majumdar

Abstract: Active speaker detection (ASD) in videos with multiple speakers is a challenging task as it requires learning effective audiovisual features and spatial-temporal correlations over long temporal windows. In this paper, we present SPELL, a novel spatial-temporal graph learning framework that can solve complex tasks such as ASD. To this end, each person in a video frame is first encoded in a unique n… ▽ More Active speaker detection (ASD) in videos with multiple speakers is a challenging task as it requires learning effective audiovisual features and spatial-temporal correlations over long temporal windows. In this paper, we present SPELL, a novel spatial-temporal graph learning framework that can solve complex tasks such as ASD. To this end, each person in a video frame is first encoded in a unique node for that frame. Nodes corresponding to a single person across frames are connected to encode their temporal dynamics. Nodes within a frame are also connected to encode inter-person relationships. Thus, SPELL reduces ASD to a node classification task. Importantly, SPELL is able to reason over long temporal contexts for all nodes without relying on computationally expensive fully connected graph neural networks. Through extensive experiments on the AVA-ActiveSpeaker dataset, we demonstrate that learning graph-based representations can significantly improve the active speaker detection performance owing to its explicit spatial and temporal structure. SPELL outperforms all previous state-of-the-art approaches while requiring significantly lower memory and computational resources. Our code is publicly available at https://github.com/SRA2/SPELL △ Less

Submitted 12 October, 2022; v1 submitted 15 July, 2022; originally announced July 2022.

Comments: ECCV 2022 camera ready (Supplementary videos: on ECVA soon). This paper supersedes arXiv:2112.01479

arXiv:2207.03536 [pdf, other]

Deep Learning to Jointly Schema Match, Impute, and Transform Databases

Authors: Sandhya Tripathi, Bradley A. Fritz, Mohamed Abdelhack, Michael S. Avidan, Yixin Chen, Christopher R. King

Abstract: An applied problem facing all areas of data science is harmonizing data sources. Joining data from multiple origins with unmapped and only partially overlapping features is a prerequisite to developing and testing robust, generalizable algorithms, especially in health care. We approach this issue in the common but difficult case of numeric features such as nearly Gaussian and binary features, wher… ▽ More An applied problem facing all areas of data science is harmonizing data sources. Joining data from multiple origins with unmapped and only partially overlapping features is a prerequisite to developing and testing robust, generalizable algorithms, especially in health care. We approach this issue in the common but difficult case of numeric features such as nearly Gaussian and binary features, where unit changes and variable shift make simple matching of univariate summaries unsuccessful. We develop two novel procedures to address this problem. First, we demonstrate multiple methods of "fingerprinting" a feature based on its associations to other features. In the setting of even modest prior information, this allows most shared features to be accurately identified. Second, we demonstrate a deep learning algorithm for translation between databases. Unlike prior approaches, our algorithm takes advantage of discovered mappings while identifying surrogates for unshared features and learning transformations. In synthetic and real-world experiments using two electronic health record databases, our algorithms outperform existing baselines for matching variable sets, while jointly learning to impute unshared or transformed variables. △ Less

Submitted 22 June, 2022; originally announced July 2022.

arXiv:2205.08440 [pdf, other]

Moving Smart Contracts -- A Privacy Preserving Method for Off-Chain Data Trust

Authors: Simon Tschirner, Shashank Shekher Tripathi, Mathias Roeper, Markus M. Becker, Volker Skwarek

Abstract: Blockchains provide environments where parties can interact transparently and securely peer-to-peer without needing a trusted third party. Parties can trust the integrity and correctness of transactions and the verifiable execution of binary code on the blockchain (smart contracts) inside the system. Including information from outside of the blockchain remains challenging. A challenge is data priv… ▽ More Blockchains provide environments where parties can interact transparently and securely peer-to-peer without needing a trusted third party. Parties can trust the integrity and correctness of transactions and the verifiable execution of binary code on the blockchain (smart contracts) inside the system. Including information from outside of the blockchain remains challenging. A challenge is data privacy. In a public system, shared data becomes public and, coming from a single source, often lacks credibility. A private system gives the parties control over their data and sources but trades in positive aspects as transparency. Often, not the data itself is the most critical information but the result of a computation performed on it. An example is research data certification. To keep data private but still prove data provenance, researchers can store a hash value of that data on the blockchain. This hash value is either calculated locally on private data without the chance for validation or is calculated on the blockchain, meaning that data must be published and stored on the blockchain -- a problem of the overall data amount stored on and distributed with the ledger. A system we called moving smart contracts bypasses this problem: Data remain local, but trusted nodes can access them and execute trusted smart contract code stored on the blockchain. This method avoids the system-wide distribution of research data and makes it accessible and verifiable with trusted software. △ Less

Submitted 18 May, 2022; v1 submitted 17 May, 2022; originally announced May 2022.

Comments: 10 pages, 6 figures

ACM Class: C.2.4; E.2; E.3

arXiv:2204.08695 [pdf, other]

Automated Application Processing

Authors: Eshita Sharma, Keshav Gupta, Lubaina Machinewala, Samaksh Dhingra, Shrey Tripathi, Shreyas V S, Sujit Kumar Chakrabarti

Abstract: Recruitment in large organisations often involves interviewing a large number of candidates. The process is resource intensive and complex. Therefore, it is important to carry it out efficiently and effectively. Planning the selection process consists of several problems, each of which maps to one or the other well-known computing problem. Research that looks at each of these problems in isolation… ▽ More Recruitment in large organisations often involves interviewing a large number of candidates. The process is resource intensive and complex. Therefore, it is important to carry it out efficiently and effectively. Planning the selection process consists of several problems, each of which maps to one or the other well-known computing problem. Research that looks at each of these problems in isolation is rich and mature. However, research that takes an integrated view of the problem is not common. In this paper, we take two of the most important aspects of the application processing problem, namely review/interview panel creation and interview scheduling. We have implemented our approach as a prototype system and have used it to automatically plan the interview process of a real-life data set. Our system provides a distinctly better plan than the existing practice, which is predominantly manual. We have explored various algorithmic options and have customised them to solve these panel creation and interview scheduling problems. We have evaluated these design options experimentally on a real data set and have presented our observations. Our prototype and experimental process and results may be a very good starting point for a full-fledged development project for automating application processing process. △ Less

Submitted 19 April, 2022; originally announced April 2022.

arXiv:2204.07066 [pdf, other]

EvoSTS Forecasting: Evolutionary Sparse Time-Series Forecasting

Authors: Ethan Jacob Moyer, Alisha Isabelle Augustin, Satvik Tripathi, Ansh Aashish Dholakia, Andy Nguyen, Isamu Mclean Isozaki, Daniel Schwartz, Edward Kim

Abstract: In this work, we highlight our novel evolutionary sparse time-series forecasting algorithm also known as EvoSTS. The algorithm attempts to evolutionary prioritize weights of Long Short-Term Memory (LSTM) Network that best minimize the reconstruction loss of a predicted signal using a learned sparse coded dictionary. In each generation of our evolutionary algorithm, a set number of children with th… ▽ More In this work, we highlight our novel evolutionary sparse time-series forecasting algorithm also known as EvoSTS. The algorithm attempts to evolutionary prioritize weights of Long Short-Term Memory (LSTM) Network that best minimize the reconstruction loss of a predicted signal using a learned sparse coded dictionary. In each generation of our evolutionary algorithm, a set number of children with the same initial weights are spawned. Each child undergoes a training step and adjusts their weights on the same data. Due to stochastic back-propagation, the set of children has a variety of weights with different levels of performance. The weights that best minimize the reconstruction loss with a given signal dictionary are passed to the next generation. The predictions from the best-performing weights of the first and last generation are compared. We found improvements while comparing the weights of these two generations. However, due to several confounding parameters and hyperparameter limitations, some of the weights had negligible improvements. To the best of our knowledge, this is the first attempt to use sparse coding in this way to optimize time series forecasting model weights, such as those of an LSTM network. △ Less

Submitted 14 April, 2022; originally announced April 2022.

Comments: 5 pages, 2 figures, 2 tables

arXiv:2204.01918 [pdf, other]

Text Spotting Transformers

Authors: Xiang Zhang, Yongwen Su, Subarna Tripathi, Zhuowen Tu

Abstract: In this paper, we present TExt Spotting TRansformers (TESTR), a generic end-to-end text spotting framework using Transformers for text detection and recognition in the wild. TESTR builds upon a single encoder and dual decoders for the joint text-box control point regression and character recognition. Other than most existing literature, our method is free from Region-of-Interest operations and heu… ▽ More In this paper, we present TExt Spotting TRansformers (TESTR), a generic end-to-end text spotting framework using Transformers for text detection and recognition in the wild. TESTR builds upon a single encoder and dual decoders for the joint text-box control point regression and character recognition. Other than most existing literature, our method is free from Region-of-Interest operations and heuristics-driven post-processing procedures; TESTR is particularly effective when dealing with curved text-boxes where special cares are needed for the adaptation of the traditional bounding-box representations. We show our canonical representation of control points suitable for text instances in both Bezier curve and polygon annotations. In addition, we design a bounding-box guided polygon detection (box-to-polygon) process. Experiments on curved and arbitrarily shaped datasets demonstrate state-of-the-art performances of the proposed TESTR algorithm. △ Less

Submitted 4 April, 2022; originally announced April 2022.

Comments: Accepted to CVPR 2022

arXiv:2204.01696 [pdf, other]

Joint Hand Motion and Interaction Hotspots Prediction from Egocentric Videos

Authors: Shaowei Liu, Subarna Tripathi, Somdeb Majumdar, Xiaolong Wang

Abstract: We propose to forecast future hand-object interactions given an egocentric video. Instead of predicting action labels or pixels, we directly predict the hand motion trajectory and the future contact points on the next active object (i.e., interaction hotspots). This relatively low-dimensional representation provides a concrete description of future interactions. To tackle this task, we first provi… ▽ More We propose to forecast future hand-object interactions given an egocentric video. Instead of predicting action labels or pixels, we directly predict the hand motion trajectory and the future contact points on the next active object (i.e., interaction hotspots). This relatively low-dimensional representation provides a concrete description of future interactions. To tackle this task, we first provide an automatic way to collect trajectory and hotspots labels on large-scale data. We then use this data to train an Object-Centric Transformer (OCT) model for prediction. Our model performs hand and object interaction reasoning via the self-attention mechanism in Transformers. OCT also provides a probabilistic framework to sample the future trajectory and hotspots to handle uncertainty in prediction. We perform experiments on the Epic-Kitchens-55, Epic-Kitchens-100, and EGTEA Gaze+ datasets, and show that OCT significantly outperforms state-of-the-art approaches by a large margin. Project page is available at https://stevenlsw.github.io/hoi-forecast . △ Less

Submitted 4 April, 2022; originally announced April 2022.

Comments: CVPR 2022, Project page: https://stevenlsw.github.io/hoi-forecast

arXiv:2203.13349 [pdf, other]

Occluded Human Mesh Recovery

Authors: Rawal Khirodkar, Shashank Tripathi, Kris Kitani

Abstract: Top-down methods for monocular human mesh recovery have two stages: (1) detect human bounding boxes; (2) treat each bounding box as an independent single-human mesh recovery task. Unfortunately, the single-human assumption does not hold in images with multi-human occlusion and crowding. Consequently, top-down methods have difficulties in recovering accurate 3D human meshes under severe person-pers… ▽ More Top-down methods for monocular human mesh recovery have two stages: (1) detect human bounding boxes; (2) treat each bounding box as an independent single-human mesh recovery task. Unfortunately, the single-human assumption does not hold in images with multi-human occlusion and crowding. Consequently, top-down methods have difficulties in recovering accurate 3D human meshes under severe person-person occlusion. To address this, we present Occluded Human Mesh Recovery (OCHMR) - a novel top-down mesh recovery approach that incorporates image spatial context to overcome the limitations of the single-human assumption. The approach is conceptually simple and can be applied to any existing top-down architecture. Along with the input image, we condition the top-down model on spatial context from the image in the form of body-center heatmaps. To reason from the predicted body centermaps, we introduce Contextual Normalization (CoNorm) blocks to adaptively modulate intermediate features of the top-down model. The contextual conditioning helps our model disambiguate between two severely overlapping human bounding-boxes, making it robust to multi-person occlusion. Compared with state-of-the-art methods, OCHMR achieves superior performance on challenging multi-person benchmarks like 3DPW, CrowdPose and OCHuman. Specifically, our proposed contextual reasoning architecture applied to the SPIN model with ResNet-50 backbone results in 75.2 PMPJPE on 3DPW-PC, 23.6 AP on CrowdPose and 37.7 AP on OCHuman datasets, a significant improvement of 6.9 mm, 6.4 AP and 20.8 AP respectively over the baseline. Code and models will be released. △ Less

Submitted 24 March, 2022; originally announced March 2022.

arXiv:2203.10636 [pdf, other]

Transform your Smartphone into a DSLR Camera: Learning the ISP in the Wild

Authors: Ardhendu Shekhar Tripathi, Martin Danelljan, Samarth Shukla, Radu Timofte, Luc Van Gool

Abstract: We propose a trainable Image Signal Processing (ISP) framework that produces DSLR quality images given RAW images captured by a smartphone. To address the color misalignments between training image pairs, we employ a color-conditional ISP network and optimize a novel parametric color mapping between each input RAW and reference DSLR image. During inference, we predict the target color image by des… ▽ More We propose a trainable Image Signal Processing (ISP) framework that produces DSLR quality images given RAW images captured by a smartphone. To address the color misalignments between training image pairs, we employ a color-conditional ISP network and optimize a novel parametric color mapping between each input RAW and reference DSLR image. During inference, we predict the target color image by designing a color prediction network with efficient Global Context Transformer modules. The latter effectively leverage global information to learn consistent color and tone mappings. We further propose a robust masked aligned loss to identify and discard regions with inaccurate motion estimation during training. Lastly, we introduce the ISP in the Wild (ISPW) dataset, consisting of weakly paired phone RAW and DSLR sRGB images. We extensively evaluate our method, setting a new state-of-the-art on two datasets. △ Less

Submitted 12 July, 2022; v1 submitted 20 March, 2022; originally announced March 2022.

Comments: Accepted at ECCV 2022

arXiv:2202.10701 [pdf, other]

Bag of Visual Words (BoVW) with Deep Features -- Patch Classification Model for Limited Dataset of Breast Tumours

Authors: Suvidha Tripathi, Satish Kumar Singh, Lee Hwee Kuan

Abstract: Currently, the computational complexity limits the training of high resolution gigapixel images using Convolutional Neural Networks. Therefore, such images are divided into patches or tiles. Since, these high resolution patches are encoded with discriminative information therefore; CNNs are trained on these patches to perform patch-level predictions. However, the problem with patch-level predictio… ▽ More Currently, the computational complexity limits the training of high resolution gigapixel images using Convolutional Neural Networks. Therefore, such images are divided into patches or tiles. Since, these high resolution patches are encoded with discriminative information therefore; CNNs are trained on these patches to perform patch-level predictions. However, the problem with patch-level prediction is that pathologist generally annotates at image-level and not at patch level. Due to this limitation most of the patches may not contain enough class-relevant features. Through this work, we tried to incorporate patch descriptive capability within the deep framework by using Bag of Visual Words (BoVW) as a kind of regularisation to improve generalizability. Using this hypothesis, we aim to build a patch based classifier to discriminate between four classes of breast biopsy image patches (normal, benign, \textit{In situ} carcinoma, invasive carcinoma). The task is to incorporate quality deep features using CNN to describe relevant information in the images while simultaneously discarding irrelevant information using Bag of Visual Words (BoVW). The proposed method passes patches obtained from WSI and microscopy images through pre-trained CNN to extract features. BoVW is used as a feature selector to select most discriminative features among the CNN features. Finally, the selected feature sets are classified as one of the four classes. The hybrid model provides flexibility in terms of choice of pre-trained models for feature extraction. The pipeline is end-to-end since it does not require post processing of patch predictions to select discriminative patches. We compared our observations with state-of-the-art methods like ResNet50, DenseNet169, and InceptionV3 on the BACH-2018 challenge dataset. Our proposed method shows better performance than all the three methods. △ Less

Submitted 22 February, 2022; originally announced February 2022.

arXiv:2202.10694 [pdf, other]

doi 10.1007/s11042-020-08891-w

Ensembling Handcrafted Features with Deep Features: An Analytical Study for Classification of Routine Colon Cancer Histopathological Nuclei Images

Authors: Suvidha Tripathi, Satish Kumar Singh

Abstract: The use of Deep Learning (DL) based methods in medical histopathology images have been one of the most sought after solutions to classify, segment, and detect diseased biopsy samples. However, given the complex nature of medical datasets due to the presence of intra-class variability and heterogeneity, the use of complex DL models might not give the optimal performance up to the level which is sui… ▽ More The use of Deep Learning (DL) based methods in medical histopathology images have been one of the most sought after solutions to classify, segment, and detect diseased biopsy samples. However, given the complex nature of medical datasets due to the presence of intra-class variability and heterogeneity, the use of complex DL models might not give the optimal performance up to the level which is suitable for assisting pathologists. Therefore, ensemble DL methods with the scope of including domain agnostic handcrafted Features (HC-F) inspired this work. We have, through experiments, tried to highlight that a single DL network (domain-specific or state of the art pre-trained models) cannot be directly used as the base model without proper analysis with the relevant dataset. We have used F1-measure, Precision, Recall, AUC, and Cross-Entropy Loss to analyse the performance of our approaches. We observed from the results that the DL features ensemble bring a marked improvement in the overall performance of the model, whereas, domain agnostic HC-F remains dormant on the performance of the DL models. △ Less

Submitted 22 February, 2022; originally announced February 2022.

Journal ref: Multimedia Tools Application 79 34931-34954 2020

arXiv:2202.10691 [pdf, other]

An Object Aware Hybrid U-Net for Breast Tumour Annotation

Authors: Suvidha Tripathi, Satish Kumar Singh

Abstract: In the clinical settings, during digital examination of histopathological slides, the pathologist annotate the slides by marking the rough boundary around the suspected tumour region. The marking or annotation is generally represented as a polygonal boundary that covers the extent of the tumour in the slide. These polygonal markings are difficult to imitate through CAD techniques since the tumour… ▽ More In the clinical settings, during digital examination of histopathological slides, the pathologist annotate the slides by marking the rough boundary around the suspected tumour region. The marking or annotation is generally represented as a polygonal boundary that covers the extent of the tumour in the slide. These polygonal markings are difficult to imitate through CAD techniques since the tumour regions are heterogeneous and hence segmenting them would require exhaustive pixel wise ground truth annotation. Therefore, for CAD analysis, the ground truths are generally annotated by pathologist explicitly for research purposes. However, this kind of annotation which is generally required for semantic or instance segmentation is time consuming and tedious. In this proposed work, therefore, we have tried to imitate pathologist like annotation by segmenting tumour extents by polygonal boundaries. For polygon like annotation or segmentation, we have used Active Contours whose vertices or snake points move towards the boundary of the object of interest to find the region of minimum energy. To penalize the Active Contour we used modified U-Net architecture for learning penalization values. The proposed hybrid deep learning model fuses the modern deep learning segmentation algorithm with traditional Active Contours segmentation technique. The model is tested against both state-of-the-art semantic segmentation and hybrid models for performance evaluation against contemporary work. The results obtained show that the pathologist like annotation could be achieved by developing such hybrid models that integrate the domain knowledge through classical segmentation methods like Active Contours and global knowledge through semantic segmentation deep learning models. △ Less

Submitted 22 February, 2022; originally announced February 2022.

arXiv:2202.10177 [pdf, other]

doi 10.1145/3345318

Cell nuclei classification in histopathological images using hybrid OLConvNet

Authors: Suvidha Tripathi, Satish Kumar Singh

Abstract: Computer-aided histopathological image analysis for cancer detection is a major research challenge in the medical domain. Automatic detection and classification of nuclei for cancer diagnosis impose a lot of challenges in developing state of the art algorithms due to the heterogeneity of cell nuclei and data set variability. Recently, a multitude of classification algorithms has used complex deep… ▽ More Computer-aided histopathological image analysis for cancer detection is a major research challenge in the medical domain. Automatic detection and classification of nuclei for cancer diagnosis impose a lot of challenges in developing state of the art algorithms due to the heterogeneity of cell nuclei and data set variability. Recently, a multitude of classification algorithms has used complex deep learning models for their dataset. However, most of these methods are rigid and their architectural arrangement suffers from inflexibility and non-interpretability. In this research article, we have proposed a hybrid and flexible deep learning architecture OLConvNet that integrates the interpretability of traditional object-level features and generalization of deep learning features by using a shallower Convolutional Neural Network (CNN) named as $CNN_{3L}$. $CNN_{3L}$ reduces the training time by training fewer parameters and hence eliminating space constraints imposed by deeper algorithms. We used F1-score and multiclass Area Under the Curve (AUC) performance parameters to compare the results. To further strengthen the viability of our architectural approach, we tested our proposed methodology with state of the art deep learning architectures AlexNet, VGG16, VGG19, ResNet50, InceptionV3, and DenseNet121 as backbone networks. After a comprehensive analysis of classification results from all four architectures, we observed that our proposed model works well and perform better than contemporary complex algorithms. △ Less

Submitted 21 February, 2022; originally announced February 2022.

Journal ref: @article{10.1145/3345318, year = {2020},journal = {ACM Trans. Multimedia Comput. Commun. Appl.}, volume = {16}, number = {1s}, issn = {1551-6857}, articleno = {32}, numpages = {22}}

arXiv:2112.09828 [pdf, other]

Exploiting Long-Term Dependencies for Generating Dynamic Scene Graphs

Authors: Shengyu Feng, Subarna Tripathi, Hesham Mostafa, Marcel Nassar, Somdeb Majumdar

Abstract: Dynamic scene graph generation from a video is challenging due to the temporal dynamics of the scene and the inherent temporal fluctuations of predictions. We hypothesize that capturing long-term temporal dependencies is the key to effective generation of dynamic scene graphs. We propose to learn the long-term dependencies in a video by capturing the object-level consistency and inter-object relat… ▽ More Dynamic scene graph generation from a video is challenging due to the temporal dynamics of the scene and the inherent temporal fluctuations of predictions. We hypothesize that capturing long-term temporal dependencies is the key to effective generation of dynamic scene graphs. We propose to learn the long-term dependencies in a video by capturing the object-level consistency and inter-object relationship dynamics over object-level long-term tracklets using transformers. Experimental results demonstrate that our Dynamic Scene Graph Detection Transformer (DSG-DETR) outperforms state-of-the-art methods by a significant margin on the benchmark dataset Action Genome. Our ablation studies validate the effectiveness of each component of the proposed approach. The source code is available at https://github.com/Shengyu-Feng/DSG-DETR. △ Less

Submitted 19 October, 2022; v1 submitted 17 December, 2021; originally announced December 2021.

Comments: WACV 2023

arXiv:2112.01479 [pdf, other]

Learning Spatial-Temporal Graphs for Active Speaker Detection

Authors: Sourya Roy, Kyle Min, Subarna Tripathi, Tanaya Guha, Somdeb Majumdar

Abstract: We address the problem of active speaker detection through a new framework, called SPELL, that learns long-range multimodal graphs to encode the inter-modal relationship between audio and visual data. We cast active speaker detection as a node classification task that is aware of longer-term dependencies. We first construct a graph from a video so that each node corresponds to one person. Nodes re… ▽ More We address the problem of active speaker detection through a new framework, called SPELL, that learns long-range multimodal graphs to encode the inter-modal relationship between audio and visual data. We cast active speaker detection as a node classification task that is aware of longer-term dependencies. We first construct a graph from a video so that each node corresponds to one person. Nodes representing the same identity share edges between them within a defined temporal window. Nodes within the same video frame are also connected to encode inter-person interactions. Through extensive experiments on the Ava-ActiveSpeaker dataset, we demonstrate that learning graph-based representation, owing to its explicit spatial and temporal structure, significantly improves the overall performance. SPELL outperforms several relevant baselines and performs at par with state of the art models while requiring an order of magnitude lower computation cost. △ Less

Submitted 3 December, 2021; v1 submitted 2 December, 2021; originally announced December 2021.

Comments: 10 pages

arXiv:2111.03039 [pdf, other]

Towards Panoptic 3D Parsing for Single Image in the Wild

Authors: Sainan Liu, Vincent Nguyen, Yuan Gao, Subarna Tripathi, Zhuowen Tu

Abstract: Performing single image holistic understanding and 3D reconstruction is a central task in computer vision. This paper presents an integrated system that performs dense scene labeling, object detection, instance segmentation, depth estimation, 3D shape reconstruction, and 3D layout estimation for indoor and outdoor scenes from a single RGB image. We name our system panoptic 3D parsing (Panoptic3D)… ▽ More Performing single image holistic understanding and 3D reconstruction is a central task in computer vision. This paper presents an integrated system that performs dense scene labeling, object detection, instance segmentation, depth estimation, 3D shape reconstruction, and 3D layout estimation for indoor and outdoor scenes from a single RGB image. We name our system panoptic 3D parsing (Panoptic3D) in which panoptic segmentation ("stuff" segmentation and "things" detection/segmentation) with 3D reconstruction is performed. We design a stage-wise system, Panoptic3D (stage-wise), where a complete set of annotations is absent. Additionally, we present an end-to-end pipeline, Panoptic3D (end-to-end), trained on a synthetic dataset with a full set of annotations. We show results on both indoor (3D-FRONT) and outdoor (COCO and Cityscapes) scenes. Our proposed panoptic 3D parsing framework points to a promising direction in computer vision. Panoptic3D can be applied to a variety of applications, including autonomous driving, mapping, robotics, design, computer graphics, robotics, human-computer interaction, and augmented reality. △ Less

Submitted 29 November, 2021; v1 submitted 4 November, 2021; originally announced November 2021.

arXiv:2111.01414 [pdf, other]

A Review of Dialogue Systems: From Trained Monkeys to Stochastic Parrots

Authors: Atharv Singh Patlan, Shiven Tripathi, Shubham Korde

Abstract: In spoken dialogue systems, we aim to deploy artificial intelligence to build automated dialogue agents that can converse with humans. Dialogue systems are increasingly being designed to move beyond just imitating conversation and also improve from such interactions over time. In this survey, we present a broad overview of methods developed to build dialogue systems over the years. Different use c… ▽ More In spoken dialogue systems, we aim to deploy artificial intelligence to build automated dialogue agents that can converse with humans. Dialogue systems are increasingly being designed to move beyond just imitating conversation and also improve from such interactions over time. In this survey, we present a broad overview of methods developed to build dialogue systems over the years. Different use cases for dialogue systems ranging from task-based systems to open domain chatbots motivate and necessitate specific systems. Starting from simple rule-based systems, research has progressed towards increasingly complex architectures trained on a massive corpus of datasets, like deep learning systems. Motivated with the intuition of resembling human dialogues, progress has been made towards incorporating emotions into the natural language generator, using reinforcement learning. While we see a trend of highly marginal improvement on some metrics, we find that limited justification exists for the metrics, and evaluation practices are not uniform. To conclude, we flag these concerns and highlight possible research directions. △ Less

Submitted 2 November, 2021; originally announced November 2021.

arXiv:2110.09436 [pdf]

Early Diagnostic Prediction of Covid-19 using Gradient-Boosting Machine Model

Authors: Satvik Tripathi

Abstract: With the huge spike in the COVID-19 cases across the globe and reverse transcriptase-polymerase chain reaction (RT-PCR) test remains a key component for rapid and accurate detection of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). In recent months there has been an acute shortage of medical supplies in developing countries, especially a lack of RT-PCR testing resulting in delayed p… ▽ More With the huge spike in the COVID-19 cases across the globe and reverse transcriptase-polymerase chain reaction (RT-PCR) test remains a key component for rapid and accurate detection of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). In recent months there has been an acute shortage of medical supplies in developing countries, especially a lack of RT-PCR testing resulting in delayed patient care and high infection rates. We present a gradient-boosting machine model that predicts the diagnostics result of SARS-CoV- 2 in an RT-PCR test by utilizing eight binary features. We used the publicly available nationwide dataset released by the Israeli Ministry of Health. △ Less

Submitted 18 October, 2021; v1 submitted 12 October, 2021; originally announced October 2021.

Comments: Presented at the Drexel Society of Artificial Intelligence Research Conference, 2021 (arXiv:2110.05263)

Report number: drexelai/2021/06

arXiv:2109.10868 [pdf, other]

doi 10.1109/TCCN.2021.3115098

A Context-aware Radio Resource Management in Heterogeneous Virtual RANs

Authors: Sharda Tripathi, Corrado Puligheddu, Carla Fabiana Chiasserini, Federico Mungari

Abstract: New-generation wireless networks are designed to support a wide range of services with diverse key performance indicators (KPIs) requirements. A fundamental component of such networks, and a pivotal factor to the fulfillment of the target KPIs, is the virtual radio access network (vRAN), which allows high flexibility on the control of the radio link. However, to fully exploit the potentiality of v… ▽ More New-generation wireless networks are designed to support a wide range of services with diverse key performance indicators (KPIs) requirements. A fundamental component of such networks, and a pivotal factor to the fulfillment of the target KPIs, is the virtual radio access network (vRAN), which allows high flexibility on the control of the radio link. However, to fully exploit the potentiality of vRANs, an efficient mapping of the rapidly varying context to radio control decisions is not only essential, but also challenging owing to the interdependence of user traffic demand, channel conditions, and resource allocation. Here, we propose CAREM, a reinforcement learning framework for dynamic radio resource allocation in heterogeneous vRANs, which selects the best available link and transmission parameters for packet transfer, so as to meet the KPI requirements. To show its effectiveness, we develop a testbed for proof-of-concept. Experimental results demonstrate that CAREM enables an efficient radio resource allocation under different settings and traffic demand. Also, compared to the closest existing scheme based on neural network and the standard LTE, CAREM exhibits an improvement of one order of magnitude in packet loss and latency, while it provides a 65% latency improvement relatively to the contextual bandit approach. △ Less

Submitted 23 September, 2021; v1 submitted 22 September, 2021; originally announced September 2021.

Comments: Accepted for publication in IEEE Transactions on Cognitive Communications and Networking

Showing 1–50 of 115 results for author: Tripathi, S