Search | arXiv e-print repository

Memory-Maze: Scenario Driven Benchmark and Visual Language Navigation Model for Guiding Blind People

Authors: Masaki Kuribayashi, Kohei Uehara, Allan Wang, Daisuke Sato, Simon Chu, Shigeo Morishima

Abstract: Visual Language Navigation (VLN) powered navigation robots have the potential to guide blind people by understanding and executing route instructions provided by sighted passersby. This capability allows robots to operate in environments that are often unknown a priori. Existing VLN models are insufficient for the scenario of navigation guidance for blind people, as they need to understand routes… ▽ More Visual Language Navigation (VLN) powered navigation robots have the potential to guide blind people by understanding and executing route instructions provided by sighted passersby. This capability allows robots to operate in environments that are often unknown a priori. Existing VLN models are insufficient for the scenario of navigation guidance for blind people, as they need to understand routes described from human memory, which frequently contain stutters, errors, and omission of details as opposed to those obtained by thinking out loud, such as in the Room-to-Room dataset. However, currently, there is no benchmark that simulates instructions that were obtained from human memory in environments where blind people navigate. To this end, we present our benchmark, Memory-Maze, which simulates the scenario of seeking route instructions for guiding blind people. Our benchmark contains a maze-like structured virtual environment and novel route instruction data from human memory. To collect natural language instructions, we conducted two studies from sighted passersby onsite and annotators online. Our analysis demonstrates that instructions data collected onsite were more lengthy and contained more varied wording. Alongside our benchmark, we propose a VLN model better equipped to handle the scenario. Our proposed VLN model uses Large Language Models (LLM) to parse instructions and generate Python codes for robot control. We further show that the existing state-of-the-art model performed suboptimally on our benchmark. In contrast, our proposed method outperformed the state-of-the-art model by a fair margin. We found that future research should exercise caution when considering VLN technology for practical applications, as real-world scenarios have different characteristics than ones collected in traditional settings. △ Less

Submitted 11 May, 2024; originally announced May 2024.

arXiv:2401.10005 [pdf, other]

Advancing Large Multi-modal Models with Explicit Chain-of-Reasoning and Visual Question Generation

Authors: Kohei Uehara, Nabarun Goswami, Hanqin Wang, Toshiaki Baba, Kohtaro Tanaka, Tomohiro Hashimoto, Kai Wang, Rei Ito, Takagi Naoya, Ryo Umagami, Yingyi Wen, Tanachai Anakewat, Tatsuya Harada

Abstract: The increasing demand for intelligent systems capable of interpreting and reasoning about visual content requires the development of large Vision-and-Language Models (VLMs) that are not only accurate but also have explicit reasoning capabilities. This paper presents a novel approach to develop a VLM with the ability to conduct explicit reasoning based on visual content and textual instructions. We… ▽ More The increasing demand for intelligent systems capable of interpreting and reasoning about visual content requires the development of large Vision-and-Language Models (VLMs) that are not only accurate but also have explicit reasoning capabilities. This paper presents a novel approach to develop a VLM with the ability to conduct explicit reasoning based on visual content and textual instructions. We introduce a system that can ask a question to acquire necessary knowledge, thereby enhancing the robustness and explicability of the reasoning process. To this end, we developed a novel dataset generated by a Large Language Model (LLM), designed to promote chain-of-thought reasoning combined with a question-asking mechanism. The dataset covers a range of tasks, from common ones like caption generation to specialized VQA tasks that require expert knowledge. Furthermore, using the dataset we created, we fine-tuned an existing VLM. This training enabled the models to generate questions and perform iterative reasoning during inference. The results demonstrated a stride toward a more robust, accurate, and interpretable VLM, capable of reasoning explicitly and seeking information proactively when confronted with ambiguous visual input. △ Less

Submitted 17 July, 2024; v1 submitted 18 January, 2024; originally announced January 2024.

arXiv:2312.00989 [pdf, other]

doi 10.14722/ndss.2024.24445

Scrappy: SeCure Rate Assuring Protocol with PrivacY

Authors: Kosei Akama, Yoshimichi Nakatsuka, Masaaki Sato, Keisuke Uehara

Abstract: Preventing abusive activities caused by adversaries accessing online services at a rate exceeding that expected by websites has become an ever-increasing problem. CAPTCHAs and SMS authentication are widely used to provide a solution by implementing rate limiting, although they are becoming less effective, and some are considered privacy-invasive. In light of this, many studies have proposed better… ▽ More Preventing abusive activities caused by adversaries accessing online services at a rate exceeding that expected by websites has become an ever-increasing problem. CAPTCHAs and SMS authentication are widely used to provide a solution by implementing rate limiting, although they are becoming less effective, and some are considered privacy-invasive. In light of this, many studies have proposed better rate-limiting systems that protect the privacy of legitimate users while blocking malicious actors. However, they suffer from one or more shortcomings: (1) assume trust in the underlying hardware and (2) are vulnerable to side-channel attacks. Motivated by the aforementioned issues, this paper proposes Scrappy: SeCure Rate Assuring Protocol with PrivacY. Scrappy allows clients to generate unforgeable yet unlinkable rate-assuring proofs, which provides the server with cryptographic guarantees that the client is not misbehaving. We design Scrappy using a combination of DAA and hardware security devices. Scrappy is implemented over three types of devices, including one that can immediately be deployed in the real world. Our baseline evaluation shows that the end-to-end latency of Scrappy is minimal, taking only 0.32 seconds, and uses only 679 bytes of bandwidth when transferring necessary data. We also conduct an extensive security evaluation, showing that the rate-limiting capability of Scrappy is unaffected even if the hardware security device is compromised. △ Less

Submitted 1 December, 2023; originally announced December 2023.

Journal ref: Network and Distributed System Security (NDSS) Symposium 2024

arXiv:2210.05879 [pdf, other]

Learning by Asking Questions for Knowledge-based Novel Object Recognition

Authors: Kohei Uehara, Tatsuya Harada

Abstract: In real-world object recognition, there are numerous object classes to be recognized. Conventional image recognition based on supervised learning can only recognize object classes that exist in the training data, and thus has limited applicability in the real world. On the other hand, humans can recognize novel objects by asking questions and acquiring knowledge about them. Inspired by this, we st… ▽ More In real-world object recognition, there are numerous object classes to be recognized. Conventional image recognition based on supervised learning can only recognize object classes that exist in the training data, and thus has limited applicability in the real world. On the other hand, humans can recognize novel objects by asking questions and acquiring knowledge about them. Inspired by this, we study a framework for acquiring external knowledge through question generation that would help the model instantly recognize novel objects. Our pipeline consists of two components: the Object Classifier, which performs knowledge-based object recognition, and the Question Generator, which generates knowledge-aware questions to acquire novel knowledge. We also propose a question generation strategy based on the confidence of the knowledge-aware prediction of the Object Classifier. To train the Question Generator, we construct a dataset that contains knowledge-aware questions about objects in the images. Our experiments show that the proposed pipeline effectively acquires knowledge about novel objects compared to several baselines. △ Less

Submitted 11 October, 2022; originally announced October 2022.

arXiv:2203.07890 [pdf, other]

K-VQG: Knowledge-aware Visual Question Generation for Common-sense Acquisition

Authors: Kohei Uehara, Tatsuya Harada

Abstract: Visual Question Generation (VQG) is a task to generate questions from images. When humans ask questions about an image, their goal is often to acquire some new knowledge. However, existing studies on VQG have mainly addressed question generation from answers or question categories, overlooking the objectives of knowledge acquisition. To introduce a knowledge acquisition perspective into VQG, we co… ▽ More Visual Question Generation (VQG) is a task to generate questions from images. When humans ask questions about an image, their goal is often to acquire some new knowledge. However, existing studies on VQG have mainly addressed question generation from answers or question categories, overlooking the objectives of knowledge acquisition. To introduce a knowledge acquisition perspective into VQG, we constructed a novel knowledge-aware VQG dataset called K-VQG. This is the first large, humanly annotated dataset in which questions regarding images are tied to structured knowledge. We also developed a new VQG model that can encode and use knowledge as the target for a question. The experiment results show that our model outperforms existing models on the K-VQG dataset. △ Less

Submitted 15 March, 2022; originally announced March 2022.

arXiv:2202.07305 [pdf, other]

doi 10.1145/3487553.3524649

ViNTER: Image Narrative Generation with Emotion-Arc-Aware Transformer

Authors: Kohei Uehara, Yusuke Mori, Yusuke Mukuta, Tatsuya Harada

Abstract: Image narrative generation is a task to create a story from an image with a subjective viewpoint. Given the importance of the subjective feelings of writers, readers, and characters in storytelling, an image narrative generation method should consider human emotion. In this study, we propose a novel method of image narrative generation called ViNTER (Visual Narrative Transformer with Emotion arc R… ▽ More Image narrative generation is a task to create a story from an image with a subjective viewpoint. Given the importance of the subjective feelings of writers, readers, and characters in storytelling, an image narrative generation method should consider human emotion. In this study, we propose a novel method of image narrative generation called ViNTER (Visual Narrative Transformer with Emotion arc Representation), which takes "emotion arc" as input to capture a sequence of emotional changes. Since emotion arcs represent the trajectory of emotional change, it is expected that we can include detailed information about the emotional changes in the story to the model. We present experimental results of both automatic and manual evaluations on the Image Narrative dataset and demonstrate the effectiveness of the proposed approach. △ Less

Submitted 7 April, 2022; v1 submitted 15 February, 2022; originally announced February 2022.

arXiv:2012.02346 [pdf, other]

doi 10.1145/3474085.3475589

ChartPointFlow for Topology-Aware 3D Point Cloud Generation

Authors: Takumi Kimura, Takashi Matsubara, Kuniaki Uehara

Abstract: A point cloud serves as a representation of the surface of a three-dimensional (3D) shape. Deep generative models have been adapted to model their variations typically using a map from a ball-like set of latent variables. However, previous approaches did not pay much attention to the topological structure of a point cloud, despite that a continuous map cannot express the varying numbers of holes a… ▽ More A point cloud serves as a representation of the surface of a three-dimensional (3D) shape. Deep generative models have been adapted to model their variations typically using a map from a ball-like set of latent variables. However, previous approaches did not pay much attention to the topological structure of a point cloud, despite that a continuous map cannot express the varying numbers of holes and intersections. Moreover, a point cloud is often composed of multiple subparts, and it is also difficult to express. In this study, we propose ChartPointFlow, a flow-based generative model with multiple latent labels for 3D point clouds. Each label is assigned to points in an unsupervised manner. Then, a map conditioned on a label is assigned to a continuous subset of a point cloud, similar to a chart of a manifold. This enables our proposed model to preserve the topological structure with clear boundaries, whereas previous approaches tend to generate blurry point clouds and fail to generate holes. The experimental results demonstrate that ChartPointFlow achieves state-of-the-art performance in terms of generation and reconstruction compared with other point cloud generators. Moreover, ChartPointFlow divides an object into semantic subparts using charts, and it demonstrates superior performance in case of unsupervised segmentation. △ Less

Submitted 7 August, 2021; v1 submitted 3 December, 2020; originally announced December 2020.

Comments: Accepted to ACM International Conference on Multimedia (ACMMM2021) as an oral presentation

Journal ref: ACM International Conference on Multimedia (ACMMM2021)

arXiv:1911.10354 [pdf, other]

Unsupervised Keyword Extraction for Full-sentence VQA

Authors: Kohei Uehara, Tatsuya Harada

Abstract: In the majority of the existing Visual Question Answering (VQA) research, the answers consist of short, often single words, as per instructions given to the annotators during dataset construction. This study envisions a VQA task for natural situations, where the answers are more likely to be sentences rather than single words. To bridge the gap between this natural VQA and existing VQA approaches,… ▽ More In the majority of the existing Visual Question Answering (VQA) research, the answers consist of short, often single words, as per instructions given to the annotators during dataset construction. This study envisions a VQA task for natural situations, where the answers are more likely to be sentences rather than single words. To bridge the gap between this natural VQA and existing VQA approaches, a novel unsupervised keyword extraction method is proposed. The method is based on the principle that the full-sentence answers can be decomposed into two parts: one that contains new information answering the question (i.e., keywords), and one that contains information already included in the question. Discriminative decoders were designed to achieve such decomposition, and the method was experimentally implemented on VQA datasets containing full-sentence answers. The results show that the proposed model can accurately extract the keywords without being given explicit annotations describing them. △ Less

Submitted 12 October, 2020; v1 submitted 23 November, 2019; originally announced November 2019.

Comments: EMNLP 2020 workshop: NLP Beyond Text (NLPBT)

arXiv:1905.02442 [pdf, other]

Interactive Video Retrieval with Dialog

Authors: Sho Maeoki, Kohei Uehara, Tatsuya Harada

Abstract: Now that everyone can easily record videos, the quantity of which is continuously increasing, research on methods for improved video retrieval is important in the contemporary world. In cases where target videos are to be identified within a large collection gathered by individuals, the appropriate information must be obtained to retrieve the correct video within a large number of similar items in… ▽ More Now that everyone can easily record videos, the quantity of which is continuously increasing, research on methods for improved video retrieval is important in the contemporary world. In cases where target videos are to be identified within a large collection gathered by individuals, the appropriate information must be obtained to retrieve the correct video within a large number of similar items in the target database. The purpose of this research is to retrieve target videos in such cases by introducing an interaction, or a dialog, between the system and the user. We propose a system to retrieve videos by asking questions about the content of the videos and leveraging the user's responses to the questions. Additionally, we confirmed the usefulness of the proposed system through experiments using the dataset called AVSD which includes videos and dialogs about the videos. △ Less

Submitted 7 May, 2019; originally announced May 2019.

arXiv:1904.08504 [pdf, other]

Exploring Uncertainty Measures for Image-Caption Embedding-and-Retrieval Task

Authors: Kenta Hama, Takashi Matsubara, Kuniaki Uehara, Jianfei Cai

Abstract: With the wide development of black-box machine learning algorithms, particularly deep neural network (DNN), the practical demand for the reliability assessment is rapidly rising. On the basis of the concept that `Bayesian deep learning knows what it does not know,' the uncertainty of DNN outputs has been investigated as a reliability measure for the classification and regression tasks. However, in… ▽ More With the wide development of black-box machine learning algorithms, particularly deep neural network (DNN), the practical demand for the reliability assessment is rapidly rising. On the basis of the concept that `Bayesian deep learning knows what it does not know,' the uncertainty of DNN outputs has been investigated as a reliability measure for the classification and regression tasks. However, in the image-caption retrieval task, well-known samples are not always easy-to-retrieve samples. This study investigates two aspects of image-caption embedding-and-retrieval systems. On one hand, we quantify feature uncertainty by considering image-caption embedding as a regression task, and use it for model averaging, which can improve the retrieval performance. On the other hand, we further quantify posterior uncertainty by considering the retrieval as a classification task, and use it as a reliability measure, which can greatly improve the retrieval performance by rejecting uncertain queries. The consistent performance of two uncertainty measures is observed with different datasets (MS COCO and Flickr30k), different deep learning architectures (dropout and batch normalization), and different similarity functions. △ Less

Submitted 9 April, 2019; originally announced April 2019.

arXiv:1811.09030 [pdf, other]

doi 10.1109/TCSVT.2019.2935128

Data Augmentation using Random Image Cropping and Patching for Deep CNNs

Authors: Ryo Takahashi, Takashi Matsubara, Kuniaki Uehara

Abstract: Deep convolutional neural networks (CNNs) have achieved remarkable results in image processing tasks. However, their high expression ability risks overfitting. Consequently, data augmentation techniques have been proposed to prevent overfitting while enriching datasets. Recent CNN architectures with more parameters are rendering traditional data augmentation techniques insufficient. In this study,… ▽ More Deep convolutional neural networks (CNNs) have achieved remarkable results in image processing tasks. However, their high expression ability risks overfitting. Consequently, data augmentation techniques have been proposed to prevent overfitting while enriching datasets. Recent CNN architectures with more parameters are rendering traditional data augmentation techniques insufficient. In this study, we propose a new data augmentation technique called random image cropping and patching (RICAP) which randomly crops four images and patches them to create a new training image. Moreover, RICAP mixes the class labels of the four images, resulting in an advantage similar to label smoothing. We evaluated RICAP with current state-of-the-art CNNs (e.g., the shake-shake regularization model) by comparison with competitive data augmentation techniques such as cutout and mixup. RICAP achieves a new state-of-the-art test error of $2.19\%$ on CIFAR-10. We also confirmed that deep CNNs with RICAP achieve better results on classification tasks using CIFAR-100 and ImageNet and an image-caption retrieval task using Microsoft COCO. △ Less

Submitted 27 August, 2019; v1 submitted 22 November, 2018; originally announced November 2018.

Comments: accepted version, 16 pages

Journal ref: IEEE Transactions on Circuits and Systems for Video Technology, 2019

arXiv:1808.02996 [pdf, other]

Object Detection in Satellite Imagery using 2-Step Convolutional Neural Networks

Authors: Hiroki Miyamoto, Kazuki Uehara, Masahiro Murakawa, Hidenori Sakanashi, Hirokazu Nosato, Toru Kouyama, Ryosuke Nakamura

Abstract: This paper presents an efficient object detection method from satellite imagery. Among a number of machine learning algorithms, we proposed a combination of two convolutional neural networks (CNN) aimed at high precision and high recall, respectively. We validated our models using golf courses as target objects. The proposed deep learning method demonstrated higher accuracy than previous object id… ▽ More This paper presents an efficient object detection method from satellite imagery. Among a number of machine learning algorithms, we proposed a combination of two convolutional neural networks (CNN) aimed at high precision and high recall, respectively. We validated our models using golf courses as target objects. The proposed deep learning method demonstrated higher accuracy than previous object identification methods. △ Less

Submitted 8 August, 2018; originally announced August 2018.

Comments: 4 pages,5 figures

arXiv:1808.01821 [pdf, other]

Visual Question Generation for Class Acquisition of Unknown Objects

Authors: Kohei Uehara, Antonio Tejero-De-Pablos, Yoshitaka Ushiku, Tatsuya Harada

Abstract: Traditional image recognition methods only consider objects belonging to already learned classes. However, since training a recognition model with every object class in the world is unfeasible, a way of getting information on unknown objects (i.e., objects whose class has not been learned) is necessary. A way for an image recognition system to learn new classes could be asking a human about object… ▽ More Traditional image recognition methods only consider objects belonging to already learned classes. However, since training a recognition model with every object class in the world is unfeasible, a way of getting information on unknown objects (i.e., objects whose class has not been learned) is necessary. A way for an image recognition system to learn new classes could be asking a human about objects that are unknown. In this paper, we propose a method for generating questions about unknown objects in an image, as means to get information about classes that have not been learned. Our method consists of a module for proposing objects, a module for identifying unknown objects, and a module for generating questions about unknown objects. The experimental results via human evaluation show that our method can successfully get information about unknown objects in an image dataset. Our code and dataset are available at https://github.com/mil-tokyo/vqg-unknown. △ Less

Submitted 6 August, 2018; originally announced August 2018.

arXiv:1807.05800 [pdf, other]

Deep Generative Model using Unregularized Score for Anomaly Detection with Heterogeneous Complexity

Authors: Takashi Matsubara, Kenta Hama, Ryosuke Tachibana, Kuniaki Uehara

Abstract: Accurate and automated detection of anomalous samples in a natural image dataset can be accomplished with a probabilistic model for end-to-end modeling of images. Such images have heterogeneous complexity, however, and a probabilistic model overlooks simply shaped objects with small anomalies. This is because the probabilistic model assigns undesirably lower likelihoods to complexly shaped objects… ▽ More Accurate and automated detection of anomalous samples in a natural image dataset can be accomplished with a probabilistic model for end-to-end modeling of images. Such images have heterogeneous complexity, however, and a probabilistic model overlooks simply shaped objects with small anomalies. This is because the probabilistic model assigns undesirably lower likelihoods to complexly shaped objects that are nevertheless consistent with set standards. To overcome this difficulty, we propose an unregularized score for deep generative models (DGMs), which are generative models leveraging deep neural networks. We found that the regularization terms of the DGMs considerably influence the anomaly score depending on the complexity of the samples. By removing these terms, we obtain an unregularized score, which we evaluated on a toy dataset and real-world manufacturing datasets. Empirical results demonstrate that the unregularized score is robust to the inherent complexity of samples and can be used to better detect anomalies. △ Less

Submitted 4 September, 2018; v1 submitted 16 July, 2018; originally announced July 2018.

Comments: An extended version of a manuscript in Proc. of The 2018 International Joint Conference on Neural Networks (IJCNN2018)

arXiv:1712.06260 [pdf, other]

doi 10.1109/TBME.2019.2895663

Deep Neural Generative Model of Functional MRI Images for Psychiatric Disorder Diagnosis

Authors: Takashi Matsubara, Tetsuo Tashiro, Kuniaki Uehara

Abstract: Accurate diagnosis of psychiatric disorders plays a critical role in improving the quality of life for patients and potentially supports the development of new treatments. Many studies have been conducted on machine learning techniques that seek brain imaging data for specific biomarkers of disorders. These studies have encountered the following dilemma: A direct classification overfits to a small… ▽ More Accurate diagnosis of psychiatric disorders plays a critical role in improving the quality of life for patients and potentially supports the development of new treatments. Many studies have been conducted on machine learning techniques that seek brain imaging data for specific biomarkers of disorders. These studies have encountered the following dilemma: A direct classification overfits to a small number of high-dimensional samples but unsupervised feature-extraction has the risk of extracting a signal of no interest. In addition, such studies often provided only diagnoses for patients without presenting the reasons for these diagnoses. This study proposed a deep neural generative model of resting-state functional magnetic resonance imaging (fMRI) data. The proposed model is conditioned by the assumption of the subject's state and estimates the posterior probability of the subject's state given the imaging data, using Bayes' rule. This study applied the proposed model to diagnose schizophrenia and bipolar disorders. Diagnostic accuracy was improved by a large margin over competitive approaches, namely classifications of functional connectivity, discriminative/generative models of region-wise signals, and those with unsupervised feature-extractors. The proposed model visualizes brain regions largely related to the disorders, thus motivating further biological investigation. △ Less

Submitted 11 April, 2019; v1 submitted 18 December, 2017; originally announced December 2017.

Comments: accepted version, 12 pages

Journal ref: IEEE Transactions on Biomedical Engineering, 2019

arXiv:1707.09099 [pdf, other]

Object Detection of Satellite Images Using Multi-Channel Higher-order Local Autocorrelation

Authors: Kazuki Uehara, Hidenori Sakanashi, Hirokazu Nosato, Masahiro Murakawa, Hiroki Miyamoto, Ryosuke Nakamura

Abstract: The Earth observation satellites have been monitoring the earth's surface for a long time, and the images taken by the satellites contain large amounts of valuable data. However, it is extremely hard work to manually analyze such huge data. Thus, a method of automatic object detection is needed for satellite images to facilitate efficient data analyses. This paper describes a new image feature ext… ▽ More The Earth observation satellites have been monitoring the earth's surface for a long time, and the images taken by the satellites contain large amounts of valuable data. However, it is extremely hard work to manually analyze such huge data. Thus, a method of automatic object detection is needed for satellite images to facilitate efficient data analyses. This paper describes a new image feature extended from higher-order local autocorrelation to the object detection of multispectral satellite images. The feature has been extended to extract spectral inter-relationships in addition to spatial relationships to fully exploit multispectral information. The results of experiments with object detection tasks conducted to evaluate the effectiveness of the proposed feature extension indicate that the feature realized a higher performance compared to existing methods. △ Less

Submitted 27 July, 2017; originally announced July 2017.

Comments: 6 pages, 2 column, 7 figures, Accepted by IEEE International Conference on Systems, Man, and Cybernetics (SMC) 2017

arXiv:1702.03505 [pdf, other]

doi 10.1109/TCSVT.2018.2822773

A Novel Weight-Shared Multi-Stage CNN for Scale Robustness

Authors: Ryo Takahashi, Takashi Matsubara, Kuniaki Uehara

Abstract: Convolutional neural networks (CNNs) have demonstrated remarkable results in image classification for benchmark tasks and practical applications. The CNNs with deeper architectures have achieved even higher performance recently thanks to their robustness to the parallel shift of objects in images as well as their numerous parameters and the resulting high expression ability. However, CNNs have a l… ▽ More Convolutional neural networks (CNNs) have demonstrated remarkable results in image classification for benchmark tasks and practical applications. The CNNs with deeper architectures have achieved even higher performance recently thanks to their robustness to the parallel shift of objects in images as well as their numerous parameters and the resulting high expression ability. However, CNNs have a limited robustness to other geometric transformations such as scaling and rotation. This limits the performance improvement of the deep CNNs, but there is no established solution. This study focuses on scale transformation and proposes a network architecture called the weight-shared multi-stage network (WSMS-Net), which consists of multiple stages of CNNs. The proposed WSMS-Net is easily combined with existing deep CNNs such as ResNet and DenseNet and enables them to acquire robustness to object scaling. Experimental results on the CIFAR-10, CIFAR-100, and ImageNet datasets demonstrate that existing deep CNNs combined with the proposed WSMS-Net achieve higher accuracies for image classification tasks with only a minor increase in the number of parameters and computation time. △ Less

Submitted 11 April, 2019; v1 submitted 12 February, 2017; originally announced February 2017.

Comments: accepted version, 13 pages

Journal ref: IEEE Transactions on Circuits and Systems for Video Technology, vol. 29, no. 4, 2019, pp. 1090-1101

Showing 1–17 of 17 results for author: Uehara, K