DM2RM: Dual-Mode Multimodal Ranking for Target Objects and
Receptacles Based on Open-Vocabulary Instructions

\name Ryosuke Korekata, Kanta Kaneda, Shunya Nagashima, Yuto Imai, Komei Sugiura CONTACT Ryosuke Korekata. Email: [email protected]

This is a preprint of an article submitted for consideration in ADVANCED ROBOTICS, copyright Taylor & Francis and Robotics Society of Japan; ADVANCED ROBOTICS is available online at http://www.tandfonline.com/. Keio University, 3-14-1 Hiyoshi, Kohoku, Yokohama, Kanagawa 223-8522, Japan
Abstract

In this study, we aim to develop a domestic service robot (DSR) that, guided by open-vocabulary instructions, can carry everyday objects to the specified pieces of furniture. Few existing methods handle mobile manipulation tasks with open-vocabulary instructions in the image retrieval setting, and most do not identify both the target objects and the receptacles. We propose the Dual-Mode Multimodal Ranking model (DM2RM), which enables images of both the target objects and receptacles to be retrieved using a single model based on multimodal foundation models. We introduce a switching mechanism that leverages a mode token and phrase identification via a large language model to switch the embedding space based on the prediction target. To evaluate the DM2RM, we construct a novel dataset including real-world images collected from hundreds of building-scale environments and crowd-sourced instructions with referring expressions. The evaluation results show that the proposed DM2RM outperforms previous approaches in terms of standard metrics in image retrieval settings. Furthermore, we demonstrate the application of the DM2RM on a standardized real-world DSR platform including fetch-and-carry actions, where it achieves a task success rate of 82% despite the zero-shot transfer setting. Demonstration videos, code, and more materials are available at https://kkrr10.github.io/dm2rm/.

keywords:
Domestic Service Robot; Mobile Manipulation; Deep Learning; Large Language Models; Multimodal Foundation Models

1 Introduction

In today’s aging society, the shortage of caregivers at home has become a serious problem. A promising solution to this problem is the use of domestic service robots (DSRs) to physically assist care recipients [1]. Although natural language interfaces are user-friendly, the ability of DSRs to comprehend the instructions given by humans regarding household tasks (e.g., fetch-and-carry) remains insufficient.

Refer to caption
Figure 1: Overview of our method. First, the DSR collects images of the environment through pre-exploration. Given the open-vocabulary instruction, it is required to retrieve the red and green framed images as the target object image and receptacle image from the collected images, respectively. Subsequently, the DSR carries the target object to the receptacle, based on the user-selected images.

In this study, we aim to develop a DSR system that, guided by open-vocabulary instructions, can carry everyday objects to specified pieces of furniture by retrieving images of the target objects and receptacles from the collected images of an environment. Fig. 1 shows an overview of our method. First, the DSR collects images of the indoor environment through pre-exploration. Next, an instruction such as “Could you carry the wooden utensils on the shelf to the table with the banana on it?” is given to the DSR. In this case, the target object and the receptacle are ‘the wooden utensils on the shelf’ and ‘the table with the banana on it,’ respectively. The DSR is required to retrieve the target object image and receptacle image from the collected images. Subsequently, the DSR should carry the wooden utensils to the table on which the banana is placed, based on the target object image and the receptacle image selected by the users. In this framework, it is crucial to rank these specific images higher than irrelevant images, because presenting a limited number of images can reduce the cognitive load on the users.

It is challenging for DSRs to identify the target object or the receptacle in the environment because open-vocabulary instructions given by humans are often complex and/or ambiguous. Furthermore, if either the target object or the receptacle is misidentified, the entire task is considered unsuccessful. In a recent open-vocabulary mobile manipulation competition [2], the overall success rate of the winning team was just 10.8% [3].

Fetch-and-carry tasks based on user instructions, which are closely related to our task, have been widely studied [4, 5, 6]. However, few existing methods handle mobile manipulation tasks with open-vocabulary instructions in the image retrieval setting (e.g., [7]). Moreover, most such methods do not identify both the target objects and the receptacles. Applying these methods simply to our task would be inefficient and achieve insufficient performance because of the need to train separate models specialized for the target objects and receptacles.

In this study, we propose the Dual-Mode Multimodal Ranking model (DM2RM), a novel method that enables the retrieval of images of both the target objects and receptacles using a single model. Unlike existing methods, the DM2RM switches between target mode and receptacle mode using a single model. This is achieved by employing a switching mechanism that leverages multimodal foundation models [8, 9]. By utilizing a mode token and phrase identification using a large language model (LLM) to switch the embedding space according to the prediction target, the DM2RM enhances the similarity between the instruction and the correct image in each mode.

Please see our project page at this URL111https://kkrr10.github.io/dm2rm/ for code, dataset, and videos demonstrating the DM2RM on a standardized real-world DSR platform. The main contributions of this study are as follows:

  • We propose the DM2RM, a novel method that individually retrieves images of both target objects and receptacles using a single model.

  • We introduce the Switching Phrase Encoder (SPE) module, which employs a mode token and phrase identification through an LLM to switch the embedding space based on the prediction target.

  • To handle open-vocabulary and redundant instructions, we introduce the Task Paraphraser (TP) module, designed to paraphrase the input instructions into a standardized format suitable for fetch-and-carry tasks.

  • We introduce the Segment Anything Region Encoder (SARE) module, which enhances visual features regarding the shape and contour of objects by utilizing images overlaid with segmentation masks obtained by SAM [9].

2 Related Work

2.1 Language-Guided Embodied AI

There have been many studies in the field of embodied AI, which combines robotics, computer vision, and natural language processing [10, 11]. For example, several benchmark competitions have been conducted in which DSRs must execute fetch-and-carry tasks in standardized real-world environments, following user instructions [4, 5, 6]. Although these tasks are closely related to our task, we do not use template-based instructions, but instead allow free-form open-vocabulary instructions with referring expressions.

Vision-and-language navigation (VLN [12]) is a representative embodied AI task involving natural language instructions. For VLN tasks, most standard datasets [13, 14] use images of real-world reconstructions from the Matterport3D (MP3D) dataset [15, 12]. However, MP3D lacks environmental diversity with only a few tens of discrete environments. In contrast, the Habitat-Matterport 3D (HM3D) dataset [16, 17] provides hundreds of building-scale continuous environments. Representative methods for retrieving images of target objects from images obtained through pre-exploration have been successfully applied not only in VLN (e.g., [18]) but also in mobile manipulation tasks (e.g., [19]). Unlike these methods, the proposed DM2RM employs a more practical setting that allows users to select the correct images from the top-K𝐾Kitalic_K retrieved images.

Recently, many studies have considered the application of foundation models such as LLMs and vision-language models to robotic tasks [20, 21, 22]. Most existing methods utilize LLMs for commonsense reasoning [23, 24], hierarchical planning [25, 26, 27], or code generation [28, 29]. Unlike these existing methods, our approach leverages an LLM for the switching mechanism in the SPE module, which conditions the model by identifying relevant phrases from instructions (see Section 4.3).

2.2 Multimodal Language Understanding

There have been numerous studies in the field of multimodal language understanding [30, 31]. In this subsection, we focus on referring expression comprehension (REC), object manipulation instruction understanding, and image retrieval.

REC tasks involve grounding the target object in a single image based on a single referring expression (e.g., [32]). However, our focus is on identifying a set of target objects and receptacles from multiple images in an environment. Thus, most existing methods for REC tasks are not directly applicable to our task.

Most existing methods for understanding object manipulation instructions identify the target objects with bounding boxes [33, 34] or segmentation masks (e.g., [35]) specified by referring expressions. Image retrieval settings that provide multiple candidates are relatively practical, and such methods based on template-based [36, 37] or open-vocabulary instructions (e.g., [7]) have been proposed. In [7], a method that handles the learning-to-rank physical objects (LTRPO) task is introduced. LTRPO is similar to our task, but does not consider referring expressions regarding receptacles.

The standard datasets for image retrieval tasks take only text (e.g., [38]) or a pair of images and the associated modification text [39, 40] as input. For image retrieval tasks, large-scale vision-and-language pre-trained models (e.g., CLIP [8], [41]) have recently achieved performance improvements in the zero-shot transfer setting. However, most methods have not been designed for inputs containing complex referring expressions. To address this problem, we introduce the TP module to paraphrase instructions suitable for fetch-and-carry tasks and the SPE module to obtain fine-grained text features (see Sections 4.2 and 4.3).

3 Problem Statement

In this paper, we define the Image Retrieval-based Open-Vocabulary Fetch-and-Carry (IROV-FC) task as follows: given an open-vocabulary instruction for a fetch-and-carry task from a user, the DSR retrieves images of the target object and the receptacle and subsequently transports the target object to the designated location. This task comprises two sub-tasks: image retrieval and action execution. In the image retrieval phase, it is desirable for the images of the target object and the receptacle to be ranked highly in their respective output ranked lists. In the action execution phase, the DSR is expected to grasp the target object and carry it to the receptacle. Note that the target object and the receptacle are identified from each user-selected image.

Fig. 1 shows a typical scene of the IROV-FC task. First, the DSR collects images of the indoor environment through pre-exploration. Given the instruction “Could you carry the wooden utensils on the shelf to the table with the banana on it?,” the DSR is required to retrieve from the set of collected images the red and green framed images in Fig. 1 as the target object image and receptacle image, respectively. The DSR subsequently carries the wooden utensils to the table on which the banana is placed, based on the target object image and the receptacle image selected by the user.

The input and output of the IROV-FC task are defined as follows:

  • Input: an instruction and images taken in an indoor environment.

  • Output: two image lists ranked based on the target object and receptacle, respectively.

The terminology used in this paper is defined as follows:

  • Instruction: an open-vocabulary instruction for a fetch-and-carry task.

  • Target object: an everyday object identified as the target in the instruction.

  • Target object image: an image containing the target object.

  • Receptacle: a piece of furniture identified as the designated placement location in the instruction.

  • Receptacle image: an image containing the receptacle.

In this study, we assume that images of the indoor environment have already been collected through pre-exploration. This is a realistic setting because DSRs are typically used in the same indoor environment for long periods of time. It is also assumed that trajectory generation regarding navigation, object grasping, and object placement is based on heuristic methods (see Section 6.2).

4 Proposed Method

Refer to caption
Figure 2: Architecture of the DM2RM. ‘MLP,’ ‘Sim,’ and ‘direct-sum\oplus’ represent the multi-layer perceptron, cosine similarity, and concatenation, respectively.

Fig. 2 shows the structure of the proposed method, which mainly consists of three modules: Task Paraphraser (TP), Switching Phrase Encoder (SPE), and Segment Anything Region Encoder (SARE). The proposed method is closely related to fetch-and-carry tasks with natural language instructions. In these tasks, the DSR carries the target object to the receptacle following user instructions [4, 5, 6]. We focus on a setting in which the users select the target object or receptacle from the presented image lists, which are ranked according to open-vocabulary instructions.

In this study, we employ the SPE module to handle object manipulation instructions with a set of target objects and receptacles. Our approach is broadly applicable to multimodal language comprehension tasks involving, for example, a single target object, a single receptacle, and multiple sets of target objects and receptacles. The novelties of our method are as follows:

  • The proposed DM2RM is a novel approach that retrieves images of both target objects and receptacles individually using a single model.

  • We introduce the SPE module, which leverages a mode token and phrase identification via an LLM to switch the embedding space according to the prediction target.

  • To handle open-vocabulary and redundant instructions, we introduce the TP module, which paraphrases the input instructions into a standardized format suitable for fetch-and-carry tasks.

  • We introduce the SARE module, which utilizes images overlaid with segmentation masks obtained by SAM [9] to enhance visual features regarding the shape and contour of objects.

4.1 Input

The input \bmx\bm𝑥\bm{x}italic_x to our model is defined as follows:

\bmx\bm𝑥\displaystyle\bm{x}italic_x ={m,\bmxtxt,Ximg},absent𝑚\bmsubscript𝑥txtsubscript𝑋img\displaystyle=\left\{m,\bm{x}_{\mathrm{txt}},X_{\mathrm{img}}\right\},= { italic_m , italic_x start_POSTSUBSCRIPT roman_txt end_POSTSUBSCRIPT , italic_X start_POSTSUBSCRIPT roman_img end_POSTSUBSCRIPT } ,
Ximgsubscript𝑋img\displaystyle X_{\mathrm{img}}italic_X start_POSTSUBSCRIPT roman_img end_POSTSUBSCRIPT ={\bmximg(i)}i=1Nimg,absentsuperscriptsubscript\bmsuperscriptsubscript𝑥img𝑖𝑖1subscript𝑁img\displaystyle=\left\{\bm{x}_{\mathrm{img}}^{(i)}\right\}_{i=1}^{N_{\mathrm{img% }}},= { italic_x start_POSTSUBSCRIPT roman_img end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT roman_img end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ,

where m{target,receptacle}𝑚delimited-⟨⟩targetdelimited-⟨⟩receptaclem\in\{\langle\mathrm{target}\rangle,\langle\mathrm{receptacle}\rangle\}italic_m ∈ { ⟨ roman_target ⟩ , ⟨ roman_receptacle ⟩ }, \bmxtxt{0,1}V×L\bmsubscript𝑥txtsuperscript01𝑉𝐿\bm{x}_{\mathrm{txt}}\in\{0,1\}^{V\times L}italic_x start_POSTSUBSCRIPT roman_txt end_POSTSUBSCRIPT ∈ { 0 , 1 } start_POSTSUPERSCRIPT italic_V × italic_L end_POSTSUPERSCRIPT, and \bmximg(i)3×W×H\bmsuperscriptsubscript𝑥img𝑖superscript3𝑊𝐻\bm{x}_{\mathrm{img}}^{(i)}\in\mathbb{R}^{3\times W\times H}italic_x start_POSTSUBSCRIPT roman_img end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 × italic_W × italic_H end_POSTSUPERSCRIPT denote the mode token indicating the basis for the ranking, a tokenized instruction, and an image with width W𝑊Witalic_W and height H𝐻Hitalic_H, respectively. Here, V𝑉Vitalic_V, L𝐿Litalic_L, and Nimgsubscript𝑁imgN_{\mathrm{img}}italic_N start_POSTSUBSCRIPT roman_img end_POSTSUBSCRIPT denote the vocabulary size, maximum token length, and number of images to be ranked, respectively.

4.2 Task Paraphraser

The TP module paraphrases \bmxtxt\bmsubscript𝑥txt\bm{x}_{\mathrm{txt}}italic_x start_POSTSUBSCRIPT roman_txt end_POSTSUBSCRIPT into the standardized format \bmxtxt\bmsubscriptsuperscript𝑥txt\bm{x}^{\prime}_{\mathrm{txt}}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_txt end_POSTSUBSCRIPT suitable for the IROV-FC task using an LLM (GPT-3.5). Open-vocabulary instructions sometimes include redundancy or grammatical errors, making it unclear which phrases should be focused on. This module enables such instructions to be handled in a unified manner.

We obtain \bmxtxt\bmsubscriptsuperscript𝑥txt\bm{x}^{\prime}_{\mathrm{txt}}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_txt end_POSTSUBSCRIPT by identifying the phrases related to the target object and receptacle using an LLM. For instance, when \bmxtxt\bmsubscript𝑥txt\bm{x}_{\mathrm{txt}}italic_x start_POSTSUBSCRIPT roman_txt end_POSTSUBSCRIPT is “Could you, if you does not mind, to pick up the cardboard box and move it over towards the couch next to the fireplace?,” the TP module outputs \bmxtxt\bmsubscriptsuperscript𝑥txt\bm{x}^{\prime}_{\mathrm{txt}}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_txt end_POSTSUBSCRIPT as “Carry the cardboard box to the couch next to the fireplace.” Note that \bmxtxt\bmsubscriptsuperscript𝑥txt\bm{x}^{\prime}_{\mathrm{txt}}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_txt end_POSTSUBSCRIPT is used as the auxiliary input to the SPE module, as explained in Section 4.3. Moreover, this module is expected to be effective for other household tasks (e.g., open/close) because of the flexible design of standard formats.

4.3 Switching Phrase Encoder

The SPE module switches the embedding space of text features according to m𝑚mitalic_m. In the IROV-FC task, it is necessary to predict both the target object and receptacle from a single instruction. However, it is inefficient to train separate models specialized for various prediction tasks.

To solve this problem, we adopt a switching mechanism that enables training and inference using a single model. Our method has two modes, target mode and receptacle mode, determined by m𝑚mitalic_m. In the target mode, the target object image is expected to be ranked highly, whereas in the receptacle mode, the receptacle image should be ranked highly. Note that Ximgsubscript𝑋imgX_{\mathrm{img}}italic_X start_POSTSUBSCRIPT roman_img end_POSTSUBSCRIPT is the same regardless of mode.

The input to the module consists of m𝑚mitalic_m, \bmxtxt\bmsubscript𝑥txt\bm{x}_{\mathrm{txt}}italic_x start_POSTSUBSCRIPT roman_txt end_POSTSUBSCRIPT, and \bmxtxt\bmsubscriptsuperscript𝑥txt\bm{x}^{\prime}_{\mathrm{txt}}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_txt end_POSTSUBSCRIPT. First, we concatenate m𝑚mitalic_m at the head of \bmxtxt\bmsubscript𝑥txt\bm{x}_{\mathrm{txt}}italic_x start_POSTSUBSCRIPT roman_txt end_POSTSUBSCRIPT to condition the model. Similar to the TP module, an LLM is used to identify the phrases, \bmxtarg\bmsubscript𝑥targ\bm{x}_{\mathrm{targ}}italic_x start_POSTSUBSCRIPT roman_targ end_POSTSUBSCRIPT and \bmxrec\bmsubscript𝑥rec\bm{x}_{\mathrm{rec}}italic_x start_POSTSUBSCRIPT roman_rec end_POSTSUBSCRIPT, regarding the target object and receptacle, respectively. To avoid focusing on irrelevant expressions, we select either of them as \bmxp\bmsubscript𝑥p\bm{x}_{\mathrm{p}}italic_x start_POSTSUBSCRIPT roman_p end_POSTSUBSCRIPT, depending on the mode. Next, noun phrases {\bmxnp(i)}i=1Nnpsuperscriptsubscript\bmsuperscriptsubscript𝑥np𝑖𝑖1subscript𝑁np\{\bm{x}_{\mathrm{np}}^{(i)}\}_{i=1}^{N_{\mathrm{np}}}{ italic_x start_POSTSUBSCRIPT roman_np end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT roman_np end_POSTSUBSCRIPT end_POSTSUPERSCRIPT are extracted from \bmxtxt\bmsubscript𝑥txt\bm{x}_{\mathrm{txt}}italic_x start_POSTSUBSCRIPT roman_txt end_POSTSUBSCRIPT using a parser [42] to obtain fine-grained text features from instructions containing multiple referring expressions. Here, Nnpsubscript𝑁npN_{\mathrm{np}}italic_N start_POSTSUBSCRIPT roman_np end_POSTSUBSCRIPT denotes the maximum number of noun phrases. We obtain text features \bmltxtdct\bmsubscript𝑙txtsuperscriptsubscript𝑑ct\bm{l}_{\mathrm{txt}}\in\mathbb{R}^{d_{\mathrm{ct}}}italic_l start_POSTSUBSCRIPT roman_txt end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT roman_ct end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, \bmltxtdct\bmsubscriptsuperscript𝑙txtsuperscriptsubscript𝑑ct\bm{l}^{\prime}_{\mathrm{txt}}\in\mathbb{R}^{d_{\mathrm{ct}}}italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_txt end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT roman_ct end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, \bmlpdct\bmsubscript𝑙psuperscriptsubscript𝑑ct\bm{l}_{\mathrm{p}}\in\mathbb{R}^{d_{\mathrm{ct}}}italic_l start_POSTSUBSCRIPT roman_p end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT roman_ct end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, and {\bmlnp(i)dct}i=1Nnpsuperscriptsubscript\bmsuperscriptsubscript𝑙np𝑖superscriptsubscript𝑑ct𝑖1subscript𝑁np\{\bm{l}_{\mathrm{np}}^{(i)}\in\mathbb{R}^{d_{\mathrm{ct}}}\}_{i=1}^{N_{% \mathrm{np}}}{ italic_l start_POSTSUBSCRIPT roman_np end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT roman_ct end_POSTSUBSCRIPT end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT roman_np end_POSTSUBSCRIPT end_POSTSUPERSCRIPT from \bmxtxt\bmsubscript𝑥txt\bm{x}_{\mathrm{txt}}italic_x start_POSTSUBSCRIPT roman_txt end_POSTSUBSCRIPT, \bmxtxt\bmsubscriptsuperscript𝑥txt\bm{x}^{\prime}_{\mathrm{txt}}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_txt end_POSTSUBSCRIPT, \bmxp\bmsubscriptsuperscript𝑥p\bm{x}^{\prime}_{\mathrm{p}}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_p end_POSTSUBSCRIPT, and {\bmxnp(i)}i=1Nnpsuperscriptsubscript\bmsuperscriptsubscript𝑥np𝑖𝑖1subscript𝑁np\{\bm{x}_{\mathrm{np}}^{(i)}\}_{i=1}^{N_{\mathrm{np}}}{ italic_x start_POSTSUBSCRIPT roman_np end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT roman_np end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, respectively, using the pretrained CLIP text encoder [8]. Here, dctsubscript𝑑ctd_{\mathrm{ct}}italic_d start_POSTSUBSCRIPT roman_ct end_POSTSUBSCRIPT denotes the output dimension. Finally, the output \bmhtxtdtxt\bmsubscripttxtsuperscriptsubscript𝑑txt\bm{h}_{\mathrm{txt}}\in\mathbb{R}^{d_{\mathrm{txt}}}italic_h start_POSTSUBSCRIPT roman_txt end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT roman_txt end_POSTSUBSCRIPT end_POSTSUPERSCRIPT is obtained as follows:

\bmhtxt=MLP([\bmlp;\bmltxt;\bmltxt;Transformer([\bmlp;\bmlnp(1);;\bmlnp(Nnp)])]),\bmsubscripttxtMLP\bmsubscript𝑙p\bmsubscript𝑙txt\bmsubscriptsuperscript𝑙txtTransformer\bmsubscript𝑙p\bmsuperscriptsubscript𝑙np1\bmsuperscriptsubscript𝑙npsubscript𝑁np\displaystyle\bm{h}_{\mathrm{txt}}=\mathrm{MLP}\left(\left[\bm{l}_{\mathrm{p}}% ;\bm{l}_{\mathrm{txt}};\bm{l}^{\prime}_{\mathrm{txt}};\mathrm{Transformer}% \left(\left[\bm{l}_{\mathrm{p}};\bm{l}_{\mathrm{np}}^{(1)};\ldots;\bm{l}_{% \mathrm{np}}^{(N_{\mathrm{np}})}\right]\right)\right]\right),italic_h start_POSTSUBSCRIPT roman_txt end_POSTSUBSCRIPT = roman_MLP ( [ italic_l start_POSTSUBSCRIPT roman_p end_POSTSUBSCRIPT ; italic_l start_POSTSUBSCRIPT roman_txt end_POSTSUBSCRIPT ; italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_txt end_POSTSUBSCRIPT ; roman_Transformer ( [ italic_l start_POSTSUBSCRIPT roman_p end_POSTSUBSCRIPT ; italic_l start_POSTSUBSCRIPT roman_np end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ; … ; italic_l start_POSTSUBSCRIPT roman_np end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_N start_POSTSUBSCRIPT roman_np end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ] ) ] ) ,

where dtxtsubscript𝑑txtd_{\mathrm{txt}}italic_d start_POSTSUBSCRIPT roman_txt end_POSTSUBSCRIPT, MLP()MLP\mathrm{MLP}(\cdot)roman_MLP ( ⋅ ), and Transformer()Transformer\mathrm{Transformer}(\cdot)roman_Transformer ( ⋅ ) denote the output dimension, a multi-layer perceptron, and transformer encoder [43], respectively.

4.4 Segment Anything Region Encoder

In the SARE module, the visual features of Ximgsubscript𝑋imgX_{\mathrm{img}}italic_X start_POSTSUBSCRIPT roman_img end_POSTSUBSCRIPT and images overlaid with segmentation masks are obtained in parallel using foundation models. Most existing methods that simply extract features from the entire image sometimes misrecognize objects with similar colors or textures. Therefore, we introduce auxiliary images related to the segmentation masks to enhance the visual features related to the shape and contour of objects.

The input to this module is Ximgsubscript𝑋imgX_{\mathrm{img}}italic_X start_POSTSUBSCRIPT roman_img end_POSTSUBSCRIPT and the outputs are the visual features \bmhimgdimg\bmsubscriptimgsuperscriptsubscript𝑑img\bm{h}_{\mathrm{img}}\in\mathbb{R}^{d_{\mathrm{img}}}italic_h start_POSTSUBSCRIPT roman_img end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT roman_img end_POSTSUBSCRIPT end_POSTSUPERSCRIPT for each image \bmximg(i)\bmsuperscriptsubscript𝑥img𝑖\bm{x}_{\mathrm{img}}^{(i)}italic_x start_POSTSUBSCRIPT roman_img end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT. Here, dimgsubscript𝑑imgd_{\mathrm{img}}italic_d start_POSTSUBSCRIPT roman_img end_POSTSUBSCRIPT denotes the output dimension. First, \bmxsar(i)3×W×H\bmsuperscriptsubscript𝑥sar𝑖superscript3𝑊𝐻\bm{x}_{\mathrm{sar}}^{(i)}\in\mathbb{R}^{3\times W\times H}italic_x start_POSTSUBSCRIPT roman_sar end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 3 × italic_W × italic_H end_POSTSUPERSCRIPT are obtained by overlaying the segmentation masks derived from SAM on \bmximg(i)\bmsuperscriptsubscript𝑥img𝑖\bm{x}_{\mathrm{img}}^{(i)}italic_x start_POSTSUBSCRIPT roman_img end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT. We obtain visual features \bmvimg(i)dci\bmsuperscriptsubscript𝑣img𝑖superscriptsubscript𝑑ci\bm{v}_{\mathrm{img}}^{(i)}\in\mathbb{R}^{d_{\mathrm{ci}}}italic_v start_POSTSUBSCRIPT roman_img end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT roman_ci end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and \bmvsar(i)dci\bmsuperscriptsubscript𝑣sar𝑖superscriptsubscript𝑑ci\bm{v}_{\mathrm{sar}}^{(i)}\in\mathbb{R}^{d_{\mathrm{ci}}}italic_v start_POSTSUBSCRIPT roman_sar end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT roman_ci end_POSTSUBSCRIPT end_POSTSUPERSCRIPT from \bmximg(i)\bmsuperscriptsubscript𝑥img𝑖\bm{x}_{\mathrm{img}}^{(i)}italic_x start_POSTSUBSCRIPT roman_img end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT and \bmxsar(i)\bmsuperscriptsubscript𝑥sar𝑖\bm{x}_{\mathrm{sar}}^{(i)}italic_x start_POSTSUBSCRIPT roman_sar end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT, respectively, using the pre-trained CLIP image encoder (ViT-L/14). Here, dcisubscript𝑑cid_{\mathrm{ci}}italic_d start_POSTSUBSCRIPT roman_ci end_POSTSUBSCRIPT denotes the output dimension. These features are concatenated and input to the multi-layer perceptron to obtain \bmhimg\bmsubscriptimg\bm{h}_{\mathrm{img}}italic_h start_POSTSUBSCRIPT roman_img end_POSTSUBSCRIPT. Finally, the similarity score between \bmhtxt\bmsubscripttxt\bm{h}_{\mathrm{txt}}italic_h start_POSTSUBSCRIPT roman_txt end_POSTSUBSCRIPT and \bmhimg\bmsubscriptimg\bm{h}_{\mathrm{img}}italic_h start_POSTSUBSCRIPT roman_img end_POSTSUBSCRIPT is calculated as follows:

sim(\bmxtxt,\bmximg(i))=\bmhtxt\bmhimg\bmhtxt\bmhimg.sim\bmsubscript𝑥txt\bmsuperscriptsubscript𝑥img𝑖\bmsubscripttxt\bmsubscriptimgnorm\bmsubscripttxtnorm\bmsubscriptimg\displaystyle\mathrm{sim}\left(\bm{x}_{\mathrm{txt}},\bm{x}_{\mathrm{img}}^{(i% )}\right)=\ \frac{\ \bm{h}_{\mathrm{txt}}\cdot\bm{h}_{\mathrm{img}}\ }{\ \|\bm% {h}_{\mathrm{txt}}\|\|\bm{h}_{\mathrm{img}}\|\ }.roman_sim ( italic_x start_POSTSUBSCRIPT roman_txt end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT roman_img end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) = divide start_ARG italic_h start_POSTSUBSCRIPT roman_txt end_POSTSUBSCRIPT ⋅ italic_h start_POSTSUBSCRIPT roman_img end_POSTSUBSCRIPT end_ARG start_ARG ∥ italic_h start_POSTSUBSCRIPT roman_txt end_POSTSUBSCRIPT ∥ ∥ italic_h start_POSTSUBSCRIPT roman_img end_POSTSUBSCRIPT ∥ end_ARG .

The output is the ranked list of Ximgsubscript𝑋imgX_{\mathrm{img}}italic_X start_POSTSUBSCRIPT roman_img end_POSTSUBSCRIPT arranged in descending order based on sim(\bmxtxt,\bmximg(i))sim\bmsubscript𝑥txt\bmsuperscriptsubscript𝑥img𝑖\mathrm{sim}(\bm{x}_{\mathrm{txt}},\bm{x}_{\mathrm{img}}^{(i)})roman_sim ( italic_x start_POSTSUBSCRIPT roman_txt end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT roman_img end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ). Two image lists, Y^targsubscript^𝑌targ\hat{Y}_{\mathrm{targ}}over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT roman_targ end_POSTSUBSCRIPT and Y^recsubscript^𝑌rec\hat{Y}_{\mathrm{rec}}over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT roman_rec end_POSTSUBSCRIPT, are obtained through a total of two inferences, with m=target𝑚delimited-⟨⟩targetm=\langle\mathrm{target}\rangleitalic_m = ⟨ roman_target ⟩ and m=receptacle𝑚delimited-⟨⟩receptaclem=\langle\mathrm{receptacle}\rangleitalic_m = ⟨ roman_receptacle ⟩ specified in the input, respectively.

We use the loss function for each batch subscript\mathcal{L}_{\mathcal{B}}caligraphic_L start_POSTSUBSCRIPT caligraphic_B end_POSTSUBSCRIPT as follows:

=1||\bmximg(i)logexp(sim(\bmxtxt,\bmximg(i)))\bmximg(j)exp(sim(\bmxtxt,\bmximg(j))),subscript1subscript\bmsuperscriptsubscript𝑥img𝑖sim\bmsubscript𝑥txt\bmsuperscriptsubscript𝑥img𝑖subscript\bmsuperscriptsubscript𝑥img𝑗sim\bmsubscript𝑥txt\bmsuperscriptsubscript𝑥img𝑗\displaystyle\mathcal{L}_{\mathcal{B}}=-\frac{1}{|\mathcal{B}|}\sum_{\bm{x}_{% \mathrm{img}}^{(i)}\in\mathcal{B}}\log{\ \frac{\ \exp{\left(\mathrm{sim}\left(% \bm{x}_{\mathrm{txt}},\bm{x}_{\mathrm{img}}^{(i)}\right)\right)}\ }{\ \sum_{% \bm{x}_{\mathrm{img}}^{(j)}\in\mathcal{B}}\ \exp{\left(\mathrm{sim}\left(\bm{x% }_{\mathrm{txt}},\bm{x}_{\mathrm{img}}^{(j)}\right)\right)}}\ },caligraphic_L start_POSTSUBSCRIPT caligraphic_B end_POSTSUBSCRIPT = - divide start_ARG 1 end_ARG start_ARG | caligraphic_B | end_ARG ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT roman_img end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ∈ caligraphic_B end_POSTSUBSCRIPT roman_log divide start_ARG roman_exp ( roman_sim ( italic_x start_POSTSUBSCRIPT roman_txt end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT roman_img end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) ) end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT roman_img end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ∈ caligraphic_B end_POSTSUBSCRIPT roman_exp ( roman_sim ( italic_x start_POSTSUBSCRIPT roman_txt end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT roman_img end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_j ) end_POSTSUPERSCRIPT ) ) end_ARG ,

where |||\mathcal{B}|| caligraphic_B | denotes the batch size. subscript\mathcal{L}_{\mathcal{B}}caligraphic_L start_POSTSUBSCRIPT caligraphic_B end_POSTSUBSCRIPT is equivalent to the scenario where only \bmxtxt\bmsubscript𝑥txt\bm{x}_{\mathrm{txt}}italic_x start_POSTSUBSCRIPT roman_txt end_POSTSUBSCRIPT is considered in InfoNCE [44].

5 Experiments

5.1 Dataset

We built the novel Learning-To-Rank in Real Indoor Environments for Fetch-and-Carry (LTRRIE-FC) dataset for the IROV-FC task. The LTRRIE-FC dataset is based on the HM3D [16, 17] and MP3D [15, 12] datasets. To the best of our knowledge, there is no standard dataset for the IROV-FC task. The standard datasets for VLN tasks (e.g., [13]) and LTRPO tasks (e.g., [7]) are not suitable for IROV-FC because they do not consider the task of transporting the target object to the receptacle. Furthermore, most existing datasets were constructed from the MP3D dataset, which lacks environmental diversity as it only includes a few tens of environments. In contrast, HM3D is a large-scale dataset containing hundreds of building-scale environments of 3D real-world reconstructions. However, there is no standard dataset that contains natural language instructions annotated by humans for the HM3D dataset. Therefore, we annotated instructions for images collected from both the HM3D and MP3D datasets.

Refer to caption
Figure 3: Annotation interface. Annotators were required to give instructions for the DSR to carry the target object (a red bounding box) to the receptacle (a green bounding box). These instructions were input in the text box below the images.

To collect the images from the continuous environments in HM3D, we used [45] to simulate the exploration of the environments by a DSR. The map of each environment was provided by HM3D. The DSR captured images at randomly selected viewpoints defined on grid points. At each viewpoint, the camera pose was set towards the nearest object or piece of furniture. The DSR also captured images of the surroundings by rotating the camera pose 60° to the left and right at a fixed height. These steps were conducted on each floor in the environment. The procedure for collecting images from MP3D followed that described in [7].

In the LTRRIE-FC dataset, a sample consists of an instruction, a target object image, and a receptacle image. To extract target object images and receptacle images from the collected images, we used Detic [46], an open-vocabulary object detector, as follows: First, we defined 121 and 48 target object classes (e.g., ‘pillow,’ ‘book,’ ‘cup’) and receptacle classes (e.g., ‘shelf,’ ‘table,’ ‘bed’), respectively. These categories were selected from the classes listed in [17]. Next, Detic was applied to each image. The images in which the target object class was detected were used as target object images. We selected receptacle images using the same procedure. For data cleansing, we manually removed samples for which the detected bounding box was extremely small, the detected object did not fit within the image, or there were significant mesh reconstruction artifacts. Finally, a target object image and a receptacle image in the same environment were combined to create a sample.

The instructions in the LTRRIE-FC dataset were collected by 226 annotators using the SoSci Survey222https://www.soscisurvey.de/ service. Fig. 3 shows the annotation interface. The annotators were presented with two images, one of a target object and one of a receptacle. They were then asked to give instructions for transporting the target object (in the red bounding box) to the receptacle (in the green bounding box) (e.g., “Pick up the white cushion on the sofa and place it on the brown armchair near the bed.”). However, the depicted bounding boxes sometimes enclosed inappropriate areas for object manipulation tasks due to misdetection (e.g., an ungraspable object such as a window was detected as the target object). In such cases, the annotators were allowed to select a more appropriate object from the left image and a piece of furniture from the right image instead. Data from annotators that repeatedly input the same instruction or had short response times were excluded to improve the quality of the dataset.

The LTRRIE-FC dataset consists of 6,581 English instructions and 7,148 images collected from 774 real-world indoor environments. It has a vocabulary size of 2,491, a total of 103,263 words, and an average sentence length of 15.69 words. The LTRRIE-FC dataset includes 5,814, 354, and 413 samples in the training, validation, and test sets, respectively. These sets contain 690, 42, and 42 environments, respectively, without duplication of environments. Therefore, the objects in the test sets can be regarded as unseen. We built two subsets for the test set, HM3D-FC and MP3D-FC, depending on the type of environment from which the samples were obtained. We used the training set to train the model, the validation set to tune the hyperparameters, and the test sets to evaluate the model.

5.2 Parameter Settings

Table 1: Experimental settings of DM2RM. Here, #Ltsubscript𝐿tL_{\mathrm{t}}italic_L start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT, #H𝐻Hitalic_H, and #A𝐴Aitalic_A denote the number of layers, hidden size, and attention heads of the transformer encoder in the SPE module, respectively.
Optimizer Adam (β1=0.9subscript𝛽10.9\beta_{1}=0.9italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9, β2=0.98subscript𝛽20.98\beta_{2}=0.98italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.98)
Learning rate 1×1041superscript1041\times 10^{-4}1 × 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT
Batch size 128128128128
#Epoch 20202020
Dropout 0.40.40.40.4
Transformer #Ltsubscript𝐿tL_{\mathrm{t}}italic_L start_POSTSUBSCRIPT roman_t end_POSTSUBSCRIPT: 5555, #H𝐻Hitalic_H: 768768768768, #A𝐴Aitalic_A: 4444

Table 1 shows the experimental settings of the proposed method. Our model had approximately 71M trainable parameters and 309G multiply-add operations. We trained our model on a GeForce RTX 3090 with 24 GB of GPU memory and an Intel Core i9-10900KF with 64 GB of RAM. Training our model for 20202020 epochs took approximately 1 h. The inference time for computing the similarity between a single instruction and a single image was approximately 14.8 ms.

During every epoch, we measured the mean reciprocal rank (MRR) and recall@10 of the model on the validation set. The final performance on the test sets were based on the model achieving the maximum sum of recall@10 and MRR on the validation set.

5.3 Quantitative Results

Table 2: Quantitative comparison between the DM2RM and baseline methods on the HM3D-FC test set. The best score for each metric is in bold. denotes reproduced results.
[%] Method Prediction HM3D-FC (unseen)
Targ. Rec. MRR\uparrow R@5\uparrow R@10\uparrow R@20\uparrow
(i) CLIP [8] \checkmark \checkmark 10.810.810.810.8 13.713.713.713.7 24.924.924.924.9 49.549.549.549.5
(ii) NLMap [19] \checkmark \checkmark 11.811.811.811.8 14.114.114.114.1 26.126.126.126.1 48.448.448.448.4
(iii-a) MultiRankIt [7] \checkmark 20.520.520.520.5 ±plus-or-minus\pm± 2.32.32.32.3 30.130.130.130.1 ±plus-or-minus\pm± 3.43.43.43.4 48.248.248.248.2 ±plus-or-minus\pm± 1.41.41.41.4 73.273.273.273.2 ±plus-or-minus\pm± 2.82.82.82.8
(iii-b) \checkmark 19.819.819.819.8 ±plus-or-minus\pm± 1.11.11.11.1 27.127.127.127.1 ±plus-or-minus\pm± 3.23.23.23.2 49.149.149.149.1 ±plus-or-minus\pm± 5.95.95.95.9 74.674.674.674.6 ±plus-or-minus\pm± 3.13.13.13.1
(iv) DM2RM (ours) \checkmark \checkmark 32.032.0\mathbf{32.0}bold_32.0 ±plus-or-minus\pm± 0.50.50.50.5 47.747.7\mathbf{47.7}bold_47.7 ±plus-or-minus\pm± 1.41.41.41.4 67.967.9\mathbf{67.9}bold_67.9 ±plus-or-minus\pm± 0.80.80.80.8 87.387.3\mathbf{87.3}bold_87.3 ±plus-or-minus\pm± 1.11.11.11.1
Table 3: Quantitative comparison between the DM2RM and baseline methods on the MP3D-FC test set. The best score for each metric is in bold. denotes reproduced results.
[%] Method Prediction MP3D-FC (unseen)
Targ. Rec. MRR\uparrow R@5\uparrow R@10\uparrow R@20\uparrow
(i) CLIP [8] \checkmark \checkmark 15.015.015.015.0 14.614.614.614.6 28.528.528.528.5 59.959.959.959.9
(ii) NLMap [19] \checkmark \checkmark 11.511.511.511.5 14.314.314.314.3 25.725.725.725.7 52.352.352.352.3
(iii-a) MultiRankIt [7] \checkmark 26.726.726.726.7 ±plus-or-minus\pm± 2.42.42.42.4 35.935.935.935.9 ±plus-or-minus\pm± 4.04.04.04.0 52.852.852.852.8 ±plus-or-minus\pm± 5.35.35.35.3 71.171.171.171.1 ±plus-or-minus\pm± 2.72.72.72.7
(iii-b) \checkmark 16.416.416.416.4 ±plus-or-minus\pm± 1.61.61.61.6 23.323.323.323.3 ±plus-or-minus\pm± 2.02.02.02.0 39.739.739.739.7 ±plus-or-minus\pm± 5.35.35.35.3 60.160.160.160.1 ±plus-or-minus\pm± 3.73.73.73.7
(iv) DM2RM (ours) \checkmark \checkmark 36.836.8\mathbf{36.8}bold_36.8 ±plus-or-minus\pm± 1.51.51.51.5 46.546.5\mathbf{46.5}bold_46.5 ±plus-or-minus\pm± 2.82.82.82.8 63.563.5\mathbf{63.5}bold_63.5 ±plus-or-minus\pm± 2.82.82.82.8 76.376.3\mathbf{76.3}bold_76.3 ±plus-or-minus\pm± 1.51.51.51.5

Tables 2 and 3 show the quantitative results. The performance of the proposed method is compared with that of several baseline methods on the HM3D-FC and MP3D-FC test sets. The table presents the average and standard deviation over five trials. The ‘Prediction’ column indicates whether the model handled target objects and/or receptacles.

We used CLIP [8], NLMap [19], and MultiRankIt [7] as the baseline methods. We selected CLIP because it has been successfully applied to image retrieval tasks without fine-tuning. NLMap was selected due to its similarity to the proposed method, as it also employs a CLIP-based approach for object retrieval from images collected during pre-exploration. The scores shown for CLIP and NLMap were obtained from a single trial because the use of the pre-trained frozen model provides consistent results across multiple trials. MultiRankIt was selected as a baseline method because of its effective application in the LTRPO task, which is related to the IROV-FC task. We trained two separate MultiRankIt models for target objects and receptacles, because MultiRankIt cannot output ranked lists for both target objects and receptacles with a single model.

We used MRR and recall@K𝐾Kitalic_K as evaluation metrics, with MRR as the primary metric. This is because they are standard metrics in image retrieval settings [47]. MRR is defined as follows:

MRR=1Ntxti=1Ntxt1r1(i),MRR1subscript𝑁txtsuperscriptsubscript𝑖1subscript𝑁txt1superscriptsubscript𝑟1𝑖\displaystyle\mathrm{MRR}=\frac{1}{N_{\mathrm{txt}}}\ \sum_{i=1}^{N_{\mathrm{% txt}}}\frac{1}{r_{1}^{(i)}},roman_MRR = divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT roman_txt end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT roman_txt end_POSTSUBSCRIPT end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_ARG ,

where Ntxtsubscript𝑁txtN_{\mathrm{txt}}italic_N start_POSTSUBSCRIPT roman_txt end_POSTSUBSCRIPT and r1(i)superscriptsubscript𝑟1𝑖r_{1}^{(i)}italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT denote the number of instructions and the highest rank among the relevant images, respectively. Recall@K𝐾Kitalic_K is defined as follows:

Recall@K=1Ntxti=1Ntxt|AiBi||Ai|,Recall@𝐾1subscript𝑁txtsuperscriptsubscript𝑖1subscript𝑁txtsubscript𝐴𝑖subscript𝐵𝑖subscript𝐴𝑖\displaystyle\mathrm{Recall@}K=\ \frac{1}{N_{\mathrm{txt}}}\ \sum_{i=1}^{N_{% \mathrm{txt}}}\frac{|A_{i}\cap B_{i}|}{|A_{i}|},roman_Recall @ italic_K = divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUBSCRIPT roman_txt end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT roman_txt end_POSTSUBSCRIPT end_POSTSUPERSCRIPT divide start_ARG | italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∩ italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG start_ARG | italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | end_ARG ,

where Aisubscript𝐴𝑖A_{i}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and Bisubscript𝐵𝑖B_{i}italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denote the set of relevant images to be retrieved and the top-K𝐾Kitalic_K retrieved images, respectively.

Table 2 indicates that the proposed method (iv) achieved the MRR of 32.0%, whereas baseline methods (i), (ii), (iii-a), and (iii-b) achieved the MRR of 10.8%, 11.8%, 20.5%, and 19.8%, respectively, for the HM3D-FC test set. Furthermore, Table 3 indicates that the proposed method (iv) and baseline methods (i), (ii), (iii-a), and (iii-b) achieved the MRR of 36.8%, 15.0%, 11.5%, 26.7%, and 16.4%, respectively, for the MP3D-FC test set. Therefore, the proposed method outperformed the best baseline methods by 11.5 points and 10.1 points in terms of MRR on the HM3D-FC test set and the MP3D-FC test set, respectively. Similarly, the proposed method outperformed the baseline methods in terms of recall@K𝐾Kitalic_K on the test sets. The differences in performance between our method and the baseline methods were statistically significant in terms of all evaluation metrics (p𝑝pitalic_p-value <<< 0.01).

5.4 Qualitative Results

Refer to caption
Figure 4: Qualitative comparison between our method and a baseline method [7]. For each sample, \bmxtxt\bmsubscript𝑥txt\bm{x}_{\mathrm{txt}}italic_x start_POSTSUBSCRIPT roman_txt end_POSTSUBSCRIPT and/or \bmxtxt\bmsubscriptsuperscript𝑥txt\bm{x}^{\prime}_{\mathrm{txt}}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_txt end_POSTSUBSCRIPT, the top-3 retrieved images, and the GT image are shown. The results regarding the target object and receptacle are shown on the left (*-a) and right (*-b), respectively. The target object images and receptacle images are highlighted in the red and green frames, respectively. The words underlined in red, green, and black indicate \bmxtarg\bmsubscript𝑥targ\bm{x}_{\mathrm{targ}}italic_x start_POSTSUBSCRIPT roman_targ end_POSTSUBSCRIPT, \bmxrec\bmsubscript𝑥rec\bm{x}_{\mathrm{rec}}italic_x start_POSTSUBSCRIPT roman_rec end_POSTSUBSCRIPT, and grammatical errors, respectively.
Refer to caption
Figure 5: A failure sample on the HM3D-FC test set. Rows (a) and (b) show the qualitative results in the target and receptacle modes, respectively. From left to right: GT images and top-3 retrieved images. The words highlighted in red and green indicate \bmxtarg\bmsubscript𝑥targ\bm{x}_{\mathrm{targ}}italic_x start_POSTSUBSCRIPT roman_targ end_POSTSUBSCRIPT and \bmxrec\bmsubscript𝑥rec\bm{x}_{\mathrm{rec}}italic_x start_POSTSUBSCRIPT roman_rec end_POSTSUBSCRIPT, respectively.

Fig. 4 shows the qualitative comparison between the proposed method and one of the baseline methods [7]. The ground truth (GT) image and the top-3 retrieved images are shown for each mode. Fig. 4 (i-a) and (i-b) show a sample from the HM3D-FC test set, where \bmxtxt\bmsubscript𝑥txt\bm{x}_{\mathrm{txt}}italic_x start_POSTSUBSCRIPT roman_txt end_POSTSUBSCRIPT was “Take the white lamp on the desk near the bed, then move it to the white desk near the black chair.” In the proposed method, \bmxtarg\bmsubscript𝑥targ\bm{x}_{\mathrm{targ}}italic_x start_POSTSUBSCRIPT roman_targ end_POSTSUBSCRIPT and \bmxrec\bmsubscript𝑥rec\bm{x}_{\mathrm{rec}}italic_x start_POSTSUBSCRIPT roman_rec end_POSTSUBSCRIPT were ‘white lamp on the desk near the bed’ and ‘white desk near the black chair,’ respectively. For this sample, the MRR of our method was 100%, whereas that of the baseline method was 30%. The baseline method has incorrectly retrieved the same irrelevant image as the top-1 result in both Fig. 4 (i-a) and (i-b). On the contrary, the proposed method has successfully retrieved the correct image as the top-1 for each mode. This indicates that the switching mechanism in the SPE module works effectively.

Similarly, Fig. 4 (ii-a) and (ii-b) show a sample from the MP3D-FC test set, where \bmxtxt\bmsubscript𝑥txt\bm{x}_{\mathrm{txt}}italic_x start_POSTSUBSCRIPT roman_txt end_POSTSUBSCRIPT was “Go bathroom to pick up the white hand soap bottle and put it on the black table with books and flowers.” In the proposed method, \bmxtarg\bmsubscript𝑥targ\bm{x}_{\mathrm{targ}}italic_x start_POSTSUBSCRIPT roman_targ end_POSTSUBSCRIPT and \bmxrec\bmsubscript𝑥rec\bm{x}_{\mathrm{rec}}italic_x start_POSTSUBSCRIPT roman_rec end_POSTSUBSCRIPT were ‘white hand soap bottle’ and ‘black table with books and flowers,’ respectively. For this sample, the proposed method and the baseline method achieved the MRR of 100% and 20%, respectively. In Fig. 4 (ii-a), the baseline method has mistakenly retrieved images of tables, influenced by the word regarding the receptacle, whereas our method correctly has retrieved images of hand soap as ranks 1 and 2. In Fig. 4 (ii-b), the proposed method appropriately handles the referring expressions regarding the color of the receptacle and its surrounding objects, whereas the baseline method does not. These results indicate that the introduction of LLM-based phrase identification in the SPE module enhanced the similarity between the instruction and the correct image in each mode. In addition, we believe that obtaining \bmxtxt\bmsubscriptsuperscript𝑥txt\bm{x}^{\prime}_{\mathrm{txt}}italic_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_txt end_POSTSUBSCRIPT using the TP module was beneficial to the proper handling of \bmxtxt\bmsubscript𝑥txt\bm{x}_{\mathrm{txt}}italic_x start_POSTSUBSCRIPT roman_txt end_POSTSUBSCRIPT with grammatical errors (e.g., “Go bathroom …” should be “Go to the bathroom …”).

Fig. 5 shows a failure sample of the proposed method on the HM3D-FC test set. For this sample, \bmxtxt\bmsubscript𝑥txt\bm{x}_{\mathrm{txt}}italic_x start_POSTSUBSCRIPT roman_txt end_POSTSUBSCRIPT was “Could you please move the ceiling white light into the white shelf?” In each mode, the top-3 retrieved images did not match the GT image, resulting in the MRR of 5%. However, these images are not technically incorrect because they contain a white ceiling light and a white shelf in the target mode and the receptacle mode, respectively, as required by the instruction. Therefore, it is hypothesized that the failure was caused by ambiguous instructions containing insufficient referring expressions.

5.5 Ablation Studies

Table 4: Ablation studies of the DM2RM on the HM3D-FC test set. The best score for each metric is in bold.
Model HM3D-FC (unseen)
MRR\uparrow [%] R@5\uparrow [%] R@10\uparrow [%] R@20\uparrow [%]
(a) DM2RM (full) 32.032.0\mathbf{32.0}bold_32.0 ±plus-or-minus\pm± 0.50.50.50.5 47.747.7\mathbf{47.7}bold_47.7 ±plus-or-minus\pm± 1.41.41.41.4 67.967.9\mathbf{67.9}bold_67.9 ±plus-or-minus\pm± 0.80.80.80.8 87.387.3\mathbf{87.3}bold_87.3 ±plus-or-minus\pm± 1.11.11.11.1
(b) w/o SPE 22.522.522.522.5 ±plus-or-minus\pm± 1.41.41.41.4 33.233.233.233.2 ±plus-or-minus\pm± 1.81.81.81.8 53.053.053.053.0 ±plus-or-minus\pm± 2.52.52.52.5 78.978.978.978.9 ±plus-or-minus\pm± 2.72.72.72.7
(c) w/o TP 28.428.428.428.4 ±plus-or-minus\pm± 1.41.41.41.4 44.744.744.744.7 ±plus-or-minus\pm± 2.02.02.02.0 66.366.366.366.3 ±plus-or-minus\pm± 0.70.70.70.7 85.285.285.285.2 ±plus-or-minus\pm± 1.11.11.11.1
(d) w/o SARE 29.729.729.729.7 ±plus-or-minus\pm± 0.60.60.60.6 45.045.045.045.0 ±plus-or-minus\pm± 2.72.72.72.7 64.964.964.964.9 ±plus-or-minus\pm± 1.41.41.41.4 86.686.686.686.6 ±plus-or-minus\pm± 1.41.41.41.4
Table 5: Ablation studies of the DM2RM on the MP3D-FC test set. The best score for each metric is in bold.
Model MP3D-FC (unseen)
MRR\uparrow [%] R@5\uparrow [%] R@10\uparrow [%] R@20\uparrow [%]
(a) DM2RM (full) 36.836.8\mathbf{36.8}bold_36.8 ±plus-or-minus\pm± 1.51.51.51.5 46.546.5\mathbf{46.5}bold_46.5 ±plus-or-minus\pm± 2.82.82.82.8 63.563.5\mathbf{63.5}bold_63.5 ±plus-or-minus\pm± 2.82.82.82.8 76.376.3\mathbf{76.3}bold_76.3 ±plus-or-minus\pm± 1.51.51.51.5
(b) w/o SPE 21.321.321.321.3 ±plus-or-minus\pm± 0.90.90.90.9 25.625.625.625.6 ±plus-or-minus\pm± 1.81.81.81.8 42.042.042.042.0 ±plus-or-minus\pm± 1.71.71.71.7 63.263.263.263.2 ±plus-or-minus\pm± 1.01.01.01.0
(c) w/o TP 31.431.431.431.4 ±plus-or-minus\pm± 2.22.22.22.2 40.540.540.540.5 ±plus-or-minus\pm± 2.62.62.62.6 56.856.856.856.8 ±plus-or-minus\pm± 3.83.83.83.8 75.075.075.075.0 ±plus-or-minus\pm± 0.70.70.70.7
(d) w/o SARE 33.233.233.233.2 ±plus-or-minus\pm± 1.21.21.21.2 43.443.443.443.4 ±plus-or-minus\pm± 1.61.61.61.6 60.060.060.060.0 ±plus-or-minus\pm± 2.52.52.52.5 75.075.075.075.0 ±plus-or-minus\pm± 2.02.02.02.0

Tables 4 and 5 show the results of ablation studies. As ablation studies, we set the following three conditions:
SPE ablation: To investigate the impact of the SPE module on performance improvement, we removed m𝑚mitalic_m and handled both target objects and receptacles without using the switching mechanism. Tables 4 and 5 show that the MRR decreased by 9.5 points and 15.5 points for model (b) compared with model (a) on the HM3D-FC and MP3D-FC test sets, respectively. This indicates that the introduction of SPE is beneficial for the IROV-FC task.
TP ablation: We removed the TP module to investigate its influence on the performance. Tables 4 and 5 show that there was a decrease of 3.6 points and 5.4 points in MRR for model (c) compared with model (a) on the HM3D-FC and MP3D-FC test sets, respectively. This suggests that paraphrasing redundant instructions into a standardized format suitable for the task is an effective approach.
SARE ablation: We removed \bmvsar(i)\bmsuperscriptsubscript𝑣sar𝑖\bm{v}_{\mathrm{sar}}^{(i)}italic_v start_POSTSUBSCRIPT roman_sar end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT on investigate the impact of the SARE module to the performance. Tables 4 and 5 indicate that the MRR decreased by 2.3 points and 3.6 points for model (d) compared with model (a) on the HM3D-FC and MP3D-FC test sets, respectively. Thus, enhancing visual features associated with the shape and contour of objects is beneficial to the performance of the proposed method.

5.6 Error Analysis

Table 6: Categorization of failure cases. We selected a total of 20 samples (10 from each test set) and manually conducted a detailed error analysis on the top-5 images for each mode.
Error Type Target Mode Receptacle Mode
Ambiguous Instruction 8 8
Referring Expression Comprehension Error 7 2
Phrase Selection Error 1 7
Object Grounding Error 4 3
Total 20 20

We define a failure case as a sample for which the MRR fell below 10.0%. There were 98 failure cases (21 and 77 cases from the HM3D-FC and MP3D-FC test sets, respectively). We selected a total of 20 samples (10 from each test set) and manually conducted a detailed error analysis on the top-5 images for each mode.

Table 6 categorizes the failure cases in each mode. The causes of failure could be divided into four types: ambiguous instruction (AI), referring expression comprehension error (RE), phrase selection error (PS), and object grounding error (OG). The AI category refers to cases in which the retrieved images cannot be considered entirely incorrect because of some ambiguity in \bmxtxt\bmsubscript𝑥txt\bm{x}_{\mathrm{txt}}italic_x start_POSTSUBSCRIPT roman_txt end_POSTSUBSCRIPT, such as the failure sample shown in Fig. 5. The RE category refers to cases in which the category of retrieved images of objects or pieces of furniture was correct, but did not match the referring expressions contained in \bmxtxt\bmsubscript𝑥txt\bm{x}_{\mathrm{txt}}italic_x start_POSTSUBSCRIPT roman_txt end_POSTSUBSCRIPT (e.g., images containing a cushion on the red sofa were incorrectly retrieved when the target object was ‘the cushion on the black bed’) The PS category refers to cases in which the model retrieved images of landmark objects or pieces of furniture specified in \bmxtxt\bmsubscript𝑥txt\bm{x}_{\mathrm{txt}}italic_x start_POSTSUBSCRIPT roman_txt end_POSTSUBSCRIPT instead of the target object or receptacle (e.g., when the target object was ‘the desk next to the sofa,’ the model mistakenly retrieved images containing only the sofa). The OG category refers to cases in which the object grounding performance was insufficient, and so the model retrieved irrelevant images.

Table 6 indicates that AI was the main bottleneck in both the target and receptacle modes. A possible solution to these errors is to introduce multi-target settings, where a single expression can refer to an arbitrary number of objects (e.g., [48]).

5.7 Discussion

Refer to caption
Figure 6: t-SNE [49] visualization of \bmhtxt\bmsubscripttxt\bm{h}_{\mathrm{txt}}italic_h start_POSTSUBSCRIPT roman_txt end_POSTSUBSCRIPT for all instructions on the test sets. Each point represents the embedding space of an instruction in the target mode (red) and the receptacle mode (blue), with the color brightness reflecting the sentence length.

To investigate the influence of the switching mechanism in the SPE module, we visualized the embedding space of text features when the same instructions were input with different m𝑚mitalic_m. Fig. 6 shows the t-SNE [49] visualization of \bmhtxt\bmsubscripttxt\bm{h}_{\mathrm{txt}}italic_h start_POSTSUBSCRIPT roman_txt end_POSTSUBSCRIPT for all instructions on the test sets in each mode, with the color brightness reflecting the sentence length. The results demonstrate that the clusters of the different modes have a distinct separation, indicating that the embedding space is effectively switched according to the prediction target. Importantly, the nonlinear distribution of the clusters implies that the switching is not purely the result of simple translations or rotations in the embedding space.

6 Physical Experiments

We validated the proposed method in a real-world environment using a DSR. We did not fine-tune the model by using the physical environment. The DSR executed fetch-and-carry tasks based on the instructions given by the users.

6.1 Settings

Refer to caption (i) Refer to caption (ii)
Figure 7: Experimental settings. (i) Domestic environment standardized in WRS2020RS [5]. (ii) Everyday objects used in the physical experiments. These objects are (a) ‘Food items,’ (b) ‘Tool items,’ (c) ‘Shape items,’ and (d) ‘Kitchen items’ from the YCB object set [50], and (e) general unseen objects.

Fig. 7 (i) shows the experimental environment. The environment replicated the standardized environment of the World Robot Summit 2020 Partner Robot Challenge/Real Space (WRS2020RS [5]), which was an international contest focusing on benchmark tidy-up tasks in home environments. The size of this environment was 6.0×4.0m26.04.0superscriptm26.0\times 4.0\;\mathrm{m}^{2}6.0 × 4.0 roman_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, and it featured nine pieces of furniture arranged as shown in Fig. 7 (i). Users randomly selected one piece of furniture as the receptacle when providing instructions to the DSR. We used the Human Support Robot [1] developed by the Toyota Motor Corporation. This mobile manipulator has been used as the standard platform of the RoboCup@Home competition [4] since 2017.

Fig. 7 (ii) shows a total of 50 everyday objects used in the physical experiments. These objects are part of the YCB object set [50], which includes standard objects for manipulation research. Furthermore, we included general unseen objects to enrich the diversity in terms of appearance and sizes. We conducted experiments with 10 unique object placement patterns. In each object placement pattern, 20–30 unseen objects selected from Fig. 7 (ii) were placed in random positions on randomly selected pieces of furniture. Several small objects (e.g., toothbrush) were placed in the markerless NICT cases [51].

6.2 Implementation

During the pre-exploration phase, the DSR collected images of the environment at 17 predefined viewpoints using an Asus Xtion Pro camera. Path planning and navigation were based on standard methods using a map created in advance.

Next, the users gave English open-vocabulary instructions to the DSR. The users were required to provide instructions with referring expressions to carry an arbitrary object (see Fig. 7 (ii)) to an arbitrary receptacle (see Fig. 7 (i)), such as “Please pick up the sponge near the cleanser and put it in the blue box.” For each object placement pattern, 10 instructions were given, resulting in a total of 100 trials.

The behavior of the DSR after receiving the instructions was designed as follows: The DSR first retrieved images of the target object and the receptacle from the latest stored images and presented the respective top-10 images to the users using the WebUI. We adopted the zero-shot transfer setting using the model trained on the LTRRIE-FC dataset to test the robustness of the proposed method towards the unseen objects. Next, the target object image and the receptacle image were selected by the users from the presented images. If the target object image was not included in the top-10 images in the target mode, it was regarded as a failure and the fetching action was not conducted. Subsequently, the DSR moved to the location at which the target object image was captured and grasped the target object. The grasp point was determined based on the point cloud obtained from the depth image and the segmentation mask of the target object. The segmentation mask was obtained using SAM [9] by inputting the point prompt given by the users regarding the target object. Finally, the DSR attempted to carry the target object to the receptacle only if the following conditions were met: receptacle image was within the top-10 images in the receptacle mode and the fetching action was successful. We did not employ a learning-based approach for trajectory generation regarding object grasping and placing because this is beyond the scope of this study.

6.3 Quantitative Results

Table 7: Quantitative results of the physical experiments. The numbers in parentheses indicate Nssubscript𝑁sN_{\mathrm{s}}italic_N start_POSTSUBSCRIPT roman_s end_POSTSUBSCRIPT / Nasubscript𝑁aN_{\mathrm{a}}italic_N start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT.
MRR\uparrow [%] R@10\uparrow [%] SR\uparrow [%]
Fetching Carrying Overall
39 96 92 (89 / 97) 95 (82 / 86) 82 (82 / 100)

We used MRR, recall@10, and the task success rate (SR) as evaluation metrics in the physical experiments. SR is defined as SR=NsNa,SRsubscript𝑁ssubscript𝑁a\mathrm{SR}=\frac{N_{\mathrm{s}}}{N_{\mathrm{a}}},roman_SR = divide start_ARG italic_N start_POSTSUBSCRIPT roman_s end_POSTSUBSCRIPT end_ARG start_ARG italic_N start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT end_ARG , where Nssubscript𝑁sN_{\mathrm{s}}italic_N start_POSTSUBSCRIPT roman_s end_POSTSUBSCRIPT and Nasubscript𝑁aN_{\mathrm{a}}italic_N start_POSTSUBSCRIPT roman_a end_POSTSUBSCRIPT denote the number of successes and attempts, respectively. When calculating the MRR, 1r1(i)1superscriptsubscript𝑟1𝑖\frac{1}{r_{1}^{(i)}}divide start_ARG 1 end_ARG start_ARG italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT end_ARG was considered to be 0 if r1(i)superscriptsubscript𝑟1𝑖r_{1}^{(i)}italic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT was greater than 10. This is because only the top-10 images were presented to the users, following the standard UI configuration.

Table 7 shows the quantitative results of the physical experiments. The results show that the MRR, recall@10, fetching SR, carrying SR, and overall SR were 39%, 96%, 92%, 95%, and 82%, respectively. Despite the zero-shot transfer setting using our model trained on the LTRRIE-FC dataset, these results indicate that the performance of the proposed method remained robust when handling unseen objects in real-world environments. These results also indicate that our model can be successfully integrated into the DSR to perform the entire scenario, including fetching and carrying actions.

6.4 Qualitative Results

Refer to caption
Figure 8: Qualitative results of the physical experiments. The target object images and receptacle images are framed in red and green, respectively. The words underlined in red and green indicate \bmxtarg\bmsubscript𝑥targ\bm{x}_{\mathrm{targ}}italic_x start_POSTSUBSCRIPT roman_targ end_POSTSUBSCRIPT and \bmxrec\bmsubscript𝑥rec\bm{x}_{\mathrm{rec}}italic_x start_POSTSUBSCRIPT roman_rec end_POSTSUBSCRIPT, respectively.

Fig. 8 shows a successful sample of the physical experiments. For this sample, \bmxtxt\bmsubscript𝑥txt\bm{x}_{\mathrm{txt}}italic_x start_POSTSUBSCRIPT roman_txt end_POSTSUBSCRIPT was “Can you take the mustard container on the shelf to the black box?” The proposed method successfully retrieved the correct image as the top-1 for each mode, resulting in the MRR of 100%. Subsequently, the DSR grasped the mustard container and placed it in the black box. More information along with demonstration videos and other qualitative results are available on our project page at this URL333https://kkrr10.github.io/dm2rm/.

7 Conclusions

In this study, we focused on the IROV-FC task, in which a DSR retrieves images of the target object and the receptacle from stored images based on an open-vocabulary instruction, and subsequently transports the target object to the receptacle. Our contributions are as follows:

  • We proposed the DM2RM, a novel approach that retrieves images of both target objects and receptacles individually using a single model.

  • We introduced the SPE module, which leverages a mode token and phrase identification via an LLM to switch the embedding space according to the prediction target.

  • To handle open-vocabulary and redundant instructions, we introduced the TP module to paraphrase the input instructions into a standardized format suitable for fetch-and-carry tasks.

  • We also introduced the SARE module, which utilizes images overlaid with segmentation masks obtained by SAM [9] to enhance visual features regarding the shape and contour of objects.

  • The DM2RM outperformed the baseline methods in terms of the standard metrics on the LTRRIE-FC dataset, a novel dataset based on HM3D [16, 17] and MP3D [15, 12].

  • In physical experiments, our method achieved a task success rate of more than 80% in the standardized environment, despite the zero-shot transfer setting. These results indicate that the DM2RM can be successfully integrated into the DSR to perform the entire scenario, including fetch-and-carry actions.

In future work, we plan to introduce multi-target settings (e.g., [48]) to handle cases in which a single expression refers to an arbitrary number of objects.

ACKNOWLEDGMENT

This work was partially supported by JSPS KAKENHI Grant Number 23H03478, JST Moonshot, and NEDO.

References

  • [1] Yamamoto T, Terada K, Ochiai A, et al. Development of Human Support Robot as the Research Platform of a Domestic Mobile Manipulator. ROBOMECH Journal. 2019;6(1):1–15.
  • [2] Yenamandra S, Ramachandran A, Khanna M, et al. The HomeRobot Open Vocab Mobile Manipulation Challenge. In: NeurIPS; 2023.
  • [3] Melnik A, Büttner M, Harz L, et al. UniTeam: Open Vocabulary Mobile Manipulation Challenge. arXiv preprint arXiv:231208611. 2023;.
  • [4] Iocchi L, Holz D, Ruiz-del Solar J, et al. RoboCup@Home: Analysis and Results of Evolving Competitions for Domestic and Service Robots. AIJ. 2015;229:258–281.
  • [5] Okada H, Inamura T, Wada K. What Competitions were Conducted in the Service Categories of the World Robot Summit? AR. 2019;33(17):900–910.
  • [6] Yenamandra S, Ramachandran A, Yadav K, et al. HomeRobot: Open-Vocabulary Mobile Manipulation. In: CoRL; 2023. p. 1975–2011.
  • [7] Kaneda K, Nagashima S, Korekata R, et al. Learning-To-Rank Approach for Identifying Everyday Objects Using a Physical-World Search Engine. IEEE RA-L. 2024;9(3):2088–2095.
  • [8] Radford A, Kim J, Hallacy C, et al. Learning Transferable Visual Models From Natural Language Supervision. In: ICML; 2021. p. 8748–8763.
  • [9] Kirillov A, Mintun E, Ravi N, et al. Segment Anything. In: ICCV; 2023. p. 4015–4026.
  • [10] Liu Y, Chen W, Bai Y, et al. Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI. arXiv preprint arXiv:240706886. 2024;.
  • [11] Duan J, Yu S, Tan H, et al. A Survey of Embodied AI: From Simulators to Research Tasks. IEEE TETCI. 2022;6(2):230–244.
  • [12] Anderson P, Wu Q, Teney D, et al. Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments. In: CVPR; 2018. p. 3674–3683.
  • [13] Qi Y, Wu Q, Anderson P, et al. REVERIE: Remote Embodied Visual Referring Expression in Real Indoor Environments. In: CVPR; 2020. p. 9982–9991.
  • [14] Zhu F, Liang X, Zhu Y, et al. SOON: Scenario Oriented Object Navigation with Graph-based Exploration. In: CVPR; 2021. p. 12689–12699.
  • [15] Chang A, Dai A, Funkhouser T, et al. Matterport3D: Learning from RGB-D Data in Indoor Environments. In: 3DV; 2017. p. 667–676.
  • [16] Ramakrishnan S, Gokaslan A, Wijmans E, et al. Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI. In: NeurIPS; 2021.
  • [17] Yadav K, Ramrakhya R, Ramakrishnan S, et al. Habitat-Matterport 3D Semantics Dataset. In: CVPR; 2023. p. 4927–4936.
  • [18] Sigurdsson G, Thomason J, Sukhatme G, et al. RREx-BoT: Remote Referring Expressions with a Bag of Tricks. In: IROS; 2023. p. 5203–5210.
  • [19] Chen B, Xia F, Ichter B, et al. Open-Vocabulary Queryable Scene Representations for Real World Planning. In: ICRA; 2023. p. 11509–11522.
  • [20] Hu Y, Xie Q, Jain V, et al. Toward General-Purpose Robots via Foundation Models: A Survey and Meta-Analysis. arXiv preprint arXiv:231208782. 2023;.
  • [21] Firoozi R, Tucker J, Tian S, et al. Foundation Models in Robotics: Applications, Challenges, and the Future. arXiv preprint arXiv:240205741. 2023;.
  • [22] Kawaharazuka K, Matsushima T, Gambardella A, et al. Real-World Robot Applications of Foundation Models: A Review. arXiv preprint arXiv:231207843. 2024;.
  • [23] Wu J, Antonova R, Kan A, et al. TidyBot: Personalized Robot Assistance with Large Language Models. AuRo. 2023;47(8):1087–1102.
  • [24] Driess D, Xia F, Sajjadi M, et al. PaLM-E: An Embodied Multimodal Language Model. In: ICML; 2023. p. 8469––8488.
  • [25] Ichter B, Brohan A, Chebotar Y, et al. Do As I Can, Not As I Say: Grounding Language in Robotic Affordances. In: CoRL; 2023. p. 287–318.
  • [26] Song C, Wu J, Washington C, et al. LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models. In: ICCV; 2023. p. 2998–3009.
  • [27] Hazra R, Martires P, Raedts L. SayCanPay: Heuristic Planning with Large Language Models using Learnable Domain Knowledge. In: AAAI; 2024. p. 20123–20133.
  • [28] Singh I, Blukis V, Mousavian A, et al. ProgPrompt: Generating Situated Robot Task Plans using Large Language Models. In: ICRA; 2023. p. 11523–11530.
  • [29] Liang J, Huang W, Xia F, et al. Code as Policies: Language Model Programs for Embodied Control. In: ICRA; 2023. p. 9493–9500.
  • [30] Uppal S, Bhagat S, Hazarika D, et al. Multimodal Research in Vision and Language: A Review of Current and Emerging Trends. Information Fusion. 2022;77:149–171.
  • [31] Chen F, Zhang D, Han M, et al. VLP: A Survey on Vision-language Pre-training. MIR. 2023;20(1):38–56.
  • [32] Yu L, Poirson P, Yang S, et al. Modeling Context in Referring Expressions. In: ECCV; 2016. p. 69–85.
  • [33] Hatori J, Kikuchi Y, Kobayashi S, et al. Interactively Picking Real-World Objects with Unconstrained Spoken Language Instructions. In: ICRA; 2018. p. 3774–3781.
  • [34] Korekata R, Kambara M, Yoshida Y, et al. Switching Head–Tail Funnel UNITER for Dual Referring Expression Comprehension with Fetch-and-Carry Tasks. In: IROS; 2023. p. 3865–3872.
  • [35] Iioka Y, Yoshida Y, Wada Y, et al. Multimodal Diffusion Segmentation Model for Object Segmentation from Manipulation Instructions. In: IROS; 2023. p. 7590–7597.
  • [36] Guadarrama S, Rodner E, Saenko K, et al. Open-vocabulary Object Retrieval. In: RSS; 2014. p. 1–9.
  • [37] Nguyen T, Gopalan N, Patel R, et al. Robot Object Retrieval with Contextual Natural Language Queries. In: RSS; 2020.
  • [38] Young P, Lai A, Hodosh M, et al. From Image Descriptions to Visual Denotations: New Similarity Metrics for Semantic Inference over Event Descriptions. ACL. 2014;2:67–78.
  • [39] Liu Z, Rodriguez-Opazo C, Teney D, et al. Image Retrieval on Real-life Images with Pre-trained Vision-and-Language Models. In: ICCV; 2021. p. 2125–2134.
  • [40] Han X, Wu Z, Huang P, et al. Automatic Spatially-Aware Fashion Concept Discovery. In: ICCV; 2017. p. 1463–1471.
  • [41] Chen Y, Li L, Yu L, et al. UNITER: UNiversal Image-TExt Representation Learning. In: ECCV; 2020. p. 104–120.
  • [42] Schuster S, Manning C. Enhanced English Universal Dependencies: An Improved Representation for Natural Language Understanding Tasks. In: LREC; 2016. p. 2371–2378.
  • [43] Vaswani A, Shazeer N, Parmar N, et al. Attention Is All You Need. NIPS. 2017;.
  • [44] Oord A, Li Y, Vinyals O. Representation Learning with Contrastive Predictive Coding. arXiv preprint arXiv:180703748. 2018;.
  • [45] Savva M, Kadian A, Maksymets O, et al. Habitat: A Platform for Embodied AI Research. In: ICCV; 2019. p. 9339–9347.
  • [46] Zhou X, Girdhar R, Joulin A, et al. Detecting Twenty-Thousand Classes using Image-Level Supervision. In: ECCV; 2022. p. 350–368.
  • [47] Liu T. Learning to Rank for Information Retrieval. FNTIR. 2009;3(3):225–331.
  • [48] Liu C, Ding H, Jiang X. GRES: Generalized Referring Expression Segmentation. In: CVPR; 2023. p. 23592–23601.
  • [49] Van der Maaten L, Hinton G. Visualizing Data Using t-SNE. JMLR. 2008;9(11):2579–2605.
  • [50] Calli B, Walsman A, Singh A, et al. Benchmarking in Manipulation Research: Using the Yale-CMU-Berkeley Object and Model Set. IEEE RAM. 2015;22(3):36–52.
  • [51] Magassouba A, Sugiura K, Kawai H. A Multimodal Classifier Generative Adversarial Network for Carry and Place Tasks from Ambiguous Language Instructions. IEEE RA-L. 2018;3(4):3113–3120.