Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Int J Soc Robot (2009) 1: 71–81

DOI 10.1007/s12369-008-0001-3

O R I G I N A L PA P E R

Measurement Instruments for the Anthropomorphism, Animacy,


Likeability, Perceived Intelligence, and Perceived Safety of Robots
Christoph Bartneck · Dana Kulić · Elizabeth Croft ·
Susana Zoghbi

Accepted: 28 October 2008 / Published online: 20 November 2008


© The Author(s) 2008. This article is published with open access at Springerlink.com

Abstract This study emphasizes the need for standardized Keywords Human factors · Robot · Perception ·
measurement tools for human robot interaction (HRI). If Measurement
we are to make progress in this field then we must be able
to compare the results from different studies. A literature
review has been performed on the measurements of five 1 Introduction
key concepts in HRI: anthropomorphism, animacy, likeabil-
ity, perceived intelligence, and perceived safety. The results The success of service robots and, in particular, of entertain-
have been distilled into five consistent questionnaires using ment robots cannot be assessed only by performance cri-
semantic differential scales. We report reliability and valid- teria typically found for industrial robots. The number of
ity indicators based on several empirical studies that used processed pieces and their accordance with quality standards
these questionnaires. It is our hope that these questionnaires are not necessarily the prime objectives for an entertainment
can be used by robot developers to monitor their progress. robot such as Aibo [1], or a communication platform such
Psychologists are invited to further develop the question- as iCat [2]. The performance criteria of service robots lie
naires by adding new concepts, and to conduct further vali- within the satisfaction of their users. Therefore, it is neces-
dations where it appears necessary. sary to measure the users’ perception of service robots, since
these can not be measured within the robots themselves.
Measuring human perception and cognition has its own
pitfalls, and psychologists have developed extensive method-
C. Bartneck () ologies and statistical tests to objectify the acquired data.
Department of Industrial Design, Eindhoven University of
Technology, Den Dolech 2, 5600 Eindhoven, The Netherlands
Most engineers who develop robots are often unaware of this
e-mail: [email protected] large body of knowledge, and sometimes run naïve experi-
ments in order to verify their designs. But the same naivety
D. Kulić can also be expected of psychologists when confronted with
Nakamura & Yamane Lab, Department of Mechano-Informatics,
University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-8656,
the task of building a robot. Human-Robot Interaction (HRI)
Japan is a multidisciplinary field, but it can not be expected that
e-mail: [email protected] everyone masters all skills equally well. We do not intend
to investigate the structure of the HRI community and the
E. Croft · S. Zoghbi
Department of Mechanical Engineering, University of British
problems it is facing in the cooperation of its members. The
Columbia, 6250 Applied Science Lane, Room 2054, Vancouver, interested reader may consult Bartneck and Rauterberg [3]
V6T 1Z4, Canada who reflected on the structure of the Human-Computer In-
E. Croft teraction community. This may also apply to the HRI com-
e-mail: [email protected] munity. This study is intended for the technical developers
S. Zoghbi of interactive robots who want to evaluate their creations
e-mail: [email protected] without having to take a degree in experimental psychology.
72 Int J Soc Robot (2009) 1: 71–81

However, it is advisable to at least consult with a psycholo- the actual experience. Subjects have to reflect on their ex-
gist over the overall methodology of the experiment. perience afterwards, which might bias their response. They
A typical pitfall in the measurement of psychological could, for example, adapt their response to the socially ac-
concepts is to break them down into smaller, presumably ceptable response.
better-known, components. This is common practice, and The development of a validated questionnaire involves a
we do not intend to single out a particular author, but we still considerable amount of work, and extensive guidelines are
feel the need to present an example. Kiesler and Goetz [4] available to help with the process [5, 7]. Development will
divided the concept of anthropomorphism into the sub com- typically begin with a large number of items, which are in-
ponents sociability, intellect, and personality. They mea- tended to cover the different facets of the theoretical con-
sured each concept with the help of a questionnaire. This struct to be measured; next, empirical data is collected from
breaking down into sub components makes sense if the rela- a sample of the population to which the measurement is to
tionship and relative importance of the sub components are be applied. After appropriate analysis of this data, a subset
known and can therefore be calculated back into the origi- of the original list of items is then selected and becomes the
nal concept. Otherwise, a presumably vague concept is sim- actual multi-indicator measurement. This measurement will
ply replaced by series of just as vague concepts. There is then be formally assessed with regard to its reliability, di-
no reason to believe that it would be easier for the users of mensionality, and validity.
robots to evaluate their sociability rather than their anthropo- Due to their naivety and the amount of work necessary to
morphism. Caution is therefore necessary so as not to over- create a validated questionnaire, developers of robots have a
decompose concepts. Still, it is good practice to at least de- tendency to quickly cook up their own questionnaires. This
compose the concept under investigation into several items1 conduct results in two main problems. Firstly, the validity
so as to have richer and more reliable data as was suggested and reliability of these questionnaires has often not been
by Fink, Vol. 8, p. 20 [5]. evaluated. An engineer is unlikely to trust a voltmeter devel-
A much more reliable and possibly objective method for oped by a psychologist unless its proper function has been
measuring the users’ perception and cognition is to observe shown. In the same manner, psychologists will have little
their behavior [6]. If, for example, the intention of a certain trust in the results from a questionnaire developed by an en-
robot is to play a game with the user, then the fun experi- gineer unless information about its validity and reliability
enced can be deduced from the time the user spends playing is available. Despite the fact that we may trust experts in
it. The longer the user plays, the more fun it is. However, the field, at some point each instruments needs to be tested.
not all internal states of a user manifest themselves in ob- Secondly, the absence of standard questionnaires makes it
servable behavior. From a practical point of view it can also difficult to compare the results from different researchers. If
be very laborious to score the users’ behaviors on the basis we are to make progress in the field of human-robot inter-
of video recordings. action then we shall have to develop standardized measure-
Physiological measurements form a second group of ment tools similar to the ITC-SOPI questionnaire that was
measurement tools. Skin conductivity, heart rate, and heart developed to measure presence [8]. The need for standard-
variance are three popular measurements that provide a good ized measurements has been acknowledged and a workshop
indication of the user’s arousal in real time. The measure- on this topic has been conducted at the HRI2008 conference
ment can be taken during the interaction with the robot. in Amsterdam.
Unfortunately, these measurements can not distinguish the This study attempts to make a start in the development
arousal that stems from anger from that which may origi- of standardized measurement tools for human-robot interac-
nate from joy. To gain better insight into the user’s state, tion by first presenting a literature review on existing ques-
these measurements can be complemented by other physio- tionnaires, and then presenting empirical studies that give an
logical measurements, such as the recognition of facial ex- indication of the validity and reliability of these new ques-
pression. In combination, they can provide real time data, tionnaires. This study will take the often-used concepts of
but the effort of setting up and maintaining the equipment anthropomorphism, animacy, likeability, and perceived in-
and software should not be underestimated. telligence and perceived safety as starting points to propose
A third measurement technique is questionnaires, which a consistent set of five questionnaires for these concepts.
are often used to measure the users’ attitudes. While this We can not offer an exhaustive framework for the percep-
method is rather quick to conduct, its conceptual pitfalls tion of robots similar to the frameworks that have already
are often underestimated. One of its prime limitations is, of been developed for social robots [9–11] that would justify
course, that the questionnaire can be administered only after the selection of these five concepts. We can only recognize
that the concepts proposed have been necessary for our own
1 In the social sciences the term “item” refers to a single question or research and that they are likely to have relationships with
response. each other. A highly anthropomorphic and intelligent robot
Int J Soc Robot (2009) 1: 71–81 73

is likely to be perceived to be more animate and possibly not the meaning of words, but the attitude towards robots.
also more likeable. The verification of such a model does re- Powers and Kiesler [16] report a negative correlation (−.23)
quire appropriate measurement instruments. The discussion between “Human-likeness” and “Machine-likeness”, which
of whether it is good practice to first develop a theory and strengthens our view that semantic differentials are a useful
then the observation method or vice versa has not reached a tool for measuring the users’ perception of robots, while we
conclusion [12], but every journey begins with a first step. remain aware of the fact that every method has its limita-
The proposed set of questionnaires can later be extended to tions.
cover other relevant concepts, and their relationships can be Some information on the validity and reliability of the
further explored. The emphasis is on presenting question- questionnaires is already available from the original stud-
naires that can be used directly in the development of in- ies on which they are based. However, the transformation
teractive robots. Many robots are being built right now, and from Likert scales to semantic differential scales may com-
the engineers cannot wait for a mature model to emerge. We promise these indicators to a certain degree. We shall com-
even seriously consider the position that such a framework pensate this possible loss by reporting on complementary
can be created only once we have the robots and measure- empirical studies later in the text. First, we would like to
ment tools in place. discuss the different types of validity and reliability.
Unfortunately, the literature review revealed question- Fink in Vol. 8, pp. 5–44, [5] discusses several forms of re-
naires that used different types of items, namely Likert- liability and validity. Among the scientific forms of validity
scales [13] and semantic differential scales [14]. If more we find content validity, criterion validity, and construct va-
than one questionnaire is to be used for the evaluation of lidity. The latter, which determines the degree to which the
a certain robot, it is beneficial if the questionnaires use the instrument works in comparison with others, can only be as-
same type of items. This consistency makes it easy for the sessed after years of experience with a questionnaire, and
participants to learn the method and thereby avoids errors construct validity is often not calculated as a quantifiable
in their responses. It was therefore decided to transfer Lik- statistic. Given the short history of research in HRI it would
ert type scales to semantic differential scales. We shall now appear difficult to achieve construct validity. The same holds
discuss briefly the differences between these two types of true for criterion validity. There is a scarcity of validated
items. questionnaires with which our proposed questionnaires can
In semantic differential scales the respondent is asked to be compared. We can make an argument for content validity
indicate his or her position on a scale between two bipolar since experts in the field carried out the original studies, and
words, the anchors (see Fig. 1, top). In Likert scales (see measurements of the validity and reliability have even been
Fig. 1, bottom), subjects are asked to respond to a stem, of- published from time to time. The researchers involved in the
ten in the form of a statement, such as “I like ice cream”.
transformation of the proposed questionnaires were also in
The scale is frequently anchored with choices of “agree”–
close contact with relevant experts in the field with regard to
“disagree” or “like”–“dislike”.
the questionnaires. The proposed questionnaires can there-
Both are rating scales, and provided that response dis-
fore be considered to have content validity.
tributions are not forced, semantic differential data can be
It is easier to evaluate the reliability of the questionnaire,
treated just as any other rating data [7]. The statistical analy-
and Fink describes three forms: test-retest reliability, alter-
sis is identical. However, a semantic differential format may
nate form reliability, and internal consistency reliability. The
effectively reduce acquiescence bias without lowering psy-
latter is a measurement for how well the different items mea-
chometric quality [15]. A common objection to Osgood’s
sure the same concept, and it is of particular importance
semantic differential method is that it appears to assume
to the questionnaires proposed because they are designed
that the adjectives chosen as anchors mean the same to
to be homogenous in content. Internal consistency involves
everyone. Thus, the method becomes self-contradictory; it
the calculation of a statistic known as Cronbach’s Alpha. It
starts from the presumption that different people interpret
measures the internal consistency reliability among a group
the same word differently, but has to rely on the assump-
tion that this is not true for the anchors. However, this study of items that are combined to form a single scale. It reflects
proposes to use the semantic differential scales to evaluate the homogeneity of the scale. Given the choice of homo-
geneous semantic differential scales, alternate form reliabil-
ity appears difficult to achieve. The items cannot simply be
Strong 1 2 3 4 5 Weak negated and asked again because semantic differential scales
already include dichotomous pairs of adjectives. Test-retest
I like ice cream Disagree 1 2 3 4 5 Agree reliability can even be tested within the same experiment
by splitting the participants randomly into two groups. This
Fig. 1 Example of a semantic differential scale (top) and L. Likert
scale (bottom). The participant would be asked to rate the stimulus on procedure requires a sufficiently large number of partici-
this scale by circling one of the numbers pants and unfortunately none of the studies that we have
74 Int J Soc Robot (2009) 1: 71–81

access to have had enough participants to allow for a mean- therefore appears to be more suitable. It was necessary to
ingful test-retest analysis. For both, test-retest reliability and transform the items used by Powers and Kiesler into se-
internal consistency reliability, Nunnally [17] recommends mantic differentials: Fake/Natural, Machinelike/Humanlike,
a minimum value of 0.7. We would now like to discuss Unconscious/Conscious, Artificial/Lifelike, and Moving
the five concepts of anthropomorphism, animacy, likeability, rigidly/Moving elegantly.
perceived intelligence, and perceived safety in more detail, Two studies are available in which this new anthropomor-
and describe a questionnaire for each of them. phism questionnaire was used. The first one reports a Cron-
bach’s Alpha of 0.878 [21] and we would like to report the
Cronbach’s Alphas for the second study [22] in this paper.
2 Anthropomorphism The study consisted of three within conditions for which the
Cronbach’s Alphas must be reported separately. We can re-
Anthropomorphism refers to the attribution of a human port a Cronbach’s Alpha of 0.929 for the human condition,
form, human characteristics, or human behavior to nonhu- 0.923 for the android condition and 0.856 for the masked
man things such as robots, computers, and animals. Hiroshi android condition. The alpha values are well above 0.7, so
Ishiguro, for example, develops androids that, for a short we can conclude that the anthropomorphism questionnaire
period, are indistinguishable from human beings [18]. His has sufficient internal consistency reliability.
highly anthropomorphic androids struggle with the so-called
‘uncanny valley’, a theory that states that as a robot is made
more humanlike in its appearance and movements, the emo- 3 Animacy
tional response from a human being to the robot becomes
increasingly positive and empathic, until a point is reached The goal of many robotics researchers is to make their ro-
beyond which the response quickly becomes that of intense bots lifelike. Computer games, such as The Sims, Creatures,
repulsion. However, as the appearance and movements con- or Nintendo Dogs show that lifelike creatures can deeply in-
tinue to become less distinguishable from those of a human volve users emotionally. This involvement can then be used
being, the emotional response becomes positive once more to influence users [23]. Since Heider and Simmel [24], a
and approaches human-human empathy levels. considerable amount of research has been devoted to the
Even if it is not the intention of the design of a certain perceived animacy and “intentions” of geometric shapes on
robot to be as humanlike as possible, it still remains impor- computer screens. Scholl and Tremoulet [25] offer a good
tant to match the appearance of the robot with its abilities. summary of the research field, but, on examining the list of
A too anthropomorphic appearance can evoke expectations references, it becomes apparent that only two of the 79 ref-
that the robot might not be able to fulfill. If, for example, erences deal directly with animacy. Most of the reviewed
the robot has a human-shaped face then the naïve user will work focuses on causality and intention. This may indicate
expect that the robot is able to listen and to talk. To prevent that the measurement of animacy is difficult.
disappointment it is necessary for all developers to pay close The classic perception of life, which is often referred
attention to the anthropomorphism level of their robots. to as animacy, is based on the Piagetian framework cen-
An interesting behavioral measurement for anthropomor- tred on “moving of one’s own accord”. Observing children
phism has been presented by Minato et al. [19]. They at- in the world of “traditional”—that is, non-computational—
tempted to analyze differences in where the participants objects, Piaget found that at first they considered everything
were looking when they looked at either a human or an an- that moved to be alive, but later, only things that moved
droid. The hypothesis is that people look differently at hu- without an external push or pull. Gradually, children refined
mans compared to robots. They have not been able to pro- the notion to mean “life motions,” namely only those things
duce reliable conclusions yet, but their approach could turn that breathed and grew were taken to be alive. This frame-
out to be very useful, assuming that they can overcome the work has been widely used, and even the study of artificial
technical difficulties. life has been considered as an opportunity to extend his orig-
MacDorman [20] presents an example of a naïve ques- inal framework [26]. Piaget’s framework emphasizes the im-
tionnaire. A single question is asked to assess the human- portance of movement and intentional behaviour for the per-
likeness of what is being viewed (9-point semantic differ- ception of animacy.
ential, mechanical versus humanlike). It is good practice This framework is supported by the observation that ab-
in the social sciences to ask multiple questions about the stract geometrical shapes that move on a computer screen
same concept in order to be able to check the participants’ are already being perceived as being alive [25], especially
consistency and the questionnaire’s reliability. Powers and if they change their trajectory nonlinearly or if they seem
Kiesler [16], in comparison, used six items and are able to interact with their environments, for example, by avoid-
to report a Cronbach’s Alpha of 0.85. Their questionnaire ing obstacles or seeking goals [27]. Being alive is one of the
Int J Soc Robot (2009) 1: 71–81 75

major criteria that distinguish human beings from machines, to conclude that the new animacy questionnaire has suffi-
but since robots exhibit movement and intentional behav- cient internal consistency reliability.
iour, it is not obvious how human beings perceive them. The
category of “sort of alive” becomes increasingly used [28].
This gradient of “alive” is reflected by the recently proposed 4 Likeability
psychological benchmarks of autonomy, imitation, intrinsic
moral value, moral accountability, privacy, and reciprocity It has been reported that the way in which people form pos-
that in the future may help to deal with the question of what itive impressions of others is to some degree dependent on
constitutes the essential features of being human in compar- the visual and vocal behavior of the targets [37], and that
ison with being a robot [29]. positive first impressions (e.g., likeability) of a person of-
First discussions on a robot’s moral accountability have ten lead to more positive evaluations of that person [38]. In-
already started [30], and an analogy between animal rights terviewers report knowing within 1 to 2 minutes whether a
and android science has been discussed [31]. Other bench- potential job applicant is a winner, and people report know-
marks for life, such as the ability to reproduce, have been ing within the first 30 seconds the likelihood that a blind
date will be a success [39]. There is a growing body of
challenged by the first attempts at robotic self-reproduc-
research indicating that people often make important judg-
tion [32].
ments within seconds of meeting a person, sometimes re-
Returning to the discussion of how to measure animacy,
maining quite unaware of both the obvious and subtle cues
we observe that Tremoulet and Feldman [33] only asked
that may be influencing their judgments. Since computers,
their participants to evaluate the animacy of ‘particles’ un-
and thereby robots in particular, are to some degree treated
der a microscope on a single scale (7-point Likert scale,
as social actors [40], it can be assumed that people are able
1 = definitely not alive, 7 = definitely alive). It is question-
to judge robots in a similar way.
able how much sense it makes to ask participants about the
Jennifer Monahan [41] complemented her “liking” ques-
animacy of particles. By definition they cannot be alive since
tion with 5-point semantic differential scales: nice/awful,
particles tend to be even smaller than the simplest organ-
friendly/unfriendly, kind/unkind, and pleasant/unpleasant,
isms.
because these judgments tend to demonstrate considerable
Asking about the perceived animacy of a certain stimulus
variance in common with “liking” judgments [42]. Monahan
makes sense only if there is a possibility for it to be alive.
later eliminated the kind-unkind and pleasant-unpleasant
Robots can show physical behavior, reactions to stimuli, and items in her own analysis since they did not load suffi-
even language skills. These are typically attributed only to ciently in a factor analysis that also included items from
animals, and hence it can be argued that it makes sense to ask three other factors. The Cronbach’s Alpha of 0.68 therefore
participants about their perception of the animacy of robots. relates only to this reduced scale. Her experimental focus
McAleer et al. [34] claim to have analyzed the per- is different from the intended use of her questionnaire in
ceived animacy of modern dancers and their abstractions on the field of HRI. She also included concepts of physical at-
a computer screen, but only qualitative data of the perceived traction, conversational skills, and other orientations, which
arousal is presented. Animacy was measured with free re- might become an element of the questionnaire series at a
sponses. They looked for terms and statements that indicated later stage. In particular, physical attraction might require
that subjects had attributed human movements and charac- additional conceptual and social consideration, since it may
teristics to the shapes. These were terms such as “touched”, also entail sexuality. No reports on successful human-robot
“chased”, and “followed”, and emotions such as “happy” reproduction are available yet and hopefully never will be.
or “angry”. Other guides to animacy were when the shapes We decided to only include the five items, since it is always
were generally being described in active roles, as opposed possible to exclude items in cases where they would not con-
to being controlled in a passive role. However, they do not tribute to the reliability and validity of the questionnaire.
present any quantitative data for their analysis. Two studies used this new likeability questionnaire. The
A better approach has been presented by Lee, Kwan first reports a Cronbach’s Alpha of 0.865 [21], and we report
Min, Park, Namkee and Song, Hayeon [35]. With their four the Cronbach’s Alpha for the second [22] in this paper. The
items (10-point Likert scale; lifelike, machine-like, interac- study consisted of three “within” conditions for which the
tive, and responsive) they have been able to achieve a Cron- Cronbach’s Alpha must be reported separately. Without go-
bach’s Alpha of 0.76. For the questionnaires in this study, ing into too much detail of the study, we can report a Cron-
their items have been transformed into semantic differ- bach’s Alpha of 0.923 for the human condition, 0.878 for the
entials: Dead/Alive, Stagnant/Lively, Mechanical/Organic, android condition, and 0.842 for the masked android condi-
Artificial/Lifelike, Inert/Interactive, Apathetic/Responsive. tion. The alpha values are well above 0.7, and hence we can
One study used this new questionnaire [36] and reported a conclude that the likeability questionnaire has sufficient in-
Cronbach’s Alpha of 0.702, which is sufficiently high for us ternal consistency reliability.
76 Int J Soc Robot (2009) 1: 71–81

5 Perceived Intelligence erature, both in systems and standards established for indus-
trial robots and for service robots intended for use in the
Interactive robots face a tremendous challenge in acting in- home. The proposed approaches can be classified into three
telligently. The reasons can be traced back to the field of ar- broad categories: (i) reduce the hazard through mechanical
tificial intelligence (AI). The robots’ behaviors are based on redesign, (ii) control the hazard through electronic or phys-
methods and knowledge that were developed by AI. Many ical safeguards, and, (iii) warn the operator/user, either dur-
of the past promises of AI have not been fulfilled, and AI ing operation or through training [52]. Examples of mechan-
has been criticized extensively [43–46]. ical redesign include using a whole-body robot visco-elastic
One of the main problems that AI is struggling with is covering [53, 54], the use of spherical and compliant joints
the difficulty of formalizing human behavior, for example, [54–56], and distributed parallel actuation mechanisms to
in expert systems. Computers require this formalization to lower the effective inertia of the robot near the end effec-
generate intelligent and human-like behavior. And as long as tor [57, 58]. Control approaches have included impact force
the field of AI has not made considerable progress on these control and passive control [59–61], as well as control strate-
issues, robot intelligence will remain at a very limited level. gies based on either discrete [54, 62] or continuous safe-
So far, we have been using many Wizard-of-Oz methods to guarding zones [63, 64]. Recent work has also focused on
fake intelligent robotic behavior, but this is possible only in measurement and analysis of forces and injury during hu-
the confines of the research environment. Once the robots man robot collisions [65]. However the focus of these works
are deployed in the complex world of everyday users, their is on safety based on the robot’s perception, they do not con-
limitations will become apparent. Moreover, when the users sider the human’s perception of safety during the interaction.
are interacting with the robot for years rather than minutes, Perceived safety describes the user’s perception of the level
they will become aware of the limited abilities of most ro- of danger when interacting with a robot, and the user’s level
bots. of comfort during the interaction. Achieving a positive per-
Evasion strategies have also been utilized. The robot ception of safety is a key requirement if robots are to be ac-
would show more or less random behavior while interacting cepted as partners and co-workers in human environments.
with the user, and the user in turn sees patterns in this be- Perceived safety and user comfort have rarely been mea-
havior which he/she interprets as intelligence. Such a strat- sured directly. Instead, indirect measures have been used—
egy will not lead to a solution of the problem, and its suc- the measurement of the affective state of the user through
cess is limited to short interactions. Given sufficient time the use of physiological sensors [66–68], questionnaires
the user will give up his/her hypothesized patterns of the ro-
[66, 69, 70], and direct input devices [71]. That is, instead of
bot’s intelligent behavior and become bored with its limited
asking subjects to evaluate the robot, researchers frequently
random vocabulary of behaviors. In the end, the perceived
use affective state estimation or questionnaires asking how
intelligence of a robot will depend on its competence [47].
the subject feels in order to measure the perceived safety and
To monitor the progress being made in robotic intelligence
comfort level indirectly.
it is important to have a good measurement tool.
For example, Sarkar proposes the use of multiple phys-
Warner and Sugarman [48] developed an intellectual
iological signals to estimate affective state, and to use this
evaluation scale that consists of five seven-point seman-
estimate to modify robotic actions to make the user more
tic differential items: Incompetent/Competent, Ignorant/
comfortable [72]. Rani et al. [67, 68] use heart-rate analysis
Knowledgeable, Irresponsible/Responsible, Unintelligent/
and multiple physiological signals to estimate human stress
Intelligent, Foolish/Sensible. Parise et al. [49] excluded one
levels. In Rani et al. [67], an autonomous mobile robot mon-
question from this scale, and reported a Cronbach’s Alpha of
itors the stress level of the user, and if the level exceeds a cer-
0.92. The questionnaire was again used by Kiesler, Sproull
tain value, the robot returns to the user in a simulated rescue
and Waters [50], but no alpha was reported. Three other
attempt. However, in their study, the robot does not interact
studies used the perceived intelligence questionnaire, and
directly with the human; instead, pre-recorded physiological
reported Cronbach’s Alpha values of 0.75 [22], 0.769 [51],
information is used to allow the robot to assess the human’s
and 0.763 [36]. These values are above the suggested 0.7
threshold, and hence the perceived intelligence question- condition.
naire can be considered to have satisfactory internal consis- Koay et al. [72] describe an early study where human re-
tency reliability. action to robot motions was measured online. In this study,
28 subjects interacted with a robot in a simulated living
room environment. The robot motion was controlled by the
6 Perceived Safety experimenters in a “Wizard of Oz” setup. The subjects were
asked to indicate their level of comfort with the robot by
A key issue for robots interacting with humans is safety. The means of a handheld device. The device consisted of a sin-
issue has received considerable attention in the robotics lit- gle slider control to indicate comfort level, and a radio signal
Int J Soc Robot (2009) 1: 71–81 77

data link. Data from only 7 subjects was considered reliable, surprise, and calmness during each sample robot motion.
and was included in subsequent analysis. Analysis of the de- A 5 point Likert scale is used. The Cronbach’s Alpha for
vice data with the video of the experiment found that sub- the affective state portion of the questionnaire is 0.91. In
jects indicated discomfort when the robot was blocking their addition, the subject is asked to rate their level of attention
path, the robot was moving behind them, or the robot was on during the robot motion, to ensure that the elicited affective
a collision course with them. state was caused by the robot rather than by some other inter-
Nonaka et al. [73] describe a set of experiments where nal or external distraction. In their work, the effect of robot
human response to pick-and-place motions of a virtual hu- movement on the human response, both in terms of safety
manoid robot is evaluated. In their experiment, a virtual re- and trajectory employed, is examined. They show that mo-
ality display is used to depict the robot. Human response tion planning can be used to reduce the perceived anxiety
is measured through heart rate measurements and subjec- and surprise felt by subjects during high speed movements.
tive responses. A 6-level scale is used from 1 = “never” to This and later work [75, 76] by the same authors showed a
6 = “very much”, for the categories of “surprise”, “fear”, strong statistical correlation between the affective state re-
“disgust”, and “unpleasantness”. No relationship was found ported by the subjects and their physiological responses.
between the heart rate and robot motion, but a correlation The scales they produced were then transformed to the
was reported between the robot velocity and the subject’s following semantic differential scales: Anxious/Relaxed,
rating of “fear” and “surprise”. In a subsequent study [69], Agitated/Calm, Quiescent/Surprised. This revised question-
a physical mobile manipulator was used to validate the naire was utilized with a new set of 16 subjects (10 males
results obtained with the virtual robot. In this case, sub- and 6 females) using the same robot and physiological sen-
jects are asked to rate their responses on the following sor system and the same experimental protocol as in the pre-
(5-point) direction levels: “secure–anxious”, “restless–calm”, vious study [74]. In the experiment, the user is shown a robot
“comfortable–unpleasant”, “unapproachable–accessible”, manipulator performing various motions and asked to rate
“favorable–unfavorable”, “tense–relaxed”, “unfriendly– their responses to the robot behavior. The robot performs
friendly”, “interesting–tedious”, and “unreliable–reliable”. two different tasks, a pick and place task and a reach and re-
They are also asked to rate their level of “intimidated” and tract task. These tasks were chosen to represent typical mo-
“surprised” on a 5-point Likert scale. The study finds that tions a robot could be asked to perform during human-robot
similar results are obtained regardless of whether a physi- interaction. Two planning strategies were used to plan the
cal or a virtual robot is used. Unfortunately, no information path of the robot for each task, a safe planning strategy [77]
about the reliability or validity of their scales is available. and the nominal potential field approach [78]. Each motion
There is a very large number of different questions that was presented at three different speeds, with the fastest be-
can be asked on the topic of safety and comfort in response ing the maximum velocity of the robot, for a total of 12
to physical robot motion. This underlines the need for a care- trajectories. The trajectories were presented to each subject
ful and studied set of baseline questions for eliciting compa- in random order.
rable results from research efforts, especially in concert with Table 1 shows the correlation analysis between the new
physiological measurement tools. It becomes apparent that measuring scales and speed. In correspondence with previ-
two approaches can be taken to assess the perceived safety. ous results, strong correlation coefficients were obtained be-
On the one hand the users can be asked to evaluate their tween speed and reported levels of Anxiety, Agitation and
impression of the robot, and on the other hand they can be Surprise. All correlation coefficients were significant at the
asked to assess their own affective state. It is assumed that if 0.01 level for 2-tailed t-tests.
the robot is perceived to be dangerous then the user affective Table 2 presents a 3-factor ANOVA table for the reported
state would be tense. levels of Anxiety/Relaxation. There was a significant effect
Kulic and Croft [66, 74] combined a questionnaire with of all factors—speed, task, and type of planning strategy—at
physiological sensors to estimate the user’s level of anxi- the 0.05 level while all interactions were not significant.
ety and surprise during sample interactions with an indus- Utilizing this semantic differential scaled questionnaire
trial robot. They ask the user to rate their level of anxiety, yielded the same statistical outcomes as the previous

Table 1 Correlation analysis


Speed Anxious/Relaxed Agitated/Calm Quiescent/Surprised

Speed 1
Anxious/Relaxed −0.530 1
Agitated/Calm −0.553 0.842 1
Quiescent/Surprised 0.695 −0.711 −0.732 1
78 Int J Soc Robot (2009) 1: 71–81
Table 2 ANOVA table for Anxiety/Relaxation mixed so as to mask the intention. Of course, each semantic
Source of variation df MS F p differential needs to be headed with an instruction, such as
“Please rate your impression of the robot”. The interested
Speed 2 42.09 40.96 .000 reader may consult [5] to learn more about designing ques-
Task 1 5.56 5.41 .021 tionnaires. Before calculating the mean scores for anthropo-
Planning strategy 1 7.33 7.13 .008 morphism, animacy, likeability, or perceived intelligence it
Speed ∗ Task 2 .001 .001 .999 is good practice to perform a reliability test and report the
Speed ∗ Planner 2 .25 .25 .783 resulting Cronbach’s Alpha.
Task ∗ Planner 1 1.44 1.40 .238 The interpretation of the results has, of course, some limi-
Speed ∗ Task ∗ Planner 2 .15 .15 .863
tations. First, it is extremely difficult to determine the ground
truth. In other words, it is complicated to determine objec-
Error 181 1.03
tively, for example, how anthropomorphic a certain robot
is. Many factors, such as the cultural backgrounds of the
5-point Likert scale questionnaire for Anxiety, Surprise and participants, prior experiences with robots, and personality
Calmness. The results previously obtained do have consider- may influence the measurements. Taking all the possible bi-
ases into account would require a complex and therefore im-
able relevance to the results of the new semantic differential
practicable experiment. The resulting values of the measure-
questionnaire. The correlation of the previous questionnaire
ments should therefore be interpreted not as absolute val-
with the physiological measurements suggest a strong va-
ues, but rather as a tool for comparison. Robot developers
lidity of that Likert-style questionnaire. Given the similar
can, for example, use the questionnaires to compare differ-
results for the semantic differential version of the question-
ent configurations of a robot. The results may then help the
naire, it is highly likely that the correlation to the physiolog-
developers to choose one option over the other. In the fu-
ical measurements still exists, and hence the validity of the
ture, this set of questionnaires could be extended to also in-
questionnaire may be assumed. These results show that also
clude the believability of a robot, the enjoyment of inter-
the new semantic differential questionnaire can provide a re-
acting with it, and the robot’s social presence. However, we
peatable and reliable measure for assessing user’s perceived have to point out that the perceptions of humans is not sta-
safety in response to robot motion. ble. The more humans gets used to the presence of robots,
the more their knowledge and expectations might change.
The questionnaires can therefore only offer a snapshot and
7 Conclusions it is likely that the if the experiment would be repeated in
twenty years, it would yield different results.
The study proposes a series of questionnaires to measure the It is the hope of the authors that robot developers may
users’ perception of robots. This series will be called “God- find this collection of measurement tools useful. Using these
speed” because it is intended to help creators of robots on tools would make the results in HRI research more compa-
their development journey. Appendix shows the application rable and could therefore increase our progress. Interested
of the five Godspeed questionnaires using 5-point scales. It readers, in particular experimental psychologists, are invited
is important to notice that there is a certain overlap between to continue to develop these questionnaires, and to validate
anthropomorphism and animacy. The item artificial/lifelike them further.
appears in both sections. This is to be expected, since being A necessary development would be translation into dif-
alive is an essential part of being human-like. An additional ferent languages. Only native speakers can understand
correlation analysis is therefore recommended when both the true meanings of the adjectives in their language. It
questionnaires are being administered in the same study. We is therefore necessary to translate the questionnaires into
also have to point out that the sensitivity of the Godspeed the mother language of the participants. Appendix in-
questionnaire series is not completely known. There may cludes the Japanese translation of the adjectives that we
very well be a small difference in perception between two al- created using the back translation method. It is advis-
most identical robots, but this difference might be too small able to use the same method to translate the question-
to be picked up by the questionnaire with a small number of naire into other languages. It would be appreciated if
participants. If the experimenter suspects such a situation, other translations are reported back to the authors of this
then we recommend increasing the number of participants, study. They will then be collected and posted on this
based on a power analysis. website: http://www.bartneck.de/2008/03/11/the-godspeed-
When one of these questionnaires is used by itself in a questionnaire-series/.
study it would be useful to mask the questionnaire’s inten-
Acknowledgement The Intelligent Robotics and Communication
tion by adding dummy items, such as optimistic/pessimistic. Laboratories at the Advanced Telecommunications Institute Interna-
If multiple questionnaires are used then the items should be tional (Kyoto, Japan) supported this study.
Int J Soc Robot (2009) 1: 71–81 79

Open Access This article is distributed under the terms of the Cre-
ative Commons Attribution Noncommercial License which permits
any noncommercial use, distribution, and reproduction in any medium,
provided the original author(s) and source are credited.

Appendix
80 Int J Soc Robot (2009) 1: 71–81

References 25. Scholl B, Tremoulet PD (2000) Perceptual causality and animacy.


Trends Cogn Sci 4:299–309
1. Sony (1999) Aibo, vol 1999 26. Parisi D, Schlesinger M (2002) Artificial life and Piaget. Cogn
2. Breemen A, Yan X, Meerbeek B (2005) iCat: an animated Dev 17:1301–1321
user-interface robot with personality. In: Fourth international 27. Blythe P, Miller GF, Todd PM (1999) How motion reveals in-
conference on autonomous agents & multi agent systems, tention: Categorizing social interactions. In: Gigerenzer G, Todd
Utrecht P (eds) Simple heuristics that make us smart. Oxford University
3. Bartneck C, Rauterberg M (2007) HCI reality—an unreal tourna- Press, London, pp 257–285
ment. Int J Hum Comput Stud 65:737–743 28. Turkle S (1998) Cyborg babies and cy-dough-plasm: ideas about
4. Kiesler S, Goetz J (2002) Mental models of robotic assistants. In: life in the culture of simulation. In: Davis-Floyd R, Dumit J (eds)
CHI’02 extended abstracts on Human factors in computing sys- Cyborg babies: from techno-sex to techno-tots. Routledge, New
tems, Minneapolis, Minnesota, USA York, pp 317–329
5. Fink A (2003) The survey kit, 2nd edn. Sage, Thousand Oaks 29. Kahn P, Ishiguro H, Friedman B, Kanda T (2006) What is
6. Kooijmans T, Kanda T, Bartneck C, Ishiguro H, Hagita N a human?—Toward psychological benchmarks in the field of
(2007) Accelerating robot development through integral analysis human-robot interaction. In: The 15th IEEE international sympo-
of human-robot interaction. IEEE Trans Robot 23:1001–1012 sium on robot and human interactive communication, ROMAN
7. Dawis RV (1987) Scale construction. J Counsel Psychol 34:481– 2006, Salt Lake City, pp 364–371
489
30. Calverley DJ (2005) Toward a method for determining the legal
8. Lessiter J, Freeman J, Keogh E, Davidoff J (2001) A cross-media
status of a conscious machine. In: AISB 2005 symposium on next
presence questionnaire: The itc sense of presence inventory. Pres-
generation approaches to machine consciousness: imagination, de-
ence 10:282–297
velopment, intersubjectivity, and embodiment, Hatfield
9. Fong T, Nourbakhsh I, Dautenhahn K (2003) A survey of socially
interactive robots. Robot Auton Syst 42:143–166 31. Calverley DJ (2006) Android science and animal rights, does an
10. Bartneck C, Forlizzi J (2004) A design-centred framework analogy exist? Connect Sci 18:403–417
for social human-robot interaction. In: Ro-Man2004, Kurashiki, 32. Zykov V, Mytilinaios E, Adams B, Lipson H (2005) Self-
pp 591–594 reproducing machines. Nature 435:163–164
11. Dautenhahn K (2007) Socially intelligent robots: dimensions of 33. Tremoulet PD, Feldman J (2000) Perception of animacy from the
human–robot interaction. Philos Trans R Soc B Biol Sci 362:679– motion of a single object. Perception 29:943–951
704 34. McAleer P, Mazzarino B, Volpe G, Camurri A, Patterson H, Pol-
12. Chalmers AF (1999) What is this thing called science? 3rd edn. lick F (2004) Perceiving animacy and arousal in transformed dis-
Hackett, Indianapolis plays of human interaction. J Vis 4:230–230
13. Likert R (1932) A technique for the measurement of attitudes. 35. Lee KM, Park N, Song H (2005) Can a robot be perceived as a
Arch Psychol 140:1–55 developing creature? Hum Commun Res 31:538–563
14. Osgood CE, Suci GJ, Tannenbaum PH (1957) The measurements 36. Bartneck C, Kanda T, Mubin O, Mahmud AA (2007) The percep-
of meaning. University of Illinois Press, Champaign tion of animacy and intelligence based on a robot’s embodiment.
15. Friborg O, Martinussen M, Rosenvinge JH (2006) Likert-based In: Humanoids 2007, Pittsburgh
vs. semantic differential-based scorings of positive psychological 37. Clark N, Rutter D (1985) Social categorization, visual cues and
constructs: A psychometric comparison of two versions of a scale social judgments. Eur J Soc Psychol 15:105–119
measuring resilience. Pers Individ Differ 40:873–884 38. Robbins T, DeNisi A (1994) A closer look at interpersonal af-
16. Powers A, Kiesler S (2006) The advisor robot: tracing people’s fect as a distinct influence on cognitive processing in performance
mental model from a robot’s physical attributes. In: 1st ACM evaluations. J Appl Psychol 79:341–353
SIGCHI/SIGART conference on Human-robot interaction, Salt 39. Berg JH, Piner K (1990) Social relationships and the lack of social
Lake City, Utah, USA relationship. In: Duck W, Silver RC (eds) Personal relationships
17. Nunnally JC (1978) Psychometric theory, 2nd edn. McGraw-Hill, and social support. Sage, Thousand Oaks, pp 104–221
New York 40. Nass C, Reeves B (1996) The media equation. SLI Publica-
18. Ishiguro H (2005) Android Science—Towards a new cross-
tions/Cambridge University Press, Cambridge
interdisciplinary framework. In: CogSci workshop towards social
41. Monahan JL (1998) I don’t know it but I like you—the influence
mechanisms of android science, Stresa, pp 1–6
of non-conscious affect on person perception. Hum Commun Res
19. Minato T, Shimada M, Itakura S, Lee K, Ishiguro H (2005) Does
24:480–500
gaze reveal the human likeness of an android? In: 4th IEEE inter-
national conference on development and learning, Osaka 42. Burgoon JK, Hale JL (1987) Validation and measurement of
20. MacDorman KF (2006) Subjective ratings of robot video clips for the fundamental themes for relational communication. Commun
human likeness, familiarity, and eeriness: An exploration of the Monogr 54:19–41
uncanny valley. In: ICCS/CogSci-2006 long symposium: toward 43. Dreyfus HL, Dreyfus SE (1992) What computers still can’t do: a
social mechanisms of android science, Vancouver critique of artificial reason. MIT Press, Cambridge
21. Bartneck C, Kanda T, Ishiguro H, Hagita N (2007) Is the uncanny 44. Dreyfus HL, Dreyfus SE, Athanasiou T (1986) Mind over ma-
valley an uncanny cliff? In: 16th IEEE international symposium chine: the power of human intuition and expertise in the era of the
on robot and human interactive communication, RO-MAN 2007, computer. Free Press, New York
Jeju, Korea, pp 368–373 45. Weizenbaum J (1976) Computer power and human reason: from
22. Bartneck C, Kanda T, Ishiguro H, Hagita N (2008) My robotic judgment to calculation. Freeman, San Francisco
doppelganger—a critical look at the uncanny valley theory. In: In- 46. Searle JR (1980) Minds, brains and programs. Behav Brain Sci
teraction studies—social behaviour and communication in biolog- 3:417–457
ical and artificial systems 47. Koda T (1996) Agents with faces: a study on the effect of person-
23. Fogg BJ (2003) Persuasive technology: using computers to change ification of software agents. MIT Media Lab, Cambridge
what we think and do. Morgan Kaufmann, San Mateo 48. Warner RM, Sugarman DB (1996) Attributes of personality based
24. Heider F, Simmel M (1944) An experimental study of apparent on physical appearance, speech, and handwriting. J Pers Soc Psy-
behavior. Am J Psychol 57:243–249 chol 50:792–799
Int J Soc Robot (2009) 1: 71–81 81

49. Parise S, Kiesler S, Sproull LD, Waters K (1996) My partner is ence on intelligent robots and systems (IROS 2000). Proceedings,
a real dog: cooperation with social agents. In: 1996 ACM con- vol 1, pp 696–701
ference on computer supported cooperative work, Boston, Massa- 64. Ikuta K, Ishii H, Nokata M (2003) Safety evaluation method of
chusetts, United States, pp 399–408 design and control for human-care robots. Int J Robot Res 22:281–
50. Kiesler S, Sproull L, Waters K (1996) A prisoner’s dilemma ex- 297
periment on cooperation with people and human-like computers. 65. Haddadin S, Albu-Schaffer A, Hirzinger G (2007) Safe physical
J Pers Soc Psychol 70:47–65 human–robot interaction: Measurements, analysis & new insights.
51. Bartneck C, Verbunt M, Mubin O, Mahmud AA (2007) To kill a In: International symposium on robotics research (ISRR2007), Hi-
mockingbird robot. In: 2nd ACM/IEEE international conference roshima, Japan
on human-robot interaction, Washington DC, pp 81–87 66. Kulic D, Croft E (2005) Anxiety detection during human-robot
52. American National Standards Institute (1999) RIA/ANSI interaction. In: IEEE international conference on intelligent robots
R15.06—1999 American national standard for industrial robots and systems, Edmonton, Canada, pp 389–394
and robot systems—safety requirements. American National 67. Rani P, Sarkar N, Smith CA, Kirby LD (2004) Anxiety detecting
Standards Institute, New York robotic system—towards implicit human-robot collaboration. Ro-
53. Yamada Y, Hirasawa Y, Huang S, Umetani Y, Suita K (1997) botica 22:85–95
Human-robot contact in the safeguarding space. In: IEEE/ASME 68. Rani P, Sims J, Brackin R, Sarkar N (2002) Online stress detec-
transactions on mechatronics, vol 2, pp 230–236 tion using phychophysiological signals for implicit human-robot
54. Yamada Y, Yamamoto T, Morizono T, Umetani Y (1999) FTA- cooperation. Robotica 20:673–685
based issues on securing human safety in a human/robot coexis-
69. Inoue K, Nonaka S, Ujiie Y, Takubo T, Arai T (2005) Comparison
tence system. In: IEEE international conference on systems, man,
of human psychology for real and virtual mobile manipulators.
and cybernetics, 1999. IEEE SMC’99 conference proceedings,
In: IEEE international conference on robot and human interactive
vol 2, pp 1058–1063
communication, pp 73–78
55. Bicchi A, Rizzini SL, Tonietti G (2001) Compliant design for
70. Wada K, Shibata T, Saito T, Tanie K (2004) Effects of robot-
intrinsic safety: general issues and preliminary design. In: 2001
assisted activity for elderly people and nurses at a day service cen-
IEEE/RSJ international conference on intelligent robots and sys-
ter. Proc IEEE 92:1780–1788
tems, 2001. Proceedings, vol 4, pp 1864–1869
56. Bicchi A, Tonietti G (2004) Fast and “soft-arm” tactics [robot arm 71. Koay KL, Walters ML, Dautenhahn K (2005) Methodological is-
design]. IEEE Robot Autom Mag 11:22–33 sues using a comfort level device in human-robot interactions. In:
57. Zinn M, Khatib O, Roth B, Salisbury JK (2002) Towards IEEE RO-MAN, pp 359–364
a human-centered intrinsically safe robotic manipulator. In: 72. Sarkar N (2002) Psychophysiological control architecture for
IARPIEEE/RAS joint workshop on technical challenges for de- human-robot coordination—concepts and initial experiments. In:
pendable robots in human environments, Toulouse, France IEEE international conference on robotics and automation, Wash-
58. Zinn M, Khatib O, Roth B (2004) A new actuation approach for ington, DC, USA, pp 3719–3724
human friendly robot design. In: 2004 IEEE international confer- 73. Nonaka S, Inoue K, Arai T, Mae Y (2004) Evaluation of human
ence on robotics and automation. Proceedings. ICRA’04, vol 1, sense of security for coexisting robots using virtual reality. In:
pp 249–254 IEEE international conference on robotics and automation, New
59. Heinzmann J, Zelinsky A (1999) Building human-friendly robot Orleans, LA, USA, pp 2770–2775
systems. Int Symp Robot Res 305–312 74. Kulic D, Croft E (2007) Physiological and subjective responses to
60. Heinzmann J, Zelinsky A (2003) Quantitative safety guarantees articulated robot motion. Robotica 25:13–27
for physical human-robot interaction. Int J Robot Res 22:479–504 75. Kulic D, Croft E (2006) Estimating robot induced affective state
61. Lew JY, Yung-Tsan J, Pasic H (2000) Interactive control of hu- using hidden Markov models. In: RO-MAN 2006—the 15th IEEE
man/robot sharing same workspace. In: 2000 IEEE/RSJ interna- international symposium on robot and human interactive commu-
tional conference on intelligent robots and systems, 2000 (IROS nication, Hatfield, pp 257–262
2000). Proceedings. pp 535–540 76. Kulic D, Croft EA (2007) Affective state estimation for human-
62. Zurada J, Wright AL, Graham JH (2001) A neuro-fuzzy approach robot interaction. IEEE Robot Trans 23:991–1000
for robot system safety. IEEE Trans Syst Man Cybern Part C Appl 77. Kulic D, Croft E (2005) Safe planning for human-robot interac-
Rev 31:49–64 tion. J Robot Syst 22:383–396
63. Traver VJ, del Pobil AP, Perez-Francisco M (2000) Making ser- 78. Khatib O (1986) Real-time obstacle avoidance for manipulators
vice robots human-safe. In: 2000 IEEE/RSJ international confer- and mobile robots. Int J Robot Res 5:90–98

You might also like