Search | arXiv e-print repository

WildHallucinations: Evaluating Long-form Factuality in LLMs with Real-World Entity Queries

Authors: Wenting Zhao, Tanya Goyal, Yu Ying Chiu, Liwei Jiang, Benjamin Newman, Abhilasha Ravichander, Khyathi Chandu, Ronan Le Bras, Claire Cardie, Yuntian Deng, Yejin Choi

Abstract: While hallucinations of large language models (LLMs) prevail as a major challenge, existing evaluation benchmarks on factuality do not cover the diverse domains of knowledge that the real-world users of LLMs seek information about. To bridge this gap, we introduce WildHallucinations, a benchmark that evaluates factuality. It does so by prompting LLMs to generate information about entities mined fr… ▽ More While hallucinations of large language models (LLMs) prevail as a major challenge, existing evaluation benchmarks on factuality do not cover the diverse domains of knowledge that the real-world users of LLMs seek information about. To bridge this gap, we introduce WildHallucinations, a benchmark that evaluates factuality. It does so by prompting LLMs to generate information about entities mined from user-chatbot conversations in the wild. These generations are then automatically fact-checked against a systematically curated knowledge source collected from web search. Notably, half of these real-world entities do not have associated Wikipedia pages. We evaluate 118,785 generations from 15 LLMs on 7,919 entities. We find that LLMs consistently hallucinate more on entities without Wikipedia pages and exhibit varying hallucination rates across different domains. Finally, given the same base models, adding a retrieval component only slightly reduces hallucinations but does not eliminate hallucinations. △ Less

Submitted 24 July, 2024; originally announced July 2024.

arXiv:2407.08876 [pdf, other]

DegustaBot: Zero-Shot Visual Preference Estimation for Personalized Multi-Object Rearrangement

Authors: Benjamin A. Newman, Pranay Gupta, Kris Kitani, Yonatan Bisk, Henny Admoni, Chris Paxton

Abstract: De gustibus non est disputandum ("there is no accounting for others' tastes") is a common Latin maxim describing how many solutions in life are determined by people's personal preferences. Many household tasks, in particular, can only be considered fully successful when they account for personal preferences such as the visual aesthetic of the scene. For example, setting a table could be optimized… ▽ More De gustibus non est disputandum ("there is no accounting for others' tastes") is a common Latin maxim describing how many solutions in life are determined by people's personal preferences. Many household tasks, in particular, can only be considered fully successful when they account for personal preferences such as the visual aesthetic of the scene. For example, setting a table could be optimized by arranging utensils according to traditional rules of Western table setting decorum, without considering the color, shape, or material of each object, but this may not be a completely satisfying solution for a given person. Toward this end, we present DegustaBot, an algorithm for visual preference learning that solves household multi-object rearrangement tasks according to personal preference. To do this, we use internet-scale pre-trained vision-and-language foundation models (VLMs) with novel zero-shot visual prompting techniques. To evaluate our method, we collect a large dataset of naturalistic personal preferences in a simulated table-setting task, and conduct a user study in order to develop two novel metrics for determining success based on personal preference. This is a challenging problem and we find that 50% of our model's predictions are likely to be found acceptable by at least 20% of people. △ Less

Submitted 11 July, 2024; originally announced July 2024.

Comments: 19 pages, 10 figures

arXiv:2404.10733 [pdf, other]

Bootstrapping Linear Models for Fast Online Adaptation in Human-Agent Collaboration

Authors: Benjamin A Newman, Chris Paxton, Kris Kitani, Henny Admoni

Abstract: Agents that assist people need to have well-initialized policies that can adapt quickly to align with their partners' reward functions. Initializing policies to maximize performance with unknown partners can be achieved by bootstrapping nonlinear models using imitation learning over large, offline datasets. Such policies can require prohibitive computation to fine-tune in-situ and therefore may mi… ▽ More Agents that assist people need to have well-initialized policies that can adapt quickly to align with their partners' reward functions. Initializing policies to maximize performance with unknown partners can be achieved by bootstrapping nonlinear models using imitation learning over large, offline datasets. Such policies can require prohibitive computation to fine-tune in-situ and therefore may miss critical run-time information about a partner's reward function as expressed through their immediate behavior. In contrast, online logistic regression using low-capacity models performs rapid inference and fine-tuning updates and thus can make effective use of immediate in-task behavior for reward function alignment. However, these low-capacity models cannot be bootstrapped as effectively by offline datasets and thus have poor initializations. We propose BLR-HAC, Bootstrapped Logistic Regression for Human Agent Collaboration, which bootstraps large nonlinear models to learn the parameters of a low-capacity model which then uses online logistic regression for updates during collaboration. We test BLR-HAC in a simulated surface rearrangement task and demonstrate that it achieves higher zero-shot accuracy than shallow methods and takes far less computation to adapt online while still achieving similar performance to fine-tuned, large nonlinear models. For code, please see our project page https://sites.google.com/view/blr-hac. △ Less

Submitted 16 April, 2024; originally announced April 2024.

Comments: 10 pages, 4 figures, Accepted to AAMAS 2024

arXiv:2401.13045 [pdf]

Assessment of Sports Concussion in Female Athletes: A Role for Neuroinformatics?

Authors: Rachel Edelstein, Sterling Gutterman, Benjamin Newman, John Darrell Van Horn

Abstract: Over the past decade, the intricacies of sports-related concussions among female athletes have become readily apparent. Traditional clinical methods for diagnosing concussions suffer limitations when applied to female athletes, often failing to capture subtle changes in brain structure and function. Advanced neuroinformatics techniques and machine learning models have become invaluable assets in t… ▽ More Over the past decade, the intricacies of sports-related concussions among female athletes have become readily apparent. Traditional clinical methods for diagnosing concussions suffer limitations when applied to female athletes, often failing to capture subtle changes in brain structure and function. Advanced neuroinformatics techniques and machine learning models have become invaluable assets in this endeavor. While these technologies have been extensively employed in understanding concussion in male athletes, there remains a significant gap in our comprehension of their effectiveness for female athletes. With its remarkable data analysis capacity, machine learning offers a promising avenue to bridge this deficit. By harnessing the power of machine learning, researchers can link observed phenotypic neuroimaging data to sex-specific biological mechanisms, unraveling the mysteries of concussions in female athletes. Furthermore, embedding methods within machine learning enable examining brain architecture and its alterations beyond the conventional anatomical reference frame. In turn, allows researchers to gain deeper insights into the dynamics of concussions, treatment responses, and recovery processes. To guarantee that female athletes receive the optimal care they deserve, researchers must employ advanced neuroimaging techniques and sophisticated machine-learning models. These tools enable an in-depth investigation of the underlying mechanisms responsible for concussion symptoms stemming from neuronal dysfunction in female athletes. This paper endeavors to address the crucial issue of sex differences in multimodal neuroimaging experimental design and machine learning approaches within female athlete populations, ultimately ensuring that they receive the tailored care they require when facing the challenges of concussions. △ Less

Submitted 9 March, 2024; v1 submitted 23 January, 2024; originally announced January 2024.

arXiv:2311.00059 [pdf, other]

The Generative AI Paradox: "What It Can Create, It May Not Understand"

Authors: Peter West, Ximing Lu, Nouha Dziri, Faeze Brahman, Linjie Li, Jena D. Hwang, Liwei Jiang, Jillian Fisher, Abhilasha Ravichander, Khyathi Chandu, Benjamin Newman, Pang Wei Koh, Allyson Ettinger, Yejin Choi

Abstract: The recent wave of generative AI has sparked unprecedented global attention, with both excitement and concern over potentially superhuman levels of artificial intelligence: models now take only seconds to produce outputs that would challenge or exceed the capabilities even of expert humans. At the same time, models still show basic errors in understanding that would not be expected even in non-exp… ▽ More The recent wave of generative AI has sparked unprecedented global attention, with both excitement and concern over potentially superhuman levels of artificial intelligence: models now take only seconds to produce outputs that would challenge or exceed the capabilities even of expert humans. At the same time, models still show basic errors in understanding that would not be expected even in non-expert humans. This presents us with an apparent paradox: how do we reconcile seemingly superhuman capabilities with the persistence of errors that few humans would make? In this work, we posit that this tension reflects a divergence in the configuration of intelligence in today's generative models relative to intelligence in humans. Specifically, we propose and test the Generative AI Paradox hypothesis: generative models, having been trained directly to reproduce expert-like outputs, acquire generative capabilities that are not contingent upon -- and can therefore exceed -- their ability to understand those same types of outputs. This contrasts with humans, for whom basic understanding almost always precedes the ability to generate expert-level outputs. We test this hypothesis through controlled experiments analyzing generation vs. understanding in generative models, across both language and image modalities. Our results show that although models can outperform humans in generation, they consistently fall short of human capabilities in measures of understanding, as well as weaker correlation between generation and understanding performance, and more brittleness to adversarial inputs. Our findings support the hypothesis that models' generative capability may not be contingent upon understanding capability, and call for caution in interpreting artificial intelligence by analogy to human intelligence. △ Less

Submitted 31 October, 2023; originally announced November 2023.

arXiv:2305.14772 [pdf, other]

A Question Answering Framework for Decontextualizing User-facing Snippets from Scientific Documents

Authors: Benjamin Newman, Luca Soldaini, Raymond Fok, Arman Cohan, Kyle Lo

Abstract: Many real-world applications (e.g., note taking, search) require extracting a sentence or paragraph from a document and showing that snippet to a human outside of the source document. Yet, users may find snippets difficult to understand as they lack context from the original document. In this work, we use language models to rewrite snippets from scientific documents to be read on their own. First,… ▽ More Many real-world applications (e.g., note taking, search) require extracting a sentence or paragraph from a document and showing that snippet to a human outside of the source document. Yet, users may find snippets difficult to understand as they lack context from the original document. In this work, we use language models to rewrite snippets from scientific documents to be read on their own. First, we define the requirements and challenges for this user-facing decontextualization task, such as clarifying where edits occur and handling references to other documents. Second, we propose a framework that decomposes the task into three stages: question generation, question answering, and rewriting. Using this framework, we collect gold decontextualizations from experienced scientific article readers. We then conduct a range of experiments across state-of-the-art commercial and open-source language models to identify how to best provide missing-but-relevant information to models for our task. Finally, we develop QaDecontext, a simple prompting strategy inspired by our framework that improves over end-to-end prompting. We conclude with analysis that finds, while rewriting is easy, question generation and answering remain challenging for today's models. △ Less

Submitted 30 November, 2023; v1 submitted 24 May, 2023; originally announced May 2023.

Comments: 19 pages, 2 figures, 8 tables, EMNLP2023

arXiv:2302.13382 [pdf, other]

doi 10.1145/3544548.3581351

Comparing Sentence-Level Suggestions to Message-Level Suggestions in AI-Mediated Communication

Authors: Liye Fu, Benjamin Newman, Maurice Jakesch, Sarah Kreps

Abstract: Traditionally, writing assistance systems have focused on short or even single-word suggestions. Recently, large language models like GPT-3 have made it possible to generate significantly longer natural-sounding suggestions, offering more advanced assistance opportunities. This study explores the trade-offs between sentence- vs. message-level suggestions for AI-mediated communication. We recruited… ▽ More Traditionally, writing assistance systems have focused on short or even single-word suggestions. Recently, large language models like GPT-3 have made it possible to generate significantly longer natural-sounding suggestions, offering more advanced assistance opportunities. This study explores the trade-offs between sentence- vs. message-level suggestions for AI-mediated communication. We recruited 120 participants to act as staffers from legislators' offices who often need to respond to large volumes of constituent concerns. Participants were asked to reply to emails with different types of assistance. The results show that participants receiving message-level suggestions responded faster and were more satisfied with the experience, as they mainly edited the suggested drafts. In addition, the texts they wrote were evaluated as more helpful by others. In comparison, participants receiving sentence-level assistance retained a higher sense of agency, but took longer for the task as they needed to plan the flow of their responses and decide when to use suggestions. Our findings have implications for designing task-appropriate communication assistance systems. △ Less

Submitted 26 February, 2023; originally announced February 2023.

Comments: 13 pages, 10 figures

arXiv:2211.09110 [pdf, other]

Holistic Evaluation of Language Models

Authors: Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher Ré, Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu Ren, Huaxiu Yao , et al. (25 additional authors not shown)

Abstract: Language models (LMs) are becoming the foundation for almost all major language technologies, but their capabilities, limitations, and risks are not well understood. We present Holistic Evaluation of Language Models (HELM) to improve the transparency of language models. First, we taxonomize the vast space of potential scenarios (i.e. use cases) and metrics (i.e. desiderata) that are of interest fo… ▽ More Language models (LMs) are becoming the foundation for almost all major language technologies, but their capabilities, limitations, and risks are not well understood. We present Holistic Evaluation of Language Models (HELM) to improve the transparency of language models. First, we taxonomize the vast space of potential scenarios (i.e. use cases) and metrics (i.e. desiderata) that are of interest for LMs. Then we select a broad subset based on coverage and feasibility, noting what's missing or underrepresented (e.g. question answering for neglected English dialects, metrics for trustworthiness). Second, we adopt a multi-metric approach: We measure 7 metrics (accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency) for each of 16 core scenarios when possible (87.5% of the time). This ensures metrics beyond accuracy don't fall to the wayside, and that trade-offs are clearly exposed. We also perform 7 targeted evaluations, based on 26 targeted scenarios, to analyze specific aspects (e.g. reasoning, disinformation). Third, we conduct a large-scale evaluation of 30 prominent language models (spanning open, limited-access, and closed models) on all 42 scenarios, 21 of which were not previously used in mainstream LM evaluation. Prior to HELM, models on average were evaluated on just 17.9% of the core HELM scenarios, with some prominent models not sharing a single scenario in common. We improve this to 96.0%: now all 30 models have been densely benchmarked on the same core scenarios and metrics under standardized conditions. Our evaluation surfaces 25 top-level findings. For full transparency, we release all raw model prompts and completions publicly for further analysis, as well as a general modular toolkit. We intend for HELM to be a living benchmark for the community, continuously updated with new scenarios, metrics, and models. △ Less

Submitted 1 October, 2023; v1 submitted 16 November, 2022; originally announced November 2022.

Comments: Authored by the Center for Research on Foundation Models (CRFM) at the Stanford Institute for Human-Centered Artificial Intelligence (HAI). Project page: https://crfm.stanford.edu/helm/v1.0

Journal ref: Published in Transactions on Machine Learning Research (TMLR), 2023

arXiv:2110.07280 [pdf, other]

P-Adapters: Robustly Extracting Factual Information from Language Models with Diverse Prompts

Authors: Benjamin Newman, Prafulla Kumar Choubey, Nazneen Rajani

Abstract: Recent work (e.g. LAMA (Petroni et al., 2019)) has found that the quality of the factual information extracted from Large Language Models (LLMs) depends on the prompts used to query them. This inconsistency is problematic because different users will query LLMs for the same information using different wording, but should receive the same, accurate responses regardless. In this work we aim to addre… ▽ More Recent work (e.g. LAMA (Petroni et al., 2019)) has found that the quality of the factual information extracted from Large Language Models (LLMs) depends on the prompts used to query them. This inconsistency is problematic because different users will query LLMs for the same information using different wording, but should receive the same, accurate responses regardless. In this work we aim to address this shortcoming by introducing P-Adapters: lightweight models that sit between the embedding layer and first attention layer of LLMs. They take LLM embeddings as input and output continuous prompts that are used to query the LLM. Additionally, we investigate Mixture of Experts (MoE) models that learn a set of continuous prompts ("experts") and select one to query the LLM. They require a separate classifier trained on human-annotated data to map natural language prompts to the continuous ones. P-Adapters perform comparably to the more complex MoE models in extracting factual information from BERT and RoBERTa while eliminating the need for additional annotations. P-Adapters show between 12-26% absolute improvement in precision and 36-50% absolute improvement in consistency over a baseline of only using natural language queries. Finally, we investigate what makes P-Adapters successful and conclude that a significant factor is access to the LLM's embeddings of the original natural language prompt, particularly the subject of the entity pair being queried. △ Less

Submitted 19 April, 2022; v1 submitted 14 October, 2021; originally announced October 2021.

Comments: 15 pages, 6 figures, 4 tables

arXiv:2108.07258 [pdf, other]

On the Opportunities and Risks of Foundation Models

Authors: Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, Shyamal Buch, Dallas Card, Rodrigo Castellon, Niladri Chatterji, Annie Chen, Kathleen Creel, Jared Quincy Davis, Dora Demszky, Chris Donahue, Moussa Doumbouya, Esin Durmus, Stefano Ermon, John Etchemendy, Kawin Ethayarajh , et al. (89 additional authors not shown)

Abstract: AI is undergoing a paradigm shift with the rise of models (e.g., BERT, DALL-E, GPT-3) that are trained on broad data at scale and are adaptable to a wide range of downstream tasks. We call these models foundation models to underscore their critically central yet incomplete character. This report provides a thorough account of the opportunities and risks of foundation models, ranging from their cap… ▽ More AI is undergoing a paradigm shift with the rise of models (e.g., BERT, DALL-E, GPT-3) that are trained on broad data at scale and are adaptable to a wide range of downstream tasks. We call these models foundation models to underscore their critically central yet incomplete character. This report provides a thorough account of the opportunities and risks of foundation models, ranging from their capabilities (e.g., language, vision, robotics, reasoning, human interaction) and technical principles(e.g., model architectures, training procedures, data, systems, security, evaluation, theory) to their applications (e.g., law, healthcare, education) and societal impact (e.g., inequity, misuse, economic and environmental impact, legal and ethical considerations). Though foundation models are based on standard deep learning and transfer learning, their scale results in new emergent capabilities,and their effectiveness across so many tasks incentivizes homogenization. Homogenization provides powerful leverage but demands caution, as the defects of the foundation model are inherited by all the adapted models downstream. Despite the impending widespread deployment of foundation models, we currently lack a clear understanding of how they work, when they fail, and what they are even capable of due to their emergent properties. To tackle these questions, we believe much of the critical research on foundation models will require deep interdisciplinary collaboration commensurate with their fundamentally sociotechnical nature. △ Less

Submitted 12 July, 2022; v1 submitted 16 August, 2021; originally announced August 2021.

Comments: Authored by the Center for Research on Foundation Models (CRFM) at the Stanford Institute for Human-Centered Artificial Intelligence (HAI). Report page with citation guidelines: https://crfm.stanford.edu/report.html

arXiv:2104.09635 [pdf, other]

Refining Targeted Syntactic Evaluation of Language Models

Authors: Benjamin Newman, Kai-Siang Ang, Julia Gong, John Hewitt

Abstract: Targeted syntactic evaluation of subject-verb number agreement in English (TSE) evaluates language models' syntactic knowledge using hand-crafted minimal pairs of sentences that differ only in the main verb's conjugation. The method evaluates whether language models rate each grammatical sentence as more likely than its ungrammatical counterpart. We identify two distinct goals for TSE. First, eval… ▽ More Targeted syntactic evaluation of subject-verb number agreement in English (TSE) evaluates language models' syntactic knowledge using hand-crafted minimal pairs of sentences that differ only in the main verb's conjugation. The method evaluates whether language models rate each grammatical sentence as more likely than its ungrammatical counterpart. We identify two distinct goals for TSE. First, evaluating the systematicity of a language model's syntactic knowledge: given a sentence, can it conjugate arbitrary verbs correctly? Second, evaluating a model's likely behavior: given a sentence, does the model concentrate its probability mass on correctly conjugated verbs, even if only on a subset of the possible verbs? We argue that current implementations of TSE do not directly capture either of these goals, and propose new metrics to capture each goal separately. Under our metrics, we find that TSE overestimates systematicity of language models, but that models score up to 40% better on verbs that they predict are likely in context. △ Less

Submitted 19 April, 2021; originally announced April 2021.

Comments: 14 pages, 5 figures, 3 tables. To appear at NAACL 2021

ACM Class: I.2.7

arXiv:2010.07358 [pdf, other]

Optimal Assistance for Object-Rearrangement Tasks in Augmented Reality

Authors: Benjamin Newman, Kevin Carlberg, Ruta Desai

Abstract: Augmented-reality (AR) glasses that will have access to onboard sensors and an ability to display relevant information to the user present an opportunity to provide user assistance in quotidian tasks. Many such tasks can be characterized as object-rearrangement tasks. We introduce a novel framework for computing and displaying AR assistance that consists of (1) associating an optimal action sequen… ▽ More Augmented-reality (AR) glasses that will have access to onboard sensors and an ability to display relevant information to the user present an opportunity to provide user assistance in quotidian tasks. Many such tasks can be characterized as object-rearrangement tasks. We introduce a novel framework for computing and displaying AR assistance that consists of (1) associating an optimal action sequence with the policy of an embodied agent and (2) presenting this sequence to the user as suggestions in the AR system's heads-up display. The embodied agent comprises a "hybrid" between the AR system and the user, with the AR system's observation space (i.e., sensors) and the user's action space (i.e., task-execution actions); its policy is learned by minimizing the task-completion time. In this initial study, we assume that the AR system's observations include the environment's map and localization of the objects and the user. These choices allow us to formalize the problem of computing AR assistance for any object-rearrangement task as a planning problem, specifically as a capacitated vehicle-routing problem. Further, we introduce a novel AR simulator that can enable web-based evaluation of AR-like assistance and associated at-scale data collection via the Habitat simulator for embodied artificial intelligence. Finally, we perform a study that evaluates user response to the proposed form of AR assistance on a specific quotidian object-rearrangement task, house cleaning, using our proposed AR simulator on mechanical turk. In particular, we study the effect of the proposed AR assistance on users' task performance and sense of agency over a range of task difficulties. Our results indicate that providing users with such assistance improves their overall performance and while users report a negative impact to their agency, they may still prefer the proposed assistance to having no assistance at all. △ Less

Submitted 14 October, 2020; originally announced October 2020.

Comments: 19 pages including supplementary. Under review for ACM IUI 2021

arXiv:2010.07174 [pdf, other]

The EOS Decision and Length Extrapolation

Authors: Benjamin Newman, John Hewitt, Percy Liang, Christopher D. Manning

Abstract: Extrapolation to unseen sequence lengths is a challenge for neural generative models of language. In this work, we characterize the effect on length extrapolation of a modeling decision often overlooked: predicting the end of the generative process through the use of a special end-of-sequence (EOS) vocabulary item. We study an oracle setting - forcing models to generate to the correct sequence len… ▽ More Extrapolation to unseen sequence lengths is a challenge for neural generative models of language. In this work, we characterize the effect on length extrapolation of a modeling decision often overlooked: predicting the end of the generative process through the use of a special end-of-sequence (EOS) vocabulary item. We study an oracle setting - forcing models to generate to the correct sequence length at test time - to compare the length-extrapolative behavior of networks trained to predict EOS (+EOS) with networks not trained to (-EOS). We find that -EOS substantially outperforms +EOS, for example extrapolating well to lengths 10 times longer than those seen at training time in a bracket closing task, as well as achieving a 40% improvement over +EOS in the difficult SCAN dataset length generalization task. By comparing the hidden states and dynamics of -EOS and +EOS models, we observe that +EOS models fail to generalize because they (1) unnecessarily stratify their hidden states by their linear position is a sequence (structures we call length manifolds) or (2) get stuck in clusters (which we refer to as length attractors) once the EOS token is the highest-probability prediction. △ Less

Submitted 14 October, 2020; originally announced October 2020.

Comments: 16 page, 7 Figures, 9 Tables, Blackbox NLP Workshop at EMNLP 2020

arXiv:1909.07290 [pdf, other]

Communication-based Evaluation for Natural Language Generation

Authors: Benjamin Newman, Reuben Cohn-Gordon, Christopher Potts

Abstract: Natural language generation (NLG) systems are commonly evaluated using n-gram overlap measures (e.g. BLEU, ROUGE). These measures do not directly capture semantics or speaker intentions, and so they often turn out to be misaligned with our true goals for NLG. In this work, we argue instead for communication-based evaluations: assuming the purpose of an NLG system is to convey information to a read… ▽ More Natural language generation (NLG) systems are commonly evaluated using n-gram overlap measures (e.g. BLEU, ROUGE). These measures do not directly capture semantics or speaker intentions, and so they often turn out to be misaligned with our true goals for NLG. In this work, we argue instead for communication-based evaluations: assuming the purpose of an NLG system is to convey information to a reader/listener, we can directly evaluate its effectiveness at this task using the Rational Speech Acts model of pragmatic language use. We illustrate with a color reference dataset that contains descriptions in pre-defined quality categories, showing that our method better aligns with these quality categories than do any of the prominent n-gram overlap methods. △ Less

Submitted 11 October, 2019; v1 submitted 16 September, 2019; originally announced September 2019.

Comments: 11 pages, 2 figures, SCiL, camera-ready - clarified certain points, updated acknowledgements

arXiv:1807.11154 [pdf, other]

HARMONIC: A Multimodal Dataset of Assistive Human-Robot Collaboration

Authors: Benjamin A. Newman, Reuben M. Aronson, Siddartha S. Srinivasa, Kris Kitani, Henny Admoni

Abstract: We present the Human And Robot Multimodal Observations of Natural Interactive Collaboration (HARMONIC) data set. This is a large multimodal data set of human interactions with a robotic arm in a shared autonomy setting designed to imitate assistive eating. The data set provides human, robot, and environmental data views of twenty-four different people engaged in an assistive eating task with a 6 d… ▽ More We present the Human And Robot Multimodal Observations of Natural Interactive Collaboration (HARMONIC) data set. This is a large multimodal data set of human interactions with a robotic arm in a shared autonomy setting designed to imitate assistive eating. The data set provides human, robot, and environmental data views of twenty-four different people engaged in an assistive eating task with a 6 degree-of-freedom (DOF) robot arm. From each participant, we recorded video of both eyes, egocentric video from a head-mounted camera, joystick commands, electromyography from the forearm used to operate the joystick, third person stereo video, and the joint positions of the 6 DOF robot arm. Also included are several features that come as a direct result of these recordings, such as eye gaze projected onto the egocentric video, body pose, hand pose, and facial keypoints. These data streams were collected specifically because they have been shown to be closely related to human mental states and intention. This data set could be of interest to researchers studying intention prediction, human mental state modeling, and shared autonomy. Data streams are provided in a variety of formats such as video and human-readable CSV and YAML files. △ Less

Submitted 30 July, 2020; v1 submitted 29 July, 2018; originally announced July 2018.

arXiv:1302.3912 [pdf]

An Online Environment for Democratic Deliberation: Motivations, Principles, and Design

Authors: Todd Davies, Brendan O'Connor, Alex Cochran, Jonathan J. Effrat, Andrew Parker, Benjamin Newman, Aaron Tam

Abstract: We have created a platform for online deliberation called Deme (which rhymes with 'team'). Deme is designed to allow groups of people to engage in collaborative drafting, focused discussion, and decision making using the Internet. The Deme project has evolved greatly from its beginning in 2003. This chapter outlines the thinking behind Deme's initial design: our motivations for creating it, the pr… ▽ More We have created a platform for online deliberation called Deme (which rhymes with 'team'). Deme is designed to allow groups of people to engage in collaborative drafting, focused discussion, and decision making using the Internet. The Deme project has evolved greatly from its beginning in 2003. This chapter outlines the thinking behind Deme's initial design: our motivations for creating it, the principles that guided its construction, and its most important design features. The version of Deme described here was written in PHP and was deployed in 2004 and used by several groups (including organizers of the 2005 Online Deliberation Conference). Other papers describe later developments in the Deme project (see Davies et al. 2005, 2008; Davies and Mintz 2009). △ Less

Submitted 15 February, 2013; originally announced February 2013.

Comments: Appeared in Todd Davies and Seeta Peña Gangadharan (Editors), Online Deliberation: Design, Research, and Practice, CSLI Publications/University of Chicago Press, October 2009, pp. 275-292; 18 pages, 3 figures

ACM Class: H.5.3; K.4.1; K.4.3

arXiv:1302.3545 [pdf]

Displaying Asynchronous Reactions to a Document: Two Goals and a Design

Authors: Todd Davies, Benjamin Newman, Brendan O'Connor, Aaron Tam, Leo Perry

Abstract: We describe and motivate three goals for the screen display of asynchronous text deliberation pertaining to a document: (1) visibility of relationships between comments and the text they reference, between different comments, and between group members and the document and discussion, and (2) distinguishability of boundaries between contextually related and unrelated text and comments and between i… ▽ More We describe and motivate three goals for the screen display of asynchronous text deliberation pertaining to a document: (1) visibility of relationships between comments and the text they reference, between different comments, and between group members and the document and discussion, and (2) distinguishability of boundaries between contextually related and unrelated text and comments and between individual authors of documents and comments. Interfaces for document-centered discussion generally fail to fulfill one or both of these goals as well as they could. We describe the design of the new version of Deme, a Web-based platform for online deliberation, and argue that it achieves the two goals better than other recent designs. △ Less

Submitted 14 February, 2013; originally announced February 2013.

Comments: Appeared as a Poster Paper, Conference on Computer Supported Cooperative Work, 20th Anniversary - Conference Supplement (CSCW 2006, Banff, November 4-8, 2006), pp. 169-170; Modified as "Document Centered Discussion: A Design Pattern for Online Deliberation", in D. Schuler, Liberating Voices: A Pattern Language for Communication Revolution, MIT Press, 2008, pp. 384-386; 2 pages, 1 figure, 1 table

ACM Class: H.5.3; I.7.1

arXiv:cs/0306116 [pdf]

Global Platform for Rich Media Conferencing and Collaboration

Authors: Harvey B. Newman, Philippe Galvez, Gregory Denis, David Collados, Kun Wei, David Adamczyk

Abstract: The Virtual Rooms Videoconferencing Service (VRVS) provides a worldwide videoconferencing service and collaborative environment to the research and education communities. This system provides a low cost, bandwidth-efficient, extensible means for videoconferencing and remote collaboration over networks within the High Energy and Nuclear Physics communities (HENP). VRVS has become a standard part… ▽ More The Virtual Rooms Videoconferencing Service (VRVS) provides a worldwide videoconferencing service and collaborative environment to the research and education communities. This system provides a low cost, bandwidth-efficient, extensible means for videoconferencing and remote collaboration over networks within the High Energy and Nuclear Physics communities (HENP). VRVS has become a standard part of the toolset used daily by a large sector of HENP, and it is used increasingly for other DoE/NSF-supported programs. The current features included multi-protocol, multi-OS support for all significant video enabled clients including: H.323, Mbone, QuickTime, MPEG2, Java Media Framework, and other clients. The current architecture makes VRVS a distributed, highly functional, and efficient software-only system for multipoint audio, video and web conferencing and collaboration over global IP networks. VRVS has developed the VRVS-AG Reflector and a specialized Web interface that enables end users to connect to any Access Grid (AG) session, in any of the AG "virtual venues" from anywhere worldwide. The VRVS system has now been running for the last five and half years, offering to the HENP community a working and reliable tool for collaboration within groups and among physicists dispersed world-wide. The goal of this ongoing effort is to develop the next generation collaborative systems running over next generation networks. The new developments area integrate emerging standards, include all security aspects, and will extend the range of VRVS video technologies supported to cover the latest high end standards quality. We will focus the discussion on the new capability provides by the latest version V3.0 and its future evolution. △ Less

Submitted 15 July, 2003; v1 submitted 19 June, 2003; originally announced June 2003.

Comments: CHEP03 Conference

ACM Class: H.5.3

arXiv:cs/0306109 [pdf]

Distributed Heterogeneous Relational Data Warehouse In A Grid Environment

Authors: Saima Iqbal, Julian J. Bunn, Harvey B. Newman

Abstract: This paper examines how a "Distributed Heterogeneous Relational Data Warehouse" can be integrated in a Grid environment that will provide physicists with efficient access to large and small object collections drawn from databases at multiple sites. This paper investigates the requirements of Grid-enabling such a warehouse, and explores how these requirements may be met by extensions to existing… ▽ More This paper examines how a "Distributed Heterogeneous Relational Data Warehouse" can be integrated in a Grid environment that will provide physicists with efficient access to large and small object collections drawn from databases at multiple sites. This paper investigates the requirements of Grid-enabling such a warehouse, and explores how these requirements may be met by extensions to existing Grid middleware. We present initial results obtained with a working prototype warehouse of this kind using both SQLServer and Oracle9i, where a Grid-enabled web-services interface makes it easier for web-applications to access the distributed contents of the databases securely. Based on the success of the prototype, we proposes a framework for using heterogeneous relational data warehouse through the web-service interface and create a single "Virtual Database System" for users. The ability to transparently access data in this way, as shown in prototype, is likely to be a very powerful facility for HENP and other grid users wishing to collate and analyze information distributed over Grid. △ Less

Submitted 18 June, 2003; originally announced June 2003.

Comments: 4 pages, 6 figures

ACM Class: H.2.1; H.2.2; H.2.4; H.2.7; H.3.1; H.3.5

arXiv:cs/0306096 [pdf]

MonALISA : A Distributed Monitoring Service Architecture

Authors: H. B. Newman, I. C. Legrand, P. Galvez, R. Voicu, C. Cirstoiu

Abstract: The MonALISA (Monitoring Agents in A Large Integrated Services Architecture) system provides a distributed monitoring service. MonALISA is based on a scalable Dynamic Distributed Services Architecture which is designed to meet the needs of physics collaborations for monitoring global Grid systems, and is implemented using JINI/JAVA and WSDL/SOAP technologies. The scalability of the system derive… ▽ More The MonALISA (Monitoring Agents in A Large Integrated Services Architecture) system provides a distributed monitoring service. MonALISA is based on a scalable Dynamic Distributed Services Architecture which is designed to meet the needs of physics collaborations for monitoring global Grid systems, and is implemented using JINI/JAVA and WSDL/SOAP technologies. The scalability of the system derives from the use of multithreaded Station Servers to host a variety of loosely coupled self-describing dynamic services, the ability of each service to register itself and then to be discovered and used by any other services, or clients that require such information, and the ability of all services and clients subscribing to a set of events (state changes) in the system to be notified automatically. The framework integrates several existing monitoring tools and procedures to collect parameters describing computational nodes, applications and network performance. It has built-in SNMP support and network-performance monitoring algorithms that enable it to monitor end-to-end network performance as well as the performance and state of site facilities in a Grid. MonALISA is currently running around the clock on the US CMS test Grid as well as an increasing number of other sites. It is also being used to monitor the performance and optimize the interconnections among the reflectors in the VRVS system. △ Less

Submitted 16 June, 2003; originally announced June 2003.

Comments: Talk from the 2003 Computing in High Energy and Nuclear Physics (CHEP03), La Jolla, Ca, USA, March 2003, 8 pages, pdf. PSN MOET001

ACM Class: H4.3; H5.2; J2; D2.8

arXiv:cs/0306002 [pdf, ps, other]

The Clarens web services architecture

Authors: Conrad D. Steenberg, Eric Aslakson, Julian J. Bunn, Harvey B. Newman, Michael Thomas, Frank van Lingen

Abstract: Clarens is a uniquely flexible web services infrastructure providing a unified access protocol to a diverse set of functions useful to the HEP community. It uses the standard HTTP protocol combined with application layer, certificate based authentication to provide single sign-on to individuals, organizations and hosts, with fine-grained access control to services, files and virtual organization… ▽ More Clarens is a uniquely flexible web services infrastructure providing a unified access protocol to a diverse set of functions useful to the HEP community. It uses the standard HTTP protocol combined with application layer, certificate based authentication to provide single sign-on to individuals, organizations and hosts, with fine-grained access control to services, files and virtual organization (VO) management. This contribution describes the server functionality, while client applications are described in a subsequent talk. △ Less

Submitted 14 July, 2003; v1 submitted 30 May, 2003; originally announced June 2003.

Comments: Talk from the 2003 Computing in High Energy and Nuclear Physics (CHEP03), La Jolla, Ca, USA, March 2003, 6 pages, LaTeX, 4 figures, PSN MONT008

ACM Class: H.3.4

arXiv:cs/0306001 [pdf, ps, other]

Clarens Client and Server Applications

Authors: Conrad D. Steenberg, Eric Aslakson, Julian J. Bunn, Harvey B. Newman, Michael Thomas, Frank van Lingen

Abstract: Several applications have been implemented with access via the Clarens web service infrastructure, including virtual organization management, JetMET physics data analysis using relational databases, and Storage Resource Broker (SRB) access. This functionality is accessible transparently from Python scripts, the Root analysis framework and from Java applications and browser applets. Several applications have been implemented with access via the Clarens web service infrastructure, including virtual organization management, JetMET physics data analysis using relational databases, and Storage Resource Broker (SRB) access. This functionality is accessible transparently from Python scripts, the Root analysis framework and from Java applications and browser applets. △ Less

Submitted 14 July, 2003; v1 submitted 30 May, 2003; originally announced June 2003.

Comments: Talk from the 2003 Computing in High Energy and Nuclear Physics (CHEP03), La Jolla, Ca, USA, March 2003, 4 pages, LaTeX, no figures, PSN TUCT005

ACM Class: H.3.4

Showing 1–22 of 22 results for author: Newman, B