SIVA'23 - Workshop on Socially Interactive Human-like Virtual Agents

  • Symposium

Morning schedule

Room: Paniolo II

Al times are in Hawaii Standard Time

09:00am Welcome speech (SIVA organizers)
09:10am - 09:50am


The psychological benefits of virtual human agents
Gale Lucas (Institute for Creative Technologies,  University of Southern California)

In order to explore the benefits of virtual human agents with the ability to engage socially with users, this talk presents research comparing such agents  both to non-social machines and to humans. Social agents have the potential to build rapport like humans (which non-social machines cannot do), but do so while assuring anonymity (which humans cannot do). In this way, they may offer the “best of both worlds” in terms of psychological benefits, especially feeling comfortable in situations where they would otherwise be afraid of being negatively evaluated. This has implications for user design and offers possibilities for future research.

09:50am - 10:10am Dynamic face imaging: a novel analysis framework for 4D social face perception and expression
Lukas Snoek (University of Glasgow); Rachael Jack (University of Glasgow); Philippe Schyns (Institute of Neuroscience and Psychology, University of Glasgow)

Measuring facial expressions is a notoriously difficult and time-consuming process, often involving manual labeling of low-dimensional descriptors such as Action Units (AUs). Computer vision algorithms provide automated alternatives for measuring and annotating face shape and expression from 2D images, but often ignore the complexities of dynamic 3D facial expressions. Moreover, open-source implementations are often difficult to use, preventing widespread adoption by the wider scientific community beyond computer vision. To address these issues, we develop dynamic face imaging, a novel analysis framework to study social face perception and expression. We use state-of-the-art 3D face reconstruction models to quantify face movement as temporal shape deviations in a common 3D mesh topology, which disentangles global (head) movement and local (facial) movement. Using a set of validation analyses, we test different reconstruction algorithms and quantify how well they reconstruct facial “action units” and track key facial landmarks in 3D, demonstrating promising performance and highlight areas for improvement. We provide an open-source software package that implements functionality for easy reconstruction, preprocessing, and analysis of these dynamic facial expression data.
10:10am – 10:45am Coffee break
10:45am - 11:05am

Modelling Culturally Diverse Smiles Using Data-Driven Methods
Chaona Chen (University of Glasgow); Oliver Garrod (Institute of Neuroscience and Psychology, University of Glasgow); Philippe Schyns (Institute of Neuroscience and Psychology, University of Glasgow); Rachael Jack (University of Glasgow)

Smiling faces are often preferred in daily social interactions.  Many socially interactive human-like virtual agents are equipped with the capability to produce standardized smiles. Such smiles often comprise a specific set of facial movements (i.e., Action Units, AUs)–e.g., Lip Corner Puller (AU12) and Cheek Raiser (AU6)–that are widely considered to be universal and are often presented on a static image. However, mounting evidence shows that people from different cultures prefer different smiles. To engage a culturally diverse range of human users, socially interactive human-like virtual agents must be equipped with culturally-valid dynamic facial expressions.  To develop culturally sensitive smiles, we use data-driven, perception-based methods to model the facial expressions of happy in 60 individuals in two distinct cultures (East Asian and Western European). On each experimental trial, we generated a random facial animation composed of a random sub-set of individual face movements (i.e., AUs), each with a random movement. Each cultural participant categorized 2400 such facial animations according to an emotion label (e.g., happy) if appropriate, otherwise selecting ‘other.’ We derived facial expression models of happy for each cultural participant by measuring the statistical relationship between the dynamic Aus presented on each trial and each participant's responses.  Analysis of the facial expression models revealed clear cross-cultural similarity and diversity in smiles–for example, smiling with raised cheeks (AU12-6) is culturally common, while open-mouth smiling (AU25-12) is Western-specific and smiling with eyebrow raising (AU1-2) is East Asian-specific. Analysis of the temporal dynamics of each AU further revealed cultural diversity in smiles–for example, East Asian smiles show higher amplitude and faster acceleration, while Western smiles show earlier peak activation. Our results therefore demonstrate the general power of using a data-driven, perception-based approach to derive culturally-sensitive dynamic facial expressions, which are directly transferable to socially interactive human-like virtual agents. We anticipate that our approach will improve the social signalling capabilities of socially interactive human-like virtual agents and broaden their usability in global market.

11:05am  - 11:25am The Role of the Vocal Persona in Natural and Synthesized Speech
Camille Noufi (Stanford University); Lloyd May (Stanford University); Jonathan Berger (Stanford University)

The inclusion of voice persona in synthesized voice can be significant in a broad range of human-computer-interaction (HCI) applications, including augmentative and assistive communication (AAC), artistic performance, and designof virtual agents. We propose a framework to imbue compelling and contextually-dependent expression within a synthesized voice by introducing the role of the vocal persona within a synthesis system. In this framework, the resultant ‘tone of voice’
is defined as a point existing within a continuous, contextually-dependent probability space that is traversable by the user ofthe voice. We also present initial findings of a thematic analysis of 10 interviews with vocal studies and performance experts to further understand the role of the vocal persona within a natural communication ecology. The themes identified are then used to inform the design of the aforementioned framework.
11:25am - 11:45am Acceptability and Trustworthiness of Virtual Agents by Effects of Theory of Mind and Social Skills Training
Hiroki Tanaka (Nara Institute of Science and Technology); Takeshi Saga (Nara Institute of Science and Technology); Kota Iwauchi (Nara Institute of Science and Technology ); Satoshi Nakamura (Nara Institute of Science and Technology, Japan)

We constructed a social skills training system using virtual agents and developed a new training module for four basic tasks: declining, requesting, praising, and listening. Previous work demonstrated that a virtual agent’s theory of mind influences the building of trust between agents and users. The purpose of this study is to explore the effect of trustworthiness, acceptability, familiarity, and likeability on the agents’ theory of mind and the social skills training contents. In our experiment, 29 participants rated the trustworthiness and acceptability of the virtual agent after watching a video that featured levels of theory of mind and social skills training. Their system ratings were obtained using self-evaluation measures at each stage. We confirmed that our users’ trust and acceptability of the virtual agent were significantly changed depending on the level of the virtual agent’s theory of mind. We also confirmed that the users’ trust and acceptability in the trainer tended to improve after the social skills training.
12pm - 1pm Lunch Break

Afternoon schedule

Room: Paniolo II

Al times are in Hawaii Standard Time

1:00pm  -  1:40pm

On Challenges and Opportunities in Situated Language Interaction
Dan Bohus (Microsoft Research)

Situated language interaction is a complex, multimodal affair that extends well beyond the spoken word. When interacting with each other, we use a wide array of verbal and non-verbal signals to resolve several problems in parallel: we manage engagement, coordinate on taking turns, recognize intentions, and establish and maintain common ground. Proximity and body pose, attention and gaze, head nods and hand gestures, as well as prosody and facial expressions, all play very important roles in this process. Recent advances with deep learning methods on various perceptual tasks promise to create a more robust foundation for tracking these types of signals. Yet, developing agents that can engage in fluid, natural interactions with people in physically situated settings requires not just detecting these signals, but incrementally coordinating with people, in real time, on producing them. In this talk, using a few research vignettes from work we have done over the last decade at Microsoft Research, I will draw attention to some of the challenges and opportunities that lie ahead of us in constructing systems that understand the world around and collaborate with people in physical space.
1:40pm  -  2:00pm Signing Avatars - Multimodal Challenges for Text-to-sign Generation
Silvie Gibet (Université Bretagne Sud)

This paper is a positional paper that surveys existing technologies for animating signing avatars from written language. The main grammatical mechanisms of sign languages are described, and in particular the sign inflecting mechanisms in light of the processes of spatialization and iconicity that characterize these visual-gestural languages. The challenges faced by sign language generation systems using signing avatars are then outlined, as well as unresolved issues in building text-to-sign generation systems.
2:00pm -  2:20pm Zero-Shot Style Transfer for Multimodal Data-Driven Gesture Synthesis
Mireille Fares (Sorbonne University); Catherine Pelachaud (Sorbonne Université); Nicolas Obin (STMS (Ircam, CNRS, Sorbonne Université))

We propose a multimodal speech driven approach to generate 2D upper-body gestures for virtual agents, in the communicative style of different speakers, seen or unseen by our model during training. Upper-body gestures of a source speaker are generated based on the content of his/her multimodal data - speech acoustics and text semantics. The synthesized source speaker’s gestures are conditioned on the multimodal style representation of the target speaker. Our approach is zero-shot, and can generalize the style transfer to new unseen speakers, without any additional training. An objective evaluation is conducted to validate our approach.
2:20pm  -  2:40pm AFFDEX 2.0: A Real-Time Facial Expression Analysis Toolkit
Mina Bishay (SmartEye); Kenneth Preston (SmartEye); Matthew Strafuss (SmartEye); Graham Page (SmartEye); Jay Turcot (Affectiva); Mohammad Mavadati (SmartEye)

In this paper we introduce AFFDEX 2.0 – a toolkit for analyzing facial expressions in the wild, that is, it is intended for users aiming to; a) estimate the 3D head pose, b) detect facial Action Units (AUs), c) recognize basic emotions and 2 new emotional states (sentimentality and confusion), and d) detect high-level expressive metrics like blink and attention. AFFDEX 2.0 models are mainly based on Deep Learning, and are trained using a large-scale naturalistic dataset consisting of thousands of participants from different demographic groups. AFFDEX 2.0 is an enhanced version of our previous toolkit [33], that is capable of tracking faces at challenging conditions, detecting more accurately facial expressions, and recognizing new emotional states (sentimentality and confusion). AFFDEX 2.0 outperforms the state-of-the-art methods in AU detection and emotion recognition. AFFDEX 2.0 can process multiple faces in real time, and is working across the Windows and Linux platforms.
2:40pm -  03:00pm Casual chatter or speaking up? Adjusting articulatory effort in generation of speech and animation for conversational characters
Joakim Gustafson (KTH Royal Institute of Technology); Eva Szekely (KTH); Simon Alexanderson ( KTH Royal Institute of Technology); Jonas Beskow (KTH Royal Institute of Technology)

Embodied conversational agents and social robots need to be able to generate spontaneous behavior in order to be believable in social interactions. We present a system that can generate spontaneous speech with supporting lip movements. The conversational TTS voice is trained on a podcast corpus that has been prosodically tagged (f0, speaking rate and energy) and transcribed (including tokens for breathing, fillers and laughter). We introduce a speech animation algorithm where articulatory effort an be adjusted. The speech animation is driven by time-stamped phonemes obtained from the internal alignment attention map of the TTS system, and we use prominence estimates from the synthesised speech waveform to modulate the lip and jaw movements accordingly.
3:00pm  –  3:30pm Coffee break
03:30pm -  04:00pm Are we in sync during turn switch?
Jieyeon Woo (ISIR, Sorbonne University); Liu Yang (ISIR, Sorbonne University); Catherine Achard (UPMC); Catherine Pelachaud (Sorbonne Université)

During turns by coordinating with their partners. Exchanges can be done smoothly, with pauses between turns or through interruptions. Previous studies have analyzed various modalities to investigate turn shifts and their types (smooth turn exchange, overlap, and interruption). Modality analyses were also done to study the interpersonal synchronization which is observed throughout the whole interaction. Likewise, we intend to analyze different modalities to find a relationship between the different turn switch types and interpersonal synchrony. In this study, we provide an analysis of multimodal features, focusing on prosodic features (F0 and loudness), head activity, and facial action units, to characterize different switch types.

04:00pm -  04:20pm

Toward a Scoping Review of Social Intelligence in Virtual Humans
Sharon A Mozgai (USC ICT); Sarah Beland (USC Institute for Creative Technologies); Andrew Leeds (USC Institute for Creative Technologies); Jade Winn (USC Libraries); Cari Kaurloto (USC Libraries); Dirk Heylen (University of Twente ); Arno Hartholt

As the demand for socially intelligent Virtual Humans (VHs) increases, so follows the demand for effective and efficient cross-discipline collaboration that is required to bring these VHs “to life”. One avenue for increasing cross-discipline fluency is the aggregation and organization of seemingly disparate areas of research and development (e.g., graphics and emotion models) that are essential to the field of VH research. Our initial investigation (1) identifies and catalogues research streams concentrated in three multidisciplinary VH topic clusters within the domain of social intelligence, Emotion, Social Behavior, and The Face, (2) brings to the forefront key themes and prolific authors within each topic cluster, and (3) provides evidence that a full scoping review is warranted to further map the field, aggregate research findings, and identify gaps in the research. To enable collaboration, we provide full access to the refined VH cluster datasets, key word and author word clouds, as well as interactive evidence maps.
04:20pm - 04:30pm Closing remarks (SIVA organizers)

En poursuivant votre navigation sur ce site, vous acceptez l'utilisation de cookies pour nous permettre de mesurer l'audience, et pour vous permettre de partager du contenu via les boutons de partage de réseaux sociaux. En savoir plus.