UniCAE Unified Cross-modal Affective & Empathetic Computing

UniCAE

Unified Cross-modal Affective & Empathetic Computing

AI should not remain a cold parser of instructions. It should learn to read human feelings, reason about emotional context, and respond with empathy across language, speech, face, avatar, vision, and 3D. UniCAE studies a unified paradigm for both affective comprehension and affective generation.

👉🏻 Affective Comprehension

Infer and reason about fine-grained sentiment and emotion from multimodal human signals, dialogue, behavior, speech, face, avatar, vision, and embodied context.

👉🏻 Affective Generation

Generate emotionally aligned text, speech, facial expressions, avatars, and 3D motion that feel coherent to humans.

Concept illustration of unified affective understanding and empathetic generation

Flagship Research

Track A

Cross-modal Affective Understanding & Reasoning

ACL 2023 Findings Benchmark Dialogue ABSA

DiaASQ: A Benchmark of Conversational Aspect-based Sentiment Quadruple Analysis

A first step from sentence-level ABSA toward dialogue-native, fine-grained conversational sentiment understanding.

  • Introduces conversational aspect-based sentiment quadruple analysis over multi-turn dialogue.
  • Builds a bilingual Chinese-English benchmark with high-quality manual annotations.
  • Establishes end-to-end quadruple prediction with dialogue-specific discourse modeling.
Overview figure for DiaASQ
ACM MM 2024 Oral Cross-modal Dialogue Reasoning

PanoSent: A Panoptic Sextuple Extraction Benchmark for Cross-modal Conversational Aspect-based Sentiment Analysis

A broader formulation of sentiment reasoning that treats multimodality, rationales, and sentiment dynamics as first-class citizens.

  • Defines panoptic sentiment sextuple extraction and sentiment flipping analysis in multimodal conversation.
  • Curates a large-scale multilingual dataset with holder, target, aspect, opinion, sentiment, and rationale labels.
  • Proposes Chain-of-Sentiment reasoning with the Sentica multimodal model and verification mechanism.
Overview figure for PanoSent

Track B

Cross-modal Affective & Empathetic Generation

ACL 2024 Demo Open-source System Avatar Chatbot

EmpathyEar: An Open-source Avatar Multimodal Empathetic Chatbot

An open multimodal empathetic chatbot that turns text-only ERG into embodied avatar interaction.

  • Accepts text, speech, and vision inputs in flexible combinations.
  • Produces empathetic responses with text, talking face, and synchronized speech.
  • Uses emotion-aware instruction tuning for deeper emotional resonance and human-like response quality.
Overview figure for EmpathyEar
WWW 2025 Benchmark Text-Speech-Vision

Towards Multimodal Empathetic Response Generation: A Rich Text-Speech-Vision Avatar-based Benchmark

A benchmark and system for end-to-end multimodal empathetic response generation with authentic speech and avatar video.

  • Introduces AvaMERG for avatar-based multimodal ERG with diverse profiles and real-world scenarios.
  • Builds Empatheia on top of an MLLM with multimodal encoder, speech generator, and avatar generator.
  • Adds Chain-of-Empathetic reasoning and empathy-enhanced tuning for emotional accuracy and cross-modal consistency.
Overview figure for the multimodal empathetic response generation benchmark
EMNLP 2025 Oral 3D Expression Text-to-Face

When Words Smile: Generating Diverse Emotional Facial Expressions from Text

Emotional facial expression generation as a missing piece beyond lip synchronization for digital humans.

  • Learns a continuous latent space for diverse, fluid, and emotionally coherent facial expressions.
  • Introduces EmoAva with 15,000 text-to-3D-expression pairs.
  • Advances text-driven expressive avatar generation for dialogue, gaming, and interactive agents.
Overview figure for emotional 3D avatar generation
NeurIPS 2025 Motion LLM Video RAG

VimoRAG: Video-based Retrieval-augmented 3D Motion Generation for Motion Language Models

Retrieval-augmented motion generation that leverages in-the-wild video as emotional and behavioral grounding for 3D motion.

  • Addresses out-of-domain and out-of-vocabulary failure modes in motion language models.
  • Retrieves motion-centered 2D video evidence with a Gemini Motion Video Retriever.
  • Improves robustness with a motion-centric dual-alignment DPO trainer for generation after retrieval.
Overview figure for VimoRAG

Community

Survey In Progress UniCAE Roadmap

A Survey on Unified Cross-modal Affective Comprehension and Generation

A living survey that systematizes how future emotionally intelligent AI should jointly comprehend, reason, and generate affective content across modalities.

  • Scope: language, speech, face, avatar, multimodal dialogue, expressive 3D, and embodied empathy.
  • Focus: unified task taxonomies, datasets, benchmarks, models, evaluation, and open challenges.
  • Status: paper in progress, intended as a roadmap for the UniCAE thread.

Unified Affective

Comprehension & Generation