UniCAE

Unified Cross-modal Affective & Empathetic Computing

AI should not remain a cold parser of instructions. It should learn to read human feelings, reason about emotional context, and respond with empathy across language, speech, face, avatar, vision, and 3D. UniCAE studies a unified paradigm for both affective comprehension and affective generation.

👉🏻 Affective Comprehension

Infer and reason about fine-grained sentiment and emotion from multimodal human signals, dialogue, behavior, speech, face, avatar, vision, and embodied context.

👉🏻 Affective Generation

Generate emotionally aligned text, speech, facial expressions, avatars, and 3D motion that feel coherent to humans.

Concept illustration of unified affective understanding and empathetic generation

Flagship Research

Track A

Cross-modal Affective Understanding & Reasoning

ACL 2023 Findings Benchmark Dialogue ABSA

DiaASQ: A Benchmark of Conversational Aspect-based Sentiment Quadruple Analysis

A first step from sentence-level ABSA toward dialogue-native, fine-grained conversational sentiment understanding.

Introduces conversational aspect-based sentiment quadruple analysis over multi-turn dialogue.
Builds a bilingual Chinese-English benchmark with high-quality manual annotations.
Establishes end-to-end quadruple prediction with dialogue-specific discourse modeling.

Project Paper

ACM MM 2024 Oral Cross-modal Dialogue Reasoning

PanoSent: A Panoptic Sextuple Extraction Benchmark for Cross-modal Conversational Aspect-based Sentiment Analysis

A broader formulation of sentiment reasoning that treats multimodality, rationales, and sentiment dynamics as first-class citizens.

Defines panoptic sentiment sextuple extraction and sentiment flipping analysis in multimodal conversation.
Curates a large-scale multilingual dataset with holder, target, aspect, opinion, sentiment, and rationale labels.
Proposes Chain-of-Sentiment reasoning with the Sentica multimodal model and verification mechanism.

Project Paper Challenge

Track B

Cross-modal Affective & Empathetic Generation

ACL 2024 Demo Open-source System Avatar Chatbot

EmpathyEar: An Open-source Avatar Multimodal Empathetic Chatbot

An open multimodal empathetic chatbot that turns text-only ERG into embodied avatar interaction.

Accepts text, speech, and vision inputs in flexible combinations.
Produces empathetic responses with text, talking face, and synchronized speech.
Uses emotion-aware instruction tuning for deeper emotional resonance and human-like response quality.

Code Paper

WWW 2025 Benchmark Text-Speech-Vision

Towards Multimodal Empathetic Response Generation: A Rich Text-Speech-Vision Avatar-based Benchmark

A benchmark and system for end-to-end multimodal empathetic response generation with authentic speech and avatar video.

Introduces AvaMERG for avatar-based multimodal ERG with diverse profiles and real-world scenarios.
Builds Empatheia on top of an MLLM with multimodal encoder, speech generator, and avatar generator.
Adds Chain-of-Empathetic reasoning and empathy-enhanced tuning for emotional accuracy and cross-modal consistency.

Project Paper Challenge

Overview figure for the multimodal empathetic response generation benchmark

EMNLP 2025 Oral 3D Expression Text-to-Face

When Words Smile: Generating Diverse Emotional Facial Expressions from Text

Emotional facial expression generation as a missing piece beyond lip synchronization for digital humans.

Learns a continuous latent space for diverse, fluid, and emotionally coherent facial expressions.
Introduces EmoAva with 15,000 text-to-3D-expression pairs.
Advances text-driven expressive avatar generation for dialogue, gaming, and interactive agents.

Project Paper

NeurIPS 2025 Motion LLM Video RAG

VimoRAG: Video-based Retrieval-augmented 3D Motion Generation for Motion Language Models

Retrieval-augmented motion generation that leverages in-the-wild video as emotional and behavioral grounding for 3D motion.

Addresses out-of-domain and out-of-vocabulary failure modes in motion language models.
Retrieves motion-centered 2D video evidence with a Gemini Motion Video Retriever.
Improves robustness with a motion-centric dual-alignment DPO trainer for generation after retrieval.

Project Paper

Community

Survey In Progress UniCAE Roadmap

A Survey on Unified Cross-modal Affective Comprehension and Generation

A living survey that systematizes how future emotionally intelligent AI should jointly comprehend, reason, and generate affective content across modalities.

Scope: language, speech, face, avatar, multimodal dialogue, expressive 3D, and embodied empathy.
Focus: unified task taxonomies, datasets, benchmarks, models, evaluation, and open challenges.
Status: paper in progress, intended as a roadmap for the UniCAE thread.

TBD

Unified Affective

Comprehension & Generation