---
arxiv_id: 2603.25728
title: "PixelSmile: Toward Fine-Grained Facial Expression Editing"
authors:
  - Jiabin Hua
  - Hengyuan Xu
  - Aojie Li
  - Wei Cheng
  - Gang Yu
  - Xingjun Ma
  - Yu-Gang Jiang
difficulty: Intermediate
tags:
  - Vision
  - Diffusion
published_at: 2026-03-26
flecto_url: https://flecto.zer0ai.dev/papers/2603.25728/
lang: en
---

Jiabin Hua 1,2,* Hengyuan Xu 1,2,* Aojie Li 2,† Wei Cheng 2 Gang Yu 2,‡ Xingjun Ma 1,‡ Yu-Gang Jiang 1

* Equal contribution  · † Project lead  · ‡ Corresponding authors

arXiv:2603.25728v1 [cs.CV]  ·  26 Mar 2026

## Abstract

> Fine-grained facial expression editing has long been limited by intrinsic semantic overlap. To address this, we construct the Flex Facial Expression (FFE) dataset with continuous affective annotations and establish FFE-Bench to evaluate structural confusion, editing accuracy, linear controllability, and the trade-off between expression editing and identity preservation. We propose PixelSmile , a diffusion framework that disentangles expression semantics via fully symmetric joint training. PixelSmile combines intensity supervision with contrastive learning to produce stronger and more distinguishable expressions, achieving precise and stable linear expression control through textual latent interpolation. Extensive experiments demonstrate that PixelSmile achieves superior disentanglement and robust identity preservation, confirming its effectiveness for continuous, controllable, and fine-grained expression editing, while naturally supporting smooth expression blending.

#### What is "semantic overlap" in facial expressions?

Many emotions share physical cues — "fear" and "surprise" both involve wide eyes and raised brows. This structural overlap means models trained on discrete emotion labels confuse these categories: generating "fear" may accidentally produce "surprise" instead. PixelSmile treats this not as a labeling error but as a fundamental geometry problem in expression space, requiring explicit disentanglement rather than just better classifiers.

- 🔍 Systematic Analysis of Semantic Overlap We reveal and formalize the structured semantic overlap between facial expressions, demonstrating that structured semantic overlap—rather than purely classification error—is a primary cause of failures in both recognition and generative editing tasks.

### Systematic Analysis of Semantic Overlap

We reveal and formalize the structured semantic overlap between facial expressions, demonstrating that structured semantic overlap—rather than purely classification error—is a primary cause of failures in both recognition and generative editing tasks.

- 📊 Dataset & Benchmark (FFE + FFE-Bench) A large-scale cross-domain collection featuring 12 expression categories with continuous affective annotations. Multi-dimensional evaluation for structural confusion, expression editing accuracy, linear controllability, and identity preservation.

### Dataset & Benchmark (FFE + FFE-Bench)

A large-scale cross-domain collection featuring 12 expression categories with continuous affective annotations. Multi-dimensional evaluation for structural confusion, expression editing accuracy, linear controllability, and identity preservation.

- 🤖 PixelSmile Framework Novel diffusion-based framework utilizing fully symmetric joint training and textual latent interpolation. Effectively disentangles overlapping emotions and enables disentangled, linearly controllable expression editing.

### PixelSmile Framework

Novel diffusion-based framework utilizing fully symmetric joint training and textual latent interpolation. Effectively disentangles overlapping emotions and enables disentangled, linearly controllable expression editing.

## 3. Dataset & Benchmark

### The FFE Dataset

FFE is constructed through a four-stage collect–compose–generate–annotate pipeline designed to ensure expression diversity, cross-domain coverage, and reliable annotations. The final dataset contains 60,000 images across real and anime domains.

#### Continuous vs. discrete labels

Traditional emotion datasets assign a single label ("happy" or "angry"). FFE instead assigns a 12-dimensional vector where every dimension is a real number in [0, 1], e.g., a face can be 0.8 happy and 0.3 surprised simultaneously. This enables the model to learn a smooth emotion manifold rather than hard class boundaries.

### FFE-Bench Evaluation

- 📉 mSCR Mean Structural Confusion Rate Quantifies cross-category confusion between semantically similar expressions. Lower is better.

Mean Structural Confusion Rate

Quantifies cross-category confusion between semantically similar expressions. Lower is better.

- 🎯 HES Harmonic Editing Score HES = 2×S E ×S ID / (S E +S ID ). Balances expression strength and identity preservation. Higher is better.

Harmonic Editing Score

HES = 2×S E ×S ID / (S E +S ID ). Balances expression strength and identity preservation. Higher is better.

- 📏 CLS Control Linearity Score Pearson correlation between α and VLM-predicted intensity. Higher indicates more predictable control.

Control Linearity Score

Pearson correlation between α and VLM-predicted intensity. Higher indicates more predictable control.

- ✅ Acc Expression Editing Accuracy Proportion of generated images whose predicted dominant expression matches the target instruction.

Expression Editing Accuracy

Proportion of generated images whose predicted dominant expression matches the target instruction.

## 4. Method — PixelSmile Framework

### Textual Latent Interpolation

Performs linear interpolation in textual latent space:

Continuous conditioning embedding enables precise and smooth expression manipulation at inference time without requiring reference images. α > 1 supports extrapolation for stronger expression transfer.

#### How textual latent interpolation works

The diffusion model is conditioned on a text embedding vector e . PixelSmile computes two embeddings: one for "neutral" and one for the target emotion. The conditioning vector is then a weighted blend: e_neu + α × (e_target − e_neu) . At α=0 the face is neutral; at α=1 it shows the full target expression; at α=1.5 the emotion is exaggerated. This is all done in the model's internal text-embedding space — no reference face image is needed.

### Fully Symmetric Joint Training

Samples confusing expression pair (E a , E b ). Symmetric contrastive loss:

Enforces bidirectional separation of overlapping expressions using InfoNCE-style objective (τ = 0.07).

#### Why "symmetric" training matters

Naively, you could train the model to push "fear" away from "surprise." But without symmetry, the model might just improve on fear while ignoring the reverse — leaving surprise still confused with fear. Symmetric training simultaneously optimizes both directions: fear images push away from surprise, and surprise images push away from fear. The ablation (Fig. 8) confirms that removing symmetry causes expression confusion even when the contrastive loss is still present.

### Identity Preservation

ArcFace frozen as identity encoder Φ arc :

Stabilizes biometric features under strong expression extrapolation, preventing hairstyle and skin texture drift.

#### ArcFace and cosine identity loss

ArcFace is a face recognition model that maps any face image into a high-dimensional embedding vector where the same person always maps close together regardless of expression . The identity loss penalizes cosine distance between generated and target face embeddings, so editing "happy" onto a face cannot inadvertently change the hair color, skin tone, or face shape — features that ArcFace is highly sensitive to.

### Overall Training Objective

λ sc controls the trade-off between expression disentanglement and identity preservation. Trained on 4 NVIDIA H200 GPUs with LoRA (rank 64, α 128).

## 5. Experiments & Results

#### Key metrics at a glance

- mSCR (lower is better): How often the model confuses one expression for another. PixelSmile achieves 0.055, far below prior work.

- HES (higher is better): Harmonic mean of expression strength and identity similarity — a single number that collapses the quality/identity trade-off.

- CLS (higher is better): Pearson correlation measuring whether α=0.3 really produces 30% intensity, α=0.7 really produces 70%, etc. A perfect linear controller would score 1.0.

### 5.2 Quantitative Evaluation

Table 1. Quantitative Evaluation of General Editing Models. PixelSmile achieves the lowest mSCR (0.0550) and highest Acc-6 (0.8627).

Table 2. Quantitative Evaluation of Linear Control Models. PixelSmile achieves best CLS-6 (0.8078), CLS-12 (0.7305), and HES (0.4723).

### 5.3 Qualitative Comparison

### 5.4 Ablation Study

### 5.5 User Study

#### What "LoRA rank 64" means for training efficiency

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that freezes the original large model and adds small trainable matrices of rank r. Rank 64 with α 128 is a moderately large LoRA — it captures enough expressiveness for complex expression editing while keeping added parameters at roughly 1% of the full model, enabling training on just 4 GPUs without catastrophic forgetting of the base model's image generation capability.

### 5.6 Expression Blend

#### Why only 9 of 15 blends work?

With 6 basic emotions, there are 15 pairwise combinations. Some pairs (like "happy+surprised" = excited) produce semantically coherent compound expressions. Others (like "disgust+happy") are physically or psychologically contradictory — the face muscles required are antagonistic. The fact that 9/15 succeed is evidence that PixelSmile has learned a geometrically meaningful emotion manifold — blending is an emergent capability not explicitly trained for.

## 6. Conclusion

In this paper, we present PixelSmile, a framework for addressing semantic entanglement in facial expression editing. By shifting from discrete supervision to the continuous expression manifold defined by FFE and evaluated through FFE-Bench, our approach enables precise and linearly controllable editing via symmetric joint training. Extensive experiments demonstrate effectiveness of PixelSmile in four dimensions: structural confusion, expression accuracy, linear controllability, and identity preservation. Overall, this work establishes a standardized framework for fine-grained facial expression editing and advances research toward continuous and compositional facial affect manipulation.