PixelSmile: Toward Fine-Grained Facial Expression Editing

Abstract

Fine-grained facial expression editing has long been limited by intrinsic semantic overlap. To address this, we construct the Flex Facial Expression (FFE) dataset with continuous affective annotations and establish FFE-Bench to evaluate structural confusion, editing accuracy, linear controllability, and the trade-off between expression editing and identity preservation. We propose PixelSmile, a diffusion framework that disentangles expression semantics via fully symmetric joint training. PixelSmile combines intensity supervision with contrastive learning to produce stronger and more distinguishable expressions, achieving precise and stable linear expression control through textual latent interpolation. Extensive experiments demonstrate that PixelSmile achieves superior disentanglement and robust identity preservation, confirming its effectiveness for continuous, controllable, and fine-grained expression editing, while naturally supporting smooth expression blending.

What is "semantic overlap" in facial expressions?

Many emotions share physical cues — "fear" and "surprise" both involve wide eyes and raised brows. This structural overlap means models trained on discrete emotion labels confuse these categories: generating "fear" may accidentally produce "surprise" instead. PixelSmile treats this not as a labeling error but as a fundamental geometry problem in expression space, requiring explicit disentanglement rather than just better classifiers.

Contributions

🔍
Systematic Analysis of Semantic Overlap

We reveal and formalize the structured semantic overlap between facial expressions, demonstrating that structured semantic overlap—rather than purely classification error—is a primary cause of failures in both recognition and generative editing tasks.
📊
Dataset & Benchmark (FFE + FFE-Bench)

A large-scale cross-domain collection featuring 12 expression categories with continuous affective annotations. Multi-dimensional evaluation for structural confusion, expression editing accuracy, linear controllability, and identity preservation.
🤖
PixelSmile Framework

Novel diffusion-based framework utilizing fully symmetric joint training and textual latent interpolation. Effectively disentangles overlapping emotions and enables disentangled, linearly controllable expression editing.

3. Dataset & Benchmark

Observation of Expression Semantic Overlap: human and model confusion with semantically adjacent emotions, and PixelSmile's solution via FFE dataset and symmetric training — **Figure 2. Observation of Expression Semantic Overlap.** Inherent expression overlap causes systematic confusion across human annotators, recognition models, and generative models (top). We resolve this via the FFE dataset (bottom left) and PixelSmile framework (bottom right), utilizing continuous supervision and symmetric training for disentangled editing.

The FFE Dataset

FFE is constructed through a four-stage collect–compose–generate–annotate pipeline designed to ensure expression diversity, cross-domain coverage, and reliable annotations. The final dataset contains 60,000 images across real and anime domains.

🏗 Base Identity Collection: ~6,000 real portraits (diverse demographics) + 6,000 anime characters from 207 productions, covering 629 characters.
✍ Expression Prompt Composition: 12-category taxonomy: 6 basic emotions + 6 extended (Confused, Contempt, Confident, Shy, Sleepy, Anxious), decomposed into facial attribute components.
🎨 Controlled Expression Generation: ~60,000 images via Nano Banana Pro dual-part prompts (global expression + localized facial attributes).
📐 Continuous Annotation: 12-dimensional continuous score vector v ∈ [0, 1]¹² predicted by Gemini 3 Pro. Human-verified subset for reliability.

Continuous vs. discrete labels

Traditional emotion datasets assign a single label ("happy" or "angry"). FFE instead assigns a 12-dimensional vector where every dimension is a real number in [0, 1], e.g., a face can be 0.8 happy and 0.3 surprised simultaneously. This enables the model to learn a smooth emotion manifold rather than hard class boundaries.

FFE-Bench Evaluation

📉
mSCR

Mean Structural Confusion Rate

Quantifies cross-category confusion between semantically similar expressions. Lower is better.
🎯
HES

Harmonic Editing Score

HES = 2×S_E×S_ID / (S_E+S_ID). Balances expression strength and identity preservation. Higher is better.
📏
CLS

Control Linearity Score

Pearson correlation between α and VLM-predicted intensity. Higher indicates more predictable control.
✅
Acc

Expression Editing Accuracy

Proportion of generated images whose predicted dominant expression matches the target instruction.

4. Method — PixelSmile Framework

Textual Latent Interpolation

Performs linear interpolation in textual latent space:

e_cond(α) = e_neu + α · Δe, α ∈ [0, 1]

Continuous conditioning embedding enables precise and smooth expression manipulation at inference time without requiring reference images. α > 1 supports extrapolation for stronger expression transfer.

How textual latent interpolation works

The diffusion model is conditioned on a text embedding vector e. PixelSmile computes two embeddings: one for "neutral" and one for the target emotion. The conditioning vector is then a weighted blend: e_neu + α × (e_target − e_neu). At α=0 the face is neutral; at α=1 it shows the full target expression; at α=1.5 the emotion is exaggerated. This is all done in the model's internal text-embedding space — no reference face image is needed.

Fully Symmetric Joint Training

Samples confusing expression pair (E_a, E_b). Symmetric contrastive loss:

ℒ_SC = ½[𝒯(G_a,P_a,N_surp) + 𝒯(G_b,P_b,N_fear)]

Enforces bidirectional separation of overlapping expressions using InfoNCE-style objective (τ = 0.07).

Why "symmetric" training matters

Naively, you could train the model to push "fear" away from "surprise." But without symmetry, the model might just improve on fear while ignoring the reverse — leaving surprise still confused with fear. Symmetric training simultaneously optimizes both directions: fear images push away from surprise, and surprise images push away from fear. The ablation (Fig. 8) confirms that removing symmetry causes expression confusion even when the contrastive loss is still present.

Identity Preservation

ArcFace frozen as identity encoder Φ_arc:

ℒ_ID = ½ Σ [1 − cos(Φ_arc(G_i), Φ_arc(P_i))]

Stabilizes biometric features under strong expression extrapolation, preventing hairstyle and skin texture drift.

ArcFace and cosine identity loss

ArcFace is a face recognition model that maps any face image into a high-dimensional embedding vector where the same person always maps close together regardless of expression. The identity loss penalizes cosine distance between generated and target face embeddings, so editing "happy" onto a face cannot inadvertently change the hair color, skin tone, or face shape — features that ArcFace is highly sensitive to.

Overall Training Objective

ℒ_total = ½(ℒ^a_FM + ℒ^b_FM) + λ_sc·ℒ_SC + λ_id·ℒ_ID

λ_sc controls the trade-off between expression disentanglement and identity preservation. Trained on 4 NVIDIA H200 GPUs with LoRA (rank 64, α 128).

5. Experiments & Results

Key metrics at a glance

mSCR (lower is better): How often the model confuses one expression for another. PixelSmile achieves 0.055, far below prior work.
HES (higher is better): Harmonic mean of expression strength and identity similarity — a single number that collapses the quality/identity trade-off.
CLS (higher is better): Pearson correlation measuring whether α=0.3 really produces 30% intensity, α=0.7 really produces 70%, etc. A perfect linear controller would score 1.0.

5.2 Quantitative Evaluation

Table 1: Quantitative Evaluation of General Editing Models comparing mSCR, Acc-6, Acc-12, and ID Similarity

Table 1. Quantitative Evaluation of General Editing Models. PixelSmile achieves the lowest mSCR (0.0550) and highest Acc-6 (0.8627).

Table 2: Quantitative Evaluation of Linear Control Models comparing CLS-6, CLS-12, ID Similarity, and HES

Table 2. Quantitative Evaluation of Linear Control Models. PixelSmile achieves best CLS-6 (0.8078), CLS-12 (0.7305), and HES (0.4723).

Scatter plot comparing Expression Score vs ID Similarity across methods: PixelSmile achieves wider expression range with narrower ID impairment — **Figure 4. Quantitative Evaluation of Linear Control Methods.** Comparison of the trade-off between ID similarity and expression score across different models. PixelSmile achieves an optimal balance, providing a wider expression manipulation range while preserving identity fidelity.

5.3 Qualitative Comparison

Qualitative comparison grid: PixelSmile vs 6 general editing models across Angry/Disgust/Fear/Surprised expressions — **Figure 5. Qualitative Comparison with General Editing Models.** PixelSmile produces clearer expression changes while preserving facial identity, whereas existing editing models either weaken expression editing or degrade identity consistency.

Qualitative comparison with linear control models showing 5 methods across 6 intensity levels for happy and surprised expressions — **Figure 6. Qualitative Comparison with Linear Control Models.** PixelSmile achieves smooth and monotonic expression transitions while preserving facial identity. The figure shows two representative expressions: happy (top row) and surprised (bottom row).

5.4 Ablation Study

Ablation on identity loss: without ID loss, large expression intensities cause identity drift in hairstyle and skin texture — **Fig 7.** Ablation on identity loss. Without ID loss, large expression intensities cause identity drift. Our full method preserves identity consistently.

Ablation on symmetric contrastive learning showing expression confusion without contrastive loss vs precise disentanglement with full method — **Fig 8.** Ablation on symmetric contrastive learning. Both w/o Contrastive Loss and w/o Symmetric Framework suffer from expression confusion; our full method achieves precise expression disentanglement.

Training dynamics plot showing mSCR and train loss over steps for symmetric vs asymmetric framework — **Fig 9.** Training dynamics of symmetric contrastive learning. The symmetric framework achieves lower and more stable mSCR despite slower initial convergence.

Table 3: Ablation Study comparing 7 configurations across mSCR, ACC-6, ACC-12, CLS-6, CLS-12, HES, and ID Similarity — **Table 3. Ablation Study.** Removing Contrastive Loss maximizes structural confusion (mSCR 0.2725). Removing ID Loss improves expression accuracy but degrades identity. Full Setting achieves best overall balance.

5.5 User Study

User study scatter plot: PixelSmile achieves highest continuity (4.48) and strong identity consistency (3.80) with the largest HES bubble — **Figure 10. User study results.** Trade-off between identity preservation and continuity of editing, annotated by human annotators (N=2,400 images, 10 annotators). Bubble size indicates HES scores.

PixelSmile achieves the best human-judged balance: Continuity 4.48  |  Identity Consistency 3.80 — outperforming K-Slider (1.36, 4.06) and SliderEdit (3.16, 1.14).
      

What "LoRA rank 64" means for training efficiency

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that freezes the original large model and adds small trainable matrices of rank r. Rank 64 with α 128 is a moderately large LoRA — it captures enough expressiveness for complex expression editing while keeping added parameters at roughly 1% of the full model, enabling training on just 4 GPUs without catastrophic forgetting of the base model's image generation capability.

5.6 Expression Blend

Expression blending results showing pairwise linear interpolation between basic expressions generating plausible compound expressions — **Figure 12. Expression Blending Results.** Visualizing compositional facial expressions generated by smoothly blending multiple emotional categories in PixelSmile. 9 out of 15 pairwise combinations generate plausible compound expressions, suggesting the learned emotion manifold is continuous and compositional.

Why only 9 of 15 blends work?

With 6 basic emotions, there are 15 pairwise combinations. Some pairs (like "happy+surprised" = excited) produce semantically coherent compound expressions. Others (like "disgust+happy") are physically or psychologically contradictory — the face muscles required are antagonistic. The fact that 9/15 succeed is evidence that PixelSmile has learned a geometrically meaningful emotion manifold — blending is an emergent capability not explicitly trained for.

6. Conclusion

In this paper, we present PixelSmile, a framework for addressing semantic entanglement in facial expression editing. By shifting from discrete supervision to the continuous expression manifold defined by FFE and evaluated through FFE-Bench, our approach enables precise and linearly controllable editing via symmetric joint training. Extensive experiments demonstrate effectiveness of PixelSmile in four dimensions: structural confusion, expression accuracy, linear controllability, and identity preservation. Overall, this work establishes a standardized framework for fine-grained facial expression editing and advances research toward continuous and compositional facial affect manipulation.

Facial Expression Editing Contrastive Learning Identity Preservation Continuous Control Expression Disentanglement FFE Dataset FFE-Bench Diffusion Models LoRA

😁 PixelSmile: Toward Fine-Grained Facial Expression Editing