← Flecto🤖 Agent Ready

😁 PixelSmile: Toward Fine-Grained Facial Expression Editing

Jiabin Hua1,2,*  Hengyuan Xu1,2,*  Aojie Li2,†  Wei Cheng2  Gang Yu2,‡  Xingjun Ma1,‡  Yu-Gang Jiang1

*Equal contribution  ·  Project lead  ·  Corresponding authors

1 Fudan University 2 StepFun

arXiv:2603.25728v1 [cs.CV]  ·  26 Mar 2026

Overview of PixelSmile showing fine-grained expression editing, extended expression categories, and expression blending capabilities
Figure 1. Overview of PixelSmile. It enables 1) continuous and precise control of facial expression intensity across real-world and anime domains, 2) editing across 12 distinct expression categories, and 3) seamless blending of multiple expressions.

Abstract

Fine-grained facial expression editing has long been limited by intrinsic semantic overlap. To address this, we construct the Flex Facial Expression (FFE) dataset with continuous affective annotations and establish FFE-Bench to evaluate structural confusion, editing accuracy, linear controllability, and the trade-off between expression editing and identity preservation. We propose PixelSmile, a diffusion framework that disentangles expression semantics via fully symmetric joint training. PixelSmile combines intensity supervision with contrastive learning to produce stronger and more distinguishable expressions, achieving precise and stable linear expression control through textual latent interpolation. Extensive experiments demonstrate that PixelSmile achieves superior disentanglement and robust identity preservation, confirming its effectiveness for continuous, controllable, and fine-grained expression editing, while naturally supporting smooth expression blending.

What is "semantic overlap" in facial expressions?

Many emotions share physical cues — "fear" and "surprise" both involve wide eyes and raised brows. This structural overlap means models trained on discrete emotion labels confuse these categories: generating "fear" may accidentally produce "surprise" instead. PixelSmile treats this not as a labeling error but as a fundamental geometry problem in expression space, requiring explicit disentanglement rather than just better classifiers.

Contributions

  • 🔍

    Systematic Analysis of Semantic Overlap

    We reveal and formalize the structured semantic overlap between facial expressions, demonstrating that structured semantic overlap—rather than purely classification error—is a primary cause of failures in both recognition and generative editing tasks.

  • 📊

    Dataset & Benchmark (FFE + FFE-Bench)

    A large-scale cross-domain collection featuring 12 expression categories with continuous affective annotations. Multi-dimensional evaluation for structural confusion, expression editing accuracy, linear controllability, and identity preservation.

  • 🤖

    PixelSmile Framework

    Novel diffusion-based framework utilizing fully symmetric joint training and textual latent interpolation. Effectively disentangles overlapping emotions and enables disentangled, linearly controllable expression editing.

3. Dataset & Benchmark

Observation of Expression Semantic Overlap: human and model confusion with semantically adjacent emotions, and PixelSmile's solution via FFE dataset and symmetric training
Figure 2. Observation of Expression Semantic Overlap. Inherent expression overlap causes systematic confusion across human annotators, recognition models, and generative models (top). We resolve this via the FFE dataset (bottom left) and PixelSmile framework (bottom right), utilizing continuous supervision and symmetric training for disentangled editing.

The FFE Dataset

FFE is constructed through a four-stage collect–compose–generate–annotate pipeline designed to ensure expression diversity, cross-domain coverage, and reliable annotations. The final dataset contains 60,000 images across real and anime domains.

🏗 Base Identity Collection
~6,000 real portraits (diverse demographics) + 6,000 anime characters from 207 productions, covering 629 characters.
✍ Expression Prompt Composition
12-category taxonomy: 6 basic emotions + 6 extended (Confused, Contempt, Confident, Shy, Sleepy, Anxious), decomposed into facial attribute components.
🎨 Controlled Expression Generation
~60,000 images via Nano Banana Pro dual-part prompts (global expression + localized facial attributes).
📐 Continuous Annotation
12-dimensional continuous score vector v ∈ [0, 1]¹² predicted by Gemini 3 Pro. Human-verified subset for reliability.

Continuous vs. discrete labels

Traditional emotion datasets assign a single label ("happy" or "angry"). FFE instead assigns a 12-dimensional vector where every dimension is a real number in [0, 1], e.g., a face can be 0.8 happy and 0.3 surprised simultaneously. This enables the model to learn a smooth emotion manifold rather than hard class boundaries.

FFE-Bench Evaluation

  • 📉

    mSCR

    Mean Structural Confusion Rate

    Quantifies cross-category confusion between semantically similar expressions. Lower is better.

  • 🎯

    HES

    Harmonic Editing Score

    HES = 2×SE×SID / (SE+SID). Balances expression strength and identity preservation. Higher is better.

  • 📏

    CLS

    Control Linearity Score

    Pearson correlation between α and VLM-predicted intensity. Higher indicates more predictable control.

  • Acc

    Expression Editing Accuracy

    Proportion of generated images whose predicted dominant expression matches the target instruction.

4. Method — PixelSmile Framework

PixelSmile framework overview: inference stage with textual latent interpolation and training stage with fully symmetric joint training
Figure 3. Framework Overview. (1) Inference Stage: We interpolate between the neutral and target expression embeddings in textual latent space using a controllable coefficient α, enabling continuous adjustment of expression intensity. (2) Training Stage: We adopt a joint fully symmetric training framework using a symmetric contrastive objective, identity loss, and flow matching loss.

Textual Latent Interpolation

Performs linear interpolation in textual latent space:

econd(α) = eneu + α · Δe,  α ∈ [0, 1]

Continuous conditioning embedding enables precise and smooth expression manipulation at inference time without requiring reference images. α > 1 supports extrapolation for stronger expression transfer.

How textual latent interpolation works

The diffusion model is conditioned on a text embedding vector e. PixelSmile computes two embeddings: one for "neutral" and one for the target emotion. The conditioning vector is then a weighted blend: e_neu + α × (e_target − e_neu). At α=0 the face is neutral; at α=1 it shows the full target expression; at α=1.5 the emotion is exaggerated. This is all done in the model's internal text-embedding space — no reference face image is needed.

Fully Symmetric Joint Training

Samples confusing expression pair (Ea, Eb). Symmetric contrastive loss:

SC = ½[𝒯(Ga,Pa,Nsurp) + 𝒯(Gb,Pb,Nfear)]

Enforces bidirectional separation of overlapping expressions using InfoNCE-style objective (τ = 0.07).

Why "symmetric" training matters

Naively, you could train the model to push "fear" away from "surprise." But without symmetry, the model might just improve on fear while ignoring the reverse — leaving surprise still confused with fear. Symmetric training simultaneously optimizes both directions: fear images push away from surprise, and surprise images push away from fear. The ablation (Fig. 8) confirms that removing symmetry causes expression confusion even when the contrastive loss is still present.

Identity Preservation

ArcFace frozen as identity encoder Φarc:

ID = ½ Σ [1 − cos(Φarc(Gi), Φarc(Pi))]

Stabilizes biometric features under strong expression extrapolation, preventing hairstyle and skin texture drift.

ArcFace and cosine identity loss

ArcFace is a face recognition model that maps any face image into a high-dimensional embedding vector where the same person always maps close together regardless of expression. The identity loss penalizes cosine distance between generated and target face embeddings, so editing "happy" onto a face cannot inadvertently change the hair color, skin tone, or face shape — features that ArcFace is highly sensitive to.

Overall Training Objective

total = ½(ℒaFM + ℒbFM) + λsc·ℒSC + λid·ℒID

λsc controls the trade-off between expression disentanglement and identity preservation. Trained on 4 NVIDIA H200 GPUs with LoRA (rank 64, α 128).

5. Experiments & Results

Key metrics at a glance

  • mSCR (lower is better): How often the model confuses one expression for another. PixelSmile achieves 0.055, far below prior work.
  • HES (higher is better): Harmonic mean of expression strength and identity similarity — a single number that collapses the quality/identity trade-off.
  • CLS (higher is better): Pearson correlation measuring whether α=0.3 really produces 30% intensity, α=0.7 really produces 70%, etc. A perfect linear controller would score 1.0.

5.2 Quantitative Evaluation

Table 1: Quantitative Evaluation of General Editing Models comparing mSCR, Acc-6, Acc-12, and ID Similarity

Table 1. Quantitative Evaluation of General Editing Models. PixelSmile achieves the lowest mSCR (0.0550) and highest Acc-6 (0.8627).

Table 2: Quantitative Evaluation of Linear Control Models comparing CLS-6, CLS-12, ID Similarity, and HES

Table 2. Quantitative Evaluation of Linear Control Models. PixelSmile achieves best CLS-6 (0.8078), CLS-12 (0.7305), and HES (0.4723).

Scatter plot comparing Expression Score vs ID Similarity across methods: PixelSmile achieves wider expression range with narrower ID impairment
Figure 4. Quantitative Evaluation of Linear Control Methods. Comparison of the trade-off between ID similarity and expression score across different models. PixelSmile achieves an optimal balance, providing a wider expression manipulation range while preserving identity fidelity.

5.3 Qualitative Comparison

Qualitative comparison grid: PixelSmile vs 6 general editing models across Angry/Disgust/Fear/Surprised expressions
Figure 5. Qualitative Comparison with General Editing Models. PixelSmile produces clearer expression changes while preserving facial identity, whereas existing editing models either weaken expression editing or degrade identity consistency.
Qualitative comparison with linear control models showing 5 methods across 6 intensity levels for happy and surprised expressions
Figure 6. Qualitative Comparison with Linear Control Models. PixelSmile achieves smooth and monotonic expression transitions while preserving facial identity. The figure shows two representative expressions: happy (top row) and surprised (bottom row).

5.4 Ablation Study

Ablation on identity loss: without ID loss, large expression intensities cause identity drift in hairstyle and skin texture
Fig 7. Ablation on identity loss. Without ID loss, large expression intensities cause identity drift. Our full method preserves identity consistently.
Ablation on symmetric contrastive learning showing expression confusion without contrastive loss vs precise disentanglement with full method
Fig 8. Ablation on symmetric contrastive learning. Both w/o Contrastive Loss and w/o Symmetric Framework suffer from expression confusion; our full method achieves precise expression disentanglement.
Training dynamics plot showing mSCR and train loss over steps for symmetric vs asymmetric framework
Fig 9. Training dynamics of symmetric contrastive learning. The symmetric framework achieves lower and more stable mSCR despite slower initial convergence.
Table 3: Ablation Study comparing 7 configurations across mSCR, ACC-6, ACC-12, CLS-6, CLS-12, HES, and ID Similarity
Table 3. Ablation Study. Removing Contrastive Loss maximizes structural confusion (mSCR 0.2725). Removing ID Loss improves expression accuracy but degrades identity. Full Setting achieves best overall balance.

5.5 User Study

User study scatter plot: PixelSmile achieves highest continuity (4.48) and strong identity consistency (3.80) with the largest HES bubble
Figure 10. User study results. Trade-off between identity preservation and continuity of editing, annotated by human annotators (N=2,400 images, 10 annotators). Bubble size indicates HES scores.
PixelSmile achieves the best human-judged balance: Continuity 4.48  |  Identity Consistency 3.80 — outperforming K-Slider (1.36, 4.06) and SliderEdit (3.16, 1.14).

What "LoRA rank 64" means for training efficiency

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that freezes the original large model and adds small trainable matrices of rank r. Rank 64 with α 128 is a moderately large LoRA — it captures enough expressiveness for complex expression editing while keeping added parameters at roughly 1% of the full model, enabling training on just 4 GPUs without catastrophic forgetting of the base model's image generation capability.

5.6 Expression Blend

Expression blending results showing pairwise linear interpolation between basic expressions generating plausible compound expressions
Figure 12. Expression Blending Results. Visualizing compositional facial expressions generated by smoothly blending multiple emotional categories in PixelSmile. 9 out of 15 pairwise combinations generate plausible compound expressions, suggesting the learned emotion manifold is continuous and compositional.

Why only 9 of 15 blends work?

With 6 basic emotions, there are 15 pairwise combinations. Some pairs (like "happy+surprised" = excited) produce semantically coherent compound expressions. Others (like "disgust+happy") are physically or psychologically contradictory — the face muscles required are antagonistic. The fact that 9/15 succeed is evidence that PixelSmile has learned a geometrically meaningful emotion manifold — blending is an emergent capability not explicitly trained for.

6. Conclusion

In this paper, we present PixelSmile, a framework for addressing semantic entanglement in facial expression editing. By shifting from discrete supervision to the continuous expression manifold defined by FFE and evaluated through FFE-Bench, our approach enables precise and linearly controllable editing via symmetric joint training. Extensive experiments demonstrate effectiveness of PixelSmile in four dimensions: structural confusion, expression accuracy, linear controllability, and identity preservation. Overall, this work establishes a standardized framework for fine-grained facial expression editing and advances research toward continuous and compositional facial affect manipulation.

Facial Expression Editing Contrastive Learning Identity Preservation Continuous Control Expression Disentanglement FFE Dataset FFE-Bench Diffusion Models LoRA
📄 Appendix — Additional Results
Additional linear expression editing results across 10 remaining expressions (Anxious, Contempt, Disgust, Fear, Sad, Angry, Confident, Confused, Shy, Sleepy) showing intensity increase left to right
Figure 11. Additional linear expression editing results. The remaining ten expressions across both real and anime domains. Expression intensity increases from left to right for each expression.
FFE dataset statistics: age distribution in real-world domain (Young Adult 53.5%) and style distribution in anime domain (CG Anime 44.7%, 2D Anime 44.1%)
Figure 13. Statistical distributions of annotated data in FFE. (a) Age distribution in the real-world domain — dominated by young adults (53.5%). (b) Style distribution in the anime domain — CG Anime (44.7%) and 2D Anime (44.1%) are nearly equal.

B2B Content

Any content, beautifully transformed for your organization

PDFs, videos, web pages — we turn any source material into production-quality content. Rich HTML · Custom slides · Animated video.

View Services Contact Us