*Equal contribution · †Project lead · ‡Corresponding authors
arXiv:2603.25728v1 [cs.CV] · 26 Mar 2026
Fine-grained facial expression editing has long been limited by intrinsic semantic overlap. To address this, we construct the Flex Facial Expression (FFE) dataset with continuous affective annotations and establish FFE-Bench to evaluate structural confusion, editing accuracy, linear controllability, and the trade-off between expression editing and identity preservation. We propose PixelSmile, a diffusion framework that disentangles expression semantics via fully symmetric joint training. PixelSmile combines intensity supervision with contrastive learning to produce stronger and more distinguishable expressions, achieving precise and stable linear expression control through textual latent interpolation. Extensive experiments demonstrate that PixelSmile achieves superior disentanglement and robust identity preservation, confirming its effectiveness for continuous, controllable, and fine-grained expression editing, while naturally supporting smooth expression blending.
Many emotions share physical cues — "fear" and "surprise" both involve wide eyes and raised brows. This structural overlap means models trained on discrete emotion labels confuse these categories: generating "fear" may accidentally produce "surprise" instead. PixelSmile treats this not as a labeling error but as a fundamental geometry problem in expression space, requiring explicit disentanglement rather than just better classifiers.
We reveal and formalize the structured semantic overlap between facial expressions, demonstrating that structured semantic overlap—rather than purely classification error—is a primary cause of failures in both recognition and generative editing tasks.
A large-scale cross-domain collection featuring 12 expression categories with continuous affective annotations. Multi-dimensional evaluation for structural confusion, expression editing accuracy, linear controllability, and identity preservation.
Novel diffusion-based framework utilizing fully symmetric joint training and textual latent interpolation. Effectively disentangles overlapping emotions and enables disentangled, linearly controllable expression editing.
FFE is constructed through a four-stage collect–compose–generate–annotate pipeline designed to ensure expression diversity, cross-domain coverage, and reliable annotations. The final dataset contains 60,000 images across real and anime domains.
Traditional emotion datasets assign a single label ("happy" or "angry"). FFE instead assigns a 12-dimensional vector where every dimension is a real number in [0, 1], e.g., a face can be 0.8 happy and 0.3 surprised simultaneously. This enables the model to learn a smooth emotion manifold rather than hard class boundaries.
Mean Structural Confusion Rate
Quantifies cross-category confusion between semantically similar expressions. Lower is better.
Harmonic Editing Score
HES = 2×SE×SID / (SE+SID). Balances expression strength and identity preservation. Higher is better.
Control Linearity Score
Pearson correlation between α and VLM-predicted intensity. Higher indicates more predictable control.
Expression Editing Accuracy
Proportion of generated images whose predicted dominant expression matches the target instruction.
Performs linear interpolation in textual latent space:
econd(α) = eneu + α · Δe, α ∈ [0, 1]
Continuous conditioning embedding enables precise and smooth expression manipulation at inference time without requiring reference images. α > 1 supports extrapolation for stronger expression transfer.
The diffusion model is conditioned on a text embedding vector e. PixelSmile computes two embeddings: one for "neutral" and one for the target emotion. The conditioning vector is then a weighted blend: e_neu + α × (e_target − e_neu). At α=0 the face is neutral; at α=1 it shows the full target expression; at α=1.5 the emotion is exaggerated. This is all done in the model's internal text-embedding space — no reference face image is needed.
Samples confusing expression pair (Ea, Eb). Symmetric contrastive loss:
ℒSC = ½[𝒯(Ga,Pa,Nsurp) + 𝒯(Gb,Pb,Nfear)]
Enforces bidirectional separation of overlapping expressions using InfoNCE-style objective (τ = 0.07).
Naively, you could train the model to push "fear" away from "surprise." But without symmetry, the model might just improve on fear while ignoring the reverse — leaving surprise still confused with fear. Symmetric training simultaneously optimizes both directions: fear images push away from surprise, and surprise images push away from fear. The ablation (Fig. 8) confirms that removing symmetry causes expression confusion even when the contrastive loss is still present.
ArcFace frozen as identity encoder Φarc:
ℒID = ½ Σ [1 − cos(Φarc(Gi), Φarc(Pi))]
Stabilizes biometric features under strong expression extrapolation, preventing hairstyle and skin texture drift.
ArcFace is a face recognition model that maps any face image into a high-dimensional embedding vector where the same person always maps close together regardless of expression. The identity loss penalizes cosine distance between generated and target face embeddings, so editing "happy" onto a face cannot inadvertently change the hair color, skin tone, or face shape — features that ArcFace is highly sensitive to.
ℒtotal = ½(ℒaFM + ℒbFM) + λsc·ℒSC + λid·ℒID
λsc controls the trade-off between expression disentanglement and identity preservation. Trained on 4 NVIDIA H200 GPUs with LoRA (rank 64, α 128).
Table 1. Quantitative Evaluation of General Editing Models. PixelSmile achieves the lowest mSCR (0.0550) and highest Acc-6 (0.8627).
Table 2. Quantitative Evaluation of Linear Control Models. PixelSmile achieves best CLS-6 (0.8078), CLS-12 (0.7305), and HES (0.4723).
LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique that freezes the original large model and adds small trainable matrices of rank r. Rank 64 with α 128 is a moderately large LoRA — it captures enough expressiveness for complex expression editing while keeping added parameters at roughly 1% of the full model, enabling training on just 4 GPUs without catastrophic forgetting of the base model's image generation capability.
With 6 basic emotions, there are 15 pairwise combinations. Some pairs (like "happy+surprised" = excited) produce semantically coherent compound expressions. Others (like "disgust+happy") are physically or psychologically contradictory — the face muscles required are antagonistic. The fact that 9/15 succeed is evidence that PixelSmile has learned a geometrically meaningful emotion manifold — blending is an emergent capability not explicitly trained for.
In this paper, we present PixelSmile, a framework for addressing semantic entanglement in facial expression editing. By shifting from discrete supervision to the continuous expression manifold defined by FFE and evaluated through FFE-Bench, our approach enables precise and linearly controllable editing via symmetric joint training. Extensive experiments demonstrate effectiveness of PixelSmile in four dimensions: structural confusion, expression accuracy, linear controllability, and identity preservation. Overall, this work establishes a standardized framework for fine-grained facial expression editing and advances research toward continuous and compositional facial affect manipulation.
B2B Content
Any content, beautifully transformed for your organization
PDFs, videos, web pages — we turn any source material into production-quality content. Rich HTML · Custom slides · Animated video.