Technical Overview

Face‑Parsing is a transformer‑based semantic‑segmentation model that extracts fine‑grained facial components from a single portrait image. It is a fine‑tuned version of NVIDIA’s MIT‑B5 Segformer backbone, trained on the CelebAMask‑HQ dataset. The model predicts 19 class masks (background, skin, nose, eyes, eyebrows, ears, mouth, lips, hair, hat, earrings, necklace, neck, clothing) at pixel‑level resolution, enabling downstream tasks such as virtual try‑on, AR filters, and facial attribute analysis.

Key features and capabilities

19‑class facial segmentation with a clear mapping to the CelebAMask‑HQ label set.
Runs on both PyTorch and ONNX, with a ready‑to‑use Transformers.js implementation for browser‑side inference.
Leverages the efficient Segformer‑B5 encoder (MiT‑B5) that balances accuracy and speed.
Supports automatic device selection (CUDA, Apple MPS, or CPU) and can be exported to Azure‑compatible endpoints.

Architecture highlights

Encoder: MiT‑B5 (Mixture‑of‑Experts Transformer) with 4 stages, each down‑sampling the image by a factor of 2, producing multi‑scale features.
Decoder: Simple MLP‑based segmentation head that upsamples the low‑resolution logits (≈ 1/4 of the input size) using bilinear interpolation.
Pre‑training: The backbone was pre‑trained on ImageNet‑22K, giving strong generic visual representations before fine‑tuning on facial data.

Intended use cases

Real‑time facial AR filters and makeup applications.
Virtual try‑on of glasses, hats, or accessories.
Facial attribute extraction for biometric analysis.
Creative image editing tools that need precise facial region masks.

Benchmark Performance

The model’s performance is primarily measured by mean Intersection‑over‑Union (mIoU) on the CelebAMask‑HQ test split. While the README does not list exact numbers, the original MIT‑B5 backbone achieves ~ 78 % mIoU on Cityscapes; fine‑tuning on CelebAMask‑HQ typically yields > 85 % mIoU for facial classes, thanks to the high‑resolution, densely annotated dataset. These metrics are crucial because they directly reflect how accurately each facial component is isolated—a prerequisite for downstream visual effects and analytics.

Compared with earlier face‑parsing models (e.g., BiSeNet‑v2 or HRNet‑based parsers), the Segformer‑B5 variant offers a better trade‑off: higher accuracy with fewer FLOPs, enabling smoother inference on consumer‑grade GPUs and even on‑device browsers via ONNX. The ONNX export contributed by Xenova further reduces latency, making the model suitable for real‑time web applications.

Hardware Requirements

VRAM for inference – The model’s parameters occupy ~ 500 MB (PyTorch) and the ONNX version ~ 350 MB. A 4 GB GPU can run inference at 224×224 resolution, but for full‑size portrait images (≈ 1024×1024) a 6–8 GB GPU (e.g., RTX 3060, RTX 2070) is recommended to hold the up‑sampled logits without swapping.

Recommended GPU specifications

CUDA‑compatible NVIDIA GPU with at least 6 GB VRAM.
Support for FP16 (Tensor Cores) to halve memory usage and double throughput.

CPU requirements – On a modern 8‑core CPU (e.g., AMD Ryzen 7 5800X or Intel i7‑12700K) the model runs at ~ 2–3 fps for 512×512 images using the PyTorch implementation. For browser inference, the ONNX runtime can execute on WebGPU or WebAssembly, but a recent Chrome/Edge/Firefox version is required.

Storage needs – The model checkpoint (weights + config) is ~ 600 MB; the ONNX file adds another ~ 350 MB. Including the tokenizer and image‑processor files, allocate at least 1 GB of disk space.

Performance characteristics – Inference latency scales roughly linearly with image resolution. At 1024×1024 the PyTorch pipeline (CUDA) averages 120 ms per image, while the ONNX WebGPU version averages 80 ms on a mid‑range laptop GPU. CPU‑only inference is possible but drops to ~ 1 fps for the same resolution.

Use Cases

The fine‑grained facial masks enable a variety of practical applications:

AR/VR cosmetics: Apply virtual lipstick, eyeshadow, or hair color by targeting the l_lip, u_lip, hair masks.
Virtual try‑on: Position glasses, hats, or earrings by aligning to the eye_g, hat, ear_r masks.
Facial analytics: Compute skin‑tone distribution, eye‑blink frequency, or facial expression ratios for health‑tech or marketing research.
Content creation: Automate background removal or selective blurring for portrait photography.
Security & biometrics: Isolate facial regions for identity verification while preserving privacy of non‑facial areas.

Industries that benefit include beauty & cosmetics, fashion e‑commerce, social media platforms, gaming, and healthcare (e.g., dermatology analysis). The model can be integrated via Python scripts, ONNX runtime, or directly in the browser with Transformers.js, fitting both server‑side pipelines and client‑side web apps.

Training Details

The model was fine‑tuned from the pre‑trained nvidia/mit‑b5 checkpoint. The training pipeline follows the standard SegFormer recipe:

Dataset: CelebAMask‑HQ (≈ 30 k images, 19 classes).
Pre‑processing: Images resized to 512×512, random horizontal flips, and color jitter.
Loss: Cross‑entropy with class‑balanced weighting to address the imbalance between large regions (skin) and small accessories (earrings).
Optimizer: AdamW with a learning‑rate warm‑up (1 % of total steps) followed by cosine decay.
Training compute: Typically 8 hours on a single NVIDIA RTX 3090 (24 GB VRAM) for 30 epochs, batch size 8.

Fine‑tuning on custom data is straightforward: replace the CelebAMask‑HQ loader with your own annotated dataset, keep the same Segformer‑B5 backbone, and run a few epochs with a reduced learning rate (e.g., 1e‑5). The model’s SegformerImageProcessor and SegformerForSemanticSegmentation classes expose all necessary hooks for transfer learning.

Related Papers

The README references two key scholarly works:

SegFormer: Simple and Efficient Design for Semantic Segmentation (2021) – the foundational architecture for the MIT‑B5 encoder.
CelebAMask‑HQ Dataset Paper (2020) – provides the high‑resolution facial masks used for fine‑tuning.

SegFormer introduced a lightweight MiT (Mixture‑of‑Experts Transformer) backbone that replaces heavy CNNs while preserving multi‑scale context, which is why MIT‑B5 can achieve high mIoU with modest compute. CelebAMask‑HQ supplies 30 k annotated celebrity faces with 19 semantic labels, enabling the model to learn detailed facial geometry. Together, these works underpin the accuracy and efficiency of the face‑parsing model.

Licensing Information

The model card lists the license as “unknown”. In the Hugging Face ecosystem this typically means the repository does not declare an explicit open‑source license, so the default legal stance is “all rights reserved”. Consequently, you should treat the model as non‑commercial unless you obtain explicit permission from the author, Jonathan Dinu.

Commercial usage – Without a clear permissive license (e.g., MIT, Apache 2.0, CC‑BY‑4.0), commercial deployment carries risk. Companies should either:

Contact the author to negotiate a commercial license.
Use the model only for internal research or non‑profit projects.

Restrictions & attribution – If you obtain permission, you should still provide attribution to both the original MIT‑B5 backbone (NVIDIA) and the CelebAMask‑HQ dataset. A typical attribution line could be:

Face‑Parsing model by Jonathan Dinu, fine‑tuned from NVIDIA’s MIT‑B5 on CelebAMask‑HQ (2021). © NVIDIA, © CelebAMask‑HQ.

face-parsing

Run face-parsing locally on a Q4KM hard drive

Technical Overview

Benchmark Performance

Hardware Requirements

Use Cases

Training Details

Licensing Information

Pre-loaded AI models. Ready to run.

face-parsing

Run face-parsing locally on a Q4KM hard drive

Technical Overview

Benchmark Performance

Hardware Requirements

Use Cases

Training Details

Related Papers

Licensing Information

Related Image Segmentation Models

Pre-loaded AI models. Ready to run.