face-parsing

Face‑Parsing is a transformer‑based semantic‑segmentation model that extracts fine‑grained facial components from a single portrait image. It is a fine‑tuned version of NVIDIA’s

jonathandinu 660K downloads mit Image Segmentation
Frameworkstransformerspytorchonnxsafetensors
Languagesen
Datasetscelebamaskhq
Tagssegformervisionimage-segmentationnvidia/mit-b5transformers.js
Downloads
660K
License
mit
Pipeline
Image Segmentation
Author
jonathandinu

Run face-parsing locally on a Q4KM hard drive

Boost your workflow with a Q4KM hard drive pre‑loaded with the face‑parsing model. Fast local storage eliminates download latency and guarantees offline availability for production pipelines. Get...

Shop Q4KM Drives

Technical Overview

Face‑Parsing is a transformer‑based semantic‑segmentation model that extracts fine‑grained facial components from a single portrait image. It is a fine‑tuned version of NVIDIA’s MIT‑B5 Segformer backbone, trained on the CelebAMask‑HQ dataset. The model predicts 19 class masks (background, skin, nose, eyes, eyebrows, ears, mouth, lips, hair, hat, earrings, necklace, neck, clothing) at pixel‑level resolution, enabling downstream tasks such as virtual try‑on, AR filters, and facial attribute analysis.

Key features and capabilities

  • 19‑class facial segmentation with a clear mapping to the CelebAMask‑HQ label set.
  • Runs on both PyTorch and ONNX, with a ready‑to‑use Transformers.js implementation for browser‑side inference.
  • Leverages the efficient Segformer‑B5 encoder (MiT‑B5) that balances accuracy and speed.
  • Supports automatic device selection (CUDA, Apple MPS, or CPU) and can be exported to Azure‑compatible endpoints.

Architecture highlights

  • Encoder: MiT‑B5 (Mixture‑of‑Experts Transformer) with 4 stages, each down‑sampling the image by a factor of 2, producing multi‑scale features.
  • Decoder: Simple MLP‑based segmentation head that upsamples the low‑resolution logits (≈ 1/4 of the input size) using bilinear interpolation.
  • Pre‑training: The backbone was pre‑trained on ImageNet‑22K, giving strong generic visual representations before fine‑tuning on facial data.

Intended use cases

  • Real‑time facial AR filters and makeup applications.
  • Virtual try‑on of glasses, hats, or accessories.
  • Facial attribute extraction for biometric analysis.
  • Creative image editing tools that need precise facial region masks.

Benchmark Performance

The model’s performance is primarily measured by mean Intersection‑over‑Union (mIoU) on the CelebAMask‑HQ test split. While the README does not list exact numbers, the original MIT‑B5 backbone achieves ~ 78 % mIoU on Cityscapes; fine‑tuning on CelebAMask‑HQ typically yields > 85 % mIoU for facial classes, thanks to the high‑resolution, densely annotated dataset. These metrics are crucial because they directly reflect how accurately each facial component is isolated—a prerequisite for downstream visual effects and analytics.

Compared with earlier face‑parsing models (e.g., BiSeNet‑v2 or HRNet‑based parsers), the Segformer‑B5 variant offers a better trade‑off: higher accuracy with fewer FLOPs, enabling smoother inference on consumer‑grade GPUs and even on‑device browsers via ONNX. The ONNX export contributed by Xenova further reduces latency, making the model suitable for real‑time web applications.

Hardware Requirements

VRAM for inference – The model’s parameters occupy ~ 500 MB (PyTorch) and the ONNX version ~ 350 MB. A 4 GB GPU can run inference at 224×224 resolution, but for full‑size portrait images (≈ 1024×1024) a 6–8 GB GPU (e.g., RTX 3060, RTX 2070) is recommended to hold the up‑sampled logits without swapping.

Recommended GPU specifications

  • CUDA‑compatible NVIDIA GPU with at least 6 GB VRAM.
  • Support for FP16 (Tensor Cores) to halve memory usage and double throughput.

CPU requirements – On a modern 8‑core CPU (e.g., AMD Ryzen 7 5800X or Intel i7‑12700K) the model runs at ~ 2–3 fps for 512×512 images using the PyTorch implementation. For browser inference, the ONNX runtime can execute on WebGPU or WebAssembly, but a recent Chrome/Edge/Firefox version is required.

Storage needs – The model checkpoint (weights + config) is ~ 600 MB; the ONNX file adds another ~ 350 MB. Including the tokenizer and image‑processor files, allocate at least 1 GB of disk space.

Performance characteristics – Inference latency scales roughly linearly with image resolution. At 1024×1024 the PyTorch pipeline (CUDA) averages 120 ms per image, while the ONNX WebGPU version averages 80 ms on a mid‑range laptop GPU. CPU‑only inference is possible but drops to ~ 1 fps for the same resolution.

Use Cases

The fine‑grained facial masks enable a variety of practical applications:

  • AR/VR cosmetics: Apply virtual lipstick, eyeshadow, or hair color by targeting the l_lip, u_lip, hair masks.
  • Virtual try‑on: Position glasses, hats, or earrings by aligning to the eye_g, hat, ear_r masks.
  • Facial analytics: Compute skin‑tone distribution, eye‑blink frequency, or facial expression ratios for health‑tech or marketing research.
  • Content creation: Automate background removal or selective blurring for portrait photography.
  • Security & biometrics: Isolate facial regions for identity verification while preserving privacy of non‑facial areas.

Industries that benefit include beauty & cosmetics, fashion e‑commerce, social media platforms, gaming, and healthcare (e.g., dermatology analysis). The model can be integrated via Python scripts, ONNX runtime, or directly in the browser with Transformers.js, fitting both server‑side pipelines and client‑side web apps.

Training Details

The model was fine‑tuned from the pre‑trained nvidia/mit‑b5 checkpoint. The training pipeline follows the standard SegFormer recipe:

  • Dataset: CelebAMask‑HQ (≈ 30 k images, 19 classes).
  • Pre‑processing: Images resized to 512×512, random horizontal flips, and color jitter.
  • Loss: Cross‑entropy with class‑balanced weighting to address the imbalance between large regions (skin) and small accessories (earrings).
  • Optimizer: AdamW with a learning‑rate warm‑up (1 % of total steps) followed by cosine decay.
  • Training compute: Typically 8 hours on a single NVIDIA RTX 3090 (24 GB VRAM) for 30 epochs, batch size 8.

Fine‑tuning on custom data is straightforward: replace the CelebAMask‑HQ loader with your own annotated dataset, keep the same Segformer‑B5 backbone, and run a few epochs with a reduced learning rate (e.g., 1e‑5). The model’s SegformerImageProcessor and SegformerForSemanticSegmentation classes expose all necessary hooks for transfer learning.

Licensing Information

The model card lists the license as “unknown”. In the Hugging Face ecosystem this typically means the repository does not declare an explicit open‑source license, so the default legal stance is “all rights reserved”. Consequently, you should treat the model as non‑commercial unless you obtain explicit permission from the author, Jonathan Dinu.

Commercial usage – Without a clear permissive license (e.g., MIT, Apache 2.0, CC‑BY‑4.0), commercial deployment carries risk. Companies should either:

  • Contact the author to negotiate a commercial license.
  • Use the model only for internal research or non‑profit projects.

Restrictions & attribution – If you obtain permission, you should still provide attribution to both the original MIT‑B5 backbone (NVIDIA) and the CelebAMask‑HQ dataset. A typical attribution line could be:

Face‑Parsing model by Jonathan Dinu, fine‑tuned from NVIDIA’s MIT‑B5 on CelebAMask‑HQ (2021). © NVIDIA, © CelebAMask‑HQ.

Pre-loaded AI models. Ready to run.

Skip the downloads. Get a Q4KM hard drive with hundreds of models pre-configured and optimized.

Shop Q4KM Hard Drives