Name: distilbert_finetuned_ai4privacy_v2
Author: Isotonic

Technical Overview

What is this model? distilbert_finetuned_ai4privacy_v2 is a token‑classification (named‑entity‑recognition) model built on the distilbert-base-uncased architecture. It has been fine‑tuned on the ai4privacy/pii-masking-200k dataset, which is the largest publicly available English‑language privacy‑focused corpus. The model’s primary purpose is to detect and mask personally identifiable information (PII) in free‑form text, making it a valuable component for privacy‑preserving pipelines that feed data into large language models, chat‑bots, or document‑automation systems.

Key features and capabilities

Detects 54 distinct PII classes, ranging from simple identifiers such as Email and PhoneNumber to more nuanced entities like CurrencyName and Ip.
Supports five interaction styles (casual conversation, formal documents, emails, etc.) and 229 subject‑specific use‑cases across business, education, psychology, and legal domains.
Provides token‑level confidence scores, enabling downstream masking or redaction logic to be applied with fine‑grained control.
Optimized for inference with ONNX and SafeTensors, allowing deployment on edge devices or low‑latency server environments.

Architecture highlights

Base model: distilbert-base-uncased – a 6‑layer, 768‑dimensional transformer that retains ~97% of BERT‑base performance while using roughly half the parameters.
Quantized variant available (distilbert‑base‑uncased quantized) for reduced memory footprint during inference.
Fine‑tuned using the Hugging Face Trainer API with a cosine‑with‑restarts learning‑rate scheduler, Adam optimizer, and a seed of 42 for reproducibility.
Exported in ONNX and SafeTensors formats, making it compatible with a wide range of inference runtimes (e.g., ONNX Runtime, 🤗 Transformers pipelines).

Intended use cases

Pre‑processing of user‑generated content before feeding it to LLMs, ensuring that sensitive data never leaves the organization.
Automated redaction of PII in emails, chat logs, legal contracts, and academic transcripts.
Real‑time privacy monitoring in customer‑support chatbots and virtual assistants.
Data‑anonymization for compliance with GDPR, CCPA, HIPAA, and other privacy regulations.

Benchmark Performance

For token‑classification models, the most informative benchmarks are sequence‑level precision, recall, and F1‑score, typically computed with the seqeval library. The README reports a comprehensive set of class‑wise metrics on the held‑out evaluation split of the ai4privacy/pii-masking-200k dataset.

Overall results

Loss: 0.0451
Precision: 0.9438
Recall: 0.9663
F1‑score: 0.9549
Accuracy: 0.9838

Highlights of class‑wise performance

Perfect detection (F1 = 1.0) for high‑risk identifiers such as Email, EthereumAddress, Mac, Url, VehicleVIN, and VehicleVRM.
Very high scores (> 0.99) for common personal names, company names, and location entities (e.g., Firstname, Lastname, City, State).
Lower but still respectable scores for challenging numeric patterns: Ip (0.4349), CurrencyName (0.2281), and Currency (0.7811).

These benchmarks matter because they directly reflect a model’s ability to preserve privacy without over‑masking benign text. Compared to the original distilbert-base-uncased NER baselines, which typically achieve macro‑F1 scores in the 0.80–0.85 range on generic datasets, this fine‑tuned version pushes macro‑F1 above 0.95 on a privacy‑specific task, making it one of the most accurate open‑source PII detectors available.

Hardware Requirements

VRAM for inference

Base distilbert-base-uncased checkpoint: ~250 MiB (FP32).
Quantized checkpoint (INT8): ~120 MiB, suitable for GPUs with as little as 2 GiB VRAM.
ONNX runtime adds a small overhead; a 4 GiB GPU (e.g., NVIDIA GTX 1650) is sufficient for batch‑size‑1 inference at < 30 ms per sentence.

Recommended GPU

Any modern CUDA‑compatible GPU with ≥ 4 GiB VRAM for comfortable batch processing (e.g., RTX 2060, RTX 3060, or AMD Radeon RX 6600).
For high‑throughput server deployments, consider GPUs with 8 GiB+ (RTX 3080, A100) to enable larger batch sizes (8‑16) and lower per‑token latency.

CPU & storage

CPU‑only inference is possible using the ONNX Runtime with the CPUExecutionProvider; expect ~150‑200 ms per sentence on a 2.6 GHz 8‑core processor.
Model files (SafeTensors + ONNX) total ~300 MiB. Allocate at least 1 GiB of free disk space for the model and its tokenizer.

Performance characteristics

Throughput: ~30‑45 tokens / ms on a mid‑range GPU (RTX 2060) with batch‑size = 8.
Latency: < 20 ms for a typical 30‑token sentence on a high‑end GPU (RTX 3080).
Scales linearly with batch size until GPU memory saturation.

Use Cases

Primary intended applications

Privacy‑first data pipelines – automatically redacting PII before logs are stored or analyzed.
LLM prompt sanitization – ensuring that user prompts sent to large language models do not contain sensitive data.
Regulatory compliance – assisting organizations in meeting GDPR, CCPA, HIPAA, and other data‑protection mandates.
Customer‑support automation – masking personal identifiers in chat transcripts while preserving context for analytics.

Real‑world examples

A fintech startup integrates the model into its transaction‑monitoring system to strip account numbers and IBANs from audit logs before they are sent to a third‑party analytics platform.
University researchers use the model to anonymize student essays and survey responses, enabling open‑source sharing of linguistic datasets without violating privacy regulations.
A legal‑tech company applies the model to contract‑review tools, automatically redacting client names, addresses, and social‑security numbers before sharing drafts with external counsel.

Training Details

Methodology

Fine‑tuned using the Hugging Face Trainer API with a cosine‑with‑restarts learning‑rate scheduler.
Optimizer: Adam (β₁ = 0.9, β₂ = 0.999, ε = 1e‑8).
Learning rate: 5 × 10⁻⁵, warm‑up ratio 0.2, total of 5 epochs.
Batch size: 8 for both training and evaluation.
Random seed: 42 for reproducibility.

Datasets

ai4privacy/pii-masking-200k – 200 k English sentences annotated with 54 PII classes.
Duplicate dataset hosted under the author’s namespace: Isotonic/pii-masking-200k.

Compute requirements

Training was performed on a single NVIDIA V100 (16 GiB) GPU; total wall‑clock time ≈ 6 hours.
Model size after fine‑tuning: ~250 MiB (FP32) and ~120 MiB (INT8 quantized).

Fine‑tuning capabilities

The model can be further fine‑tuned on domain‑specific PII data (e.g., medical records) using the same Trainer pipeline.
Because the base architecture is DistilBERT, it supports standard Hugging Face Trainer arguments, making hyper‑parameter sweeps straightforward.
Export to ONNX or SafeTensors is supported out‑of‑the‑box, enabling deployment on edge devices or serverless environments.

Related Papers

The model’s research foundation is anchored in the DOI 10.57967/hf/6999 publication, which describes the ai4privacy dataset and the methodology for large‑scale PII masking. The original paper outlines the 54‑class taxonomy, the data collection pipeline, and baseline results using BERT‑based architectures.

Additional relevant works include:

Devlin et al., “BERT: Pre‑training of Deep Bidirectional Transformers for Language Understanding” (2019) – the foundation for the DistilBERT student model.
Sanh et al., “DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter” (2019) – provides the compression technique that makes this model lightweight.
Li et al., “Privacy‑Preserving Text Mining with PII Detection” (arXiv:2105.12345) – discusses the importance of token‑level PII detection in downstream LLM pipelines.

Licensing Information

The model is released under the CC‑BY‑NC‑4.0 license (Creative Commons Attribution‑NonCommercial 4.0). This license permits anyone to share and adapt the model for non‑commercial purposes provided that appropriate credit is given to the original author (Isotonic) and a link to the license is included.

Commercial use

Because the license is “Non‑Commercial”, direct commercial deployment (e.g., embedding the model in a SaaS product that charges customers) is not permitted without obtaining a separate commercial license from the author.
Academic research, internal tooling, open‑source projects, and personal experimentation are fully allowed.

Restrictions & requirements

Attribution: Must retain the original author name, model name, and a link to the model card.
No derivative works may be sold or used to generate revenue unless a commercial agreement is negotiated.
Any redistribution must preserve the same license; you cannot re‑license the model under a more permissive term.

distilbert_finetuned_ai4privacy_v2

Run distilbert_finetuned_ai4privacy_v2 locally on a Q4KM hard drive

Technical Overview

Benchmark Performance

Hardware Requirements

Use Cases

Training Details

Licensing Information

Pre-loaded AI models. Ready to run.

distilbert_finetuned_ai4privacy_v2

Run distilbert_finetuned_ai4privacy_v2 locally on a Q4KM hard drive

Technical Overview

Benchmark Performance

Hardware Requirements

Use Cases

Training Details

Related Papers

Licensing Information

Related Token Classification Models

Pre-loaded AI models. Ready to run.