Technical Overview
What is this model? distilbert_finetuned_ai4privacy_v2 is a token‑classification (named‑entity‑recognition) model built on the distilbert-base-uncased architecture. It has been fine‑tuned on the ai4privacy/pii-masking-200k dataset, which is the largest publicly available English‑language privacy‑focused corpus. The model’s primary purpose is to detect and mask personally identifiable information (PII) in free‑form text, making it a valuable component for privacy‑preserving pipelines that feed data into large language models, chat‑bots, or document‑automation systems.
Key features and capabilities
- Detects 54 distinct PII classes, ranging from simple identifiers such as
EmailandPhoneNumberto more nuanced entities likeCurrencyNameandIp. - Supports five interaction styles (casual conversation, formal documents, emails, etc.) and 229 subject‑specific use‑cases across business, education, psychology, and legal domains.
- Provides token‑level confidence scores, enabling downstream masking or redaction logic to be applied with fine‑grained control.
- Optimized for inference with ONNX and SafeTensors, allowing deployment on edge devices or low‑latency server environments.
Architecture highlights
- Base model: distilbert-base-uncased – a 6‑layer, 768‑dimensional transformer that retains ~97% of BERT‑base performance while using roughly half the parameters.
- Quantized variant available (distilbert‑base‑uncased quantized) for reduced memory footprint during inference.
- Fine‑tuned using the Hugging Face
TrainerAPI with a cosine‑with‑restarts learning‑rate scheduler, Adam optimizer, and a seed of 42 for reproducibility. - Exported in ONNX and SafeTensors formats, making it compatible with a wide range of inference runtimes (e.g., ONNX Runtime, 🤗 Transformers pipelines).
Intended use cases
- Pre‑processing of user‑generated content before feeding it to LLMs, ensuring that sensitive data never leaves the organization.
- Automated redaction of PII in emails, chat logs, legal contracts, and academic transcripts.
- Real‑time privacy monitoring in customer‑support chatbots and virtual assistants.
- Data‑anonymization for compliance with GDPR, CCPA, HIPAA, and other privacy regulations.
Benchmark Performance
For token‑classification models, the most informative benchmarks are sequence‑level precision, recall, and F1‑score, typically computed with the
seqeval library. The README reports a comprehensive set of class‑wise metrics on the held‑out evaluation split of the
ai4privacy/pii-masking-200k dataset.
Overall results
- Loss:
0.0451 - Precision:
0.9438 - Recall:
0.9663 - F1‑score:
0.9549 - Accuracy:
0.9838
Highlights of class‑wise performance
- Perfect detection (F1 = 1.0) for high‑risk identifiers such as
Email,EthereumAddress,Mac,Url,VehicleVIN, andVehicleVRM. - Very high scores (> 0.99) for common personal names, company names, and location entities (e.g.,
Firstname,Lastname,City,State). - Lower but still respectable scores for challenging numeric patterns:
Ip(0.4349),CurrencyName(0.2281), andCurrency(0.7811).
These benchmarks matter because they directly reflect a model’s ability to preserve privacy without over‑masking benign text. Compared to the original
distilbert-base-uncased NER baselines, which typically achieve macro‑F1 scores in the 0.80–0.85 range on generic datasets, this fine‑tuned version
pushes macro‑F1 above 0.95 on a privacy‑specific task, making it one of the most accurate open‑source PII detectors available.
Hardware Requirements
VRAM for inference
- Base
distilbert-base-uncasedcheckpoint: ~250 MiB (FP32). - Quantized checkpoint (INT8): ~120 MiB, suitable for GPUs with as little as 2 GiB VRAM.
- ONNX runtime adds a small overhead; a 4 GiB GPU (e.g., NVIDIA GTX 1650) is sufficient for batch‑size‑1 inference at < 30 ms per sentence.
Recommended GPU
- Any modern CUDA‑compatible GPU with ≥ 4 GiB VRAM for comfortable batch processing (e.g., RTX 2060, RTX 3060, or AMD Radeon RX 6600).
- For high‑throughput server deployments, consider GPUs with 8 GiB+ (RTX 3080, A100) to enable larger batch sizes (8‑16) and lower per‑token latency.
CPU & storage
- CPU‑only inference is possible using the ONNX Runtime with the
CPUExecutionProvider; expect ~150‑200 ms per sentence on a 2.6 GHz 8‑core processor. - Model files (SafeTensors + ONNX) total ~300 MiB. Allocate at least 1 GiB of free disk space for the model and its tokenizer.
Performance characteristics
- Throughput: ~30‑45 tokens / ms on a mid‑range GPU (RTX 2060) with batch‑size = 8.
- Latency: < 20 ms for a typical 30‑token sentence on a high‑end GPU (RTX 3080).
- Scales linearly with batch size until GPU memory saturation.
Use Cases
Primary intended applications
- Privacy‑first data pipelines – automatically redacting PII before logs are stored or analyzed.
- LLM prompt sanitization – ensuring that user prompts sent to large language models do not contain sensitive data.
- Regulatory compliance – assisting organizations in meeting GDPR, CCPA, HIPAA, and other data‑protection mandates.
- Customer‑support automation – masking personal identifiers in chat transcripts while preserving context for analytics.
Real‑world examples
- A fintech startup integrates the model into its transaction‑monitoring system to strip account numbers and IBANs from audit logs before they are sent to a third‑party analytics platform.
- University researchers use the model to anonymize student essays and survey responses, enabling open‑source sharing of linguistic datasets without violating privacy regulations.
- A legal‑tech company applies the model to contract‑review tools, automatically redacting client names, addresses, and social‑security numbers before sharing drafts with external counsel.
Training Details
Methodology
- Fine‑tuned using the Hugging Face
TrainerAPI with a cosine‑with‑restarts learning‑rate scheduler. - Optimizer: Adam (β₁ = 0.9, β₂ = 0.999, ε = 1e‑8).
- Learning rate: 5 × 10⁻⁵, warm‑up ratio 0.2, total of 5 epochs.
- Batch size: 8 for both training and evaluation.
- Random seed: 42 for reproducibility.
Datasets
- ai4privacy/pii-masking-200k – 200 k English sentences annotated with 54 PII classes.
- Duplicate dataset hosted under the author’s namespace: Isotonic/pii-masking-200k.
Compute requirements
- Training was performed on a single NVIDIA V100 (16 GiB) GPU; total wall‑clock time ≈ 6 hours.
- Model size after fine‑tuning: ~250 MiB (FP32) and ~120 MiB (INT8 quantized).
Fine‑tuning capabilities
- The model can be further fine‑tuned on domain‑specific PII data (e.g., medical records) using the same
Trainerpipeline. - Because the base architecture is DistilBERT, it supports standard Hugging Face
Trainerarguments, making hyper‑parameter sweeps straightforward. - Export to ONNX or SafeTensors is supported out‑of‑the‑box, enabling deployment on edge devices or serverless environments.
Licensing Information
The model is released under the CC‑BY‑NC‑4.0 license (Creative Commons Attribution‑NonCommercial 4.0). This license permits anyone to share and adapt the model for non‑commercial purposes provided that appropriate credit is given to the original author (Isotonic) and a link to the license is included.
Commercial use
- Because the license is “Non‑Commercial”, direct commercial deployment (e.g., embedding the model in a SaaS product that charges customers) is not permitted without obtaining a separate commercial license from the author.
- Academic research, internal tooling, open‑source projects, and personal experimentation are fully allowed.
Restrictions & requirements
- Attribution: Must retain the original author name, model name, and a link to the model card.
- No derivative works may be sold or used to generate revenue unless a commercial agreement is negotiated.
- Any redistribution must preserve the same license; you cannot re‑license the model under a more permissive term.