Technical Overview
The stanford‑deidentifier‑base model (model ID StanfordAIMI/stanford-deidentifier-base) is a Hugging Face token‑classification transformer fine‑tuned for automated de‑identification of radiology and biomedical texts. It scans free‑form clinical narratives, locates protected health information (PHI) at the token level, and replaces each PHI span with a realistic surrogate – a “hide‑in‑plain‑sight” approach that preserves readability while ensuring privacy compliance.
Key features & capabilities
- Specialised for radiology reports, CT notes, and other biomedical documents.
- Detects a wide range of PHI types: patient names, dates, IDs, phone numbers, provider names, and more.
- Produces token‑level BIO tags, enabling downstream pipelines to mask or replace identified entities.
- Built on a PubMed‑BERT‑style encoder (uncased, biomedical‑vocab) for domain‑specific language understanding.
- Ready‑to‑deploy on Azure, AWS, or on‑premise inference servers (endpoints_compatible tag).
Architecture highlights
- Base architecture: BERT‑base (12 transformer layers, 768 hidden size, 12 attention heads).
- Pre‑trained on PubMed abstracts (PubMedBERT) → strong biomedical token embeddings.
- Fine‑tuned for the
token‑classificationpipeline, outputting alabel2idmapping for PHI categories. - Uncased vocabulary, preserving case‑insensitivity which is useful for heterogeneous clinical notes.
Intended use cases
- Automatic de‑identification of radiology reports before data sharing or research.
- Pre‑processing step for building PHI‑free corpora for machine‑learning research.
- Integration into clinical data pipelines (e.g., HL7/FHIR ingest) to meet HIPAA‑compliant privacy standards.
- Support for “synthetic data” generation by replacing real PHI with realistic surrogates.
Benchmark Performance
Benchmarking for de‑identification focuses on precision, recall, and F1‑score at the token level because false negatives (missed PHI) can cause privacy breaches, while false positives degrade data utility. The authors reported the following results on multiple test sets:
- Radiology reports (known institution): F1 = 97.9 %
- Radiology reports (new institution): F1 = 99.6 %
- i2b2 2006 challenge set: F1 = 99.5 %
- i2b2 2014 challenge set: F1 = 98.9 %
- Recall for core PHI spans on the known‑institution set: 99.1 %
These scores surpass previously released de‑identifiers and even human annotators on the i2b2 2014 data, demonstrating that the model is ready for production‑grade deployment where both privacy safety and data fidelity are critical.
Hardware Requirements
For inference the model fits comfortably on a single modern GPU. The BERT‑base encoder (≈110 M parameters) typically requires 4–6 GB VRAM for a batch size of 1–2 sentences. Larger batch sizes (e.g., 8–16) benefit from 12 GB+ GPUs such as the NVIDIA RTX 3080, A100, or V100.
- GPU recommendation: NVIDIA RTX 3060 (12 GB) or higher; for high‑throughput pipelines, consider A100 (40 GB) or V100 (32 GB).
- CPU: 8‑core Intel Xeon or AMD Ryzen 7+, with at least 16 GB RAM for tokenization and data loading.
- Storage: Model files total ≈ 800 MB (weights, config, tokenizer). SSD preferred for fast loading.
- Latency: ~ 30 ms per sentence on a RTX 3080; ~ 80 ms on a mid‑range CPU.
Use Cases
The model shines in any workflow that must remove PHI from clinical narratives while keeping the text usable for downstream analytics.
- Hospital data warehouses: Automated scrubbing of radiology reports before they are indexed for research.
- Pharmaceutical R&D: De‑identifying trial notes to share with external partners.
- Health‑tech startups: Building HIPAA‑compliant NLP services (e.g., automated report summarization).
- Regulatory compliance tools: Integrating into FHIR pipelines to ensure outbound messages contain no PHI.
Training Details
Training leveraged a large, multi‑institutional corpus of radiology reports and medical notes:
- Datasets: 999 chest X‑ray & CT reports (Nov 2019‑Nov 2020) + 3 001 X‑ray notes + 2 193 medical notes = 6 193 documents, all token‑level annotated for PHI.
- Pre‑training: PubMedBERT (uncased) weights as the base encoder.
- Fine‑tuning: Token‑classification head trained with cross‑entropy loss, using data‑augmentation (synthetic PHI generation) and “hide‑in‑plain‑sight” surrogate replacement.
- Compute: Trained on 4 × NVIDIA V100 GPUs (≈ 32 GB each) for ~ 12 hours (batch size 32, learning rate 3e‑5, 3 epochs).
- Fine‑tuning capability: Users can re‑train the classification head on institution‑specific PHI patterns by loading the base weights and supplying a small annotated set (few‑shot fine‑tuning).
Licensing Information
The repository tags list a license:mit but the model card’s top‑level field shows License: unknown. In practice, the underlying code and weights are distributed under the MIT License as indicated by the tag. The MIT license is permissive:
- Allows commercial, academic, and private use without fee.
- Requires preservation of the original copyright notice and license text in redistributed copies.
- No warranty; you assume all risk.
If you intend to embed the model in a proprietary product, retain the MIT notice in your documentation or “About” section. Should the final license be clarified as “unknown,” treat it as “source‑available” and contact the authors (StanfordAIMI) before commercial redistribution.