stanford-deidentifier-base

The stanford‑deidentifier‑base model (model ID StanfordAIMI/stanford-deidentifier-base ) is a Hugging Face token‑classification transformer fine‑tuned for automated de‑identification of radiology and biomedical texts. It scans free‑form clinical narratives, locates protected health information (PHI) at the token level, and replaces each PHI span with a realistic surrogate – a “hide‑in‑plain‑sight” approach that preserves readability while ensuring privacy compliance.

StanfordAIMI 820K downloads mit Token Classification
Frameworkstransformerspytorch
Languagesen
Datasetsradreports
Tagsberttoken-classificationsequence-tagger-modelpubmedbertuncasedradiologybiomedicalbdf-toolbox
Downloads
820K
License
mit
Pipeline
Token Classification
Author
StanfordAIMI

Run stanford-deidentifier-base locally on a Q4KM hard drive

Accelerate deployment with Q4KM hard drives pre‑loaded with the Stanford‑deidentifier‑base model . Get instant, plug‑and‑play performance on‑premise – perfect for secure environments that cannot rely...

Shop Q4KM Drives

Technical Overview

The stanford‑deidentifier‑base model (model ID StanfordAIMI/stanford-deidentifier-base) is a Hugging Face token‑classification transformer fine‑tuned for automated de‑identification of radiology and biomedical texts. It scans free‑form clinical narratives, locates protected health information (PHI) at the token level, and replaces each PHI span with a realistic surrogate – a “hide‑in‑plain‑sight” approach that preserves readability while ensuring privacy compliance.

Key features & capabilities

  • Specialised for radiology reports, CT notes, and other biomedical documents.
  • Detects a wide range of PHI types: patient names, dates, IDs, phone numbers, provider names, and more.
  • Produces token‑level BIO tags, enabling downstream pipelines to mask or replace identified entities.
  • Built on a PubMed‑BERT‑style encoder (uncased, biomedical‑vocab) for domain‑specific language understanding.
  • Ready‑to‑deploy on Azure, AWS, or on‑premise inference servers (endpoints_compatible tag).

Architecture highlights

  • Base architecture: BERT‑base (12 transformer layers, 768 hidden size, 12 attention heads).
  • Pre‑trained on PubMed abstracts (PubMedBERT) → strong biomedical token embeddings.
  • Fine‑tuned for the token‑classification pipeline, outputting a label2id mapping for PHI categories.
  • Uncased vocabulary, preserving case‑insensitivity which is useful for heterogeneous clinical notes.

Intended use cases

  • Automatic de‑identification of radiology reports before data sharing or research.
  • Pre‑processing step for building PHI‑free corpora for machine‑learning research.
  • Integration into clinical data pipelines (e.g., HL7/FHIR ingest) to meet HIPAA‑compliant privacy standards.
  • Support for “synthetic data” generation by replacing real PHI with realistic surrogates.

Benchmark Performance

Benchmarking for de‑identification focuses on precision, recall, and F1‑score at the token level because false negatives (missed PHI) can cause privacy breaches, while false positives degrade data utility. The authors reported the following results on multiple test sets:

  • Radiology reports (known institution): F1 = 97.9 %
  • Radiology reports (new institution): F1 = 99.6 %
  • i2b2 2006 challenge set: F1 = 99.5 %
  • i2b2 2014 challenge set: F1 = 98.9 %
  • Recall for core PHI spans on the known‑institution set: 99.1 %

These scores surpass previously released de‑identifiers and even human annotators on the i2b2 2014 data, demonstrating that the model is ready for production‑grade deployment where both privacy safety and data fidelity are critical.

Hardware Requirements

For inference the model fits comfortably on a single modern GPU. The BERT‑base encoder (≈110 M parameters) typically requires 4–6 GB VRAM for a batch size of 1–2 sentences. Larger batch sizes (e.g., 8–16) benefit from 12 GB+ GPUs such as the NVIDIA RTX 3080, A100, or V100.

  • GPU recommendation: NVIDIA RTX 3060 (12 GB) or higher; for high‑throughput pipelines, consider A100 (40 GB) or V100 (32 GB).
  • CPU: 8‑core Intel Xeon or AMD Ryzen 7+, with at least 16 GB RAM for tokenization and data loading.
  • Storage: Model files total ≈ 800 MB (weights, config, tokenizer). SSD preferred for fast loading.
  • Latency: ~ 30 ms per sentence on a RTX 3080; ~ 80 ms on a mid‑range CPU.

Use Cases

The model shines in any workflow that must remove PHI from clinical narratives while keeping the text usable for downstream analytics.

  • Hospital data warehouses: Automated scrubbing of radiology reports before they are indexed for research.
  • Pharmaceutical R&D: De‑identifying trial notes to share with external partners.
  • Health‑tech startups: Building HIPAA‑compliant NLP services (e.g., automated report summarization).
  • Regulatory compliance tools: Integrating into FHIR pipelines to ensure outbound messages contain no PHI.

Training Details

Training leveraged a large, multi‑institutional corpus of radiology reports and medical notes:

  • Datasets: 999 chest X‑ray & CT reports (Nov 2019‑Nov 2020) + 3 001 X‑ray notes + 2 193 medical notes = 6 193 documents, all token‑level annotated for PHI.
  • Pre‑training: PubMedBERT (uncased) weights as the base encoder.
  • Fine‑tuning: Token‑classification head trained with cross‑entropy loss, using data‑augmentation (synthetic PHI generation) and “hide‑in‑plain‑sight” surrogate replacement.
  • Compute: Trained on 4 × NVIDIA V100 GPUs (≈ 32 GB each) for ~ 12 hours (batch size 32, learning rate 3e‑5, 3 epochs).
  • Fine‑tuning capability: Users can re‑train the classification head on institution‑specific PHI patterns by loading the base weights and supplying a small annotated set (few‑shot fine‑tuning).

Licensing Information

The repository tags list a license:mit but the model card’s top‑level field shows License: unknown. In practice, the underlying code and weights are distributed under the MIT License as indicated by the tag. The MIT license is permissive:

  • Allows commercial, academic, and private use without fee.
  • Requires preservation of the original copyright notice and license text in redistributed copies.
  • No warranty; you assume all risk.

If you intend to embed the model in a proprietary product, retain the MIT notice in your documentation or “About” section. Should the final license be clarified as “unknown,” treat it as “source‑available” and contact the authors (StanfordAIMI) before commercial redistribution.

Pre-loaded AI models. Ready to run.

Skip the downloads. Get a Q4KM hard drive with hundreds of models pre-configured and optimized.

Shop Q4KM Hard Drives