Technical Overview
dslim/bert-base-NER is a fine‑tuned bert-base‑cased model that performs Named Entity Recognition (NER) on English text. The model classifies each token into one of nine BIO‑style tags (O, B‑LOC, I‑LOC, B‑ORG, I‑ORG, B‑PER, I‑PER, B‑MISC, I‑MISC) and therefore can extract four entity types – locations, organizations, persons and miscellaneous entities – from raw sentences.
Key features & capabilities
- Ready‑to‑use with the
pipeline("ner")API from 🤗 Transformers. - State‑of‑the‑art F1 score of 0.926 on the CoNLL‑2003 test split.
- Compact 110 M‑parameter footprint (bert‑base) while retaining high accuracy.
- Available in both cased and uncased variants for downstream flexibility.
Architecture highlights
- 12‑layer Transformer encoder with 768 hidden units and 12 attention heads.
- Pre‑trained on the original BERT‑cased corpus (BooksCorpus + English Wikipedia).
- Fine‑tuned on the CoNLL‑2003 NER dataset using a token‑classification head (linear layer + softmax).
- Supports PyTorch, TensorFlow, JAX, ONNX and safetensors formats.
Intended use cases
- Information extraction from news articles, reports, or any English prose.
- Pre‑processing step for downstream tasks such as relation extraction, knowledge‑graph construction, or question answering.
- Real‑time NER in chat‑bots, virtual assistants, and document automation pipelines.
Benchmark Performance
The most relevant benchmark for a token‑classification model is the CoNLL‑2003 NER test set. The README reports the following verified metrics:
- Accuracy: 0.9118
- Precision: 0.9212
- Recall: 0.9306
- F1‑score: 0.9259
- Loss: 0.4833
These numbers place bert-base-NER on par with other BERT‑base NER baselines and ahead of smaller models such as DistilBERT‑NER (≈0.90 F1). The high recall indicates the model rarely misses entities, while the precision shows it keeps false positives low – a crucial balance for downstream pipelines that rely on clean entity spans.
Hardware Requirements
VRAM for inference – The model’s 110 M parameters occupy roughly 420 MB of GPU memory when loaded in FP32. Using 16‑bit (FP16) or torch.float16 reduces this to ~210 MB, allowing inference on a single 4 GB GPU.
- Recommended GPU: NVIDIA V100, RTX 3080, or any GPU with ≥6 GB VRAM for comfortable batch‑size ≥ 8.
- CPU: Modern x86_64 CPU with at least 8 GB RAM; inference speed scales with core count but is not a bottleneck for typical sentence‑level NER.
- Storage: Model files (~420 MB for FP32, ~210 MB for FP16) plus tokenizer files (~30 MB). SSD storage is recommended for fast loading.
- Performance: On a V100, token‑level latency is ~1 ms per token (batch = 1). Larger batches (e.g., 32 sentences) can achieve >200 tokens/s.
Use Cases
Primary applications include any scenario that needs to extract structured entities from unstructured English text.
- News & media monitoring: Detect people, places, and organizations in real‑time streams.
- Legal document analysis: Highlight parties, locations, and miscellaneous entities in contracts or case files.
- Customer support automation: Identify product names, user names, and locations in tickets for routing.
- Healthcare record anonymization: Flag personal identifiers before de‑identification.
The model can be integrated via the 🤗 Transformers pipeline API, exported to ONNX for edge deployment, or wrapped in a REST API using FastAPI or Flask.
Training Details
The model was fine‑tuned on a single NVIDIA V100 GPU using the hyper‑parameters suggested in the original BERT paper (learning rate ≈ 2e‑5, batch size ≈ 32, 3–4 epochs). The training pipeline:
- Base model:
bert-base-cased(12 layers, 110 M parameters). - Dataset: English CoNLL‑2003 (≈203 k training tokens, 51 k dev tokens, 46 k test tokens).
- Loss: Cross‑entropy over the nine BIO tags.
- Evaluation: Accuracy, Precision, Recall, F1, and loss on the test split.
- Fine‑tuning capability: Users can continue training on domain‑specific NER corpora (e.g., biomedical or financial) by loading the checkpoint with
AutoModelForTokenClassificationand providing a newTrainerconfiguration.
Licensing Information
The model card lists the MIT license for the underlying code and data, but the overall license: unknown field indicates that the exact distribution terms for the fine‑tuned weights are not explicitly stated on the hub.
In practice, the MIT license permits:
- Free commercial and non‑commercial use.
- Modification, redistribution, and integration into proprietary software.
- Requirement to retain the original copyright notice and license text.
Because the license is marked “unknown”, users should:
- Check the model card for any updates.
- Contact the author (dslim) if a commercial deployment raises legal concerns.
- Provide attribution to “dslim/bert-base-NER” and the MIT‑licensed source code.