indonesian-roberta-base-posp-tagger

The indonesian‑roberta‑base‑posp‑tagger is a fine‑tuned token‑classification model built on top of the flax‑community/indonesian‑roberta‑base checkpoint. It is specifically trained to perform

w11wo 2.6M downloads mit Token Classification
Frameworkstransformerspytorchtfsafetensors
Datasetsindonlu
Tagstensorboardrobertatoken-classificationgenerated_from_trainerindbase_model:flax-community/indonesian-roberta-basebase_model:finetune:flax-community/indonesian-roberta-basemodel-index
Downloads
2.6M
License
mit
Pipeline
Token Classification
Author
w11wo

Run indonesian-roberta-base-posp-tagger locally on a Q4KM hard drive

Looking for a plug‑and‑play solution? Q4KM offers high‑performance SSDs pre‑loaded with the indonesian‑roberta‑base‑posp‑tagger . Get instant, out‑of‑the‑box inference on your edge devices. Shop now →

Shop Q4KM Drives

Technical Overview

The indonesian‑roberta‑base‑posp‑tagger is a fine‑tuned token‑classification model built on top of the flax‑community/indonesian‑roberta‑base checkpoint. It is specifically trained to perform part‑of‑speech (POS) tagging on Indonesian text, using the posp configuration of the IndonLU dataset. The model takes a sequence of tokens as input and returns a label for each token, indicating its grammatical role (noun, verb, adjective, etc.).

Key Features & Capabilities

  • High‑precision POS tagging for Bahasa Indonesia (Precision/Recall/F1 ≈ 0.9625).
  • Based on the RoBERTa‑base architecture (12 transformer layers, 768‑dim hidden size).
  • Supports the token‑classification pipeline in 🤗 Transformers, making inference straightforward in Python, Rust, or JavaScript.
  • Optimized for both PyTorch and TensorFlow via the generated_from_trainer tag.
  • Includes safetensors weights for faster loading and reduced memory overhead.

Architecture Highlights

  • Base model: RoBERTa‑base pre‑trained on a large Indonesian corpus (≈ 160 M tokens).
  • Classification head: A linear layer on top of the final hidden state, outputting the POS tag set defined in the posp config.
  • Training framework: Hugging Face Trainer with Adam optimizer, linear LR scheduler, and a batch size of 16.

Intended Use Cases

  • Grammar checking and proofreading tools for Indonesian.
  • Pre‑processing for downstream NLP tasks such as named‑entity recognition, dependency parsing, or sentiment analysis.
  • Educational software that teaches Indonesian grammar.
  • Voice‑assistant pipelines that need syntactic awareness for better intent extraction.

Benchmark Performance

The model was evaluated on the test split of the indonlu dataset (POS‑P configuration). The reported metrics are:

  • Precision: 0.9625
  • Recall: 0.9625
  • F1‑Score: 0.9625
  • Accuracy: 0.9625
  • Loss: 0.1395

These benchmarks are crucial for token‑classification models because they directly reflect the model’s ability to assign correct grammatical tags to each token—a core requirement for any downstream syntactic analysis. Compared with earlier Indonesian POS taggers (e.g., CRF‑based or BiLSTM‑CRF models that typically hover around 0.90 F1), this RoBERTa‑based tagger offers a noticeable boost in both precision and recall, making it one of the most accurate publicly available Indonesian POS taggers as of 2024.

Hardware Requirements

Inference with the indonesian‑roberta‑base‑posp‑tagger is lightweight enough to run on consumer‑grade GPUs while still delivering sub‑second latency for typical sentence lengths (≤ 128 tokens). Recommended hardware:

  • GPU: 8 GB VRAM (e.g., NVIDIA RTX 3060, RTX 2070) – sufficient for a single‑sentence batch.
  • CPU: Modern multi‑core CPU (Intel i5‑10600K or AMD Ryzen 5 5600X) for batch inference when GPU is unavailable.
  • RAM: 8 GB minimum; 16 GB+ for large‑scale batch processing.
  • Storage: Model files (~400 MB) plus tokenizers (~50 MB). safetensors format reduces disk I/O.
  • Performance: ~30 ms per sentence on RTX 3060 (FP16) and ~120 ms on a mid‑range CPU.

Use Cases

The model excels in any scenario where accurate Indonesian POS tagging is needed. Typical applications include:

  • Grammar‑checking tools: Integrate the tagger to highlight syntactic errors in word processors or web editors.
  • Search engine indexing: Use POS tags to improve query parsing and relevance ranking for Indonesian content.
  • Chatbot/NLP pipelines: Enhance intent detection by feeding POS information into downstream classifiers.
  • Digital humanities: Analyze large corpora of Indonesian literature for linguistic research.

Training Details

Methodology

  • Fine‑tuned from flax‑community/indonesian‑roberta‑base using the 🤗 Transformers Trainer API.
  • Optimized with Adam (β₁=0.9, β₂=0.999, ε=1e‑8) and a linear learning‑rate scheduler.
  • Learning rate: 2 × 10⁻⁵, batch size: 16 (both training and evaluation).
  • Training ran for 10 epochs with a fixed random seed of 42 for reproducibility.

Dataset

  • IndonLU posp split – a curated set of Indonesian sentences annotated with POS tags.
  • Training/validation split follows the default split provided by the dataset.

Compute

  • Training performed on a single GPU (NVIDIA RTX 3090, 24 GB VRAM) – total wall‑time ≈ 2 hours.
  • Peak VRAM usage ≈ 7 GB (FP16) or 13 GB (FP32).

Fine‑tuning Capabilities

  • Users can further fine‑tune on domain‑specific Indonesian corpora by loading the model with AutoModelForTokenClassification and continuing training with a lower learning rate.
  • The model’s token‑classification pipeline makes it easy to swap the head for other sequence‑labeling tasks (e.g., NER) with minimal code changes.

Licensing Information

The README states a MIT license for the model weights and code, even though the Hugging Face card lists the license as “unknown”. Under the MIT license you may:

  • Use the model for commercial or non‑commercial purposes without fee.
  • Modify, redistribute, or embed the model in proprietary software.
  • Combine the model with other datasets or frameworks.

The only requirement is attribution: you must retain the original copyright notice and license text in any distribution. No warranty is provided, and you are responsible for ensuring compliance with any downstream data licenses (e.g., the indonlu dataset).

Pre-loaded AI models. Ready to run.

Skip the downloads. Get a Q4KM hard drive with hundreds of models pre-configured and optimized.

Shop Q4KM Hard Drives