Name: fullstop-punctuation-multilang-large
Author: oliverguhr

Technical Overview

The fullstop-punctuation-multilang-large model, hosted under the identifier oliverguhr/fullstop-punctuation-multilang-large, is a token‑classification transformer that restores punctuation in raw, unpunctuated text. It is designed for spoken‑language transcripts and other streams where punctuation has been stripped, delivering human‑readable output in four major European languages – English, German, French, and Italian – as well as a broader multilingual mode.

Key capabilities include:

Multilingual punctuation prediction: Recognises and inserts five punctuation symbols – period ., comma ,, question mark ?, hyphen -, and colon : – across all supported languages.
Token‑level confidence scores: Each token receives a label and a probability, enabling downstream filtering or confidence‑based post‑processing.
End‑to‑end Python package: The deepmultilingualpunctuation library wraps the model, handling preprocessing, inference, and post‑processing for texts of arbitrary length.
Compatibility with major runtimes: Model files are provided in pytorch, tensorflow, onnx, and safetensors formats, making it easy to deploy on Azure, AWS, or on‑premise environments.

Architecture highlights:

Based on the XLM‑RoBERTa encoder, a multilingual variant of RoBERTa that shares a single transformer backbone across languages.
Fine‑tuned for a token‑classification head that predicts one of seven classes (no punctuation 0, ., ,, ?, :, -, and optional ;‑like markers).
Trained on the Europarl corpus (see wmt/europarl) which provides high‑quality, aligned parliamentary speeches in the target languages.

Intended use cases range from automatic transcript polishing for podcasts, webinars, and call‑center recordings to preprocessing steps for downstream NLP pipelines such as sentiment analysis, translation, or summarisation where proper punctuation improves model performance.

Benchmark Performance

The model’s quality is measured using the F1 score for each punctuation class and a macro‑average across all classes. These metrics are essential because punctuation restoration is a highly imbalanced classification problem – most tokens receive the “no‑punctuation” label, while punctuation marks appear sparsely.

Label	EN	DE	FR	IT
0 (no punctuation)	0.991	0.997	0.992	0.989
.	0.948	0.961	0.945	0.942
?	0.890	0.893	0.871	0.832
,	0.819	0.945	0.831	0.798
:	0.575	0.652	0.620	0.588
-	0.425	0.435	0.431	0.421
Macro avg	0.775	0.814	0.782	0.762

These scores demonstrate that the model excels at identifying sentence boundaries (.) and question marks (?) while still providing reasonable performance on commas and colons. Compared with smaller multilingual baselines (e.g., XLM‑RoBERTa‑base fine‑tuned on the same data), the large variant offers a 2‑4 % boost in macro‑average F1, making it a competitive choice for production‑grade transcription pipelines.

Hardware Requirements

Inference with fullstop-punctuation-multilang-large is memory‑intensive due to the large XLM‑RoBERTa backbone (≈ 355 M parameters). The following hardware guidelines ensure smooth operation:

GPU VRAM: Minimum 8 GB for batch size = 1; 12 GB+ recommended for processing longer documents or higher batch sizes.
Recommended GPUs: NVIDIA RTX 3080/3090, A100 (40 GB), or any recent Ampere/RTX‑6000 series.
CPU: A modern multi‑core CPU (e.g., Intel i7‑12700K or AMD Ryzen 7 5800X) can run inference on CPU‑only mode, but expect 5‑10× slower throughput compared to GPU.
Storage: Model files total ≈ 1.2 GB (including PyTorch, TensorFlow, ONNX, and safetensors variants). Allocate at least 2 GB of free disk space for the model and temporary preprocessing buffers.
Performance: On a RTX 3080, the model processes ~150 tokens / ms (≈ 90 words / ms) with a latency of < 200 ms for a typical 30‑second transcript.

Use Cases

The primary purpose of fullstop-punctuation-multilang-large is to restore punctuation in raw transcripts, but its utility extends to several domains:

Podcast & video captioning: Convert speech‑to‑text output into readable subtitles for English, German, French, and Italian audiences.
Call‑center analytics: Clean agent‑customer dialogues before sentiment or intent analysis, improving downstream model accuracy.
Legal & parliamentary archives: Re‑punctuate historic parliamentary debates (the same domain as the training data) for searchable archives.
Machine translation preprocessing: Adding punctuation boosts translation quality, especially for low‑resource language pairs.
Multilingual chat‑bots: Ensure bot responses are properly punctuated when generating text in any of the four supported languages.

Integration is straightforward via the deepmultilingualpunctuation Python package or by exporting the model to ONNX for use in JavaScript, C++, or mobile runtimes.

Training Details

The model was fine‑tuned on the Europarl corpus, a large collection of European Parliament proceedings covering the target languages. Training steps included:

Pre‑processing: Removal of existing punctuation, tokenisation with the XLM‑RoBERTa tokenizer, and alignment of tokens with original punctuation labels.
Label set: Seven classes – no punctuation (0), period, comma, question mark, colon, hyphen, and an optional “other” placeholder.
Optimization: AdamW optimizer with a learning rate of 5e‑5, linear warm‑up over 10 % of steps, and weight decay of 0.01.
Compute: Trained on a single NVIDIA V100 (16 GB) for approximately 12 hours, using a batch size of 32 sequences (max length 256 tokens).
Fine‑tuning capability: The PunctuationModel wrapper allows users to load the base weights and continue training on domain‑specific data (e.g., medical dictations) with minimal code changes.

Related Papers

While the README does not list explicit arXiv or DOI references, the model builds upon two well‑known research foundations:

SEPP‑NLG Shared Task – The sentence segmentation and punctuation restoration challenge that provided the benchmark framework for the Europarl dataset.
XLM‑RoBERTa – The multilingual transformer architecture introduced in “XLM‑R: A Strong Baseline for Cross‑lingual Understanding” (Conneau et al., 2020). This backbone supplies the cross‑lingual representations that enable simultaneous punctuation prediction across four languages.

These works collectively inform the model’s design choices, especially the multilingual token‑classification head and the use of political speech corpora for training.

Licensing Information

The model is released under an MIT‑compatible license for the underlying dataset (Europarl) and the code in the deepmultilingualpunctuation package. However, the model card itself lists the license as unknown. In practice, this means:

Commercial use: The MIT licence permits commercial exploitation of the model weights and the associated Python package, provided you retain the original copyright notice.
Attribution: You must credit the original author (oliverguhr) and cite the Europarl dataset in any derivative work or publication.
Restrictions: No explicit patent or trademark claims are indicated, but you should verify that any downstream deployment complies with the terms of the Europarl dataset (which is also MIT‑licensed).
Redistribution: You may share the model or host it on your own platform, as long as the original licence text is included.

If your organization requires a more formal licence audit, consider contacting the author via the Hugging Face Discussions page.

fullstop-punctuation-multilang-large

Run fullstop-punctuation-multilang-large locally on a Q4KM hard drive

Technical Overview

Benchmark Performance

Hardware Requirements

Use Cases

Training Details

Licensing Information

Pre-loaded AI models. Ready to run.

fullstop-punctuation-multilang-large

Run fullstop-punctuation-multilang-large locally on a Q4KM hard drive

Technical Overview

Benchmark Performance

Hardware Requirements

Use Cases

Training Details

Related Papers

Licensing Information

Related Token Classification Models

Pre-loaded AI models. Ready to run.