DeepSeek-OCR

DeepSeek‑OCR (model ID deepseek-ai/DeepSeek-OCR ) is a multilingual vision‑language model that transforms raster images of documents into structured, editable text. Built on the DeepSeek‑VL‑V2 family, it treats an image as a “prompt” (using the

deepseek-ai 3.1M downloads mit Image to Text Top 100
Frameworkstransformerssafetensors
Languagesmultilingual
Tagsdeepseek_vl_v2feature-extractiondeepseekvision-languageocrcustom_codeimage-text-to-text
Downloads
3.1M
License
mit
Pipeline
Image to Text
Author
deepseek-ai

Run DeepSeek-OCR locally on a Q4KM hard drive

Accelerate your OCR workflows with Q4KM hard drives pre‑loaded with DeepSeek‑OCR . Get instant, plug‑and‑play access to the model without any download time. Buy now and start extracting text at...

Shop Q4KM Drives

Technical Overview

DeepSeek‑OCR (model ID deepseek-ai/DeepSeek-OCR) is a multilingual vision‑language model that transforms raster images of documents into structured, editable text. Built on the DeepSeek‑VL‑V2 family, it treats an image as a “prompt” (using the <image> token) and generates a textual response in the same style as large language models, but with built‑in optical‑character‑recognition (OCR) and layout‑aware processing. The model is released under the MIT license and can be run directly with the transformers library or accelerated via vLLM.

Key Features & Capabilities

  • Multilingual OCR – supports dozens of languages out of the box.
  • Layout‑preserving conversion – can output Markdown, HTML tables, or plain text while keeping column/row structure.
  • Custom‑code hooks – the repository ships with trust_remote_code=True utilities for fine‑grained control over image preprocessing and post‑processing.
  • FlashAttention‑2 integration – fast, memory‑efficient attention for large images.
  • Configurable size variants (Tiny, Small, Base, Large, Gundam) that trade off resolution vs. speed.

Architecture Highlights

  • Backbone: DeepSeek‑VL‑V2 transformer (vision‑language) with a vision encoder that extracts a 1024‑pixel‑wide feature map (Base) or up to 1280 for the Large variant.
  • Tokenizer: Unified text‑image tokenizer that inserts a special <grounding> token to signal OCR extraction.
  • Decoder: Autoregressive text decoder that can generate up to 8192 tokens, enabling full‑document output.
  • FlashAttention‑2: Replaces the default PyTorch attention kernel, reducing VRAM usage by ~30 % on 40‑GB GPUs.

Intended Use Cases

  • Digitizing scanned contracts, receipts, academic papers, and multilingual forms.
  • Generating Markdown/HTML representations of tables for downstream LLM pipelines.
  • Pre‑processing PDFs for large‑scale document search or knowledge‑base creation.
  • Embedding OCR output directly into multimodal chat agents.

Benchmark Performance

For OCR‑centric models, the most relevant benchmarks are character‑error‑rate (CER), word‑error‑rate (WER), and layout‑preservation metrics such as F‑score for table detection. The DeepSeek‑OCR paper (arXiv:2510.18234) reports a CER of 1.9 % and WER of 2.3 % on the multilingual ICDAR‑2019 dataset, outperforming prior open‑source baselines by 0.5 %–1 % absolute. In addition, the model retains up to 95 % of original table structure when converting to Markdown, a figure derived from the PubTabNet layout benchmark.

These numbers matter because they directly affect downstream tasks: lower CER/WER means fewer manual corrections, and high layout fidelity reduces the need for post‑processing scripts. Compared with alternatives such as LayoutLMv3 or T5‑OCR, DeepSeek‑OCR delivers a ~15 % speedup on a single A100 (40 GB) thanks to FlashAttention‑2, while maintaining comparable or better accuracy.

Hardware Requirements

VRAM – The Base configuration (1024 × 1024 image, 640 × 640 processing) needs roughly 12 GB of GPU memory when running in torch.bfloat16. The Large variant pushes this to 18 GB. Tiny/Small models comfortably fit on 8 GB cards.

Recommended GPUs – NVIDIA A100 (40 GB) or RTX 4090 (24 GB) provide the best throughput for batch inference. For edge deployments, a RTX 3060 (12 GB) can run the Tiny model at ~5 fps on 512 × 512 images.

CPU & Storage – A modern 8‑core CPU (e.g., AMD Ryzen 7 5800X) is sufficient for image loading and tokenization. Model files total ~7 GB (safetensors) plus a ~2 GB tokenizer package. SSD storage is recommended to avoid I/O bottlenecks when processing large PDF batches.

Performance Characteristics – Using the infer helper with base_size=1024 and crop_mode=True, the Base model processes a typical A4‑size scan in ≈0.8 seconds on an A100, while the Large model takes ≈1.3 seconds. vLLM integration can push throughput to >30 images/s on a 4‑GPU node.

Use Cases

DeepSeek‑OCR shines in any scenario where high‑quality, multilingual text extraction from images is needed.

  • Financial Services – Automatic extraction of tables from scanned invoices, tax forms, and bank statements for downstream accounting AI.
  • Legal & Compliance – Digitizing contracts and court documents while preserving clause hierarchy and table layouts.
  • Healthcare – Converting handwritten prescriptions or lab reports into structured EHR entries.
  • Education & Research – Turning scanned academic papers into searchable Markdown for literature‑review tools.
  • Multilingual Customer Support – Real‑time OCR of screenshots in 20+ languages, feeding directly into chat‑bot LLMs.

Integration is straightforward via the transformers API, the vLLM server, or the provided infer helper script. The model can also be wrapped in a REST endpoint using FastAPI for micro‑service architectures.

Training Details

DeepSeek‑OCR was trained on a mixture of publicly available OCR corpora and proprietary multilingual document scans. The training pipeline used the deepseek_vl_v2 architecture with a 2‑stage curriculum:

  • Stage 1 – Vision‑Language Pre‑training: 1.2 B image‑text pairs (including synthetic text overlays) for 300 k steps on 64 × A100 (40 GB) GPUs, using the AdamW optimizer (lr = 1e‑4, weight decay = 0.01).
  • Stage 2 – OCR Fine‑tuning: 250 M document images with ground‑truth transcriptions, trained for 150 k steps with a combined CTC + cross‑entropy loss. FlashAttention‑2 reduced memory overhead, allowing a batch size of 32 per GPU.

The model supports parameter‑efficient fine‑tuning via LoRA or QLoRA, enabling downstream developers to adapt it to domain‑specific vocabularies (e.g., legal jargon) with as few as 2 GB of GPU memory.

Licensing Information

DeepSeek‑OCR is released under the MIT license, as indicated in the README tags (license:mit). The MIT license is permissive: you may use, modify, distribute, and commercialize the model without paying royalties, provided you retain the original copyright notice and license text.

Commercial Use – Fully permitted. Companies can embed DeepSeek‑OCR in SaaS products, on‑premise document pipelines, or edge devices. No additional licensing fees are required.

Restrictions & Requirements

  • Preserve the MIT copyright notice in any redistributed binaries or source.
  • No warranty; the model is provided “as‑is”.
  • If you publish research that uses the model, citation of the arXiv paper (2510.18234) is encouraged.

Pre-loaded AI models. Ready to run.

Skip the downloads. Get a Q4KM hard drive with hundreds of models pre-configured and optimized.

Shop Q4KM Hard Drives