DeepSeek-OCR: Contexts Optical Compression
DeepSeek-OCR: Contexts Optical Compression — A Deep Dive into Visual-Text Compression
In the rapidly evolving world of multimodal AI, DeepSeek-OCR carves out a unique niche by exploring how vision and language can collaborate to compress and interpret complex visual-text scenes. Built as an experiment in “contexts optical compression,” this work pushes beyond traditional OCR by examining how vision encoders can be leveraged from an LLM-centric perspective. The project not only introduces a powerful OCR model but also demonstrates practical pathways for deploying it with cutting-edge inference engines like vLLM and standard Transformers pipelines. This blog takes you through the core ideas, release milestones, practical setup, inference modes, and visualizations that bring DeepSeek-OCR to life.
What is DeepSeek-OCR?
DeepSeek-OCR emerges from the idea that reading text embedded in images is more than a character-level task; it is about understanding context. The model treats text as a stream of information that lives inside a larger visual scene, requiring attention to layout, typography, and surrounding cues to extract meaning accurately. By adopting an LLM-centric viewpoint, DeepSeek-OCR investigates how vision encoders—responsible for interpreting the image—can collaborate with large language models to produce coherent, context-aware OCR outputs.
Key notions include:
- Context-aware OCR: capturing not just the words, but their arrangement, hierarchy, and relationships within a document or scene.
- Multimodal integration: fusing image data with textual prompts to guide extraction, formatting, and downstream reasoning.
- Visual compression of content: exploring how visual representations can be compressed into textual outputs or structured formats with high fidelity.
To illustrate the project’s vision, a striking visual introduction accompanies the project materials, including an image that spotlights the central idea of visual-text compression. See the figure referenced in the project materials to get a sense of how information density is managed across modalities.
Figure: Contexts Optical Compression concept
Release Timeline and Milestones
DeepSeek-OCR has evolved through a series of coordinated releases and integrations, reflecting a strong emphasis on practical usability and ecosystem compatibility.
- Release: 2026-01-27 — DeepSeek-OCR2 arrives, expanding capabilities and performance. This newer version signals a continued commitment to pushing the boundaries of multimodal OCR and its applications.
- Release: 2025-10-23 — DeepSeek-OCR becomes officially supported in upstream vLLM. The collaboration with the vLLM project enables more robust and scalable inference workflows, particularly for streaming and high-throughput scenarios.
- Release: 2025-10-20 — Initial release of DeepSeek-OCR, focusing on the role of vision encoders from an LLM-centric viewpoint. This milestone marks the transition from concept to deployable tooling, inviting practitioners to experiment with real-world documents and scenes.
These releases reflect a broader strategy: to blend vision, language, and efficient inference in a way that makes context-aware OCR practical for researchers and developers.
Key Features and Capabilities
DeepSeek-OCR is designed to be both powerful and adaptable. Its feature set centers on enabling accurate text extraction while preserving the context and structure that give text meaning in documents and scenes.
- Contextual OCR: Beyond word-for-word transcription, the model emphasizes layout, sections, and relationships between textual elements.
- Multimodal prompts: The system supports prompts that nudge the model to format outputs (e.g., convert to markdown), describe figures, or locate references within the image.
- Streaming and batch inference: Different modes cater to real-time streaming outputs (for images) and high-throughput batch processing (for PDFs and large document collections).
- Compatibility with vLLM: DeepSeek-OCR is designed to work well with vLLM for efficient, scalable inference, including image streaming and large prompt handling.
- Transformers-based inference: A conventional pathway using Hugging Face Transformers enables straightforward experimentation with familiar tooling.
- Native resolution support: The model offers several native resolutions to balance quality and performance:
- Tiny: 512×512 (64 vision tokens)
- Small: 640×640 (100 vision tokens)
- Base: 1024×1024 (256 vision tokens)
- Large: 1280×1280 (400 vision tokens)
- Dynamic resolution options: In addition to fixed sizes, the model supports dynamic resolution strategies to optimize for different documents and devices.
- Documentation and prompts: A set of example prompts helps users guide the model to perform tasks like conversion to markdown, OCR for general images, or locating references within the content.
Visual sample outputs and demonstrations accompany the release, offering tangible illustrations of how the model handles complex pages, diagrams, and mixed content.
Getting Started: Install and Setup
The project provides a practical guide for setting up DeepSeek-OCR in a CUDA-enabled environment. Here is a distilled, user-friendly overview of the recommended steps, reflecting the official guidance.
Prerequisites
CUDA 11.8+ is required.
PyTorch compatible version: torch 2.6.0 (with matching torchvision and torchaudio).
Clone the repository and prepare the environment
git clone https://github.com/deepseek-ai/DeepSeek-OCR.git
Navigate to the DeepSeek-OCR directory and create a conda environment:
- conda create -n deepseek-ocr python=3.12.9 -y
- conda activate deepseek-ocr
Install dependencies
Obtain the vLLM wheel compatible with CUDA 11.8:
- https://github.com/vllm-project/vllm/releases/tag/v0.8.5
Install PyTorch stack with CUDA 11.8 support:
- pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu118
Install the vLLM wheel:
- pip install vllm-0.8.5+cu118-cp38-abi3-manylinux1x8664.whl
Install remaining dependencies:
- pip install -r requirements.txt
Optional for performance: install flash-attn==2.7.3 --no-build-isolation
Note on compatibility
If you want vLLM and Transformers codes to run in the same environment, you can avoid common installation conflicts by ensuring compatibility (e.g., vLLM 0.8.5+cu118 requiring Transformers >= 4.51.1).
vLLM Inference workflow
Change INPUTPATH/OUTPUTPATH and other settings in the configuration:
- DeepSeek-OCR-master/DeepSeek-OCR-vllm/config.py
Run the relevant scripts:
- For image streaming: python rundpskocr_image.py
- For PDF concurrency (high throughput): python rundpskocr_pdf.py
- For batch evaluation: python rundpskocrevalbatch.py
Note on vLLM versions
- Until v0.11.1, it’s recommended to install vLLM from nightly builds:
- uv pip install -U vllm --pre --extra-index-url https://wheels.vllm.ai/nightly
Transformers Inference workflow
A typical setup uses Hugging Face Transformers:
- from transformers import AutoModel, AutoTokenizer
- Load model and tokenizer, enable efficient attention and safety options, and place the model on CUDA with bf16 precision.
Example prompts and outputs can be produced via a dedicated interface, with the model configured to produce structured outputs (e.g., markdown) or to perform doc-level extraction tasks.
Sample usage snippet
The project provides code-ready instructions to assemble model inputs with multi-modal data and to set up a batching pipeline for OCR tasks. You’ll find detailed examples in the documentation and codebase, including how to compose a batch with multiple images and a shared prompt.
vLLM vs. Transformers: choosing a path
vLLM: Optimized for streaming and high-throughput workloads, suitable for large PDFs and real-time OCR pipelines.
Transformers: Useful for rapid experimentation and smaller-scale workflows, benefitting from familiar APIs and tooling.
Hardware considerations
The setup emphasizes modern GPUs (e.g., A100-class or better) to maximize throughput, especially for concurrent PDF and batch tasks.
The installation guidance is designed to help practitioners move from concept to practical deployment, enabling experimentation with both the vLLM and Transformer inference routes.
Inference Modes: vLLM and Transformers
DeepSeek-OCR supports two principal pathways for inference, each with its own strengths and suited use cases.
vLLM Inference
- Streaming image outputs
- Ideal for processing individual images in near real time, producing OCR results that can be consumed progressively.
- PDF processing
- Concurrency of around 2500 tokens per second on high-end hardware (e.g., A100-40G) for dense documents, enabling efficient large-scale OCR of PDFs.
- Batch evaluation
- Designed to benchmark performance across multiple samples, enabling systematic comparisons and optimization.
Operational notes:
- Ensure the configuration points to the right input and output paths.
- You can adjust batch sizes, prompt lengths, and token budgets to balance speed and fidelity.
- The project demonstrates a practical code path with a few lines of Python to instantiate the model, prepare batched inputs, and run generation with custom sampling parameters.
Transformers Inference
Standard Hugging Face workflow
Using AutoModel and AutoTokenizer, you load the DeepSeek-OCR model with trustremotecode enabled and, if desired, use the flash attention variant for efficiency.
The model is prepared for bf16 precision on CUDA devices, providing a familiar and robust path for researchers who prefer the Transformers ecosystem.
Prompting and outputs
The Transformer path facilitates prompts such as:
- A request to convert content to markdown.
- General OCR for images.
- Descriptions for figures within documents.
It also supports tasks like locating references or parsing specific content blocks, enabling flexible experimentation with document understanding.
Practical tips
When using Transformers, you’ll likely rely on standard tooling for data loading, batching, and result saving, which can simplify integration with existing OCR pipelines.
The balance between model size, resolution, and prompt complexity will influence performance and accuracy.
Both inference modes are designed to complement each other, offering options for streaming, batch processing, and experimental workflows depending on your hardware and application needs.
Prompts and Use Cases
Prompts guide the model's behavior and shape the kinds of outputs you receive. DeepSeek-OCR ships with a set of prompt templates and examples that demonstrate how to get the most out of your OCR tasks.
- Document-to-markdown
- Prompt example: “<|grounding|>Convert the document to markdown.”
- Output: A structured markdown representation capturing headings, lists, sections, and textual content while preserving layout semantics.
- General OCR for images
- Prompt example: “OCR this image.”
- Output: A plain-text transcription with attention to text blocks and ordering.
- Without layouts
- Prompt example: “Free OCR.”
- Output: A straightforward text extraction without explicit layout information.
- Figures and diagrams
- Prompt example: “Parse the figure.”
- Output: Descriptions or extracted figure captions and annotations.
- Descriptive tasks
- Prompt example: “Describe this image in detail.”
- Output: Rich, scene-level descriptions that go beyond OCR to capture visual context.
- Relational and locating tasks
- Prompt example: “Locate <|ref|>xxxx<|/ref|> in the image.”
- Output: Location and reference extraction, helpful for technical documents with cross-references.
- Multilingual and cultural coverage
- While the base prompts emphasize English content, similar templates can be extended to other languages and scripts, enabling broader OCR usability.
These prompts reflect a design philosophy that emphasizes not just text extraction, but also the contextual and descriptive dimensions of complex documents and scenes.
Visualizations and Examples
Visual evidence helps convey what DeepSeek-OCR can achieve. The project provides a collection of sample outputs and visualizations to illustrate how the model handles real-world content.
- Visual sample gallery
- show1.jpg, show2.jpg, show3.jpg, show4.jpg
- These visuals demonstrate OCR outputs, layout awareness, and content parsing across diverse document types and scenes.
- Additional imagery
- fig1.png (contextual compression concept) provides a macro view of how text and images interact within a scene.
- The logo and badge images at the top of the page underline branding and project affiliations.
In practice, these visuals help users assess a model’s ability to maintain textual fidelity while simultaneously preserving document structure and figure content.
Acknowledgments
DeepSeek-OCR stands on the shoulders of many contributors and related projects. The authors recognize and thank a number of collaborators and predecessors who have shaped the field of OCR and multimodal reasoning:
- Vary (Ucas-HaoranWei/Vary)
- GOT-OCR2.0 (Ucas-HaoranWei/GOT-OCR2.0)
- MinerU (opendatalab/MinerU)
- PaddleOCR (PaddlePaddle/PaddleOCR)
- OneChart (LingyvKong/OneChart)
- Slow Perception (Ucas-HaoranWei/Slow-Perception)
The project also acknowledges benchmarks and datasets from Fox and OmniDocBench, which provide valuable context for evaluating OCR systems and multi-document understanding.
Citations
For academic referencing, the DeepSeek-OCR work is documented in a BibTeX entry:
BibTeX @article{wei2025deepseek, title={DeepSeek-OCR: Contexts Optical Compression}, author={Wei, Haoran and Sun, Yaofeng and Li, Yukun}, journal={arXiv preprint arXiv:2510.18234}, year={2025} }
This citation captures the core contribution and connects readers with the arXiv preprint for detailed methodology, experiments, and theoretical framing.
Practical Takeaways
- DeepSeek-OCR represents a thoughtful integration of vision and language for OCR tasks, emphasizing context, structure, and modality fusion.
- The project’s dual-inference pathways—vLLM and Transformers—offer flexibility for researchers and practitioners working in diverse environments and with different performance requirements.
- The ongoing releases, including DeepSeek-OCR2 and upstream vLLM support, signal a trajectory toward more scalable and robust multimodal OCR solutions.
- The emphasis on prompts, modular prompts, and structured outputs makes DeepSeek-OCR a practical tool for document ingestion pipelines, archival projects, and research into visual-text compression.
Images from the input—logo, figure, and sample visuals—are embedded in appropriate sections to reinforce the narrative and provide intuitive anchors for readers exploring the material.
Conclusion
DeepSeek-OCR stands at the intersection of visual processing and language reasoning, offering a compelling exploration of contexts optical compression. By treating vision encoders as integral partners to language models and by providing practical guidance for deployment via vLLM and Transformers, the project makes a strong case for a more integrated approach to OCR. The release cadence demonstrates an active, collaborative effort to improve performance, scalability, and usability, ensuring that researchers and developers can experiment with real-world documents, diagrams, and mixed-media pages.
As OCR continues to evolve—from simple text extraction to nuanced understanding of layout, semantics, and references—DeepSeek-OCR invites you to experiment, measure, and push the boundaries of how machines read the world. The collaboration-friendly ecosystem, together with clear installation guidance and robust inference pathways, makes this an appealing entry point for exploring the next generation of multimodal document understanding.
[Image references continue throughout the article to illustrate concepts, outputs, and results.]
Images used from the Input:
- DeepSeek AI logo: assets/logo.svg
- Contextual compression concept: assets/fig1.png
- Visualization samples: assets/show1.jpg, assets/show2.jpg, assets/show3.jpg, assets/show4.jpg
If you’d like, I can tailor this post to a specific audience (researchers, engineers, or product teams) or expand any section with additional examples, diagrams, or case studies.
Enjoying this project?
Discover more amazing open-source projects on TechLogHub. We curate the best developer tools and projects.
Repository:https://github.com/deepseek-ai/DeepSeek-OCR
GitHub - deepseek-ai/DeepSeek-OCR: DeepSeek-OCR: Contexts Optical Compression
DeepSeek-OCR: Contexts Optical Compression — A Deep Dive into Visual-Text Compression...
github - deepseek-ai/deepseek-ocr