OCR (Optical Character Recognition) - Document Converter API

OCR (Optical Character Recognition) enables the Document Converter to extract text from images, scanned documents, and PDFs that contain non-selectable text.

OCR Providers

The Document Converter supports multiple OCR providers, each with different strengths:

PaddleOCR

Recommended for production

80+ languages supported
High accuracy
Free and open source
Works offline
GPU acceleration support

EasyOCR

Easy to set up

80+ languages supported
Simple installation
Good performance
PyTorch-based
Active community

Mistral AI

AI-powered

Context-aware processing
Excellent accuracy
Multi-modal understanding
Requires API key
Cloud-based

Configuration

PaddleOCR
EasyOCR
Mistral AI

Environment Variables:

# Enable PaddleOCR
PADDLE_OCR_USE_GPU=false
PADDLE_OCR_LANG=en
PADDLE_MODELS_DIR=./models/paddle_models

# PaddlePaddle optimization
FLAGS_USE_MKLDNN=0
FLAGS_ENABLE_EAGER_MODE=1
PADDLE_LOG_LEVEL=3

Supported Languages:

English (en)
Chinese Simplified (ch)
Chinese Traditional (chinese_cht)
French (french)
German (german)
Japanese (japan)
Korean (korean)
Spanish (spanish)
And 70+ more languages

Installation:

# CPU version (default)
pip install paddlepaddle

# GPU version (requires CUDA)
pip install paddlepaddle-gpu

Usage Example:

# Using PaddleOCR provider
response = requests.post(
    'http://localhost:8000/api/v1/jobs',
    files={'file': open('scanned_document.pdf', 'rb')},
    data={
        'output_format': 'md',
        'use_ocr': 'true',
        'ocr_provider': 'paddle'
    }
)

Environment Variables:

# Enable EasyOCR
EASY_OCR_USE_GPU=false
EASY_OCR_LANG=en
EASY_OCR_MODEL_STORAGE=./models/easy_ocr_models

# Advanced settings
EASY_OCR_TEXT_THRESHOLD=0.7
EASY_OCR_LINK_THRESHOLD=0.4
EASY_OCR_LOW_TEXT=0.4

Language Configuration:

# Single language
EASY_OCR_LANG=en

# Multiple languages (comma-separated)
EASY_OCR_LANG=en,ch_sim,fr

Common Language Codes:

en - English
ch_sim - Chinese Simplified
ch_tra - Chinese Traditional
fr - French
de - German
ja - Japanese
ko - Korean
es - Spanish
ru - Russian
ar - Arabic

GPU Setup:

# For GPU support
EASY_OCR_USE_GPU=true

# Requires CUDA installation
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

Environment Variables:

# Mistral AI configuration
MISTRAL_API_KEY=your_api_key_here
MISTRAL_API_URL=https://api.mistral.ai/v1/chat/completions
MISTRAL_MODEL=pixtral-12b-2409

Getting API Key:

Sign up at Mistral AI
Navigate to API Keys section
Create a new API key
Add to your environment variables

Model Options:

pixtral-12b-2409 - Latest vision model (recommended)
pixtral-large-2411 - High-accuracy model

Usage Example:

# Using Mistral AI provider
response = requests.post(
    'http://localhost:8000/api/v1/jobs',
    files={'file': open('complex_image.png', 'rb')},
    data={
        'output_format': 'json',
        'use_ocr': 'true',
        'ocr_provider': 'mistral'
    }
)

Pricing:

Pay-per-use based on API calls
Check Mistral AI pricing for current rates
More expensive but higher accuracy

When to Use OCR

Required for:

Scanned documents (PDF, images)
Screenshots of text
Photos of documents
Handwritten text (limited support)
PDFs with image-based text
Old or legacy documents

Not needed for:

Modern PDF files with selectable text
Word documents (.docx)
PowerPoint presentations (.pptx)
Excel spreadsheets (.xlsx)
Plain text files (.txt)
Web pages (HTML)

OCR Quality Factors

Image Quality

Resolution:

Minimum: 150 DPI
Recommended: 300+ DPI
Higher resolution = better accuracy

Contrast:

High contrast between text and background
Black text on white background is ideal
Avoid low contrast color combinations

Clarity:

Sharp, focused images
Avoid blurry or pixelated text
Good lighting conditions

Text Characteristics

Font Size:

Minimum: 10pt font size
Recommended: 12pt or larger
Very small text may be missed

Font Type:

Sans-serif fonts work better
Avoid decorative or script fonts
Standard fonts (Arial, Times) are ideal

Text Orientation:

Horizontal text works best
Vertical text supported but less accurate
Avoid skewed or rotated text

Layout Considerations

Structure:

Clear column separation
Consistent spacing
Avoid overlapping text

Background:

Clean, uniform background
Avoid patterns or textures
Remove noise and artifacts

Margins:

Adequate white space around text
Clear boundaries between sections
Avoid text near edges

Performance Comparison

Provider	Speed	Accuracy	Languages	Offline	GPU Support
PaddleOCR	Fast	High	80+	✅	✅
EasyOCR	Medium	Good	80+	✅	✅
Mistral AI	Slow	Very High	All	❌	N/A

Best Practices

Choose the Right Provider

PaddleOCR: Production workloads, batch processing
EasyOCR: Development, simple setups
Mistral AI: Complex documents, maximum accuracy needed

Optimize Images

Scan at 300+ DPI resolution
Use high contrast settings
Ensure proper lighting
Crop to relevant areas

Configure Languages

Specify exact languages for better accuracy
Avoid unnecessary languages (slows processing)
Use multiple languages only when needed

Monitor Performance

Track processing times
Monitor accuracy rates
Adjust settings based on results
Consider hardware upgrades for GPU acceleration

Advanced Configuration

from paddleocr import PaddleOCR

# Initialize with custom settings
ocr = PaddleOCR(
    use_angle_cls=True,  # Detect text angle
    lang='en',           # Language
    use_gpu=False,       # GPU acceleration
    show_log=False,      # Disable logging
    
    # Detection parameters
    det_db_thresh=0.3,
    det_db_box_thresh=0.5,
    det_db_unclip_ratio=1.6,
    
    # Recognition parameters
    rec_batch_num=6,
    max_text_length=25,
    
    # Text direction classification
    cls_batch_num=6,
    cls_thresh=0.9,
)

Troubleshooting

PaddleOCR Issues

Common Problems:

Model download failures
CUDA compatibility issues
Memory errors with large images

Solutions:

# Pre-download models
python -c "from paddleocr import PaddleOCR; PaddleOCR(use_angle_cls=True, lang='en')"

# Disable GPU if issues
PADDLE_OCR_USE_GPU=false

# Increase memory limits
export MALLOC_ARENA_MAX=4

EasyOCR Issues

Common Problems:

PyTorch installation issues
CUDA version mismatches
Model loading failures

Solutions:

# Reinstall PyTorch
pip uninstall torch torchvision torchaudio
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

# Clear model cache
rm -rf ~/.EasyOCR/

# Use CPU only
EASY_OCR_USE_GPU=false

Mistral AI Issues

Common Problems:

API key authentication failures
Rate limiting errors
Network connectivity issues

Solutions:

# Test API key
curl -H "Authorization: Bearer $MISTRAL_API_KEY" \
     "https://api.mistral.ai/v1/models"

# Check rate limits
# Implement exponential backoff
# Monitor API usage

General OCR Issues

Poor Accuracy:

Improve image quality
Use correct language settings
Try different OCR providers
Preprocess images (noise reduction, contrast enhancement)

Slow Processing:

Use GPU acceleration
Reduce image resolution
Process in batches
Consider hardware upgrades

Memory Issues:

Reduce image size
Process images sequentially
Increase system memory
Use image compression

API Examples

import requests

# Simple OCR conversion
response = requests.post(
    'http://localhost:8000/api/v1/jobs',
    files={'file': open('scanned_document.pdf', 'rb')},
    data={
        'output_format': 'md',
        'use_ocr': 'true',
        'ocr_provider': 'paddle'
    }
)

job = response.json()
print(f"Job ID: {job['id']}")

Monitoring OCR Performance

Quality Metrics

Character accuracy rate
Word accuracy rate
Processing time per page
Error rate by document type

System Metrics

CPU usage during OCR
Memory consumption
GPU utilization (if enabled)
Network usage (for Mistral AI)

Example Monitoring Script:

import time
import psutil
import requests

def monitor_ocr_job(job_id):
    start_time = time.time()
    start_memory = psutil.virtual_memory().used
    
    # Poll job status
    while True:
        response = requests.get(f"http://localhost:8000/api/v1/jobs/{job_id}")
        job = response.json()
        
        if job['status'] == 'completed':
            end_time = time.time()
            end_memory = psutil.virtual_memory().used
            
            print(f"Processing time: {end_time - start_time:.2f}s")
            print(f"Memory usage: {(end_memory - start_memory) / 1024 / 1024:.2f}MB")
            break
        
        time.sleep(1)

Next Steps

Output Formats

Learn about structured JSON and Markdown outputs

Webhooks

Set up real-time notifications for OCR jobs

Production Deployment

Deploy OCR processing at scale

Examples

See complete OCR integration examples

Documentation Index

​OCR Providers

PaddleOCR

EasyOCR

Mistral AI

​Configuration

​When to Use OCR

Required for:

Not needed for:

​OCR Quality Factors

​Performance Comparison

​Best Practices

​Advanced Configuration

​Troubleshooting

​API Examples

​Monitoring OCR Performance

Quality Metrics

System Metrics

​Next Steps

Output Formats

Webhooks

Production Deployment

Examples

OCR Providers

Configuration

When to Use OCR

OCR Quality Factors

Performance Comparison

Best Practices

Advanced Configuration

Troubleshooting

API Examples

Monitoring OCR Performance

Next Steps