Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.grigori.in/llms.txt

Use this file to discover all available pages before exploring further.

OCR (Optical Character Recognition) enables the Document Converter to extract text from images, scanned documents, and PDFs that contain non-selectable text.

OCR Providers

The Document Converter supports multiple OCR providers, each with different strengths:

PaddleOCR

Recommended for production
  • 80+ languages supported
  • High accuracy
  • Free and open source
  • Works offline
  • GPU acceleration support

EasyOCR

Easy to set up
  • 80+ languages supported
  • Simple installation
  • Good performance
  • PyTorch-based
  • Active community

Mistral AI

AI-powered
  • Context-aware processing
  • Excellent accuracy
  • Multi-modal understanding
  • Requires API key
  • Cloud-based

Configuration

Environment Variables:
# Enable PaddleOCR
PADDLE_OCR_USE_GPU=false
PADDLE_OCR_LANG=en
PADDLE_MODELS_DIR=./models/paddle_models

# PaddlePaddle optimization
FLAGS_USE_MKLDNN=0
FLAGS_ENABLE_EAGER_MODE=1
PADDLE_LOG_LEVEL=3
Supported Languages:
  • English (en)
  • Chinese Simplified (ch)
  • Chinese Traditional (chinese_cht)
  • French (french)
  • German (german)
  • Japanese (japan)
  • Korean (korean)
  • Spanish (spanish)
  • And 70+ more languages
Installation:
# CPU version (default)
pip install paddlepaddle

# GPU version (requires CUDA)
pip install paddlepaddle-gpu
Usage Example:
# Using PaddleOCR provider
response = requests.post(
    'http://localhost:8000/api/v1/jobs',
    files={'file': open('scanned_document.pdf', 'rb')},
    data={
        'output_format': 'md',
        'use_ocr': 'true',
        'ocr_provider': 'paddle'
    }
)

When to Use OCR

Required for:

  • Scanned documents (PDF, images)
  • Screenshots of text
  • Photos of documents
  • Handwritten text (limited support)
  • PDFs with image-based text
  • Old or legacy documents

Not needed for:

  • Modern PDF files with selectable text
  • Word documents (.docx)
  • PowerPoint presentations (.pptx)
  • Excel spreadsheets (.xlsx)
  • Plain text files (.txt)
  • Web pages (HTML)

OCR Quality Factors

Resolution:
  • Minimum: 150 DPI
  • Recommended: 300+ DPI
  • Higher resolution = better accuracy
Contrast:
  • High contrast between text and background
  • Black text on white background is ideal
  • Avoid low contrast color combinations
Clarity:
  • Sharp, focused images
  • Avoid blurry or pixelated text
  • Good lighting conditions
Font Size:
  • Minimum: 10pt font size
  • Recommended: 12pt or larger
  • Very small text may be missed
Font Type:
  • Sans-serif fonts work better
  • Avoid decorative or script fonts
  • Standard fonts (Arial, Times) are ideal
Text Orientation:
  • Horizontal text works best
  • Vertical text supported but less accurate
  • Avoid skewed or rotated text
Structure:
  • Clear column separation
  • Consistent spacing
  • Avoid overlapping text
Background:
  • Clean, uniform background
  • Avoid patterns or textures
  • Remove noise and artifacts
Margins:
  • Adequate white space around text
  • Clear boundaries between sections
  • Avoid text near edges

Performance Comparison

ProviderSpeedAccuracyLanguagesOfflineGPU Support
PaddleOCRFastHigh80+
EasyOCRMediumGood80+
Mistral AISlowVery HighAllN/A

Best Practices

1

Choose the Right Provider

  • PaddleOCR: Production workloads, batch processing
  • EasyOCR: Development, simple setups
  • Mistral AI: Complex documents, maximum accuracy needed
2

Optimize Images

  • Scan at 300+ DPI resolution
  • Use high contrast settings
  • Ensure proper lighting
  • Crop to relevant areas
3

Configure Languages

  • Specify exact languages for better accuracy
  • Avoid unnecessary languages (slows processing)
  • Use multiple languages only when needed
4

Monitor Performance

  • Track processing times
  • Monitor accuracy rates
  • Adjust settings based on results
  • Consider hardware upgrades for GPU acceleration

Advanced Configuration

from paddleocr import PaddleOCR

# Initialize with custom settings
ocr = PaddleOCR(
    use_angle_cls=True,  # Detect text angle
    lang='en',           # Language
    use_gpu=False,       # GPU acceleration
    show_log=False,      # Disable logging
    
    # Detection parameters
    det_db_thresh=0.3,
    det_db_box_thresh=0.5,
    det_db_unclip_ratio=1.6,
    
    # Recognition parameters
    rec_batch_num=6,
    max_text_length=25,
    
    # Text direction classification
    cls_batch_num=6,
    cls_thresh=0.9,
)

Troubleshooting

Common Problems:
  • Model download failures
  • CUDA compatibility issues
  • Memory errors with large images
Solutions:
# Pre-download models
python -c "from paddleocr import PaddleOCR; PaddleOCR(use_angle_cls=True, lang='en')"

# Disable GPU if issues
PADDLE_OCR_USE_GPU=false

# Increase memory limits
export MALLOC_ARENA_MAX=4
Common Problems:
  • PyTorch installation issues
  • CUDA version mismatches
  • Model loading failures
Solutions:
# Reinstall PyTorch
pip uninstall torch torchvision torchaudio
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

# Clear model cache
rm -rf ~/.EasyOCR/

# Use CPU only
EASY_OCR_USE_GPU=false
Common Problems:
  • API key authentication failures
  • Rate limiting errors
  • Network connectivity issues
Solutions:
# Test API key
curl -H "Authorization: Bearer $MISTRAL_API_KEY" \
     "https://api.mistral.ai/v1/models"

# Check rate limits
# Implement exponential backoff
# Monitor API usage
Poor Accuracy:
  • Improve image quality
  • Use correct language settings
  • Try different OCR providers
  • Preprocess images (noise reduction, contrast enhancement)
Slow Processing:
  • Use GPU acceleration
  • Reduce image resolution
  • Process in batches
  • Consider hardware upgrades
Memory Issues:
  • Reduce image size
  • Process images sequentially
  • Increase system memory
  • Use image compression

API Examples

import requests

# Simple OCR conversion
response = requests.post(
    'http://localhost:8000/api/v1/jobs',
    files={'file': open('scanned_document.pdf', 'rb')},
    data={
        'output_format': 'md',
        'use_ocr': 'true',
        'ocr_provider': 'paddle'
    }
)

job = response.json()
print(f"Job ID: {job['id']}")

Monitoring OCR Performance

Quality Metrics

  • Character accuracy rate
  • Word accuracy rate
  • Processing time per page
  • Error rate by document type

System Metrics

  • CPU usage during OCR
  • Memory consumption
  • GPU utilization (if enabled)
  • Network usage (for Mistral AI)
Example Monitoring Script:
import time
import psutil
import requests

def monitor_ocr_job(job_id):
    start_time = time.time()
    start_memory = psutil.virtual_memory().used
    
    # Poll job status
    while True:
        response = requests.get(f"http://localhost:8000/api/v1/jobs/{job_id}")
        job = response.json()
        
        if job['status'] == 'completed':
            end_time = time.time()
            end_memory = psutil.virtual_memory().used
            
            print(f"Processing time: {end_time - start_time:.2f}s")
            print(f"Memory usage: {(end_memory - start_memory) / 1024 / 1024:.2f}MB")
            break
        
        time.sleep(1)

Next Steps

Output Formats

Learn about structured JSON and Markdown outputs

Webhooks

Set up real-time notifications for OCR jobs

Production Deployment

Deploy OCR processing at scale

Examples

See complete OCR integration examples