Document Conversion - Document Converter API

The Document Converter supports a wide range of input formats and provides flexible output options to meet your needs.

Supported Input Formats

Documents

PDF: Portable Document Format (.pdf)
Word: Microsoft Word (.docx)
RTF: Rich Text Format (.rtf)
Text: Plain text (.txt), Markdown (.md), Log files

Presentations

PowerPoint: .pptx, .pptm, .potx, .potm
OpenDocument: .odp (planned)
Google Slides: Via export (planned)

Spreadsheets

Excel: .xlsx, .xlsm, .xls
CSV: Comma-separated values (.csv)
OpenDocument: .ods (planned)

Images

Raster: PNG, JPEG, GIF, BMP, WebP, ICO, TIFF
Vector: SVG (planned)
Requires OCR for text extraction

Output Formats

Markdown
JSON

Human-readable format with embedded imagesFeatures:

Preserves document structure with headers
Maintains formatting (bold, italic, lists)
Embeds images as base64 data URLs
Compatible with documentation systems
Easy to read and edit

Example Output:

# Document Title

## Section 1

This is a paragraph with **bold** and *italic* text.

- List item 1
- List item 2
- List item 3

![Image](data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJ...)

## Section 2

More content here...

Use Cases:

Documentation generation
Content management systems
Static site generators
Wiki systems
README files

Structured data preserving document hierarchyFeatures:

Maintains original document structure
Programmatically accessible
Includes metadata and images
Type-specific formatting
Supports complex data extraction

Example Output:

{
  "document": {
    "filename": "report.pdf",
    "title": "Annual Report",
    "type": "PDF Document",
    "metadata": {
      "pages": 25,
      "author": "John Doe",
      "created": "2024-01-15T10:30:00Z"
    }
  },
  "content": {
    "type": "multi_page_document",
    "total_pages": 25,
    "pages": [
      {
        "page_number": 1,
        "text": "Page content...",
        "word_count": 245
      }
    ]
  },
  "images": {
    "chart1.png": "base64_image_data"
  }
}

Use Cases:

Data extraction pipelines
Search indexing
Content analysis
Database import
API integration

Format-Specific Features

PDF Processing

Capabilities:

Multi-page document extraction
Table detection and extraction
Image extraction with OCR
Metadata preservation (author, creation date)
Bookmark and outline preservation

Limitations:

Scanned PDFs require OCR
Complex layouts may need manual review
Password-protected PDFs not supported
Some fonts may not render correctly

Configuration:

# Enable OCR for scanned PDFs
use_ocr=true
ocr_provider=paddle

Word Document Processing

Capabilities:

Heading hierarchy preservation
Table extraction
Image and embedded object handling
Style and formatting preservation
Comments and track changes (basic)

Supported Elements:

Paragraphs and headings
Lists (bulleted and numbered)
Tables with headers
Images and shapes
Headers and footers

Example Structure:

{
  "content": {
    "type": "structured_document",
    "sections": [
      {
        "heading": "Introduction",
        "level": 1,
        "content": "Document introduction..."
      }
    ]
  }
}

PowerPoint Processing

Capabilities:

Slide-by-slide extraction
Title and content separation
Speaker notes extraction
Image and media handling
Slide layout preservation

Slide Elements:

Slide titles and subtitles
Bullet points and lists
Text boxes and shapes
Images and charts
Tables and diagrams

JSON Structure:

{
  "content": {
    "type": "presentation",
    "slides": [
      {
        "slide_number": 1,
        "title": "Slide Title",
        "content": "Slide content...",
        "notes": "Speaker notes..."
      }
    ]
  }
}

Excel Processing

Capabilities:

Multi-sheet workbook support
Table and data extraction
Formula preservation (as text)
Chart and graph handling
Metadata extraction

Data Handling:

Header row detection
Data type inference
Empty cell handling
Large dataset sampling

Sample Output:

{
  "content": {
    "type": "spreadsheet",
    "sheets": [
      {
        "name": "Sales Data",
        "headers": ["Date", "Product", "Amount"],
        "sample_data": [
          ["2024-01-01", "Widget A", "150.00"],
          ["2024-01-02", "Widget B", "200.00"]
        ],
        "total_rows": 1000
      }
    ]
  }
}

Image Processing

Capabilities:

OCR text extraction
Image format standardization
Compression and optimization
Metadata preservation
Multi-language support

OCR Providers:

PaddleOCR: High accuracy, 80+ languages
EasyOCR: Simple setup, good performance
Mistral AI: AI-powered, context-aware

Image Optimization:

# Configuration
IMAGE_COMPRESSION_QUALITY=95
IMAGE_MAX_WIDTH=2048
IMAGE_MAX_HEIGHT=2048

Conversion Quality

High Quality

Text documents, Modern Office files99%+ accuracy for text extraction

Good Quality

PDFs with standard fonts, Simple layouts95%+ accuracy with proper formatting

Variable Quality

Scanned documents, Complex layouts, ImagesDepends on OCR quality and image resolution

Best Practices

Choose the Right Format

Use Markdown for documentation and content management
Use JSON for data processing and API integration
Consider your downstream processing needs

Optimize for OCR

Use high-resolution images (300+ DPI)
Ensure good contrast and lighting
Avoid skewed or rotated text
Choose appropriate OCR provider

Handle Large Files

Monitor file size limits
Consider chunking very large documents
Use appropriate timeout settings
Implement progress monitoring

Validate Results

Review converted content for accuracy
Test with representative documents
Implement quality checks in your pipeline
Monitor conversion success rates

File Size Limits

Default maximum file size is 100MB. This can be configured via the MAX_FILE_SIZE environment variable.

File Type	Recommended Size	Maximum Size
Text files	< 10MB	100MB
PDFs	< 50MB	100MB
Images	< 20MB	100MB
Office files	< 30MB	100MB

Performance Considerations

Processing Time

Text files: < 1 second
Simple PDFs: 2-10 seconds
Complex documents: 30-60 seconds
OCR processing: 1-5 minutes

Resource Usage

CPU: High during OCR processing
Memory: 500MB-2GB per job
Storage: 2-3x original file size
Network: Minimal for local processing

Error Handling

Common conversion errors and solutions:

Unsupported Format

Error: “No suitable converter found”Solution:

Check file extension and MIME type
Verify file is not corrupted
Convert to supported format first

Corrupted File

Error: “Failed to read file”Solution:

Verify file integrity
Try opening in original application
Re-export or re-save the file

Memory Issues

Error: “Out of memory”Solution:

Reduce file size
Increase system memory
Process in smaller chunks

OCR Failures

Error: “OCR extraction failed”Solution:

Try different OCR provider
Improve image quality
Check language settings

Integration Examples

import requests

# Convert a document
response = requests.post(
    'http://localhost:8000/api/v1/jobs',
    files={'file': open('document.pdf', 'rb')},
    data={'output_format': 'json'}
)

job = response.json()
print(f"Job ID: {job['id']}")

# Monitor progress
while True:
    status = requests.get(f"http://localhost:8000/api/v1/jobs/{job['id']}")
    job_status = status.json()
    
    if job_status['status'] == 'completed':
        # Download result
        result = requests.get(f"http://localhost:8000/api/v1/jobs/{job['id']}/result")
        with open('converted.json', 'w') as f:
            f.write(result.text)
        break

Next Steps

Configure OCR

Set up OCR providers for image processing

Output Formats

Learn about JSON structure and Markdown features

Webhooks

Set up real-time notifications

API Reference

Explore the complete API documentation

Documentation Index

​Supported Input Formats

Documents

Presentations

Spreadsheets

Images

​Output Formats

​Format-Specific Features

​Conversion Quality

High Quality

Good Quality

Variable Quality

​Best Practices

​File Size Limits

​Performance Considerations

Processing Time

Resource Usage

​Error Handling

​Integration Examples

​Next Steps

Configure OCR

Output Formats

Webhooks

API Reference

Supported Input Formats

Output Formats

Format-Specific Features

Conversion Quality

Best Practices

File Size Limits

Performance Considerations

Error Handling

Integration Examples

Next Steps