Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.grigori.in/llms.txt

Use this file to discover all available pages before exploring further.

The Document Converter supports a wide range of input formats and provides flexible output options to meet your needs.

Supported Input Formats

Documents

  • PDF: Portable Document Format (.pdf)
  • Word: Microsoft Word (.docx)
  • RTF: Rich Text Format (.rtf)
  • Text: Plain text (.txt), Markdown (.md), Log files

Presentations

  • PowerPoint: .pptx, .pptm, .potx, .potm
  • OpenDocument: .odp (planned)
  • Google Slides: Via export (planned)

Spreadsheets

  • Excel: .xlsx, .xlsm, .xls
  • CSV: Comma-separated values (.csv)
  • OpenDocument: .ods (planned)

Images

  • Raster: PNG, JPEG, GIF, BMP, WebP, ICO, TIFF
  • Vector: SVG (planned)
  • Requires OCR for text extraction

Output Formats

Human-readable format with embedded imagesFeatures:
  • Preserves document structure with headers
  • Maintains formatting (bold, italic, lists)
  • Embeds images as base64 data URLs
  • Compatible with documentation systems
  • Easy to read and edit
Example Output:
# Document Title

## Section 1

This is a paragraph with **bold** and *italic* text.

- List item 1
- List item 2
- List item 3

![Image](data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJ...)

## Section 2

More content here...
Use Cases:
  • Documentation generation
  • Content management systems
  • Static site generators
  • Wiki systems
  • README files

Format-Specific Features

Capabilities:
  • Multi-page document extraction
  • Table detection and extraction
  • Image extraction with OCR
  • Metadata preservation (author, creation date)
  • Bookmark and outline preservation
Limitations:
  • Scanned PDFs require OCR
  • Complex layouts may need manual review
  • Password-protected PDFs not supported
  • Some fonts may not render correctly
Configuration:
# Enable OCR for scanned PDFs
use_ocr=true
ocr_provider=paddle
Capabilities:
  • Heading hierarchy preservation
  • Table extraction
  • Image and embedded object handling
  • Style and formatting preservation
  • Comments and track changes (basic)
Supported Elements:
  • Paragraphs and headings
  • Lists (bulleted and numbered)
  • Tables with headers
  • Images and shapes
  • Headers and footers
Example Structure:
{
  "content": {
    "type": "structured_document",
    "sections": [
      {
        "heading": "Introduction",
        "level": 1,
        "content": "Document introduction..."
      }
    ]
  }
}
Capabilities:
  • Slide-by-slide extraction
  • Title and content separation
  • Speaker notes extraction
  • Image and media handling
  • Slide layout preservation
Slide Elements:
  • Slide titles and subtitles
  • Bullet points and lists
  • Text boxes and shapes
  • Images and charts
  • Tables and diagrams
JSON Structure:
{
  "content": {
    "type": "presentation",
    "slides": [
      {
        "slide_number": 1,
        "title": "Slide Title",
        "content": "Slide content...",
        "notes": "Speaker notes..."
      }
    ]
  }
}
Capabilities:
  • Multi-sheet workbook support
  • Table and data extraction
  • Formula preservation (as text)
  • Chart and graph handling
  • Metadata extraction
Data Handling:
  • Header row detection
  • Data type inference
  • Empty cell handling
  • Large dataset sampling
Sample Output:
{
  "content": {
    "type": "spreadsheet",
    "sheets": [
      {
        "name": "Sales Data",
        "headers": ["Date", "Product", "Amount"],
        "sample_data": [
          ["2024-01-01", "Widget A", "150.00"],
          ["2024-01-02", "Widget B", "200.00"]
        ],
        "total_rows": 1000
      }
    ]
  }
}
Capabilities:
  • OCR text extraction
  • Image format standardization
  • Compression and optimization
  • Metadata preservation
  • Multi-language support
OCR Providers:
  • PaddleOCR: High accuracy, 80+ languages
  • EasyOCR: Simple setup, good performance
  • Mistral AI: AI-powered, context-aware
Image Optimization:
# Configuration
IMAGE_COMPRESSION_QUALITY=95
IMAGE_MAX_WIDTH=2048
IMAGE_MAX_HEIGHT=2048

Conversion Quality

High Quality

Text documents, Modern Office files99%+ accuracy for text extraction

Good Quality

PDFs with standard fonts, Simple layouts95%+ accuracy with proper formatting

Variable Quality

Scanned documents, Complex layouts, ImagesDepends on OCR quality and image resolution

Best Practices

1

Choose the Right Format

  • Use Markdown for documentation and content management
  • Use JSON for data processing and API integration
  • Consider your downstream processing needs
2

Optimize for OCR

  • Use high-resolution images (300+ DPI)
  • Ensure good contrast and lighting
  • Avoid skewed or rotated text
  • Choose appropriate OCR provider
3

Handle Large Files

  • Monitor file size limits
  • Consider chunking very large documents
  • Use appropriate timeout settings
  • Implement progress monitoring
4

Validate Results

  • Review converted content for accuracy
  • Test with representative documents
  • Implement quality checks in your pipeline
  • Monitor conversion success rates

File Size Limits

Default maximum file size is 100MB. This can be configured via the MAX_FILE_SIZE environment variable.
File TypeRecommended SizeMaximum Size
Text files< 10MB100MB
PDFs< 50MB100MB
Images< 20MB100MB
Office files< 30MB100MB

Performance Considerations

Processing Time

  • Text files: < 1 second
  • Simple PDFs: 2-10 seconds
  • Complex documents: 30-60 seconds
  • OCR processing: 1-5 minutes

Resource Usage

  • CPU: High during OCR processing
  • Memory: 500MB-2GB per job
  • Storage: 2-3x original file size
  • Network: Minimal for local processing

Error Handling

Common conversion errors and solutions:
Error: “No suitable converter found”Solution:
  • Check file extension and MIME type
  • Verify file is not corrupted
  • Convert to supported format first
Error: “Failed to read file”Solution:
  • Verify file integrity
  • Try opening in original application
  • Re-export or re-save the file
Error: “Out of memory”Solution:
  • Reduce file size
  • Increase system memory
  • Process in smaller chunks
Error: “OCR extraction failed”Solution:
  • Try different OCR provider
  • Improve image quality
  • Check language settings

Integration Examples

import requests

# Convert a document
response = requests.post(
    'http://localhost:8000/api/v1/jobs',
    files={'file': open('document.pdf', 'rb')},
    data={'output_format': 'json'}
)

job = response.json()
print(f"Job ID: {job['id']}")

# Monitor progress
while True:
    status = requests.get(f"http://localhost:8000/api/v1/jobs/{job['id']}")
    job_status = status.json()
    
    if job_status['status'] == 'completed':
        # Download result
        result = requests.get(f"http://localhost:8000/api/v1/jobs/{job['id']}/result")
        with open('converted.json', 'w') as f:
            f.write(result.text)
        break

Next Steps

Configure OCR

Set up OCR providers for image processing

Output Formats

Learn about JSON structure and Markdown features

Webhooks

Set up real-time notifications

API Reference

Explore the complete API documentation