JSON Output Format - Document Converter API

When choosing JSON as the output format, the document converter provides structured, hierarchical data that preserves the organization and meaning of the original document.

JSON output maintains the original document structure while making it programmatically accessible for further processing.

Base Structure

All JSON outputs follow this base structure:

{
  "document": {
    "filename": "original_filename.ext",
    "title": "Document Title",
    "type": "Document Type",
    "metadata": { /* document-specific metadata */ }
  },
  "content": { /* structured content based on document type */ },
  "images": { /* base64-encoded images if present */ }
}

Document

Metadata about the original document

Content

Structured content based on document type

Images

Base64-encoded images found in the document

Document-Specific Structures

PDF Documents
PowerPoint Presentations
Excel Spreadsheets
Word Documents

PDF documents are structured page-wise with concise summaries:

{
  "document": {
    "filename": "report.pdf",
    "title": "Annual Report",
    "type": "PDF Document",
    "metadata": {
      "num_pages": 10,
      "author": "John Doe",
      "created": "2024-01-15T10:30:00Z"
    }
  },
  "content": {
    "type": "multi_page_document",
    "total_pages": 10,
    "pages": [
      {
        "page_number": 1,
        "text": "Page 1 complete text content...",
        "word_count": 245
      },
      {
        "page_number": 2,
        "text": "Page 2 complete text content...",
        "word_count": 312
      }
    ]
  }
}

Presentations are structured slide-wise with concise summaries:

{
  "document": {
    "filename": "presentation.pptx",
    "title": "Marketing Strategy",
    "type": "PowerPoint Presentation",
    "metadata": {
      "num_slides": 15,
      "slide_width": 9144000,
      "slide_height": 6858000
    }
  },
  "content": {
    "type": "presentation",
    "total_slides": 15,
    "slides": [
      {
        "slide_number": 1,
        "text": "## Introduction\n- Point 1\n- Point 2\n- Point 3",
        "summary": "Introduction"
      },
      {
        "slide_number": 2,
        "text": "## Market Analysis\n- Current trends\n- Competition overview",
        "summary": "Market Analysis"
      }
    ]
  }
}

Spreadsheets provide summary information with sample data (not full data):

{
  "document": {
    "filename": "data.xlsx",
    "title": "Sales Data",
    "type": "Excel Workbook",
    "metadata": {
      "num_sheets": 3,
      "sheet_names": ["Sales", "Summary", "Charts"]
    }
  },
  "content": {
    "type": "spreadsheet",
    "total_sheets": 3,
    "sheet_names": ["Sales", "Summary", "Charts"],
    "sheets": [
      {
        "sheet_name": "Sales",
        "row_count": 1000,
        "column_count": 5,
        "headers": ["Date", "Product", "Amount", "Region", "Salesperson"],
        "has_data": true,
        "sample_data": [
          ["2024-01-01", "Widget A", "150.00", "North", "John"],
          ["2024-01-02", "Widget B", "200.00", "South", "Jane"],
          ["2024-01-03", "Widget C", "175.00", "East", "Bob"]
        ]
      }
    ]
  }
}

Word documents are organized by sections and headings:

{
  "document": {
    "filename": "report.docx",
    "title": "Project Report",
    "type": "Word Document",
    "metadata": {
      "author": "Jane Smith",
      "created": "2024-01-10T14:30:00Z",
      "num_paragraphs": 25
    }
  },
  "content": {
    "type": "structured_document",
    "total_sections": 5,
    "sections": [
      {
        "heading": "Executive Summary",
        "content": "This section contains the executive summary..."
      },
      {
        "heading": "Introduction",
        "content": "The introduction section explains..."
      }
    ]
  }
}

Images

When documents contain images, they are included as base64-encoded data:

{
  "images": {
    "chart1.png": "iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAADUlEQVR42mP8/5+hHgAHggJ/PchI7wAAAABJRU5ErkJggg==",
    "logo.jpg": "base64_encoded_image_data_here..."
  }
}

Images are base64-encoded, which can result in large JSON files. Consider the trade-off between completeness and file size.

Element Types

Slide Elements (PowerPoint)

heading: Text with heading formatting
bullet_point: List items and bullet points
text: Regular text content
table_row: Table row data

Table Data (Spreadsheets)

header: Table header row
data: Data rows with both cell arrays and key-value mapping
empty: Empty sheets or tables

Document Sections

heading: Section headings at various levels
content: Paragraph content within sections
metadata: Word count, formatting information

Usage Examples

Here are code examples for working with different document types:

// Get content from page 5
const page5Content = jsonData.content.pages.find(p => p.page_number === 5);
console.log(page5Content.text);

Schema Validation

For applications requiring strict validation, here’s a basic JSON schema structure:

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "required": ["document", "content"],
  "properties": {
    "document": {
      "type": "object",
      "required": ["filename", "title", "type"],
      "properties": {
        "filename": {"type": "string"},
        "title": {"type": "string"},
        "type": {"type": "string"},
        "metadata": {"type": "object"}
      }
    },
    "content": {
      "type": "object",
      "required": ["type"],
      "properties": {
        "type": {"type": "string"}
      }
    },
    "images": {
      "type": "object",
      "patternProperties": {
        "^.*\\.(png|jpg|jpeg|gif|bmp)$": {
          "type": "string",
          "format": "base64"
        }
      }
    }
  }
}

Best Practices

Memory Management

Large documents can produce substantial JSON files. Consider streaming or chunked processing for very large files.

Image Handling

Base64 images increase file size by ~33%. Consider separate image storage for production use.

Data Processing

The sample_data arrays are limited to prevent enormous JSON files. Full data extraction may require custom processing.

Type Safety

Use TypeScript interfaces or JSON schema validation for robust applications.

This structured approach makes it easy to programmatically process documents while preserving their original organization and hierarchy.

Documentation Index

​Base Structure

Document

Content

Images

​Document-Specific Structures

​Images

​Element Types

​Usage Examples

​Schema Validation

​Best Practices

Memory Management

Image Handling

Data Processing

Type Safety

Base Structure

Document-Specific Structures

Images

Element Types

Usage Examples

Schema Validation

Best Practices