Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.grigori.in/llms.txt

Use this file to discover all available pages before exploring further.

When choosing JSON as the output format, the document converter provides structured, hierarchical data that preserves the organization and meaning of the original document.
JSON output maintains the original document structure while making it programmatically accessible for further processing.

Base Structure

All JSON outputs follow this base structure:
{
  "document": {
    "filename": "original_filename.ext",
    "title": "Document Title",
    "type": "Document Type",
    "metadata": { /* document-specific metadata */ }
  },
  "content": { /* structured content based on document type */ },
  "images": { /* base64-encoded images if present */ }
}

Document

Metadata about the original document

Content

Structured content based on document type

Images

Base64-encoded images found in the document

Document-Specific Structures

PDF documents are structured page-wise with concise summaries:
{
  "document": {
    "filename": "report.pdf",
    "title": "Annual Report",
    "type": "PDF Document",
    "metadata": {
      "num_pages": 10,
      "author": "John Doe",
      "created": "2024-01-15T10:30:00Z"
    }
  },
  "content": {
    "type": "multi_page_document",
    "total_pages": 10,
    "pages": [
      {
        "page_number": 1,
        "text": "Page 1 complete text content...",
        "word_count": 245
      },
      {
        "page_number": 2,
        "text": "Page 2 complete text content...",
        "word_count": 312
      }
    ]
  }
}

Images

When documents contain images, they are included as base64-encoded data:
{
  "images": {
    "chart1.png": "iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAADUlEQVR42mP8/5+hHgAHggJ/PchI7wAAAABJRU5ErkJggg==",
    "logo.jpg": "base64_encoded_image_data_here..."
  }
}
Images are base64-encoded, which can result in large JSON files. Consider the trade-off between completeness and file size.

Element Types

  • heading: Text with heading formatting
  • bullet_point: List items and bullet points
  • text: Regular text content
  • table_row: Table row data
  • header: Table header row
  • data: Data rows with both cell arrays and key-value mapping
  • empty: Empty sheets or tables
  • heading: Section headings at various levels
  • content: Paragraph content within sections
  • metadata: Word count, formatting information

Usage Examples

Here are code examples for working with different document types:
// Get content from page 5
const page5Content = jsonData.content.pages.find(p => p.page_number === 5);
console.log(page5Content.text);

Schema Validation

For applications requiring strict validation, here’s a basic JSON schema structure:
{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "required": ["document", "content"],
  "properties": {
    "document": {
      "type": "object",
      "required": ["filename", "title", "type"],
      "properties": {
        "filename": {"type": "string"},
        "title": {"type": "string"},
        "type": {"type": "string"},
        "metadata": {"type": "object"}
      }
    },
    "content": {
      "type": "object",
      "required": ["type"],
      "properties": {
        "type": {"type": "string"}
      }
    },
    "images": {
      "type": "object",
      "patternProperties": {
        "^.*\\.(png|jpg|jpeg|gif|bmp)$": {
          "type": "string",
          "format": "base64"
        }
      }
    }
  }
}

Best Practices

Memory Management

Large documents can produce substantial JSON files. Consider streaming or chunked processing for very large files.

Image Handling

Base64 images increase file size by ~33%. Consider separate image storage for production use.

Data Processing

The sample_data arrays are limited to prevent enormous JSON files. Full data extraction may require custom processing.

Type Safety

Use TypeScript interfaces or JSON schema validation for robust applications.
This structured approach makes it easy to programmatically process documents while preserving their original organization and hierarchy.