Documentation Index Fetch the complete documentation index at: https://docs.grigori.in/llms.txt
Use this file to discover all available pages before exploring further.
When choosing JSON as the output format, the document converter provides structured, hierarchical data that preserves the organization and meaning of the original document.
JSON output maintains the original document structure while making it programmatically accessible for further processing.
Base Structure
All JSON outputs follow this base structure:
{
"document" : {
"filename" : "original_filename.ext" ,
"title" : "Document Title" ,
"type" : "Document Type" ,
"metadata" : { /* document-specific metadata */ }
},
"content" : { /* structured content based on document type */ },
"images" : { /* base64-encoded images if present */ }
}
Document Metadata about the original document
Content Structured content based on document type
Images Base64-encoded images found in the document
Document-Specific Structures
PDF Documents
PowerPoint Presentations
Excel Spreadsheets
Word Documents
PDF documents are structured page-wise with concise summaries: {
"document" : {
"filename" : "report.pdf" ,
"title" : "Annual Report" ,
"type" : "PDF Document" ,
"metadata" : {
"num_pages" : 10 ,
"author" : "John Doe" ,
"created" : "2024-01-15T10:30:00Z"
}
},
"content" : {
"type" : "multi_page_document" ,
"total_pages" : 10 ,
"pages" : [
{
"page_number" : 1 ,
"text" : "Page 1 complete text content..." ,
"word_count" : 245
},
{
"page_number" : 2 ,
"text" : "Page 2 complete text content..." ,
"word_count" : 312
}
]
}
}
Presentations are structured slide-wise with concise summaries: {
"document" : {
"filename" : "presentation.pptx" ,
"title" : "Marketing Strategy" ,
"type" : "PowerPoint Presentation" ,
"metadata" : {
"num_slides" : 15 ,
"slide_width" : 9144000 ,
"slide_height" : 6858000
}
},
"content" : {
"type" : "presentation" ,
"total_slides" : 15 ,
"slides" : [
{
"slide_number" : 1 ,
"text" : "## Introduction \n - Point 1 \n - Point 2 \n - Point 3" ,
"summary" : "Introduction"
},
{
"slide_number" : 2 ,
"text" : "## Market Analysis \n - Current trends \n - Competition overview" ,
"summary" : "Market Analysis"
}
]
}
}
Spreadsheets provide summary information with sample data (not full data): {
"document" : {
"filename" : "data.xlsx" ,
"title" : "Sales Data" ,
"type" : "Excel Workbook" ,
"metadata" : {
"num_sheets" : 3 ,
"sheet_names" : [ "Sales" , "Summary" , "Charts" ]
}
},
"content" : {
"type" : "spreadsheet" ,
"total_sheets" : 3 ,
"sheet_names" : [ "Sales" , "Summary" , "Charts" ],
"sheets" : [
{
"sheet_name" : "Sales" ,
"row_count" : 1000 ,
"column_count" : 5 ,
"headers" : [ "Date" , "Product" , "Amount" , "Region" , "Salesperson" ],
"has_data" : true ,
"sample_data" : [
[ "2024-01-01" , "Widget A" , "150.00" , "North" , "John" ],
[ "2024-01-02" , "Widget B" , "200.00" , "South" , "Jane" ],
[ "2024-01-03" , "Widget C" , "175.00" , "East" , "Bob" ]
]
}
]
}
}
Word documents are organized by sections and headings: {
"document" : {
"filename" : "report.docx" ,
"title" : "Project Report" ,
"type" : "Word Document" ,
"metadata" : {
"author" : "Jane Smith" ,
"created" : "2024-01-10T14:30:00Z" ,
"num_paragraphs" : 25
}
},
"content" : {
"type" : "structured_document" ,
"total_sections" : 5 ,
"sections" : [
{
"heading" : "Executive Summary" ,
"content" : "This section contains the executive summary..."
},
{
"heading" : "Introduction" ,
"content" : "The introduction section explains..."
}
]
}
}
Images
When documents contain images, they are included as base64-encoded data:
{
"images" : {
"chart1.png" : "iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAADUlEQVR42mP8/5+hHgAHggJ/PchI7wAAAABJRU5ErkJggg==" ,
"logo.jpg" : "base64_encoded_image_data_here..."
}
}
Images are base64-encoded, which can result in large JSON files. Consider the trade-off between completeness and file size.
Element Types
Slide Elements (PowerPoint)
heading : Text with heading formatting
bullet_point : List items and bullet points
text : Regular text content
table_row : Table row data
Table Data (Spreadsheets)
header : Table header row
data : Data rows with both cell arrays and key-value mapping
empty : Empty sheets or tables
heading : Section headings at various levels
content : Paragraph content within sections
metadata : Word count, formatting information
Usage Examples
Here are code examples for working with different document types:
PDF Content
PowerPoint Data
Excel Data
Word Document
PDF Processing
Spreadsheet Processing
// Get content from page 5
const page5Content = jsonData . content . pages . find ( p => p . page_number === 5 );
console . log ( page5Content . text );
Schema Validation
For applications requiring strict validation, here’s a basic JSON schema structure:
Base Schema
Schema Validation
{
"$schema" : "http://json-schema.org/draft-07/schema#" ,
"type" : "object" ,
"required" : [ "document" , "content" ],
"properties" : {
"document" : {
"type" : "object" ,
"required" : [ "filename" , "title" , "type" ],
"properties" : {
"filename" : { "type" : "string" },
"title" : { "type" : "string" },
"type" : { "type" : "string" },
"metadata" : { "type" : "object" }
}
},
"content" : {
"type" : "object" ,
"required" : [ "type" ],
"properties" : {
"type" : { "type" : "string" }
}
},
"images" : {
"type" : "object" ,
"patternProperties" : {
"^.* \\ .(png|jpg|jpeg|gif|bmp)$" : {
"type" : "string" ,
"format" : "base64"
}
}
}
}
}
Best Practices
Memory Management Large documents can produce substantial JSON files. Consider streaming or chunked processing for very large files.
Image Handling Base64 images increase file size by ~33%. Consider separate image storage for production use.
Data Processing The sample_data arrays are limited to prevent enormous JSON files. Full data extraction may require custom processing.
Type Safety Use TypeScript interfaces or JSON schema validation for robust applications.
This structured approach makes it easy to programmatically process documents while preserving their original organization and hierarchy.