DocDigitizer Multi-Document Files API Guide

Learn how to process PDF files that contain multiple documents in a single file. DocDigitizer automatically detects document boundaries and extracts data from each document individually.

Overview

Multi-document files are single PDF files that contain multiple logical documents. Common examples include:

Scanned batches – Multiple invoices scanned together into one PDF
Email attachments – Combined PDFs from email threads
Archive exports – Bulk exports from document management systems
Bank statements – Monthly statements combined into yearly files
Invoice bundles – Multiple invoices from the same vendor

Why Use Multi-Document Processing?

Benefit	Description
Reduced API Calls	Process many documents with a single request
Automatic Splitting	No need to manually split PDFs before processing
Preserved Context	Related documents maintain their relationship
Batch Efficiency	Optimized processing for high-volume workflows

How It Works

When you submit a multi-document PDF, the DocDigitizer pipeline:

Receives the PDF – The API accepts your file upload
Analyzes structure – The Document Partitioner examines page layouts and content
Detects boundaries – AI identifies where one document ends and another begins
Classifies each document – Each segment is classified independently
Extracts data – Data extraction runs on each identified document
Returns combined results – All extractions are returned in a single response

Processing Flow

┌─────────────────────────────────────────────────────────────────┐
│                     Multi-Document PDF                          │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐       │
│  │ Page 1-2 │  │ Page 3-4 │  │ Page 5   │  │ Page 6-8 │       │
│  │Invoice #1│  │Invoice #2│  │Receipt   │  │Invoice #3│       │
│  └──────────┘  └──────────┘  └──────────┘  └──────────┘       │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
                    ┌─────────────────┐
                    │   Partitioner   │
                    │  (AI Analysis)  │
                    └─────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                      API Response                               │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐       │
│  │Extract 1 │  │Extract 2 │  │Extract 3 │  │Extract 4 │       │
│  │Pages: 1-2│  │Pages: 3-4│  │Pages: 5  │  │Pages: 6-8│       │
│  └──────────┘  └──────────┘  └──────────┘  └──────────┘       │
└─────────────────────────────────────────────────────────────────┘

Enabling Document Partitioning

Document partitioning is controlled by your Context configuration. Contact your account manager to enable or configure partitioning for your Context ID.

Configuration Options

Option	Description	Default
Partitioning Enabled	Whether to detect and split multi-document files	Enabled
Max Documents	Maximum number of documents to extract per file	50
Classification Mode	Classify each document independently or use single type	Independent

Standard API Request

No special parameters are needed. Simply submit your multi-document PDF:

curl -X POST https://apix.docdigitizer.com/sync \
  -H "x-api-key: YOUR_API_KEY" \
  -F "files=@multi_document_batch.pdf" \
  -F "id=550e8400-e29b-41d4-a716-446655440000" \
  -F "contextId=your-context-id"

Response Structure

When processing multi-document files, the API returns an array of extractions, one for each detected document.

Response Format

{
    "StateText": "DONE",
    "ExternalId": "550e8400-e29b-41d4-a716-446655440000",
    "TraceId": "ABC123XYZ",
    "Output": [
        {
            "DocumentType": "Invoice",
            "Pages": "1-2",
            "PageCount": 2,
            "Confidence": 0.95,
            "Fields": {
                "InvoiceNumber": "INV-2024-001",
                "InvoiceDate": "2024-01-15",
                "TotalAmount": 1250.00,
                "VendorName": "Supplier Corp"
            }
        },
        {
            "DocumentType": "Invoice",
            "Pages": "3-4",
            "PageCount": 2,
            "Confidence": 0.92,
            "Fields": {
                "InvoiceNumber": "INV-2024-002",
                "InvoiceDate": "2024-01-16",
                "TotalAmount": 875.50,
                "VendorName": "Another Vendor"
            }
        },
        {
            "DocumentType": "Receipt",
            "Pages": "5",
            "PageCount": 1,
            "Confidence": 0.88,
            "Fields": {
                "ReceiptNumber": "RCP-5544",
                "Date": "2024-01-17",
                "Amount": 45.99,
                "Merchant": "Office Supplies Inc"
            }
        }
    ]
}

Key Response Fields

Field	Type	Description
`Output`	Array	Array of extraction objects, one per detected document
`Output[].DocumentType`	String	Classified document type for this segment
`Output[].Pages`	String	Page range in the original PDF (e.g., “1-2”, “5”)
`Output[].PageCount`	Integer	Number of pages in this document segment
`Output[].Confidence`	Float	Classification confidence score (0.0 to 1.0)
`Output[].Fields`	Object	Extracted fields specific to the document type

Processing Examples

Python

import requests
import uuid

def process_multi_document(file_path, api_key, context_id):
    """Process a multi-document PDF and return all extractions."""

    url = "https://apix.docdigitizer.com/sync"

    headers = {
        "x-api-key": api_key
    }

    with open(file_path, "rb") as f:
        files = {"files": f}
        data = {
            "id": str(uuid.uuid4()),
            "contextId": context_id
        }

        response = requests.post(url, headers=headers, files=files, data=data)

    if response.status_code == 200:
        result = response.json()

        if result.get("StateText") == "DONE":
            extractions = result.get("Output", [])
            print(f"Found {len(extractions)} documents in file")
            return extractions
        else:
            print(f"Processing failed: {result.get('Messages', [])}")
            return None
    else:
        print(f"HTTP Error: {response.status_code}")
        return None


# Process a batch file
extractions = process_multi_document(
    "invoice_batch.pdf",
    "dd_live_your_api_key",
    "your-context-id"
)

# Handle each extraction
if extractions:
    for i, doc in enumerate(extractions):
        print(f"\n--- Document {i + 1} ---")
        print(f"Type: {doc.get('DocumentType')}")
        print(f"Pages: {doc.get('Pages')}")

        fields = doc.get("Fields", {})
        for field_name, field_value in fields.items():
            print(f"  {field_name}: {field_value}")

PowerShell

function Process-MultiDocumentFile {
    param(
        [string]$FilePath,
        [string]$ApiKey,
        [string]$ContextId
    )

    $headers = @{
        "x-api-key" = $ApiKey
    }

    $form = @{
        files = Get-Item -Path $FilePath
        id = [guid]::NewGuid().ToString()
        contextId = $ContextId
    }

    try {
        $response = Invoke-RestMethod -Uri "https://apix.docdigitizer.com/sync" `
            -Method Post `
            -Headers $headers `
            -Form $form

        if ($response.StateText -eq "DONE") {
            $documents = $response.Output
            Write-Host "Found $($documents.Count) documents in file"
            return $documents
        }
        else {
            Write-Warning "Processing failed: $($response.Messages -join ', ')"
            return $null
        }
    }
    catch {
        Write-Error "Request failed: $_"
        return $null
    }
}

# Process a batch file
$documents = Process-MultiDocumentFile `
    -FilePath "C:\Documents\invoice_batch.pdf" `
    -ApiKey "dd_live_your_api_key" `
    -ContextId "your-context-id"

# Display results
if ($documents) {
    $docNumber = 1
    foreach ($doc in $documents) {
        Write-Host "`n--- Document $docNumber ---"
        Write-Host "Type: $($doc.DocumentType)"
        Write-Host "Pages: $($doc.Pages)"

        $doc.Fields.PSObject.Properties | ForEach-Object {
            Write-Host "  $($_.Name): $($_.Value)"
        }

        $docNumber++
    }
}

JavaScript (Node.js)

const FormData = require('form-data');
const fs = require('fs');
const fetch = require('node-fetch');
const { v4: uuidv4 } = require('uuid');

async function processMultiDocument(filePath, apiKey, contextId) {
    const form = new FormData();
    form.append('files', fs.createReadStream(filePath));
    form.append('id', uuidv4());
    form.append('contextId', contextId);

    const response = await fetch('https://apix.docdigitizer.com/sync', {
        method: 'POST',
        headers: {
            'x-api-key': apiKey
        },
        body: form
    });

    const result = await response.json();

    if (result.StateText === 'DONE') {
        console.log(`Found ${result.Output.length} documents`);
        return result.Output;
    } else {
        console.error('Processing failed:', result.Messages);
        return null;
    }
}

// Process and handle results
(async () => {
    const documents = await processMultiDocument(
        'invoice_batch.pdf',
        'dd_live_your_api_key',
        'your-context-id'
    );

    if (documents) {
        documents.forEach((doc, index) => {
            console.log(`\n--- Document ${index + 1} ---`);
            console.log(`Type: ${doc.DocumentType}`);
            console.log(`Pages: ${doc.Pages}`);

            Object.entries(doc.Fields || {}).forEach(([key, value]) => {
                console.log(`  ${key}: ${value}`);
            });
        });
    }
})();

Handling Multiple Results

Iterating Through Documents

Always check if the response contains an array of extractions:

# Python example
output = result.get("Output", [])

# Ensure we have a list
if not isinstance(output, list):
    output = [output]

for index, document in enumerate(output):
    document_type = document.get("DocumentType")
    page_range = document.get("Pages")
    fields = document.get("Fields", {})

    # Process each document...
    print(f"Document {index + 1}: {document_type} (pages {page_range})")

Grouping by Document Type

# Python - Group extractions by document type
from collections import defaultdict

def group_by_type(extractions):
    grouped = defaultdict(list)

    for doc in extractions:
        doc_type = doc.get("DocumentType", "Unknown")
        grouped[doc_type].append(doc)

    return dict(grouped)

# Usage
extractions = result.get("Output", [])
grouped = group_by_type(extractions)

print(f"Invoices: {len(grouped.get('Invoice', []))}")
print(f"Receipts: {len(grouped.get('Receipt', []))}")
print(f"Other: {len(grouped.get('Unknown', []))}")

Aggregating Financial Data

# Python - Sum totals from multiple invoices
def aggregate_invoice_totals(extractions):
    total = 0.0
    invoice_count = 0

    for doc in extractions:
        if doc.get("DocumentType") == "Invoice":
            fields = doc.get("Fields", {})
            amount = fields.get("TotalAmount", 0)

            # Handle string amounts
            if isinstance(amount, str):
                amount = float(amount.replace(",", "").replace("$", ""))

            total += amount
            invoice_count += 1

    return {
        "invoice_count": invoice_count,
        "total_amount": total
    }

# Usage
summary = aggregate_invoice_totals(extractions)
print(f"Processed {summary['invoice_count']} invoices")
print(f"Total value: ${summary['total_amount']:,.2f}")

Document Boundary Detection

The Document Partitioner uses multiple signals to detect where documents begin and end:

Detection Methods

Signal	Description	Example
Visual Layout	Significant changes in page layout or structure	Letterhead appearing after body text
Header Detection	Recognizable document headers (logo, company name)	Invoice header, letter greeting
Page Numbering	Page numbers restarting at 1	“Page 1 of 3” after “Page 2 of 2”
Document IDs	New document identifiers appearing	Different invoice numbers
Date Patterns	Significant date changes in headers	January invoice followed by February
Content Analysis	AI analysis of semantic content shifts	Invoice content changing to contract

Page Range Information

Each extraction includes page range information to help you:

Identify which pages belong to which document
Cross-reference with original file if needed
Verify correct splitting for quality assurance
Generate split PDFs for archival

{
    "DocumentType": "Invoice",
    "Pages": "3-5",
    "PageCount": 3,
    "StartPage": 3,
    "EndPage": 5,
    "Fields": { ... }
}

Mixed Document Types

Multi-document files can contain different document types. Each segment is classified independently.

Example: Mixed Batch Processing

# A single PDF containing invoices, receipts, and purchase orders
{
    "StateText": "DONE",
    "Output": [
        {
            "DocumentType": "Invoice",
            "Pages": "1-2",
            "Fields": {
                "InvoiceNumber": "INV-001",
                "TotalAmount": 1500.00
            }
        },
        {
            "DocumentType": "PurchaseOrder",
            "Pages": "3",
            "Fields": {
                "PONumber": "PO-2024-555",
                "OrderTotal": 2200.00
            }
        },
        {
            "DocumentType": "Receipt",
            "Pages": "4",
            "Fields": {
                "ReceiptNumber": "RCP-789",
                "Amount": 49.99
            }
        },
        {
            "DocumentType": "Invoice",
            "Pages": "5-6",
            "Fields": {
                "InvoiceNumber": "INV-002",
                "TotalAmount": 875.00
            }
        }
    ]
}

Processing Mixed Types

# Python - Route documents by type
def process_mixed_batch(extractions):
    results = {
        "invoices": [],
        "purchase_orders": [],
        "receipts": [],
        "other": []
    }

    for doc in extractions:
        doc_type = doc.get("DocumentType", "").lower()

        if doc_type == "invoice":
            results["invoices"].append(process_invoice(doc))
        elif doc_type == "purchaseorder":
            results["purchase_orders"].append(process_po(doc))
        elif doc_type == "receipt":
            results["receipts"].append(process_receipt(doc))
        else:
            results["other"].append(doc)

    return results

def process_invoice(doc):
    fields = doc.get("Fields", {})
    return {
        "number": fields.get("InvoiceNumber"),
        "amount": fields.get("TotalAmount"),
        "pages": doc.get("Pages"),
        "type": "invoice"
    }

def process_po(doc):
    fields = doc.get("Fields", {})
    return {
        "number": fields.get("PONumber"),
        "amount": fields.get("OrderTotal"),
        "pages": doc.get("Pages"),
        "type": "purchase_order"
    }

def process_receipt(doc):
    fields = doc.get("Fields", {})
    return {
        "number": fields.get("ReceiptNumber"),
        "amount": fields.get("Amount"),
        "pages": doc.get("Pages"),
        "type": "receipt"
    }

Best Practices

File Preparation

Consistent orientation – Ensure all pages have correct rotation
Reasonable file size – Keep files under 50MB for optimal processing
Clear scans – Use at least 200 DPI for scanned documents
Document limits – Stay within 50 documents per file for best results

Processing Strategies

Strategy	When to Use
Single large batch	When documents are related and need to be processed together
Multiple smaller batches	When processing very large volumes (100+ documents)
Pre-split files	When you need precise control over document boundaries

Error Handling

# Python - Robust multi-document processing
def process_with_validation(extractions):
    valid_documents = []
    issues = []

    for i, doc in enumerate(extractions):
        # Check for required fields
        if not doc.get("DocumentType"):
            issues.append(f"Document {i+1}: Missing document type")
            continue

        # Check confidence threshold
        confidence = doc.get("Confidence", 0)
        if confidence < 0.7:
            issues.append(
                f"Document {i+1}: Low confidence ({confidence:.2f})"
            )

        # Validate fields are present
        fields = doc.get("Fields", {})
        if not fields:
            issues.append(f"Document {i+1}: No fields extracted")
            continue

        valid_documents.append(doc)

    return {
        "valid": valid_documents,
        "issues": issues,
        "success_rate": len(valid_documents) / len(extractions) if extractions else 0
    }

Performance Tips

Batch related documents – Combine invoices from the same vendor
Monitor processing time – Large files take longer; plan accordingly
Use async processing – For very large batches, consider async workflows
Cache results – Store extractions to avoid reprocessing

Troubleshooting

Common Issues

Issue	Possible Cause	Solution
Only one document detected	Documents too similar or no clear boundaries	Add separator pages or pre-split the file
Too many documents detected	Multi-page documents being split incorrectly	Review page layouts; contact support for tuning
Wrong page assignments	Ambiguous document boundaries	Use clearer document separators or headers
Mixed classifications	Similar document types being confused	Request Context tuning for your document types
Timeout errors	File too large or too many documents	Split into smaller batches (25-50 pages each)

Debugging Tips

Check page counts – Verify the sum of all PageCount values equals your PDF’s total pages
Review confidence scores – Low confidence may indicate boundary issues
Test with smaller files – Start with 2-3 documents to validate behavior
Compare page ranges – Ensure no pages are missing or duplicated

# Validation helper
def validate_page_coverage(extractions, total_pages):
    covered_pages = set()

    for doc in extractions:
        pages = doc.get("Pages", "")

        # Parse page range (e.g., "1-3" or "5")
        if "-" in pages:
            start, end = map(int, pages.split("-"))
            covered_pages.update(range(start, end + 1))
        else:
            covered_pages.add(int(pages))

    expected = set(range(1, total_pages + 1))
    missing = expected - covered_pages
    extra = covered_pages - expected

    if missing:
        print(f"Warning: Pages not covered: {sorted(missing)}")
    if extra:
        print(f"Warning: Invalid pages referenced: {sorted(extra)}")

    return len(missing) == 0 and len(extra) == 0

Getting Help

If you encounter persistent issues with multi-document processing:

Note the TraceId from the API response
Save a sample file that demonstrates the issue
Document the expected vs. actual number of documents
Contact support@docdigitizer.com with these details