Processing Multi-Document Files

Learn how to process PDF files that contain multiple documents in a single file. DocDigitizer automatically detects document boundaries and extracts data from each document individually. 

Overview

Multi-document files are single PDF files that contain multiple logical documents. Common examples include:

  • Scanned batches – Multiple invoices scanned together into one PDF
  • Email attachments – Combined PDFs from email threads
  • Archive exports – Bulk exports from document management systems
  • Bank statements – Monthly statements combined into yearly files
  • Invoice bundles – Multiple invoices from the same vendor

Why Use Multi-Document Processing?

Benefit Description
Reduced API Calls Process many documents with a single request
Automatic Splitting No need to manually split PDFs before processing
Preserved Context Related documents maintain their relationship
Batch Efficiency Optimized processing for high-volume workflows

How It Works

When you submit a multi-document PDF, the DocDigitizer pipeline:

  1. Receives the PDF – The API accepts your file upload
  2. Analyzes structure – The Document Partitioner examines page layouts and content
  3. Detects boundaries – AI identifies where one document ends and another begins
  4. Classifies each document – Each segment is classified independently
  5. Extracts data – Data extraction runs on each identified document
  6. Returns combined results – All extractions are returned in a single response

Processing Flow

┌─────────────────────────────────────────────────────────────────┐
│                     Multi-Document PDF                          │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐       │
│  │ Page 1-2 │  │ Page 3-4 │  │ Page 5   │  │ Page 6-8 │       │
│  │Invoice #1│  │Invoice #2│  │Receipt   │  │Invoice #3│       │
│  └──────────┘  └──────────┘  └──────────┘  └──────────┘       │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
                    ┌─────────────────┐
                    │   Partitioner   │
                    │  (AI Analysis)  │
                    └─────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                      API Response                               │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐       │
│  │Extract 1 │  │Extract 2 │  │Extract 3 │  │Extract 4 │       │
│  │Pages: 1-2│  │Pages: 3-4│  │Pages: 5  │  │Pages: 6-8│       │
│  └──────────┘  └──────────┘  └──────────┘  └──────────┘       │
└─────────────────────────────────────────────────────────────────┘
    

Enabling Document Partitioning

Document partitioning is controlled by your Context configuration. Contact your account manager to enable or configure partitioning for your Context ID.

Configuration Options

Option Description Default
Partitioning Enabled Whether to detect and split multi-document files Enabled
Max Documents Maximum number of documents to extract per file 50
Classification Mode Classify each document independently or use single type Independent

Standard API Request

No special parameters are needed. Simply submit your multi-document PDF:

curl -X POST https://apix.docdigitizer.com/sync \
  -H "x-api-key: YOUR_API_KEY" \
  -F "files=@multi_document_batch.pdf" \
  -F "id=550e8400-e29b-41d4-a716-446655440000" \
  -F "contextId=your-context-id"
    

Response Structure

When processing multi-document files, the API returns an array of extractions, one for each detected document.

Response Format

{
    "StateText": "DONE",
    "ExternalId": "550e8400-e29b-41d4-a716-446655440000",
    "TraceId": "ABC123XYZ",
    "Output": [
        {
            "DocumentType": "Invoice",
            "Pages": "1-2",
            "PageCount": 2,
            "Confidence": 0.95,
            "Fields": {
                "InvoiceNumber": "INV-2024-001",
                "InvoiceDate": "2024-01-15",
                "TotalAmount": 1250.00,
                "VendorName": "Supplier Corp"
            }
        },
        {
            "DocumentType": "Invoice",
            "Pages": "3-4",
            "PageCount": 2,
            "Confidence": 0.92,
            "Fields": {
                "InvoiceNumber": "INV-2024-002",
                "InvoiceDate": "2024-01-16",
                "TotalAmount": 875.50,
                "VendorName": "Another Vendor"
            }
        },
        {
            "DocumentType": "Receipt",
            "Pages": "5",
            "PageCount": 1,
            "Confidence": 0.88,
            "Fields": {
                "ReceiptNumber": "RCP-5544",
                "Date": "2024-01-17",
                "Amount": 45.99,
                "Merchant": "Office Supplies Inc"
            }
        }
    ]
}
    

Key Response Fields

Field Type Description
Output Array Array of extraction objects, one per detected document
Output[].DocumentType String Classified document type for this segment
Output[].Pages String Page range in the original PDF (e.g., “1-2”, “5”)
Output[].PageCount Integer Number of pages in this document segment
Output[].Confidence Float Classification confidence score (0.0 to 1.0)
Output[].Fields Object Extracted fields specific to the document type

Processing Examples

Python

import requests
import uuid

def process_multi_document(file_path, api_key, context_id):
    """Process a multi-document PDF and return all extractions."""

    url = "https://apix.docdigitizer.com/sync"

    headers = {
        "x-api-key": api_key
    }

    with open(file_path, "rb") as f:
        files = {"files": f}
        data = {
            "id": str(uuid.uuid4()),
            "contextId": context_id
        }

        response = requests.post(url, headers=headers, files=files, data=data)

    if response.status_code == 200:
        result = response.json()

        if result.get("StateText") == "DONE":
            extractions = result.get("Output", [])
            print(f"Found {len(extractions)} documents in file")
            return extractions
        else:
            print(f"Processing failed: {result.get('Messages', [])}")
            return None
    else:
        print(f"HTTP Error: {response.status_code}")
        return None


# Process a batch file
extractions = process_multi_document(
    "invoice_batch.pdf",
    "dd_live_your_api_key",
    "your-context-id"
)

# Handle each extraction
if extractions:
    for i, doc in enumerate(extractions):
        print(f"\n--- Document {i + 1} ---")
        print(f"Type: {doc.get('DocumentType')}")
        print(f"Pages: {doc.get('Pages')}")

        fields = doc.get("Fields", {})
        for field_name, field_value in fields.items():
            print(f"  {field_name}: {field_value}")
    

PowerShell

function Process-MultiDocumentFile {
    param(
        [string]$FilePath,
        [string]$ApiKey,
        [string]$ContextId
    )

    $headers = @{
        "x-api-key" = $ApiKey
    }

    $form = @{
        files = Get-Item -Path $FilePath
        id = [guid]::NewGuid().ToString()
        contextId = $ContextId
    }

    try {
        $response = Invoke-RestMethod -Uri "https://apix.docdigitizer.com/sync" `
            -Method Post `
            -Headers $headers `
            -Form $form

        if ($response.StateText -eq "DONE") {
            $documents = $response.Output
            Write-Host "Found $($documents.Count) documents in file"
            return $documents
        }
        else {
            Write-Warning "Processing failed: $($response.Messages -join ', ')"
            return $null
        }
    }
    catch {
        Write-Error "Request failed: $_"
        return $null
    }
}

# Process a batch file
$documents = Process-MultiDocumentFile `
    -FilePath "C:\Documents\invoice_batch.pdf" `
    -ApiKey "dd_live_your_api_key" `
    -ContextId "your-context-id"

# Display results
if ($documents) {
    $docNumber = 1
    foreach ($doc in $documents) {
        Write-Host "`n--- Document $docNumber ---"
        Write-Host "Type: $($doc.DocumentType)"
        Write-Host "Pages: $($doc.Pages)"

        $doc.Fields.PSObject.Properties | ForEach-Object {
            Write-Host "  $($_.Name): $($_.Value)"
        }

        $docNumber++
    }
}
    

JavaScript (Node.js)

const FormData = require('form-data');
const fs = require('fs');
const fetch = require('node-fetch');
const { v4: uuidv4 } = require('uuid');

async function processMultiDocument(filePath, apiKey, contextId) {
    const form = new FormData();
    form.append('files', fs.createReadStream(filePath));
    form.append('id', uuidv4());
    form.append('contextId', contextId);

    const response = await fetch('https://apix.docdigitizer.com/sync', {
        method: 'POST',
        headers: {
            'x-api-key': apiKey
        },
        body: form
    });

    const result = await response.json();

    if (result.StateText === 'DONE') {
        console.log(`Found ${result.Output.length} documents`);
        return result.Output;
    } else {
        console.error('Processing failed:', result.Messages);
        return null;
    }
}

// Process and handle results
(async () => {
    const documents = await processMultiDocument(
        'invoice_batch.pdf',
        'dd_live_your_api_key',
        'your-context-id'
    );

    if (documents) {
        documents.forEach((doc, index) => {
            console.log(`\n--- Document ${index + 1} ---`);
            console.log(`Type: ${doc.DocumentType}`);
            console.log(`Pages: ${doc.Pages}`);

            Object.entries(doc.Fields || {}).forEach(([key, value]) => {
                console.log(`  ${key}: ${value}`);
            });
        });
    }
})();
    

Handling Multiple Results

Iterating Through Documents

Always check if the response contains an array of extractions:

# Python example
output = result.get("Output", [])

# Ensure we have a list
if not isinstance(output, list):
    output = [output]

for index, document in enumerate(output):
    document_type = document.get("DocumentType")
    page_range = document.get("Pages")
    fields = document.get("Fields", {})

    # Process each document...
    print(f"Document {index + 1}: {document_type} (pages {page_range})")
    

Grouping by Document Type

# Python - Group extractions by document type
from collections import defaultdict

def group_by_type(extractions):
    grouped = defaultdict(list)

    for doc in extractions:
        doc_type = doc.get("DocumentType", "Unknown")
        grouped[doc_type].append(doc)

    return dict(grouped)

# Usage
extractions = result.get("Output", [])
grouped = group_by_type(extractions)

print(f"Invoices: {len(grouped.get('Invoice', []))}")
print(f"Receipts: {len(grouped.get('Receipt', []))}")
print(f"Other: {len(grouped.get('Unknown', []))}")
    

Aggregating Financial Data

# Python - Sum totals from multiple invoices
def aggregate_invoice_totals(extractions):
    total = 0.0
    invoice_count = 0

    for doc in extractions:
        if doc.get("DocumentType") == "Invoice":
            fields = doc.get("Fields", {})
            amount = fields.get("TotalAmount", 0)

            # Handle string amounts
            if isinstance(amount, str):
                amount = float(amount.replace(",", "").replace("$", ""))

            total += amount
            invoice_count += 1

    return {
        "invoice_count": invoice_count,
        "total_amount": total
    }

# Usage
summary = aggregate_invoice_totals(extractions)
print(f"Processed {summary['invoice_count']} invoices")
print(f"Total value: ${summary['total_amount']:,.2f}")
    

Document Boundary Detection

The Document Partitioner uses multiple signals to detect where documents begin and end:

Detection Methods

Signal Description Example
Visual Layout Significant changes in page layout or structure Letterhead appearing after body text
Header Detection Recognizable document headers (logo, company name) Invoice header, letter greeting
Page Numbering Page numbers restarting at 1 “Page 1 of 3” after “Page 2 of 2”
Document IDs New document identifiers appearing Different invoice numbers
Date Patterns Significant date changes in headers January invoice followed by February
Content Analysis AI analysis of semantic content shifts Invoice content changing to contract

Page Range Information

Each extraction includes page range information to help you:

  • Identify which pages belong to which document
  • Cross-reference with original file if needed
  • Verify correct splitting for quality assurance
  • Generate split PDFs for archival
{
    "DocumentType": "Invoice",
    "Pages": "3-5",
    "PageCount": 3,
    "StartPage": 3,
    "EndPage": 5,
    "Fields": { ... }
}
    

Mixed Document Types

Multi-document files can contain different document types. Each segment is classified independently.

Example: Mixed Batch Processing

# A single PDF containing invoices, receipts, and purchase orders
{
    "StateText": "DONE",
    "Output": [
        {
            "DocumentType": "Invoice",
            "Pages": "1-2",
            "Fields": {
                "InvoiceNumber": "INV-001",
                "TotalAmount": 1500.00
            }
        },
        {
            "DocumentType": "PurchaseOrder",
            "Pages": "3",
            "Fields": {
                "PONumber": "PO-2024-555",
                "OrderTotal": 2200.00
            }
        },
        {
            "DocumentType": "Receipt",
            "Pages": "4",
            "Fields": {
                "ReceiptNumber": "RCP-789",
                "Amount": 49.99
            }
        },
        {
            "DocumentType": "Invoice",
            "Pages": "5-6",
            "Fields": {
                "InvoiceNumber": "INV-002",
                "TotalAmount": 875.00
            }
        }
    ]
}
    

Processing Mixed Types

# Python - Route documents by type
def process_mixed_batch(extractions):
    results = {
        "invoices": [],
        "purchase_orders": [],
        "receipts": [],
        "other": []
    }

    for doc in extractions:
        doc_type = doc.get("DocumentType", "").lower()

        if doc_type == "invoice":
            results["invoices"].append(process_invoice(doc))
        elif doc_type == "purchaseorder":
            results["purchase_orders"].append(process_po(doc))
        elif doc_type == "receipt":
            results["receipts"].append(process_receipt(doc))
        else:
            results["other"].append(doc)

    return results

def process_invoice(doc):
    fields = doc.get("Fields", {})
    return {
        "number": fields.get("InvoiceNumber"),
        "amount": fields.get("TotalAmount"),
        "pages": doc.get("Pages"),
        "type": "invoice"
    }

def process_po(doc):
    fields = doc.get("Fields", {})
    return {
        "number": fields.get("PONumber"),
        "amount": fields.get("OrderTotal"),
        "pages": doc.get("Pages"),
        "type": "purchase_order"
    }

def process_receipt(doc):
    fields = doc.get("Fields", {})
    return {
        "number": fields.get("ReceiptNumber"),
        "amount": fields.get("Amount"),
        "pages": doc.get("Pages"),
        "type": "receipt"
    }
    

Best Practices

File Preparation

  • Consistent orientation – Ensure all pages have correct rotation
  • Reasonable file size – Keep files under 50MB for optimal processing
  • Clear scans – Use at least 200 DPI for scanned documents
  • Document limits – Stay within 50 documents per file for best results

Processing Strategies

Strategy When to Use
Single large batch When documents are related and need to be processed together
Multiple smaller batches When processing very large volumes (100+ documents)
Pre-split files When you need precise control over document boundaries

Error Handling

# Python - Robust multi-document processing
def process_with_validation(extractions):
    valid_documents = []
    issues = []

    for i, doc in enumerate(extractions):
        # Check for required fields
        if not doc.get("DocumentType"):
            issues.append(f"Document {i+1}: Missing document type")
            continue

        # Check confidence threshold
        confidence = doc.get("Confidence", 0)
        if confidence < 0.7:
            issues.append(
                f"Document {i+1}: Low confidence ({confidence:.2f})"
            )

        # Validate fields are present
        fields = doc.get("Fields", {})
        if not fields:
            issues.append(f"Document {i+1}: No fields extracted")
            continue

        valid_documents.append(doc)

    return {
        "valid": valid_documents,
        "issues": issues,
        "success_rate": len(valid_documents) / len(extractions) if extractions else 0
    }
    

Performance Tips

  • Batch related documents – Combine invoices from the same vendor
  • Monitor processing time – Large files take longer; plan accordingly
  • Use async processing – For very large batches, consider async workflows
  • Cache results – Store extractions to avoid reprocessing

Troubleshooting

Common Issues

Issue Possible Cause Solution
Only one document detected Documents too similar or no clear boundaries Add separator pages or pre-split the file
Too many documents detected Multi-page documents being split incorrectly Review page layouts; contact support for tuning
Wrong page assignments Ambiguous document boundaries Use clearer document separators or headers
Mixed classifications Similar document types being confused Request Context tuning for your document types
Timeout errors File too large or too many documents Split into smaller batches (25-50 pages each)

Debugging Tips

  1. Check page counts – Verify the sum of all PageCount values equals your PDF’s total pages
  2. Review confidence scores – Low confidence may indicate boundary issues
  3. Test with smaller files – Start with 2-3 documents to validate behavior
  4. Compare page ranges – Ensure no pages are missing or duplicated
# Validation helper
def validate_page_coverage(extractions, total_pages):
    covered_pages = set()

    for doc in extractions:
        pages = doc.get("Pages", "")

        # Parse page range (e.g., "1-3" or "5")
        if "-" in pages:
            start, end = map(int, pages.split("-"))
            covered_pages.update(range(start, end + 1))
        else:
            covered_pages.add(int(pages))

    expected = set(range(1, total_pages + 1))
    missing = expected - covered_pages
    extra = covered_pages - expected

    if missing:
        print(f"Warning: Pages not covered: {sorted(missing)}")
    if extra:
        print(f"Warning: Invalid pages referenced: {sorted(extra)}")

    return len(missing) == 0 and len(extra) == 0
    

Getting Help

If you encounter persistent issues with multi-document processing:

  1. Note the TraceId from the API response
  2. Save a sample file that demonstrates the issue
  3. Document the expected vs. actual number of documents
  4. Contact support@docdigitizer.com with these details