DocDigitizer Multi-Document Files API Guide

Learn how to process PDF files that contain multiple documents in a single file. DocDigitizer automatically detects document boundaries and extracts data from each document individually. 

Overview

Multi-document files are single PDF files that contain multiple logical documents. Common examples include:

  • Scanned batches – Multiple invoices scanned together into one PDF
  • Email attachments – Combined PDFs from email threads
  • Archive exports – Bulk exports from document management systems
  • Bank statements – Monthly statements combined into yearly files
  • Invoice bundles – Multiple invoices from the same vendor

Why Use Multi-Document Processing?

BenefitDescription
Reduced API CallsProcess many documents with a single request
Automatic SplittingNo need to manually split PDFs before processing
Preserved ContextRelated documents maintain their relationship
Batch EfficiencyOptimized processing for high-volume workflows

How It Works

When you submit a multi-document PDF, the DocDigitizer pipeline:

  1. Receives the PDF – The API accepts your file upload
  2. Analyzes structure – The Document Partitioner examines page layouts and content
  3. Detects boundaries – AI identifies where one document ends and another begins
  4. Classifies each document – Each segment is classified independently
  5. Extracts data – Data extraction runs on each identified document
  6. Returns combined results – All extractions are returned in a single response

Processing Flow

┌─────────────────────────────────────────────────────────────────┐
│                     Multi-Document PDF                          │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐       │
│  │ Page 1-2 │  │ Page 3-4 │  │ Page 5   │  │ Page 6-8 │       │
│  │Invoice #1│  │Invoice #2│  │Receipt   │  │Invoice #3│       │
│  └──────────┘  └──────────┘  └──────────┘  └──────────┘       │
└─────────────────────────────────────────────────────────────────┘
                              │
                              ▼
                    ┌─────────────────┐
                    │   Partitioner   │
                    │  (AI Analysis)  │
                    └─────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────┐
│                      API Response                               │
│  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐       │
│  │Extract 1 │  │Extract 2 │  │Extract 3 │  │Extract 4 │       │
│  │Pages: 1-2│  │Pages: 3-4│  │Pages: 5  │  │Pages: 6-8│       │
│  └──────────┘  └──────────┘  └──────────┘  └──────────┘       │
└─────────────────────────────────────────────────────────────────┘
    

Enabling Document Partitioning

Document partitioning is controlled by your Context configuration. Contact your account manager to enable or configure partitioning for your Context ID.

Configuration Options

OptionDescriptionDefault
Partitioning EnabledWhether to detect and split multi-document filesEnabled
Max DocumentsMaximum number of documents to extract per file50
Classification ModeClassify each document independently or use single typeIndependent

Standard API Request

No special parameters are needed. Simply submit your multi-document PDF:

curl -X POST https://apix.docdigitizer.com/sync \
  -H "x-api-key: YOUR_API_KEY" \
  -F "files=@multi_document_batch.pdf" \
  -F "id=550e8400-e29b-41d4-a716-446655440000" \
  -F "contextId=your-context-id"
    

Response Structure

When processing multi-document files, the API returns an array of extractions, one for each detected document.

Response Format

{
    "StateText": "DONE",
    "ExternalId": "550e8400-e29b-41d4-a716-446655440000",
    "TraceId": "ABC123XYZ",
    "Output": [
        {
            "DocumentType": "Invoice",
            "Pages": "1-2",
            "PageCount": 2,
            "Confidence": 0.95,
            "Fields": {
                "InvoiceNumber": "INV-2024-001",
                "InvoiceDate": "2024-01-15",
                "TotalAmount": 1250.00,
                "VendorName": "Supplier Corp"
            }
        },
        {
            "DocumentType": "Invoice",
            "Pages": "3-4",
            "PageCount": 2,
            "Confidence": 0.92,
            "Fields": {
                "InvoiceNumber": "INV-2024-002",
                "InvoiceDate": "2024-01-16",
                "TotalAmount": 875.50,
                "VendorName": "Another Vendor"
            }
        },
        {
            "DocumentType": "Receipt",
            "Pages": "5",
            "PageCount": 1,
            "Confidence": 0.88,
            "Fields": {
                "ReceiptNumber": "RCP-5544",
                "Date": "2024-01-17",
                "Amount": 45.99,
                "Merchant": "Office Supplies Inc"
            }
        }
    ]
}
    

Key Response Fields

FieldTypeDescription
OutputArrayArray of extraction objects, one per detected document
Output[].DocumentTypeStringClassified document type for this segment
Output[].PagesStringPage range in the original PDF (e.g., “1-2”, “5”)
Output[].PageCountIntegerNumber of pages in this document segment
Output[].ConfidenceFloatClassification confidence score (0.0 to 1.0)
Output[].FieldsObjectExtracted fields specific to the document type

Processing Examples

Python

import requests
import uuid

def process_multi_document(file_path, api_key, context_id):
    """Process a multi-document PDF and return all extractions."""

    url = "https://apix.docdigitizer.com/sync"

    headers = {
        "x-api-key": api_key
    }

    with open(file_path, "rb") as f:
        files = {"files": f}
        data = {
            "id": str(uuid.uuid4()),
            "contextId": context_id
        }

        response = requests.post(url, headers=headers, files=files, data=data)

    if response.status_code == 200:
        result = response.json()

        if result.get("StateText") == "DONE":
            extractions = result.get("Output", [])
            print(f"Found {len(extractions)} documents in file")
            return extractions
        else:
            print(f"Processing failed: {result.get('Messages', [])}")
            return None
    else:
        print(f"HTTP Error: {response.status_code}")
        return None


# Process a batch file
extractions = process_multi_document(
    "invoice_batch.pdf",
    "dd_live_your_api_key",
    "your-context-id"
)

# Handle each extraction
if extractions:
    for i, doc in enumerate(extractions):
        print(f"\n--- Document {i + 1} ---")
        print(f"Type: {doc.get('DocumentType')}")
        print(f"Pages: {doc.get('Pages')}")

        fields = doc.get("Fields", {})
        for field_name, field_value in fields.items():
            print(f"  {field_name}: {field_value}")
    

PowerShell

function Process-MultiDocumentFile {
    param(
        [string]$FilePath,
        [string]$ApiKey,
        [string]$ContextId
    )

    $headers = @{
        "x-api-key" = $ApiKey
    }

    $form = @{
        files = Get-Item -Path $FilePath
        id = [guid]::NewGuid().ToString()
        contextId = $ContextId
    }

    try {
        $response = Invoke-RestMethod -Uri "https://apix.docdigitizer.com/sync" `
            -Method Post `
            -Headers $headers `
            -Form $form

        if ($response.StateText -eq "DONE") {
            $documents = $response.Output
            Write-Host "Found $($documents.Count) documents in file"
            return $documents
        }
        else {
            Write-Warning "Processing failed: $($response.Messages -join ', ')"
            return $null
        }
    }
    catch {
        Write-Error "Request failed: $_"
        return $null
    }
}

# Process a batch file
$documents = Process-MultiDocumentFile `
    -FilePath "C:\Documents\invoice_batch.pdf" `
    -ApiKey "dd_live_your_api_key" `
    -ContextId "your-context-id"

# Display results
if ($documents) {
    $docNumber = 1
    foreach ($doc in $documents) {
        Write-Host "`n--- Document $docNumber ---"
        Write-Host "Type: $($doc.DocumentType)"
        Write-Host "Pages: $($doc.Pages)"

        $doc.Fields.PSObject.Properties | ForEach-Object {
            Write-Host "  $($_.Name): $($_.Value)"
        }

        $docNumber++
    }
}
    

JavaScript (Node.js)

const FormData = require('form-data');
const fs = require('fs');
const fetch = require('node-fetch');
const { v4: uuidv4 } = require('uuid');

async function processMultiDocument(filePath, apiKey, contextId) {
    const form = new FormData();
    form.append('files', fs.createReadStream(filePath));
    form.append('id', uuidv4());
    form.append('contextId', contextId);

    const response = await fetch('https://apix.docdigitizer.com/sync', {
        method: 'POST',
        headers: {
            'x-api-key': apiKey
        },
        body: form
    });

    const result = await response.json();

    if (result.StateText === 'DONE') {
        console.log(`Found ${result.Output.length} documents`);
        return result.Output;
    } else {
        console.error('Processing failed:', result.Messages);
        return null;
    }
}

// Process and handle results
(async () => {
    const documents = await processMultiDocument(
        'invoice_batch.pdf',
        'dd_live_your_api_key',
        'your-context-id'
    );

    if (documents) {
        documents.forEach((doc, index) => {
            console.log(`\n--- Document ${index + 1} ---`);
            console.log(`Type: ${doc.DocumentType}`);
            console.log(`Pages: ${doc.Pages}`);

            Object.entries(doc.Fields || {}).forEach(([key, value]) => {
                console.log(`  ${key}: ${value}`);
            });
        });
    }
})();
    

Handling Multiple Results

Iterating Through Documents

Always check if the response contains an array of extractions:

# Python example
output = result.get("Output", [])

# Ensure we have a list
if not isinstance(output, list):
    output = [output]

for index, document in enumerate(output):
    document_type = document.get("DocumentType")
    page_range = document.get("Pages")
    fields = document.get("Fields", {})

    # Process each document...
    print(f"Document {index + 1}: {document_type} (pages {page_range})")
    

Grouping by Document Type

# Python - Group extractions by document type
from collections import defaultdict

def group_by_type(extractions):
    grouped = defaultdict(list)

    for doc in extractions:
        doc_type = doc.get("DocumentType", "Unknown")
        grouped[doc_type].append(doc)

    return dict(grouped)

# Usage
extractions = result.get("Output", [])
grouped = group_by_type(extractions)

print(f"Invoices: {len(grouped.get('Invoice', []))}")
print(f"Receipts: {len(grouped.get('Receipt', []))}")
print(f"Other: {len(grouped.get('Unknown', []))}")
    

Aggregating Financial Data

# Python - Sum totals from multiple invoices
def aggregate_invoice_totals(extractions):
    total = 0.0
    invoice_count = 0

    for doc in extractions:
        if doc.get("DocumentType") == "Invoice":
            fields = doc.get("Fields", {})
            amount = fields.get("TotalAmount", 0)

            # Handle string amounts
            if isinstance(amount, str):
                amount = float(amount.replace(",", "").replace("$", ""))

            total += amount
            invoice_count += 1

    return {
        "invoice_count": invoice_count,
        "total_amount": total
    }

# Usage
summary = aggregate_invoice_totals(extractions)
print(f"Processed {summary['invoice_count']} invoices")
print(f"Total value: ${summary['total_amount']:,.2f}")
    

Document Boundary Detection

The Document Partitioner uses multiple signals to detect where documents begin and end:

Detection Methods

SignalDescriptionExample
Visual LayoutSignificant changes in page layout or structureLetterhead appearing after body text
Header DetectionRecognizable document headers (logo, company name)Invoice header, letter greeting
Page NumberingPage numbers restarting at 1“Page 1 of 3” after “Page 2 of 2”
Document IDsNew document identifiers appearingDifferent invoice numbers
Date PatternsSignificant date changes in headersJanuary invoice followed by February
Content AnalysisAI analysis of semantic content shiftsInvoice content changing to contract

Page Range Information

Each extraction includes page range information to help you:

  • Identify which pages belong to which document
  • Cross-reference with original file if needed
  • Verify correct splitting for quality assurance
  • Generate split PDFs for archival
{
    "DocumentType": "Invoice",
    "Pages": "3-5",
    "PageCount": 3,
    "StartPage": 3,
    "EndPage": 5,
    "Fields": { ... }
}
    

Mixed Document Types

Multi-document files can contain different document types. Each segment is classified independently.

Example: Mixed Batch Processing

# A single PDF containing invoices, receipts, and purchase orders
{
    "StateText": "DONE",
    "Output": [
        {
            "DocumentType": "Invoice",
            "Pages": "1-2",
            "Fields": {
                "InvoiceNumber": "INV-001",
                "TotalAmount": 1500.00
            }
        },
        {
            "DocumentType": "PurchaseOrder",
            "Pages": "3",
            "Fields": {
                "PONumber": "PO-2024-555",
                "OrderTotal": 2200.00
            }
        },
        {
            "DocumentType": "Receipt",
            "Pages": "4",
            "Fields": {
                "ReceiptNumber": "RCP-789",
                "Amount": 49.99
            }
        },
        {
            "DocumentType": "Invoice",
            "Pages": "5-6",
            "Fields": {
                "InvoiceNumber": "INV-002",
                "TotalAmount": 875.00
            }
        }
    ]
}
    

Processing Mixed Types

# Python - Route documents by type
def process_mixed_batch(extractions):
    results = {
        "invoices": [],
        "purchase_orders": [],
        "receipts": [],
        "other": []
    }

    for doc in extractions:
        doc_type = doc.get("DocumentType", "").lower()

        if doc_type == "invoice":
            results["invoices"].append(process_invoice(doc))
        elif doc_type == "purchaseorder":
            results["purchase_orders"].append(process_po(doc))
        elif doc_type == "receipt":
            results["receipts"].append(process_receipt(doc))
        else:
            results["other"].append(doc)

    return results

def process_invoice(doc):
    fields = doc.get("Fields", {})
    return {
        "number": fields.get("InvoiceNumber"),
        "amount": fields.get("TotalAmount"),
        "pages": doc.get("Pages"),
        "type": "invoice"
    }

def process_po(doc):
    fields = doc.get("Fields", {})
    return {
        "number": fields.get("PONumber"),
        "amount": fields.get("OrderTotal"),
        "pages": doc.get("Pages"),
        "type": "purchase_order"
    }

def process_receipt(doc):
    fields = doc.get("Fields", {})
    return {
        "number": fields.get("ReceiptNumber"),
        "amount": fields.get("Amount"),
        "pages": doc.get("Pages"),
        "type": "receipt"
    }
    

Best Practices

File Preparation

  • Consistent orientation – Ensure all pages have correct rotation
  • Reasonable file size – Keep files under 50MB for optimal processing
  • Clear scans – Use at least 200 DPI for scanned documents
  • Document limits – Stay within 50 documents per file for best results

Processing Strategies

StrategyWhen to Use
Single large batchWhen documents are related and need to be processed together
Multiple smaller batchesWhen processing very large volumes (100+ documents)
Pre-split filesWhen you need precise control over document boundaries

Error Handling

# Python - Robust multi-document processing
def process_with_validation(extractions):
    valid_documents = []
    issues = []

    for i, doc in enumerate(extractions):
        # Check for required fields
        if not doc.get("DocumentType"):
            issues.append(f"Document {i+1}: Missing document type")
            continue

        # Check confidence threshold
        confidence = doc.get("Confidence", 0)
        if confidence < 0.7:
            issues.append(
                f"Document {i+1}: Low confidence ({confidence:.2f})"
            )

        # Validate fields are present
        fields = doc.get("Fields", {})
        if not fields:
            issues.append(f"Document {i+1}: No fields extracted")
            continue

        valid_documents.append(doc)

    return {
        "valid": valid_documents,
        "issues": issues,
        "success_rate": len(valid_documents) / len(extractions) if extractions else 0
    }
    

Performance Tips

  • Batch related documents – Combine invoices from the same vendor
  • Monitor processing time – Large files take longer; plan accordingly
  • Use async processing – For very large batches, consider async workflows
  • Cache results – Store extractions to avoid reprocessing

Troubleshooting

Common Issues

IssuePossible CauseSolution
Only one document detectedDocuments too similar or no clear boundariesAdd separator pages or pre-split the file
Too many documents detectedMulti-page documents being split incorrectlyReview page layouts; contact support for tuning
Wrong page assignmentsAmbiguous document boundariesUse clearer document separators or headers
Mixed classificationsSimilar document types being confusedRequest Context tuning for your document types
Timeout errorsFile too large or too many documentsSplit into smaller batches (25-50 pages each)

Debugging Tips

  1. Check page counts – Verify the sum of all PageCount values equals your PDF’s total pages
  2. Review confidence scores – Low confidence may indicate boundary issues
  3. Test with smaller files – Start with 2-3 documents to validate behavior
  4. Compare page ranges – Ensure no pages are missing or duplicated
# Validation helper
def validate_page_coverage(extractions, total_pages):
    covered_pages = set()

    for doc in extractions:
        pages = doc.get("Pages", "")

        # Parse page range (e.g., "1-3" or "5")
        if "-" in pages:
            start, end = map(int, pages.split("-"))
            covered_pages.update(range(start, end + 1))
        else:
            covered_pages.add(int(pages))

    expected = set(range(1, total_pages + 1))
    missing = expected - covered_pages
    extra = covered_pages - expected

    if missing:
        print(f"Warning: Pages not covered: {sorted(missing)}")
    if extra:
        print(f"Warning: Invalid pages referenced: {sorted(extra)}")

    return len(missing) == 0 and len(extra) == 0
    

Getting Help

If you encounter persistent issues with multi-document processing:

  1. Note the TraceId from the API response
  2. Save a sample file that demonstrates the issue
  3. Document the expected vs. actual number of documents
  4. Contact support@docdigitizer.com with these details