Processing Multi-Document Files
Overview
Multi-document files are single PDF files that contain multiple logical documents. Common examples include:
- Scanned batches – Multiple invoices scanned together into one PDF
- Email attachments – Combined PDFs from email threads
- Archive exports – Bulk exports from document management systems
- Bank statements – Monthly statements combined into yearly files
- Invoice bundles – Multiple invoices from the same vendor
Why Use Multi-Document Processing?
| Benefit | Description |
|---|---|
| Reduced API Calls | Process many documents with a single request |
| Automatic Splitting | No need to manually split PDFs before processing |
| Preserved Context | Related documents maintain their relationship |
| Batch Efficiency | Optimized processing for high-volume workflows |
How It Works
When you submit a multi-document PDF, the DocDigitizer pipeline:
- Receives the PDF – The API accepts your file upload
- Analyzes structure – The Document Partitioner examines page layouts and content
- Detects boundaries – AI identifies where one document ends and another begins
- Classifies each document – Each segment is classified independently
- Extracts data – Data extraction runs on each identified document
- Returns combined results – All extractions are returned in a single response
Processing Flow
┌─────────────────────────────────────────────────────────────────┐
│ Multi-Document PDF │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Page 1-2 │ │ Page 3-4 │ │ Page 5 │ │ Page 6-8 │ │
│ │Invoice #1│ │Invoice #2│ │Receipt │ │Invoice #3│ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
└─────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────┐
│ Partitioner │
│ (AI Analysis) │
└─────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────┐
│ API Response │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │Extract 1 │ │Extract 2 │ │Extract 3 │ │Extract 4 │ │
│ │Pages: 1-2│ │Pages: 3-4│ │Pages: 5 │ │Pages: 6-8│ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
└─────────────────────────────────────────────────────────────────┘
Enabling Document Partitioning
Document partitioning is controlled by your Context configuration. Contact your account manager to enable or configure partitioning for your Context ID.
Configuration Options
| Option | Description | Default |
|---|---|---|
| Partitioning Enabled | Whether to detect and split multi-document files | Enabled |
| Max Documents | Maximum number of documents to extract per file | 50 |
| Classification Mode | Classify each document independently or use single type | Independent |
Standard API Request
No special parameters are needed. Simply submit your multi-document PDF:
curl -X POST https://apix.docdigitizer.com/sync \
-H "x-api-key: YOUR_API_KEY" \
-F "files=@multi_document_batch.pdf" \
-F "id=550e8400-e29b-41d4-a716-446655440000" \
-F "contextId=your-context-id"
Response Structure
When processing multi-document files, the API returns an array of extractions, one for each detected document.
Response Format
{
"StateText": "DONE",
"ExternalId": "550e8400-e29b-41d4-a716-446655440000",
"TraceId": "ABC123XYZ",
"Output": [
{
"DocumentType": "Invoice",
"Pages": "1-2",
"PageCount": 2,
"Confidence": 0.95,
"Fields": {
"InvoiceNumber": "INV-2024-001",
"InvoiceDate": "2024-01-15",
"TotalAmount": 1250.00,
"VendorName": "Supplier Corp"
}
},
{
"DocumentType": "Invoice",
"Pages": "3-4",
"PageCount": 2,
"Confidence": 0.92,
"Fields": {
"InvoiceNumber": "INV-2024-002",
"InvoiceDate": "2024-01-16",
"TotalAmount": 875.50,
"VendorName": "Another Vendor"
}
},
{
"DocumentType": "Receipt",
"Pages": "5",
"PageCount": 1,
"Confidence": 0.88,
"Fields": {
"ReceiptNumber": "RCP-5544",
"Date": "2024-01-17",
"Amount": 45.99,
"Merchant": "Office Supplies Inc"
}
}
]
}
Key Response Fields
| Field | Type | Description |
|---|---|---|
Output |
Array | Array of extraction objects, one per detected document |
Output[].DocumentType |
String | Classified document type for this segment |
Output[].Pages |
String | Page range in the original PDF (e.g., “1-2”, “5”) |
Output[].PageCount |
Integer | Number of pages in this document segment |
Output[].Confidence |
Float | Classification confidence score (0.0 to 1.0) |
Output[].Fields |
Object | Extracted fields specific to the document type |
Processing Examples
Python
import requests
import uuid
def process_multi_document(file_path, api_key, context_id):
"""Process a multi-document PDF and return all extractions."""
url = "https://apix.docdigitizer.com/sync"
headers = {
"x-api-key": api_key
}
with open(file_path, "rb") as f:
files = {"files": f}
data = {
"id": str(uuid.uuid4()),
"contextId": context_id
}
response = requests.post(url, headers=headers, files=files, data=data)
if response.status_code == 200:
result = response.json()
if result.get("StateText") == "DONE":
extractions = result.get("Output", [])
print(f"Found {len(extractions)} documents in file")
return extractions
else:
print(f"Processing failed: {result.get('Messages', [])}")
return None
else:
print(f"HTTP Error: {response.status_code}")
return None
# Process a batch file
extractions = process_multi_document(
"invoice_batch.pdf",
"dd_live_your_api_key",
"your-context-id"
)
# Handle each extraction
if extractions:
for i, doc in enumerate(extractions):
print(f"\n--- Document {i + 1} ---")
print(f"Type: {doc.get('DocumentType')}")
print(f"Pages: {doc.get('Pages')}")
fields = doc.get("Fields", {})
for field_name, field_value in fields.items():
print(f" {field_name}: {field_value}")
PowerShell
function Process-MultiDocumentFile {
param(
[string]$FilePath,
[string]$ApiKey,
[string]$ContextId
)
$headers = @{
"x-api-key" = $ApiKey
}
$form = @{
files = Get-Item -Path $FilePath
id = [guid]::NewGuid().ToString()
contextId = $ContextId
}
try {
$response = Invoke-RestMethod -Uri "https://apix.docdigitizer.com/sync" `
-Method Post `
-Headers $headers `
-Form $form
if ($response.StateText -eq "DONE") {
$documents = $response.Output
Write-Host "Found $($documents.Count) documents in file"
return $documents
}
else {
Write-Warning "Processing failed: $($response.Messages -join ', ')"
return $null
}
}
catch {
Write-Error "Request failed: $_"
return $null
}
}
# Process a batch file
$documents = Process-MultiDocumentFile `
-FilePath "C:\Documents\invoice_batch.pdf" `
-ApiKey "dd_live_your_api_key" `
-ContextId "your-context-id"
# Display results
if ($documents) {
$docNumber = 1
foreach ($doc in $documents) {
Write-Host "`n--- Document $docNumber ---"
Write-Host "Type: $($doc.DocumentType)"
Write-Host "Pages: $($doc.Pages)"
$doc.Fields.PSObject.Properties | ForEach-Object {
Write-Host " $($_.Name): $($_.Value)"
}
$docNumber++
}
}
JavaScript (Node.js)
const FormData = require('form-data');
const fs = require('fs');
const fetch = require('node-fetch');
const { v4: uuidv4 } = require('uuid');
async function processMultiDocument(filePath, apiKey, contextId) {
const form = new FormData();
form.append('files', fs.createReadStream(filePath));
form.append('id', uuidv4());
form.append('contextId', contextId);
const response = await fetch('https://apix.docdigitizer.com/sync', {
method: 'POST',
headers: {
'x-api-key': apiKey
},
body: form
});
const result = await response.json();
if (result.StateText === 'DONE') {
console.log(`Found ${result.Output.length} documents`);
return result.Output;
} else {
console.error('Processing failed:', result.Messages);
return null;
}
}
// Process and handle results
(async () => {
const documents = await processMultiDocument(
'invoice_batch.pdf',
'dd_live_your_api_key',
'your-context-id'
);
if (documents) {
documents.forEach((doc, index) => {
console.log(`\n--- Document ${index + 1} ---`);
console.log(`Type: ${doc.DocumentType}`);
console.log(`Pages: ${doc.Pages}`);
Object.entries(doc.Fields || {}).forEach(([key, value]) => {
console.log(` ${key}: ${value}`);
});
});
}
})();
Handling Multiple Results
Iterating Through Documents
Always check if the response contains an array of extractions:
# Python example
output = result.get("Output", [])
# Ensure we have a list
if not isinstance(output, list):
output = [output]
for index, document in enumerate(output):
document_type = document.get("DocumentType")
page_range = document.get("Pages")
fields = document.get("Fields", {})
# Process each document...
print(f"Document {index + 1}: {document_type} (pages {page_range})")
Grouping by Document Type
# Python - Group extractions by document type
from collections import defaultdict
def group_by_type(extractions):
grouped = defaultdict(list)
for doc in extractions:
doc_type = doc.get("DocumentType", "Unknown")
grouped[doc_type].append(doc)
return dict(grouped)
# Usage
extractions = result.get("Output", [])
grouped = group_by_type(extractions)
print(f"Invoices: {len(grouped.get('Invoice', []))}")
print(f"Receipts: {len(grouped.get('Receipt', []))}")
print(f"Other: {len(grouped.get('Unknown', []))}")
Aggregating Financial Data
# Python - Sum totals from multiple invoices
def aggregate_invoice_totals(extractions):
total = 0.0
invoice_count = 0
for doc in extractions:
if doc.get("DocumentType") == "Invoice":
fields = doc.get("Fields", {})
amount = fields.get("TotalAmount", 0)
# Handle string amounts
if isinstance(amount, str):
amount = float(amount.replace(",", "").replace("$", ""))
total += amount
invoice_count += 1
return {
"invoice_count": invoice_count,
"total_amount": total
}
# Usage
summary = aggregate_invoice_totals(extractions)
print(f"Processed {summary['invoice_count']} invoices")
print(f"Total value: ${summary['total_amount']:,.2f}")
Document Boundary Detection
The Document Partitioner uses multiple signals to detect where documents begin and end:
Detection Methods
| Signal | Description | Example |
|---|---|---|
| Visual Layout | Significant changes in page layout or structure | Letterhead appearing after body text |
| Header Detection | Recognizable document headers (logo, company name) | Invoice header, letter greeting |
| Page Numbering | Page numbers restarting at 1 | “Page 1 of 3” after “Page 2 of 2” |
| Document IDs | New document identifiers appearing | Different invoice numbers |
| Date Patterns | Significant date changes in headers | January invoice followed by February |
| Content Analysis | AI analysis of semantic content shifts | Invoice content changing to contract |
Page Range Information
Each extraction includes page range information to help you:
- Identify which pages belong to which document
- Cross-reference with original file if needed
- Verify correct splitting for quality assurance
- Generate split PDFs for archival
{
"DocumentType": "Invoice",
"Pages": "3-5",
"PageCount": 3,
"StartPage": 3,
"EndPage": 5,
"Fields": { ... }
}
Mixed Document Types
Multi-document files can contain different document types. Each segment is classified independently.
Example: Mixed Batch Processing
# A single PDF containing invoices, receipts, and purchase orders
{
"StateText": "DONE",
"Output": [
{
"DocumentType": "Invoice",
"Pages": "1-2",
"Fields": {
"InvoiceNumber": "INV-001",
"TotalAmount": 1500.00
}
},
{
"DocumentType": "PurchaseOrder",
"Pages": "3",
"Fields": {
"PONumber": "PO-2024-555",
"OrderTotal": 2200.00
}
},
{
"DocumentType": "Receipt",
"Pages": "4",
"Fields": {
"ReceiptNumber": "RCP-789",
"Amount": 49.99
}
},
{
"DocumentType": "Invoice",
"Pages": "5-6",
"Fields": {
"InvoiceNumber": "INV-002",
"TotalAmount": 875.00
}
}
]
}
Processing Mixed Types
# Python - Route documents by type
def process_mixed_batch(extractions):
results = {
"invoices": [],
"purchase_orders": [],
"receipts": [],
"other": []
}
for doc in extractions:
doc_type = doc.get("DocumentType", "").lower()
if doc_type == "invoice":
results["invoices"].append(process_invoice(doc))
elif doc_type == "purchaseorder":
results["purchase_orders"].append(process_po(doc))
elif doc_type == "receipt":
results["receipts"].append(process_receipt(doc))
else:
results["other"].append(doc)
return results
def process_invoice(doc):
fields = doc.get("Fields", {})
return {
"number": fields.get("InvoiceNumber"),
"amount": fields.get("TotalAmount"),
"pages": doc.get("Pages"),
"type": "invoice"
}
def process_po(doc):
fields = doc.get("Fields", {})
return {
"number": fields.get("PONumber"),
"amount": fields.get("OrderTotal"),
"pages": doc.get("Pages"),
"type": "purchase_order"
}
def process_receipt(doc):
fields = doc.get("Fields", {})
return {
"number": fields.get("ReceiptNumber"),
"amount": fields.get("Amount"),
"pages": doc.get("Pages"),
"type": "receipt"
}
Best Practices
File Preparation
- Consistent orientation – Ensure all pages have correct rotation
- Reasonable file size – Keep files under 50MB for optimal processing
- Clear scans – Use at least 200 DPI for scanned documents
- Document limits – Stay within 50 documents per file for best results
Processing Strategies
| Strategy | When to Use |
|---|---|
| Single large batch | When documents are related and need to be processed together |
| Multiple smaller batches | When processing very large volumes (100+ documents) |
| Pre-split files | When you need precise control over document boundaries |
Error Handling
# Python - Robust multi-document processing
def process_with_validation(extractions):
valid_documents = []
issues = []
for i, doc in enumerate(extractions):
# Check for required fields
if not doc.get("DocumentType"):
issues.append(f"Document {i+1}: Missing document type")
continue
# Check confidence threshold
confidence = doc.get("Confidence", 0)
if confidence < 0.7:
issues.append(
f"Document {i+1}: Low confidence ({confidence:.2f})"
)
# Validate fields are present
fields = doc.get("Fields", {})
if not fields:
issues.append(f"Document {i+1}: No fields extracted")
continue
valid_documents.append(doc)
return {
"valid": valid_documents,
"issues": issues,
"success_rate": len(valid_documents) / len(extractions) if extractions else 0
}
Performance Tips
- Batch related documents – Combine invoices from the same vendor
- Monitor processing time – Large files take longer; plan accordingly
- Use async processing – For very large batches, consider async workflows
- Cache results – Store extractions to avoid reprocessing
Troubleshooting
Common Issues
| Issue | Possible Cause | Solution |
|---|---|---|
| Only one document detected | Documents too similar or no clear boundaries | Add separator pages or pre-split the file |
| Too many documents detected | Multi-page documents being split incorrectly | Review page layouts; contact support for tuning |
| Wrong page assignments | Ambiguous document boundaries | Use clearer document separators or headers |
| Mixed classifications | Similar document types being confused | Request Context tuning for your document types |
| Timeout errors | File too large or too many documents | Split into smaller batches (25-50 pages each) |
Debugging Tips
- Check page counts – Verify the sum of all PageCount values equals your PDF’s total pages
- Review confidence scores – Low confidence may indicate boundary issues
- Test with smaller files – Start with 2-3 documents to validate behavior
- Compare page ranges – Ensure no pages are missing or duplicated
# Validation helper
def validate_page_coverage(extractions, total_pages):
covered_pages = set()
for doc in extractions:
pages = doc.get("Pages", "")
# Parse page range (e.g., "1-3" or "5")
if "-" in pages:
start, end = map(int, pages.split("-"))
covered_pages.update(range(start, end + 1))
else:
covered_pages.add(int(pages))
expected = set(range(1, total_pages + 1))
missing = expected - covered_pages
extra = covered_pages - expected
if missing:
print(f"Warning: Pages not covered: {sorted(missing)}")
if extra:
print(f"Warning: Invalid pages referenced: {sorted(extra)}")
return len(missing) == 0 and len(extra) == 0
Getting Help
If you encounter persistent issues with multi-document processing:
- Note the
TraceIdfrom the API response - Save a sample file that demonstrates the issue
- Document the expected vs. actual number of documents
- Contact support@docdigitizer.com with these details