Custom Schemas

Learn how to define custom extraction schemas to capture the specific fields your business needs from documents. Custom schemas allow you to tailor DocDigitizer's extraction to your exact requirements.

Overview

A schema defines the structure of data extracted from a document type. It specifies which fields to extract, their data types, validation rules, and how they should be formatted in the response.

Why Use Custom Schemas?

Benefit Description
Business-Specific Fields Extract fields unique to your industry or workflows
Consistent Output Ensure all extractions follow the same structure
Integration Ready Match output format to your system’s requirements
Validation Rules Define constraints and data quality requirements
Reduced Post-Processing Get data in the format you need without transformation

Schema Flow

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│   Your Schema   │ ──▶ │  Schema Registry │ ──▶ │  Your Context   │
│   Definition    │     │    (Storage)     │     │  Configuration  │
└─────────────────┘     └─────────────────┘     └─────────────────┘
                                                         │
                                                         ▼
┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│  Structured     │ ◀── │   Extraction    │ ◀── │   Document      │
│  JSON Output    │     │    Engine       │     │   Upload        │
└─────────────────┘     └─────────────────┘     └─────────────────┘
    

Understanding Schemas

A schema is a JSON document that describes the expected structure of extracted data. It includes:

  • Field definitions – Names and types of data to extract
  • Hierarchies – How fields are organized (flat, nested, arrays)
  • Data types – String, number, date, boolean, etc.
  • Constraints – Required fields, formats, value ranges
  • Descriptions – Help text for extraction guidance

Simple Schema Example

{
    "name": "SimpleInvoice",
    "version": "1.0",
    "description": "Basic invoice extraction schema",
    "fields": [
        {
            "name": "InvoiceNumber",
            "type": "string",
            "required": true,
            "description": "Unique invoice identifier"
        },
        {
            "name": "InvoiceDate",
            "type": "date",
            "required": true,
            "description": "Date the invoice was issued"
        },
        {
            "name": "TotalAmount",
            "type": "currency",
            "required": true,
            "description": "Total amount due"
        },
        {
            "name": "VendorName",
            "type": "string",
            "required": true,
            "description": "Name of the vendor/supplier"
        }
    ]
}
    

Resulting Extraction

{
    "DocumentType": "SimpleInvoice",
    "Fields": {
        "InvoiceNumber": "INV-2024-0042",
        "InvoiceDate": "2024-01-15",
        "TotalAmount": 1234.56,
        "VendorName": "Acme Corporation"
    }
}
    

Standard Schemas

DocDigitizer provides pre-built schemas for common document types. These can be used as-is or as a starting point for customization.

Available Standard Schemas

Document Type Key Fields Use Case
Invoice Invoice number, date, vendor, line items, totals, tax Accounts payable automation
Receipt Merchant, date, items, total, payment method Expense management
PurchaseOrder PO number, vendor, items, delivery date, terms Procurement workflows
BankStatement Account info, period, transactions, balances Financial reconciliation
Contract Parties, effective date, terms, signatures Legal document processing
IDDocument Name, ID number, dates, nationality, photo Identity verification
CitizenCard Personal info, document numbers, validity, filiation Government ID processing
DriversLicense Name, license number, categories, expiry License verification

Contact your account manager to see which standard schemas are available for your Context.


Requesting Custom Schemas

To create a custom schema for your organization:

Step 1: Define Your Requirements

Document the fields you need to extract:

  1. List all fields required for your business process
  2. Identify the data type for each field
  3. Note which fields are required vs. optional
  4. Describe any validation requirements
  5. Provide sample documents

Step 2: Create a Field Specification

Use this template to document each field:

Field Name Type Required Format/Rules Description
CustomerPONumber string Yes Alphanumeric, max 20 chars Customer’s purchase order reference
DeliveryDate date No ISO 8601 format Expected delivery date
UnitPrice currency Yes 2 decimal places Price per unit before tax

Step 3: Submit Your Request

Send your field specification to:

  • Email: support@docdigitizer.com
  • Subject: “Custom Schema Request – [Your Company Name]”
  • Include: Field specification, sample documents (3-5 examples), Context ID

Step 4: Review and Approval

Our team will:

  1. Review your field specification
  2. Analyze sample documents for field locations
  3. Configure the schema in the registry
  4. Test extraction accuracy
  5. Deploy to your Context
  6. Provide confirmation and documentation

Typical turnaround: 3-5 business days for standard schemas, longer for complex requirements.


Schema Structure

A complete schema definition includes metadata and field definitions:

Full Schema Example

{
    "name": "CustomInvoice",
    "version": "2.1",
    "description": "Custom invoice schema for manufacturing industry",
    "documentType": "Invoice",
    "metadata": {
        "author": "DocDigitizer",
        "created": "2024-01-01",
        "modified": "2024-06-15"
    },
    "fields": [
        {
            "name": "InvoiceNumber",
            "type": "string",
            "required": true,
            "description": "Invoice identifier",
            "aliases": ["Invoice #", "Invoice No", "Inv Number"]
        },
        {
            "name": "InvoiceDate",
            "type": "date",
            "required": true,
            "format": "ISO8601",
            "description": "Invoice issue date"
        },
        {
            "name": "CustomerPO",
            "type": "string",
            "required": false,
            "description": "Customer purchase order reference"
        },
        {
            "name": "Vendor",
            "type": "object",
            "required": true,
            "fields": [
                {"name": "Name", "type": "string", "required": true},
                {"name": "TaxID", "type": "string", "required": false},
                {"name": "Address", "type": "string", "required": false}
            ]
        },
        {
            "name": "LineItems",
            "type": "array",
            "required": true,
            "items": {
                "type": "object",
                "fields": [
                    {"name": "Description", "type": "string", "required": true},
                    {"name": "PartNumber", "type": "string", "required": false},
                    {"name": "Quantity", "type": "number", "required": true},
                    {"name": "UnitPrice", "type": "currency", "required": true},
                    {"name": "Amount", "type": "currency", "required": true}
                ]
            }
        },
        {
            "name": "Subtotal",
            "type": "currency",
            "required": true
        },
        {
            "name": "TaxAmount",
            "type": "currency",
            "required": false
        },
        {
            "name": "TotalAmount",
            "type": "currency",
            "required": true
        }
    ]
}
    

Schema Properties

Property Required Description
name Yes Unique identifier for the schema
version Yes Schema version number (semver recommended)
description No Human-readable description
documentType No Base document type (Invoice, Receipt, etc.)
metadata No Additional schema metadata
fields Yes Array of field definitions

Field Types

Each field has a type that determines how the extracted value is processed and formatted.

Primitive Types

Type Description Example Value
string Text value "INV-2024-001"
number Numeric value (integer or decimal) 423.14159
integer Whole number only 100
boolean True or false truefalse
date Date value (ISO 8601) "2024-01-15"
datetime Date and time "2024-01-15T14:30:00Z"
currency Monetary amount 1234.56
percentage Percentage value 23.5 (for 23.5%)
email Email address "contact@example.com"
phone Phone number "+1-555-123-4567"
url Web URL "https://example.com"

Complex Types

Type Description Use Case
object Nested group of fields Address, Vendor info, Contact details
array List of values or objects Line items, Transactions, Attachments

Type Examples

// String field
{
    "name": "InvoiceNumber",
    "type": "string",
    "required": true
}

// Currency field
{
    "name": "TotalAmount",
    "type": "currency",
    "required": true,
    "format": {
        "decimals": 2
    }
}

// Date field with format
{
    "name": "DueDate",
    "type": "date",
    "required": false,
    "format": "ISO8601"
}

// Percentage field
{
    "name": "TaxRate",
    "type": "percentage",
    "required": false,
    "description": "Tax rate as percentage (e.g., 23 for 23%)"
}
    

Field Validation

Fields can include validation rules to ensure data quality.

Common Validation Rules

Rule Applies To Description
required All types Field must have a value
minLength string Minimum string length
maxLength string Maximum string length
pattern string Regex pattern to match
minimum number, currency Minimum numeric value
maximum number, currency Maximum numeric value
enum string List of allowed values
format Various Specific format requirement

Validation Examples

// String with length constraints
{
    "name": "PONumber",
    "type": "string",
    "required": true,
    "minLength": 5,
    "maxLength": 20
}

// String with pattern (regex)
{
    "name": "TaxID",
    "type": "string",
    "required": true,
    "pattern": "^[0-9]{9}$",
    "description": "9-digit tax identification number"
}

// Number with range
{
    "name": "Quantity",
    "type": "number",
    "required": true,
    "minimum": 1,
    "maximum": 10000
}

// Enum (fixed values)
{
    "name": "PaymentMethod",
    "type": "string",
    "required": false,
    "enum": ["Cash", "Credit Card", "Bank Transfer", "Check"]
}

// Currency with constraints
{
    "name": "TotalAmount",
    "type": "currency",
    "required": true,
    "minimum": 0,
    "format": {
        "decimals": 2
    }
}
    

Nested Objects and Arrays

Object Fields

Use object fields to group related data:

{
    "name": "Vendor",
    "type": "object",
    "required": true,
    "description": "Vendor/supplier information",
    "fields": [
        {
            "name": "Name",
            "type": "string",
            "required": true,
            "description": "Company name"
        },
        {
            "name": "TaxID",
            "type": "string",
            "required": false,
            "description": "Tax identification number"
        },
        {
            "name": "Address",
            "type": "object",
            "required": false,
            "fields": [
                {"name": "Street", "type": "string"},
                {"name": "City", "type": "string"},
                {"name": "PostalCode", "type": "string"},
                {"name": "Country", "type": "string"}
            ]
        },
        {
            "name": "Contact",
            "type": "object",
            "required": false,
            "fields": [
                {"name": "Email", "type": "email"},
                {"name": "Phone", "type": "phone"}
            ]
        }
    ]
}
    

Result:

{
    "Vendor": {
        "Name": "Acme Corporation",
        "TaxID": "123456789",
        "Address": {
            "Street": "123 Main Street",
            "City": "New York",
            "PostalCode": "10001",
            "Country": "USA"
        },
        "Contact": {
            "Email": "billing@acme.com",
            "Phone": "+1-555-123-4567"
        }
    }
}
    

Array Fields

Use array fields for repeating items like line items or transactions:

{
    "name": "LineItems",
    "type": "array",
    "required": true,
    "description": "Invoice line items",
    "minItems": 1,
    "maxItems": 100,
    "items": {
        "type": "object",
        "fields": [
            {
                "name": "LineNumber",
                "type": "integer",
                "required": false
            },
            {
                "name": "Description",
                "type": "string",
                "required": true
            },
            {
                "name": "SKU",
                "type": "string",
                "required": false
            },
            {
                "name": "Quantity",
                "type": "number",
                "required": true,
                "minimum": 0
            },
            {
                "name": "UnitPrice",
                "type": "currency",
                "required": true
            },
            {
                "name": "Discount",
                "type": "percentage",
                "required": false
            },
            {
                "name": "TotalPrice",
                "type": "currency",
                "required": true
            }
        ]
    }
}
    

Result:

{
    "LineItems": [
        {
            "LineNumber": 1,
            "Description": "Widget A - Premium Grade",
            "SKU": "WGT-A-001",
            "Quantity": 10,
            "UnitPrice": 25.00,
            "Discount": 5,
            "TotalPrice": 237.50
        },
        {
            "LineNumber": 2,
            "Description": "Service Fee",
            "Quantity": 1,
            "UnitPrice": 50.00,
            "TotalPrice": 50.00
        }
    ]
}
    

Simple Arrays

Arrays can also contain simple values:

{
    "name": "Tags",
    "type": "array",
    "required": false,
    "items": {
        "type": "string"
    }
}

// Result: {"Tags": ["urgent", "reviewed", "q1-2024"]}
    

Working with Schema Results

Python Example

import requests
import uuid

def extract_with_schema(file_path, api_key, context_id):
    """Extract document using custom schema."""

    url = "https://apix.docdigitizer.com/sync"

    headers = {"x-api-key": api_key}

    with open(file_path, "rb") as f:
        files = {"files": f}
        data = {
            "id": str(uuid.uuid4()),
            "contextId": context_id
        }

        response = requests.post(url, headers=headers, files=files, data=data)

    return response.json()


def process_custom_invoice(result):
    """Process extraction result from custom invoice schema."""

    if result.get("StateText") != "DONE":
        print(f"Error: {result.get('Messages', [])}")
        return None

    output = result.get("Output", [])
    if not output:
        return None

    # Handle single or multiple documents
    doc = output[0] if isinstance(output, list) else output
    fields = doc.get("Fields", {})

    # Access nested vendor object
    vendor = fields.get("Vendor", {})
    vendor_name = vendor.get("Name", "Unknown")
    vendor_address = vendor.get("Address", {})

    # Process line items array
    line_items = fields.get("LineItems", [])

    invoice_data = {
        "invoice_number": fields.get("InvoiceNumber"),
        "invoice_date": fields.get("InvoiceDate"),
        "vendor": {
            "name": vendor_name,
            "tax_id": vendor.get("TaxID"),
            "city": vendor_address.get("City")
        },
        "line_items": [
            {
                "description": item.get("Description"),
                "quantity": item.get("Quantity"),
                "unit_price": item.get("UnitPrice"),
                "total": item.get("TotalPrice")
            }
            for item in line_items
        ],
        "subtotal": fields.get("Subtotal"),
        "tax": fields.get("TaxAmount"),
        "total": fields.get("TotalAmount")
    }

    return invoice_data


# Usage
result = extract_with_schema("invoice.pdf", "your_api_key", "your_context_id")
invoice = process_custom_invoice(result)

if invoice:
    print(f"Invoice: {invoice['invoice_number']}")
    print(f"Vendor: {invoice['vendor']['name']}")
    print(f"Total: ${invoice['total']:.2f}")

    print("\nLine Items:")
    for item in invoice['line_items']:
        print(f"  - {item['description']}: {item['quantity']} x ${item['unit_price']:.2f}")
    

PowerShell Example

function Process-CustomInvoice {
    param(
        [PSObject]$Result
    )

    if ($Result.StateText -ne "DONE") {
        Write-Warning "Extraction failed: $($Result.Messages -join ', ')"
        return $null
    }

    $doc = $Result.Output
    if ($doc -is [array]) { $doc = $doc[0] }

    $fields = $doc.Fields

    # Create structured output
    $invoice = [PSCustomObject]@{
        InvoiceNumber = $fields.InvoiceNumber
        InvoiceDate = $fields.InvoiceDate
        VendorName = $fields.Vendor.Name
        VendorTaxID = $fields.Vendor.TaxID
        VendorCity = $fields.Vendor.Address.City
        LineItemCount = $fields.LineItems.Count
        Subtotal = $fields.Subtotal
        Tax = $fields.TaxAmount
        Total = $fields.TotalAmount
    }

    return $invoice
}

# Usage
$response = Invoke-RestMethod -Uri "https://apix.docdigitizer.com/sync" `
    -Method Post -Headers @{"x-api-key" = $ApiKey} -Form $form

$invoice = Process-CustomInvoice -Result $response

if ($invoice) {
    Write-Host "Invoice: $($invoice.InvoiceNumber)"
    Write-Host "Vendor: $($invoice.VendorName)"
    Write-Host "Total: $($invoice.Total)"
}
    

Handling Missing Fields

# Python - Safe field access with defaults
def safe_get(data, *keys, default=None):
    """Safely navigate nested dictionaries."""
    result = data
    for key in keys:
        if isinstance(result, dict):
            result = result.get(key)
        else:
            return default
        if result is None:
            return default
    return result

# Usage
vendor_city = safe_get(fields, "Vendor", "Address", "City", default="N/A")
tax_amount = safe_get(fields, "TaxAmount", default=0.0)
    

Schema Versioning

Schemas are versioned to manage changes over time without breaking existing integrations.

Version Format

Schemas use semantic versioning (Major.Minor):

  • Major – Breaking changes (removed fields, type changes)
  • Minor – Backward-compatible additions (new optional fields)

Version in Response

{
    "StateText": "DONE",
    "Output": [
        {
            "DocumentType": "CustomInvoice",
            "SchemaVersion": "2.1",
            "Fields": { ... }
        }
    ]
}
    

Handling Version Changes

# Python - Version-aware processing
def process_invoice(doc):
    version = doc.get("SchemaVersion", "1.0")
    fields = doc.get("Fields", {})

    major_version = int(version.split(".")[0])

    if major_version >= 2:
        # v2.x has nested Vendor object
        vendor_name = fields.get("Vendor", {}).get("Name")
    else:
        # v1.x had flat VendorName field
        vendor_name = fields.get("VendorName")

    return {"vendor": vendor_name, ...}
    

Best Practices

Schema Design

  • Start simple – Begin with essential fields, add more as needed
  • Use clear names – Field names should be self-explanatory
  • Group logically – Use nested objects for related fields
  • Document everything – Add descriptions to help extraction accuracy
  • Plan for nulls – Most fields should be optional unless truly required

Field Naming Conventions

Convention Example
PascalCase for field names InvoiceNumberTotalAmount
Descriptive but concise VendorTaxID not VendorTaxIdentificationNumber
Consistent prefixes VendorNameVendorAddressVendorPhone
No abbreviations (mostly) Quantity not Qty

Integration Tips

  • Validate responses – Check required fields are present
  • Handle nulls gracefully – Optional fields may be missing
  • Log schema versions – Track which version produced each extraction
  • Test with edge cases – Verify behavior with unusual documents
  • Monitor accuracy – Review extractions regularly for quality

Common Pitfalls

Pitfall Solution
Too many required fields Only mark fields required if truly essential
Overly specific patterns Allow for format variations in documents
Missing aliases Provide alternative field names for better matching
Deep nesting Keep nesting to 2-3 levels maximum
Ignoring sample diversity Provide varied sample documents for training