Custom Schemas
Overview
A schema defines the structure of data extracted from a document type. It specifies which fields to extract, their data types, validation rules, and how they should be formatted in the response.
Why Use Custom Schemas?
| Benefit | Description |
|---|---|
| Business-Specific Fields | Extract fields unique to your industry or workflows |
| Consistent Output | Ensure all extractions follow the same structure |
| Integration Ready | Match output format to your system’s requirements |
| Validation Rules | Define constraints and data quality requirements |
| Reduced Post-Processing | Get data in the format you need without transformation |
Schema Flow
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Your Schema │ ──▶ │ Schema Registry │ ──▶ │ Your Context │
│ Definition │ │ (Storage) │ │ Configuration │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│
▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Structured │ ◀── │ Extraction │ ◀── │ Document │
│ JSON Output │ │ Engine │ │ Upload │
└─────────────────┘ └─────────────────┘ └─────────────────┘
Understanding Schemas
A schema is a JSON document that describes the expected structure of extracted data. It includes:
- Field definitions – Names and types of data to extract
- Hierarchies – How fields are organized (flat, nested, arrays)
- Data types – String, number, date, boolean, etc.
- Constraints – Required fields, formats, value ranges
- Descriptions – Help text for extraction guidance
Simple Schema Example
{
"name": "SimpleInvoice",
"version": "1.0",
"description": "Basic invoice extraction schema",
"fields": [
{
"name": "InvoiceNumber",
"type": "string",
"required": true,
"description": "Unique invoice identifier"
},
{
"name": "InvoiceDate",
"type": "date",
"required": true,
"description": "Date the invoice was issued"
},
{
"name": "TotalAmount",
"type": "currency",
"required": true,
"description": "Total amount due"
},
{
"name": "VendorName",
"type": "string",
"required": true,
"description": "Name of the vendor/supplier"
}
]
}
Resulting Extraction
{
"DocumentType": "SimpleInvoice",
"Fields": {
"InvoiceNumber": "INV-2024-0042",
"InvoiceDate": "2024-01-15",
"TotalAmount": 1234.56,
"VendorName": "Acme Corporation"
}
}
Standard Schemas
DocDigitizer provides pre-built schemas for common document types. These can be used as-is or as a starting point for customization.
Available Standard Schemas
| Document Type | Key Fields | Use Case |
|---|---|---|
| Invoice | Invoice number, date, vendor, line items, totals, tax | Accounts payable automation |
| Receipt | Merchant, date, items, total, payment method | Expense management |
| PurchaseOrder | PO number, vendor, items, delivery date, terms | Procurement workflows |
| BankStatement | Account info, period, transactions, balances | Financial reconciliation |
| Contract | Parties, effective date, terms, signatures | Legal document processing |
| IDDocument | Name, ID number, dates, nationality, photo | Identity verification |
| CitizenCard | Personal info, document numbers, validity, filiation | Government ID processing |
| DriversLicense | Name, license number, categories, expiry | License verification |
Contact your account manager to see which standard schemas are available for your Context.
Requesting Custom Schemas
To create a custom schema for your organization:
Step 1: Define Your Requirements
Document the fields you need to extract:
- List all fields required for your business process
- Identify the data type for each field
- Note which fields are required vs. optional
- Describe any validation requirements
- Provide sample documents
Step 2: Create a Field Specification
Use this template to document each field:
| Field Name | Type | Required | Format/Rules | Description |
|---|---|---|---|---|
| CustomerPONumber | string | Yes | Alphanumeric, max 20 chars | Customer’s purchase order reference |
| DeliveryDate | date | No | ISO 8601 format | Expected delivery date |
| UnitPrice | currency | Yes | 2 decimal places | Price per unit before tax |
Step 3: Submit Your Request
Send your field specification to:
- Email: support@docdigitizer.com
- Subject: “Custom Schema Request – [Your Company Name]”
- Include: Field specification, sample documents (3-5 examples), Context ID
Step 4: Review and Approval
Our team will:
- Review your field specification
- Analyze sample documents for field locations
- Configure the schema in the registry
- Test extraction accuracy
- Deploy to your Context
- Provide confirmation and documentation
Typical turnaround: 3-5 business days for standard schemas, longer for complex requirements.
Schema Structure
A complete schema definition includes metadata and field definitions:
Full Schema Example
{
"name": "CustomInvoice",
"version": "2.1",
"description": "Custom invoice schema for manufacturing industry",
"documentType": "Invoice",
"metadata": {
"author": "DocDigitizer",
"created": "2024-01-01",
"modified": "2024-06-15"
},
"fields": [
{
"name": "InvoiceNumber",
"type": "string",
"required": true,
"description": "Invoice identifier",
"aliases": ["Invoice #", "Invoice No", "Inv Number"]
},
{
"name": "InvoiceDate",
"type": "date",
"required": true,
"format": "ISO8601",
"description": "Invoice issue date"
},
{
"name": "CustomerPO",
"type": "string",
"required": false,
"description": "Customer purchase order reference"
},
{
"name": "Vendor",
"type": "object",
"required": true,
"fields": [
{"name": "Name", "type": "string", "required": true},
{"name": "TaxID", "type": "string", "required": false},
{"name": "Address", "type": "string", "required": false}
]
},
{
"name": "LineItems",
"type": "array",
"required": true,
"items": {
"type": "object",
"fields": [
{"name": "Description", "type": "string", "required": true},
{"name": "PartNumber", "type": "string", "required": false},
{"name": "Quantity", "type": "number", "required": true},
{"name": "UnitPrice", "type": "currency", "required": true},
{"name": "Amount", "type": "currency", "required": true}
]
}
},
{
"name": "Subtotal",
"type": "currency",
"required": true
},
{
"name": "TaxAmount",
"type": "currency",
"required": false
},
{
"name": "TotalAmount",
"type": "currency",
"required": true
}
]
}
Schema Properties
| Property | Required | Description |
|---|---|---|
name |
Yes | Unique identifier for the schema |
version |
Yes | Schema version number (semver recommended) |
description |
No | Human-readable description |
documentType |
No | Base document type (Invoice, Receipt, etc.) |
metadata |
No | Additional schema metadata |
fields |
Yes | Array of field definitions |
Field Types
Each field has a type that determines how the extracted value is processed and formatted.
Primitive Types
| Type | Description | Example Value |
|---|---|---|
string |
Text value | "INV-2024-001" |
number |
Numeric value (integer or decimal) | 42, 3.14159 |
integer |
Whole number only | 100 |
boolean |
True or false | true, false |
date |
Date value (ISO 8601) | "2024-01-15" |
datetime |
Date and time | "2024-01-15T14:30:00Z" |
currency |
Monetary amount | 1234.56 |
percentage |
Percentage value | 23.5 (for 23.5%) |
email |
Email address | "contact@example.com" |
phone |
Phone number | "+1-555-123-4567" |
url |
Web URL | "https://example.com" |
Complex Types
| Type | Description | Use Case |
|---|---|---|
object |
Nested group of fields | Address, Vendor info, Contact details |
array |
List of values or objects | Line items, Transactions, Attachments |
Type Examples
// String field
{
"name": "InvoiceNumber",
"type": "string",
"required": true
}
// Currency field
{
"name": "TotalAmount",
"type": "currency",
"required": true,
"format": {
"decimals": 2
}
}
// Date field with format
{
"name": "DueDate",
"type": "date",
"required": false,
"format": "ISO8601"
}
// Percentage field
{
"name": "TaxRate",
"type": "percentage",
"required": false,
"description": "Tax rate as percentage (e.g., 23 for 23%)"
}
Field Validation
Fields can include validation rules to ensure data quality.
Common Validation Rules
| Rule | Applies To | Description |
|---|---|---|
required |
All types | Field must have a value |
minLength |
string | Minimum string length |
maxLength |
string | Maximum string length |
pattern |
string | Regex pattern to match |
minimum |
number, currency | Minimum numeric value |
maximum |
number, currency | Maximum numeric value |
enum |
string | List of allowed values |
format |
Various | Specific format requirement |
Validation Examples
// String with length constraints
{
"name": "PONumber",
"type": "string",
"required": true,
"minLength": 5,
"maxLength": 20
}
// String with pattern (regex)
{
"name": "TaxID",
"type": "string",
"required": true,
"pattern": "^[0-9]{9}$",
"description": "9-digit tax identification number"
}
// Number with range
{
"name": "Quantity",
"type": "number",
"required": true,
"minimum": 1,
"maximum": 10000
}
// Enum (fixed values)
{
"name": "PaymentMethod",
"type": "string",
"required": false,
"enum": ["Cash", "Credit Card", "Bank Transfer", "Check"]
}
// Currency with constraints
{
"name": "TotalAmount",
"type": "currency",
"required": true,
"minimum": 0,
"format": {
"decimals": 2
}
}
Nested Objects and Arrays
Object Fields
Use object fields to group related data:
{
"name": "Vendor",
"type": "object",
"required": true,
"description": "Vendor/supplier information",
"fields": [
{
"name": "Name",
"type": "string",
"required": true,
"description": "Company name"
},
{
"name": "TaxID",
"type": "string",
"required": false,
"description": "Tax identification number"
},
{
"name": "Address",
"type": "object",
"required": false,
"fields": [
{"name": "Street", "type": "string"},
{"name": "City", "type": "string"},
{"name": "PostalCode", "type": "string"},
{"name": "Country", "type": "string"}
]
},
{
"name": "Contact",
"type": "object",
"required": false,
"fields": [
{"name": "Email", "type": "email"},
{"name": "Phone", "type": "phone"}
]
}
]
}
Result:
{
"Vendor": {
"Name": "Acme Corporation",
"TaxID": "123456789",
"Address": {
"Street": "123 Main Street",
"City": "New York",
"PostalCode": "10001",
"Country": "USA"
},
"Contact": {
"Email": "billing@acme.com",
"Phone": "+1-555-123-4567"
}
}
}
Array Fields
Use array fields for repeating items like line items or transactions:
{
"name": "LineItems",
"type": "array",
"required": true,
"description": "Invoice line items",
"minItems": 1,
"maxItems": 100,
"items": {
"type": "object",
"fields": [
{
"name": "LineNumber",
"type": "integer",
"required": false
},
{
"name": "Description",
"type": "string",
"required": true
},
{
"name": "SKU",
"type": "string",
"required": false
},
{
"name": "Quantity",
"type": "number",
"required": true,
"minimum": 0
},
{
"name": "UnitPrice",
"type": "currency",
"required": true
},
{
"name": "Discount",
"type": "percentage",
"required": false
},
{
"name": "TotalPrice",
"type": "currency",
"required": true
}
]
}
}
Result:
{
"LineItems": [
{
"LineNumber": 1,
"Description": "Widget A - Premium Grade",
"SKU": "WGT-A-001",
"Quantity": 10,
"UnitPrice": 25.00,
"Discount": 5,
"TotalPrice": 237.50
},
{
"LineNumber": 2,
"Description": "Service Fee",
"Quantity": 1,
"UnitPrice": 50.00,
"TotalPrice": 50.00
}
]
}
Simple Arrays
Arrays can also contain simple values:
{
"name": "Tags",
"type": "array",
"required": false,
"items": {
"type": "string"
}
}
// Result: {"Tags": ["urgent", "reviewed", "q1-2024"]}
Working with Schema Results
Python Example
import requests
import uuid
def extract_with_schema(file_path, api_key, context_id):
"""Extract document using custom schema."""
url = "https://apix.docdigitizer.com/sync"
headers = {"x-api-key": api_key}
with open(file_path, "rb") as f:
files = {"files": f}
data = {
"id": str(uuid.uuid4()),
"contextId": context_id
}
response = requests.post(url, headers=headers, files=files, data=data)
return response.json()
def process_custom_invoice(result):
"""Process extraction result from custom invoice schema."""
if result.get("StateText") != "DONE":
print(f"Error: {result.get('Messages', [])}")
return None
output = result.get("Output", [])
if not output:
return None
# Handle single or multiple documents
doc = output[0] if isinstance(output, list) else output
fields = doc.get("Fields", {})
# Access nested vendor object
vendor = fields.get("Vendor", {})
vendor_name = vendor.get("Name", "Unknown")
vendor_address = vendor.get("Address", {})
# Process line items array
line_items = fields.get("LineItems", [])
invoice_data = {
"invoice_number": fields.get("InvoiceNumber"),
"invoice_date": fields.get("InvoiceDate"),
"vendor": {
"name": vendor_name,
"tax_id": vendor.get("TaxID"),
"city": vendor_address.get("City")
},
"line_items": [
{
"description": item.get("Description"),
"quantity": item.get("Quantity"),
"unit_price": item.get("UnitPrice"),
"total": item.get("TotalPrice")
}
for item in line_items
],
"subtotal": fields.get("Subtotal"),
"tax": fields.get("TaxAmount"),
"total": fields.get("TotalAmount")
}
return invoice_data
# Usage
result = extract_with_schema("invoice.pdf", "your_api_key", "your_context_id")
invoice = process_custom_invoice(result)
if invoice:
print(f"Invoice: {invoice['invoice_number']}")
print(f"Vendor: {invoice['vendor']['name']}")
print(f"Total: ${invoice['total']:.2f}")
print("\nLine Items:")
for item in invoice['line_items']:
print(f" - {item['description']}: {item['quantity']} x ${item['unit_price']:.2f}")
PowerShell Example
function Process-CustomInvoice {
param(
[PSObject]$Result
)
if ($Result.StateText -ne "DONE") {
Write-Warning "Extraction failed: $($Result.Messages -join ', ')"
return $null
}
$doc = $Result.Output
if ($doc -is [array]) { $doc = $doc[0] }
$fields = $doc.Fields
# Create structured output
$invoice = [PSCustomObject]@{
InvoiceNumber = $fields.InvoiceNumber
InvoiceDate = $fields.InvoiceDate
VendorName = $fields.Vendor.Name
VendorTaxID = $fields.Vendor.TaxID
VendorCity = $fields.Vendor.Address.City
LineItemCount = $fields.LineItems.Count
Subtotal = $fields.Subtotal
Tax = $fields.TaxAmount
Total = $fields.TotalAmount
}
return $invoice
}
# Usage
$response = Invoke-RestMethod -Uri "https://apix.docdigitizer.com/sync" `
-Method Post -Headers @{"x-api-key" = $ApiKey} -Form $form
$invoice = Process-CustomInvoice -Result $response
if ($invoice) {
Write-Host "Invoice: $($invoice.InvoiceNumber)"
Write-Host "Vendor: $($invoice.VendorName)"
Write-Host "Total: $($invoice.Total)"
}
Handling Missing Fields
# Python - Safe field access with defaults
def safe_get(data, *keys, default=None):
"""Safely navigate nested dictionaries."""
result = data
for key in keys:
if isinstance(result, dict):
result = result.get(key)
else:
return default
if result is None:
return default
return result
# Usage
vendor_city = safe_get(fields, "Vendor", "Address", "City", default="N/A")
tax_amount = safe_get(fields, "TaxAmount", default=0.0)
Schema Versioning
Schemas are versioned to manage changes over time without breaking existing integrations.
Version Format
Schemas use semantic versioning (Major.Minor):
- Major – Breaking changes (removed fields, type changes)
- Minor – Backward-compatible additions (new optional fields)
Version in Response
{
"StateText": "DONE",
"Output": [
{
"DocumentType": "CustomInvoice",
"SchemaVersion": "2.1",
"Fields": { ... }
}
]
}
Handling Version Changes
# Python - Version-aware processing
def process_invoice(doc):
version = doc.get("SchemaVersion", "1.0")
fields = doc.get("Fields", {})
major_version = int(version.split(".")[0])
if major_version >= 2:
# v2.x has nested Vendor object
vendor_name = fields.get("Vendor", {}).get("Name")
else:
# v1.x had flat VendorName field
vendor_name = fields.get("VendorName")
return {"vendor": vendor_name, ...}
Best Practices
Schema Design
- Start simple – Begin with essential fields, add more as needed
- Use clear names – Field names should be self-explanatory
- Group logically – Use nested objects for related fields
- Document everything – Add descriptions to help extraction accuracy
- Plan for nulls – Most fields should be optional unless truly required
Field Naming Conventions
| Convention | Example |
|---|---|
| PascalCase for field names | InvoiceNumber, TotalAmount |
| Descriptive but concise | VendorTaxID not VendorTaxIdentificationNumber |
| Consistent prefixes | VendorName, VendorAddress, VendorPhone |
| No abbreviations (mostly) | Quantity not Qty |
Integration Tips
- Validate responses – Check required fields are present
- Handle nulls gracefully – Optional fields may be missing
- Log schema versions – Track which version produced each extraction
- Test with edge cases – Verify behavior with unusual documents
- Monitor accuracy – Review extractions regularly for quality
Common Pitfalls
| Pitfall | Solution |
|---|---|
| Too many required fields | Only mark fields required if truly essential |
| Overly specific patterns | Allow for format variations in documents |
| Missing aliases | Provide alternative field names for better matching |
| Deep nesting | Keep nesting to 2-3 levels maximum |
| Ignoring sample diversity | Provide varied sample documents for training |