DocDigitizer Invoice Processing API Guide

This guide walks you through processing invoices with the DocDigitizer API. Learn how to extract invoice data, handle different invoice formats, and integrate invoice processing into your applications.

Overview

DocDigitizer’s invoice processing extracts structured data from commercial invoices, including header information, line items, totals, and payment details.

Supported Invoice Types

  • Commercial invoices
  • Tax invoices
  • Pro forma invoices
  • Credit notes
  • Debit notes
  • Self-billing invoices

Supported Countries

Invoice extraction supports country-specific formats for:

RegionCountries
EuropePortugal (PT), Spain (ES), France (FR), Germany (DE), Italy (IT), UK (GB), Netherlands (NL), Belgium (BE)
AmericasUnited States (US), Brazil (BR), Mexico (MX), Canada (CA)
OtherGeneric format for unlisted countries

Invoice Fields

The following fields are extracted from invoices. Field availability depends on the invoice content.

Header Fields

FieldTypeDescription
invoiceNumberStringInvoice number or reference
invoiceDateString (Date)Invoice issue date (YYYY-MM-DD)
dueDateString (Date)Payment due date
purchaseOrderNumberStringRelated PO number
currencyStringCurrency code (EUR, USD, GBP, etc.)

Vendor (Seller) Fields

FieldTypeDescription
vendorNameStringVendor/supplier company name
vendorTaxIdStringVendor tax ID / VAT number
vendorAddressStringFull vendor address
vendorEmailStringVendor email address
vendorPhoneStringVendor phone number

Customer (Buyer) Fields

FieldTypeDescription
customerNameStringCustomer/buyer company name
customerTaxIdStringCustomer tax ID / VAT number
customerAddressStringFull customer address
shippingAddressStringShipping/delivery address (if different)

Financial Fields

FieldTypeDescription
subtotalNumberTotal before tax
taxRateNumberTax rate percentage
taxAmountNumberTotal tax amount
discountNumberDiscount amount
totalAmountNumberGrand total including tax

Payment Fields

FieldTypeDescription
paymentTermsStringPayment terms (Net 30, etc.)
paymentMethodStringPayment method
bankNameStringBank name
bankAccountStringBank account number / IBAN
swiftCodeStringSWIFT/BIC code

Line Item Fields

FieldTypeDescription
lineItemsArrayArray of line item objects
lineItems[].descriptionStringItem description
lineItems[].quantityNumberQuantity
lineItems[].unitStringUnit of measure
lineItems[].unitPriceNumberPrice per unit
lineItems[].amountNumberLine total
lineItems[].taxRateNumberTax rate for this item
lineItems[].productCodeStringProduct/SKU code

Processing Invoices

Basic Invoice Processing

cURL

curl -X POST https://apix.docdigitizer.com/sync \
  -H "x-api-key: YOUR_API_KEY" \
  -F "files=@invoice.pdf" \
  -F "id=$(uuidgen)" \
  -F "contextId=YOUR_CONTEXT_ID"
    

PowerShell

# Process an invoice
$headers = @{ "x-api-key" = "YOUR_API_KEY" }

$form = @{
    files = Get-Item -Path "invoice.pdf"
    id = [guid]::NewGuid().ToString()
    contextId = "YOUR_CONTEXT_ID"
}

$response = Invoke-RestMethod -Uri "https://apix.docdigitizer.com/sync" `
    -Method Post -Headers $headers -Form $form

# Check result
if ($response.StateText -eq "COMPLETED") {
    $invoice = $response.Output | Where-Object { $_.docType -eq "Invoice" } | Select-Object -First 1
    Write-Host "Invoice Number: $($invoice.extraction.invoiceNumber)"
    Write-Host "Total: $($invoice.extraction.totalAmount) $($invoice.extraction.currency)"
}
    

Python

import requests
import uuid

def process_invoice(pdf_path, api_key, context_id):
    response = requests.post(
        "https://apix.docdigitizer.com/sync",
        headers={"x-api-key": api_key},
        files={"files": open(pdf_path, "rb")},
        data={
            "id": str(uuid.uuid4()),
            "contextId": context_id
        }
    )

    result = response.json()

    if result["StateText"] == "COMPLETED":
        # Find the invoice in the output
        for doc in result["Output"]:
            if doc["docType"] == "Invoice":
                return doc["extraction"]

    return None

# Usage
invoice_data = process_invoice("invoice.pdf", "YOUR_API_KEY", "YOUR_CONTEXT_ID")
if invoice_data:
    print(f"Invoice: {invoice_data.get('invoiceNumber')}")
    print(f"Total: {invoice_data.get('totalAmount')} {invoice_data.get('currency')}")
    

Working with Results

Example Response

{
    "StateText": "COMPLETED",
    "TraceId": "INV4567",
    "NumberPages": 2,
    "Output": [
        {
            "docType": "Invoice",
            "country": "PT",
            "pages": [1, 2],
            "schema": "Invoice_PT.json",
            "extraction": {
                "invoiceNumber": "FT 2024/00156",
                "invoiceDate": "2024-01-15",
                "dueDate": "2024-02-14",
                "vendorName": "Tech Solutions, Lda",
                "vendorTaxId": "PT509876543",
                "vendorAddress": "Av. da Liberdade, 100, 1250-096 Lisboa",
                "customerName": "Global Corp, SA",
                "customerTaxId": "PT501234567",
                "customerAddress": "Rua do Ouro, 50, 1100-063 Lisboa",
                "subtotal": 2500.00,
                "taxRate": 23,
                "taxAmount": 575.00,
                "totalAmount": 3075.00,
                "currency": "EUR",
                "paymentTerms": "30 dias",
                "bankAccount": "PT50 0035 0000 12345678901 94",
                "lineItems": [
                    {
                        "description": "Software Development Services",
                        "quantity": 40,
                        "unit": "hours",
                        "unitPrice": 50.00,
                        "amount": 2000.00,
                        "taxRate": 23
                    },
                    {
                        "description": "Cloud Hosting - January",
                        "quantity": 1,
                        "unit": "month",
                        "unitPrice": 500.00,
                        "amount": 500.00,
                        "taxRate": 23
                    }
                ]
            }
        }
    ]
}
    

Accessing Invoice Data

PowerShell

# Assuming $response contains the API response

$invoice = $response.Output[0].extraction

# Header information
$invoiceNumber = $invoice.invoiceNumber
$invoiceDate = $invoice.invoiceDate
$dueDate = $invoice.dueDate

# Vendor information
$vendor = @{
    Name = $invoice.vendorName
    TaxId = $invoice.vendorTaxId
    Address = $invoice.vendorAddress
}

# Financial summary
$totals = @{
    Subtotal = $invoice.subtotal
    Tax = $invoice.taxAmount
    Total = $invoice.totalAmount
    Currency = $invoice.currency
}

# Display summary
Write-Host "=== Invoice $invoiceNumber ==="
Write-Host "Date: $invoiceDate | Due: $dueDate"
Write-Host "Vendor: $($vendor.Name) ($($vendor.TaxId))"
Write-Host "Total: $($totals.Total) $($totals.Currency)"
    

Python

def display_invoice(extraction):
    """Display invoice data in a readable format."""

    print(f"{'='*50}")
    print(f"INVOICE: {extraction.get('invoiceNumber', 'N/A')}")
    print(f"{'='*50}")

    print(f"\nDate: {extraction.get('invoiceDate', 'N/A')}")
    print(f"Due:  {extraction.get('dueDate', 'N/A')}")

    print(f"\n--- VENDOR ---")
    print(f"Name:    {extraction.get('vendorName', 'N/A')}")
    print(f"Tax ID:  {extraction.get('vendorTaxId', 'N/A')}")
    print(f"Address: {extraction.get('vendorAddress', 'N/A')}")

    print(f"\n--- CUSTOMER ---")
    print(f"Name:    {extraction.get('customerName', 'N/A')}")
    print(f"Tax ID:  {extraction.get('customerTaxId', 'N/A')}")

    print(f"\n--- TOTALS ---")
    currency = extraction.get('currency', 'EUR')
    print(f"Subtotal: {extraction.get('subtotal', 0):.2f} {currency}")
    print(f"Tax:      {extraction.get('taxAmount', 0):.2f} {currency}")
    print(f"TOTAL:    {extraction.get('totalAmount', 0):.2f} {currency}")
    

Handling Line Items

Line items are returned as an array. Here’s how to process them:

PowerShell

# Process line items
$lineItems = $invoice.lineItems

Write-Host "`n=== Line Items ==="
Write-Host ("-" * 80)
Write-Host ("{0,-40} {1,10} {2,12} {3,12}" -f "Description", "Qty", "Unit Price", "Amount")
Write-Host ("-" * 80)

$total = 0
foreach ($item in $lineItems) {
    Write-Host ("{0,-40} {1,10} {2,12:N2} {3,12:N2}" -f `
        $item.description.Substring(0, [Math]::Min(40, $item.description.Length)),
        $item.quantity,
        $item.unitPrice,
        $item.amount)
    $total += $item.amount
}

Write-Host ("-" * 80)
Write-Host ("{0,-40} {1,10} {2,12} {3,12:N2}" -f "TOTAL", "", "", $total)
    

Python

def process_line_items(extraction):
    """Process and display line items."""

    line_items = extraction.get('lineItems', [])

    if not line_items:
        print("No line items found")
        return

    print(f"\n{'Description':<40} {'Qty':>8} {'Price':>12} {'Amount':>12}")
    print("-" * 75)

    total = 0
    for item in line_items:
        desc = item.get('description', 'N/A')[:40]
        qty = item.get('quantity', 0)
        price = item.get('unitPrice', 0)
        amount = item.get('amount', 0)

        print(f"{desc:<40} {qty:>8} {price:>12.2f} {amount:>12.2f}")
        total += amount

    print("-" * 75)
    print(f"{'TOTAL':<40} {'':>8} {'':>12} {total:>12.2f}")

    return line_items
    

Validating Line Items

def validate_line_items(extraction):
    """Validate that line items sum to subtotal."""

    line_items = extraction.get('lineItems', [])
    subtotal = extraction.get('subtotal', 0)

    calculated_total = sum(item.get('amount', 0) for item in line_items)

    # Allow small rounding differences
    tolerance = 0.01
    if abs(calculated_total - subtotal) > tolerance:
        print(f"Warning: Line items ({calculated_total:.2f}) don't match subtotal ({subtotal:.2f})")
        return False

    return True
    

Multi-Page Invoices

DocDigitizer handles multi-page invoices automatically, combining data from all pages into a single extraction result.

How It Works

  • All pages of an invoice are analyzed together
  • Header information is typically from page 1
  • Line items are collected from all pages
  • Totals are extracted from the summary section (often last page)

Identifying Page Numbers

The pages array in the response shows which pages contain the invoice:

{
    "docType": "Invoice",
    "pages": [1, 2, 3],  // Invoice spans pages 1-3
    "extraction": { ... }
}
    

Example: Processing Multi-Page Invoice

# Check if invoice spans multiple pages
$invoice = $response.Output[0]
$pageCount = $invoice.pages.Count

if ($pageCount -gt 1) {
    Write-Host "Multi-page invoice detected: Pages $($invoice.pages -join ', ')"
}

# Line items from all pages are combined
$totalLineItems = $invoice.extraction.lineItems.Count
Write-Host "Total line items extracted: $totalLineItems"
    

Batch Processing

Process multiple invoices efficiently with batch processing.

PowerShell Batch Processing

function Process-InvoiceBatch {
    param(
        [string]$FolderPath,
        [string]$ApiKey,
        [string]$ContextId,
        [string]$OutputCsv = "results.csv"
    )

    $pdfFiles = Get-ChildItem -Path $FolderPath -Filter "*.pdf"
    $results = @()

    foreach ($pdf in $pdfFiles) {
        Write-Host "Processing: $($pdf.Name)..."

        try {
            $form = @{
                files = Get-Item -Path $pdf.FullName
                id = [guid]::NewGuid().ToString()
                contextId = $ContextId
            }

            $response = Invoke-RestMethod -Uri "https://apix.docdigitizer.com/sync" `
                -Method Post `
                -Headers @{ "x-api-key" = $ApiKey } `
                -Form $form

            if ($response.StateText -eq "COMPLETED") {
                $invoice = $response.Output | Where-Object { $_.docType -eq "Invoice" }

                if ($invoice) {
                    $results += [PSCustomObject]@{
                        FileName = $pdf.Name
                        Status = "Success"
                        InvoiceNumber = $invoice.extraction.invoiceNumber
                        VendorName = $invoice.extraction.vendorName
                        InvoiceDate = $invoice.extraction.invoiceDate
                        TotalAmount = $invoice.extraction.totalAmount
                        Currency = $invoice.extraction.currency
                        TraceId = $response.TraceId
                    }
                }
            } else {
                $results += [PSCustomObject]@{
                    FileName = $pdf.Name
                    Status = "Error"
                    InvoiceNumber = ""
                    VendorName = ""
                    InvoiceDate = ""
                    TotalAmount = ""
                    Currency = ""
                    TraceId = $response.TraceId
                }
            }
        }
        catch {
            $results += [PSCustomObject]@{
                FileName = $pdf.Name
                Status = "Failed: $_"
                InvoiceNumber = ""
                VendorName = ""
                InvoiceDate = ""
                TotalAmount = ""
                Currency = ""
                TraceId = ""
            }
        }

        # Respect rate limits
        Start-Sleep -Milliseconds 500
    }

    # Export results
    $results | Export-Csv -Path $OutputCsv -NoTypeInformation
    Write-Host "`nProcessed $($results.Count) invoices. Results saved to $OutputCsv"

    return $results
}

# Usage
Process-InvoiceBatch -FolderPath "C:\Invoices" -ApiKey "YOUR_KEY" -ContextId "YOUR_CONTEXT"
    

Python Batch Processing

import os
import csv
import uuid
import time
import requests
from pathlib import Path

def process_invoice_batch(folder_path, api_key, context_id, output_csv="results.csv"):
    """Process all PDF invoices in a folder."""

    pdf_files = list(Path(folder_path).glob("*.pdf"))
    results = []

    for pdf_file in pdf_files:
        print(f"Processing: {pdf_file.name}...")

        try:
            with open(pdf_file, "rb") as f:
                response = requests.post(
                    "https://apix.docdigitizer.com/sync",
                    headers={"x-api-key": api_key},
                    files={"files": f},
                    data={
                        "id": str(uuid.uuid4()),
                        "contextId": context_id
                    },
                    timeout=300
                )

            result = response.json()

            if result.get("StateText") == "COMPLETED":
                invoice = next(
                    (doc for doc in result.get("Output", [])
                     if doc.get("docType") == "Invoice"),
                    None
                )

                if invoice:
                    ext = invoice.get("extraction", {})
                    results.append({
                        "filename": pdf_file.name,
                        "status": "Success",
                        "invoice_number": ext.get("invoiceNumber", ""),
                        "vendor_name": ext.get("vendorName", ""),
                        "invoice_date": ext.get("invoiceDate", ""),
                        "total_amount": ext.get("totalAmount", ""),
                        "currency": ext.get("currency", ""),
                        "trace_id": result.get("TraceId", "")
                    })
                    continue

            results.append({
                "filename": pdf_file.name,
                "status": "Error",
                "invoice_number": "",
                "vendor_name": "",
                "invoice_date": "",
                "total_amount": "",
                "currency": "",
                "trace_id": result.get("TraceId", "")
            })

        except Exception as e:
            results.append({
                "filename": pdf_file.name,
                "status": f"Failed: {e}",
                "invoice_number": "",
                "vendor_name": "",
                "invoice_date": "",
                "total_amount": "",
                "currency": "",
                "trace_id": ""
            })

        # Respect rate limits
        time.sleep(0.5)

    # Export to CSV
    with open(output_csv, "w", newline="") as f:
        writer = csv.DictWriter(f, fieldnames=results[0].keys())
        writer.writeheader()
        writer.writerows(results)

    print(f"\nProcessed {len(results)} invoices. Results saved to {output_csv}")
    return results

# Usage
process_invoice_batch("/path/to/invoices", "YOUR_API_KEY", "YOUR_CONTEXT_ID")
    

Integration Examples

Export to Accounting Software

def export_to_accounting_format(extraction):
    """Convert extraction to common accounting import format."""

    return {
        "document_type": "INVOICE",
        "document_number": extraction.get("invoiceNumber"),
        "document_date": extraction.get("invoiceDate"),
        "due_date": extraction.get("dueDate"),
        "supplier": {
            "name": extraction.get("vendorName"),
            "tax_id": extraction.get("vendorTaxId"),
            "address": extraction.get("vendorAddress")
        },
        "amounts": {
            "net_amount": extraction.get("subtotal"),
            "tax_amount": extraction.get("taxAmount"),
            "gross_amount": extraction.get("totalAmount"),
            "currency": extraction.get("currency")
        },
        "lines": [
            {
                "description": item.get("description"),
                "quantity": item.get("quantity"),
                "unit_price": item.get("unitPrice"),
                "amount": item.get("amount"),
                "tax_rate": item.get("taxRate")
            }
            for item in extraction.get("lineItems", [])
        ],
        "payment": {
            "terms": extraction.get("paymentTerms"),
            "bank_account": extraction.get("bankAccount")
        }
    }
    

Database Integration

import sqlite3

def save_invoice_to_db(extraction, db_path="invoices.db"):
    """Save extracted invoice to SQLite database."""

    conn = sqlite3.connect(db_path)
    cursor = conn.cursor()

    # Create tables if not exist
    cursor.execute('''
        CREATE TABLE IF NOT EXISTS invoices (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            invoice_number TEXT,
            invoice_date TEXT,
            due_date TEXT,
            vendor_name TEXT,
            vendor_tax_id TEXT,
            customer_name TEXT,
            subtotal REAL,
            tax_amount REAL,
            total_amount REAL,
            currency TEXT,
            created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
        )
    ''')

    cursor.execute('''
        CREATE TABLE IF NOT EXISTS invoice_lines (
            id INTEGER PRIMARY KEY AUTOINCREMENT,
            invoice_id INTEGER,
            description TEXT,
            quantity REAL,
            unit_price REAL,
            amount REAL,
            FOREIGN KEY (invoice_id) REFERENCES invoices (id)
        )
    ''')

    # Insert invoice
    cursor.execute('''
        INSERT INTO invoices (
            invoice_number, invoice_date, due_date, vendor_name, vendor_tax_id,
            customer_name, subtotal, tax_amount, total_amount, currency
        ) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
    ''', (
        extraction.get("invoiceNumber"),
        extraction.get("invoiceDate"),
        extraction.get("dueDate"),
        extraction.get("vendorName"),
        extraction.get("vendorTaxId"),
        extraction.get("customerName"),
        extraction.get("subtotal"),
        extraction.get("taxAmount"),
        extraction.get("totalAmount"),
        extraction.get("currency")
    ))

    invoice_id = cursor.lastrowid

    # Insert line items
    for item in extraction.get("lineItems", []):
        cursor.execute('''
            INSERT INTO invoice_lines (invoice_id, description, quantity, unit_price, amount)
            VALUES (?, ?, ?, ?, ?)
        ''', (
            invoice_id,
            item.get("description"),
            item.get("quantity"),
            item.get("unitPrice"),
            item.get("amount")
        ))

    conn.commit()
    conn.close()

    return invoice_id
    

Best Practices

Document Quality

  • Use minimum 150 DPI for scanned invoices (300 DPI recommended)
  • Ensure text is legible and not blurry
  • Avoid skewed or rotated documents
  • Native PDFs (not scanned) provide best results

Data Validation

  • Validate extracted totals against line item sums
  • Check tax calculations are correct
  • Verify vendor/customer tax IDs are in correct format
  • Validate date formats before database insertion

Error Handling

  • Always check StateText for “COMPLETED”
  • Handle missing fields gracefully (use defaults)
  • Log TraceId for troubleshooting
  • Implement retry logic for transient failures

Performance

  • Respect rate limits (use delays between requests)
  • Process invoices in parallel where possible (within limits)
  • Cache results for already-processed invoices
  • Use batch processing for large volumes